Files
sci-gui-agent-benchmark/evaluation_examples
Tianbao Xie 4e11eafd1d Robust Evaluation, Blocking File Open, Grader Sensitivity, and LibreOffice Writer Fixes (#217)
* Refactor evaluator structure in LibreOffice Writer example JSON to support multiple expected and result files, enhancing evaluation flexibility.

* Update instance type to t3.large and add VNC access URL logging for allocated VMs, enhancing remote access capabilities.

* Update instance type to t3.large and add VNC access URL logging for allocated VMs, enhancing remote access capabilities.

* Update time format in get_vm_file function to include hours, minutes, and seconds for more precise file naming with time suffix.

* More delay for 936321ce-5236-426a-9a20-e0e3c5dc536f; support one more potential solutions.

* Enhance SetupController with configurable retry limit and improved error handling for file opening requests. Introduce new function to compare unique training records, and update logging for better debugging. Adjust JSON examples for evaluation to support multiple expected and result files.

* Clean debug code

---------

Co-authored-by: yuanmengqi <yuanmengqi@mail.ustc.edu.cn>
2025-06-16 21:37:19 +08:00
..
2025-06-10 13:23:03 +00:00
2024-11-25 16:30:59 +08:00

Evaluation examples

Here we put the data examples to benchmark the ability of agents when interacting with GUI. The examples are stored in ./examples where each data item formatted as:

{
    "id": "uid", # unique id
    "snapshot": "snapshot_id", # the snapshot id of the environment, with some data already there and apps already opened, or just desktop
    "instruction": "natural_language_instruction", # the natural language instruction of the task, what we want the agent to do
    "source": "website_url", # where we know this example, some forum, or some website, or some paper
    "config": {xxx}, # the scripts to setup the donwload and open files actions, as the initial state of a task
    # (coming in next project) "trajectory": "trajectory_directory", # the trajectory directory, which contains the action sequence file, the screenshots and the recording video
    "related_apps": ["app1", "app2", ...], # the related apps, which are opened during the task
    "evaluator": "evaluation_dir", # the directory of the evaluator, which contains the evaluation script for this example
…
}

The ./trajectories file contains the annotated trajectories for each data item in ./examples for finishing the task.

For now, it is under construction, and only tested on Windows 10. Please:

  • Modify the path accordingly to run the evaluation;
  • Remind us if some parts are overfit to our environment.