Files
sci-gui-agent-benchmark/evaluation_examples
ChenYXxxx bdaf37e0e5 fix_os&gimp (#220)
* Update ec4e3f68-9ea4-4c18-a5c9-69f89d1178b3.json

* Update c288e301-e626-4b98-a1ab-159dcb162af5.json

* Update 3ce045a0-877b-42aa-8d2c-b4a863336ab8.json

* Update b3d4a89c-53f2-4d6b-8b6a-541fb5d205fa.json

* Update 2e6f678f-472d-4c55-99cc-8e7c5c402a71.json

Please batch process all images on the desktop by increasing their brightness to 50, instead of adjusting them individually.

* Update 5ca86c6f-f317-49d8-b6a7-b527541caae8.json

* Update a746add2-cab0-4740-ac36-c3769d9bfb46.json

* Update a746add2-cab0-4740-ac36-c3769d9bfb46.json

* Update 62f7fd55-0687-4a43-b6e1-3eda16fc6252.json

* Update d52d6308-ec58-42b7-a2c9-de80e4837b2b.json

* Update d16c99dc-2a1e-46f2-b350-d97c86c85c15.json

* Update d16c99dc-2a1e-46f2-b350-d97c86c85c15.json

* Update 58d3eeeb-e9d0-499f-962e-fd0db2a744d8.json
2025-07-03 16:59:05 +08:00
..
2025-07-03 16:59:05 +08:00
2025-06-10 13:23:03 +00:00
2024-11-25 16:30:59 +08:00

Evaluation examples

Here we put the data examples to benchmark the ability of agents when interacting with GUI. The examples are stored in ./examples where each data item formatted as:

{
    "id": "uid", # unique id
    "snapshot": "snapshot_id", # the snapshot id of the environment, with some data already there and apps already opened, or just desktop
    "instruction": "natural_language_instruction", # the natural language instruction of the task, what we want the agent to do
    "source": "website_url", # where we know this example, some forum, or some website, or some paper
    "config": {xxx}, # the scripts to setup the donwload and open files actions, as the initial state of a task
    # (coming in next project) "trajectory": "trajectory_directory", # the trajectory directory, which contains the action sequence file, the screenshots and the recording video
    "related_apps": ["app1", "app2", ...], # the related apps, which are opened during the task
    "evaluator": "evaluation_dir", # the directory of the evaluator, which contains the evaluation script for this example
…
}

The ./trajectories file contains the annotated trajectories for each data item in ./examples for finishing the task.

For now, it is under construction, and only tested on Windows 10. Please:

  • Modify the path accordingly to run the evaluation;
  • Remind us if some parts are overfit to our environment.