Files

lizhanyuan 252d2f79ce fix(eval): 修复vllm_eval截图排序bug并对齐reeval逻辑

- 修复_load_screenshots_from_dir中截图按字符串排序导致step_9被误判为最终帧的bug，改为数字排序
- 对齐reeval.py的prompt逻辑：明确要求模型优先检查最终截图（STEP 1 EXAMINE FINAL SCREENSHOT FIRST）
- 评估temperature从0.7降至0.2提升一致性
- 新增batch_reeval.py：基于test_final.json批量重评测已有轨迹
- 新增reeval.py：单任务重评测脚本（final-frame-anchored evaluation）
- test_final.json新增avogadro(11题)和origin(8题)

2026-03-27 14:34:32 +08:00

data

Add origin data files

2026-03-19 18:05:13 +08:00

examples

fix(eval): 修复vllm_eval截图排序bug并对齐reeval逻辑

2026-03-27 14:34:32 +08:00

examples_windows

refactor: update URLs in multiple JSON files to ensure proper encoding of special characters

2025-06-07 17:26:45 +00:00

settings

fix some multi_apps task (#243 )

2025-07-08 18:59:00 +08:00

extract_instructions_v2.py

feat: 新增 refine_metadata 脚本，更新 extract_instructions_v2

2026-03-04 10:43:14 +08:00

extract_instructions.py

feat(tools): add instructions extraction script for generating test cases

2026-02-09 17:47:02 +08:00

README.md

Server setup readme revision (#108 )

2024-11-25 16:30:59 +08:00

refine_metadata.py

feat: 新增 refine_metadata 脚本，更新 extract_instructions_v2

2026-03-04 10:43:14 +08:00

test_all.json

config: 更新测试任务配置文件

2026-03-04 10:44:00 +08:00

test_curated.json

config: 更新测试任务配置文件

2026-03-04 10:44:00 +08:00

test_custom.json

config: 更新测试任务配置文件

2026-03-04 10:44:00 +08:00

test_each_domain_a11y_tree.json

feat: 增强科研软件的 a11y tree 支持

2026-02-26 15:04:28 +08:00

test_final.json

fix(eval): 修复vllm_eval截图排序bug并对齐reeval逻辑

2026-03-27 14:34:32 +08:00

test_nogdrive.json

add nogdrive json (#281 )

2025-07-23 19:12:42 +08:00

test_single.json

feat: vllm_eval 关键帧采样 + Gemini OpenAI 代理支持

2026-03-04 16:39:24 +08:00

test_small.json

Fix Duplicate ids; Remove unused JSON files across multiple applications

2025-02-10 15:49:54 +08:00

README.md

Evaluation examples

Here we put the data examples to benchmark the ability of agents when interacting with GUI. The examples are stored in ./examples where each data item formatted as:

{
    "id": "uid", # unique id
    "snapshot": "snapshot_id", # the snapshot id of the environment, with some data already there and apps already opened, or just desktop
    "instruction": "natural_language_instruction", # the natural language instruction of the task, what we want the agent to do
    "source": "website_url", # where we know this example, some forum, or some website, or some paper
    "config": {xxx}, # the scripts to setup the donwload and open files actions, as the initial state of a task
    # (coming in next project) "trajectory": "trajectory_directory", # the trajectory directory, which contains the action sequence file, the screenshots and the recording video
    "related_apps": ["app1", "app2", ...], # the related apps, which are opened during the task
    "evaluator": "evaluation_dir", # the directory of the evaluator, which contains the evaluation script for this example
…
}

The ./trajectories file contains the annotated trajectories for each data item in ./examples for finishing the task.

For now, it is under construction, and only tested on Windows 10. Please:

Modify the path accordingly to run the evaluation;
Remind us if some parts are overfit to our environment.