f2d40ed181
Update JADE benchmark task JSON files
...
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-03 10:39:33 +08:00
7eb8c4000c
data: 添加 Origin 输入数据文件并允许 data/origin 目录被追踪
...
新增 example_with_graph.opju 和 example_with_graph_scaled_from2_to_10.opju,同时在 .gitignore 中为 **/data/origin/ 添加例外规则。
2026-03-30 18:02:22 +08:00
581ccc4dfd
data: 修订 Origin task2/3/4/5/8/9/11/12 的 instruction 与 steps
...
将部分任务从不可行路径(如 View→Formula Bar)改为实际可执行的操作路径(如 Window→Script Window),同步更新 steps 描述与 sleep 时长。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-03-30 17:58:52 +08:00
d9986142b4
fix(origin): 修复任务配置缺少输入文件及task11设计错误
...
- task2/3/5/9/11: 补充upload_file+launch传入example.xlsx
- task11: 修正instruction菜单路径,steps补充建图前置步骤
2026-03-27 16:49:24 +08:00
252d2f79ce
fix(eval): 修复vllm_eval截图排序bug并对齐reeval逻辑
...
- 修复_load_screenshots_from_dir中截图按字符串排序导致step_9被误判为最终帧的bug,改为数字排序
- 对齐reeval.py的prompt逻辑:明确要求模型优先检查最终截图(STEP 1 EXAMINE FINAL SCREENSHOT FIRST)
- 评估temperature从0.7降至0.2提升一致性
- 新增batch_reeval.py:基于test_final.json批量重评测已有轨迹
- 新增reeval.py:单任务重评测脚本(final-frame-anchored evaluation)
- test_final.json新增avogadro(11题)和origin(8题)
2026-03-27 14:34:32 +08:00
4e192cf013
Save local changes before pulling
2026-03-26 10:52:22 +08:00
04089fa218
Save local changes before pulling
2026-03-26 10:50:07 +08:00
a38d2faec3
修改了origin启动地址
2026-03-26 10:46:29 +08:00
c9912ad54c
data: 删除 ovito remote_file_access/rendering 任务,更新 test_final.json
2026-03-25 23:27:47 +08:00
970d430dcf
feat: 本地修改 agent.py / run_proxmox / chrome tasks
2026-03-25 23:27:47 +08:00
fe1bdae9a6
updata avogadro examples
2026-03-24 11:54:17 +08:00
adb66ef972
updata avogadro examples
2026-03-24 11:36:51 +08:00
3f0ef4849a
Add origin data files
2026-03-19 18:05:13 +08:00
ae202be7b9
Update origin task
2026-03-19 17:58:11 +08:00
64e19ba17e
Add ovito data examples
2026-03-19 10:08:25 +08:00
ae92e80a0b
Update ovito evaluation examples
2026-03-19 10:00:27 +08:00
0e2702fb5b
Merge branch 'lzy/data-processing' of https://git.matai.center/lzy/sci-gui-agent-benchmark into lzy/data-processing
2026-03-18 17:33:11 +08:00
16ea3641bd
Add ovito example files
2026-03-18 16:49:11 +08:00
dc5fd173f1
data: update avogadro building-metal-complexes task1 & task3
2026-03-13 17:19:44 +08:00
a943c1e961
feat: 更新 Jade/VESTA 任务定义 + 最终评测清单
...
- Jade: 15个任务JSON更新 (instruction细化 + metadata.steps详细展开)
- VESTA: 10个任务JSON重构 (统一使用NaCl.cif/anatase_TiO2.cif + 步骤重写)
- VESTA: 删除task1, 新增2个CIF数据文件
- 新增 test_final.json (11 jade + 10 vesta = 21 tasks)
- run_proxmox.sh: MODEL→gpt-5.4, MAX_STEPS→35, TEST_META→test_final.json
2026-03-11 11:02:26 +08:00
d71f1f976d
feat: vllm_eval 关键帧采样 + Gemini OpenAI 代理支持
...
- vllm_eval.py: 新增 _sample_key_frames 关键帧采样函数
- vllm_eval.py: 当截图超过 max_eval_images 时均匀采样
- vllm_eval.py: Gemini 模型支持通过 OpenAI 兼容代理调用
- test_single.json: 更新测试任务配置
2026-03-04 16:39:24 +08:00
e70f1335f0
config: 更新测试任务配置文件
2026-03-04 10:44:00 +08:00
9431bd5bfc
data: 精炼已有 avogadro/imagej/origin/ovito/pymol/vesta 任务的 metadata steps
2026-03-04 10:43:49 +08:00
b1052c79cf
data: 新增 jade/avogadro/ovito/pymol 评测任务数据
2026-03-04 10:43:29 +08:00
ac3f38ed58
feat: 新增 refine_metadata 脚本,更新 extract_instructions_v2
2026-03-04 10:43:14 +08:00
e4b039fc02
refine jade metadata steps: add shortcuts & merge menu operations to avoid timeout
2026-02-27 18:19:04 +08:00
b75f6bf341
feat: 增强任务步骤注入与a11y状态表达,提升树形交互稳定性
...
- 打通 metadata.steps 传递链路,将任务步骤注入 agent 预测上下文
- 优化 a11y tree 线性化输出:使用中心坐标并新增 states 列(expanded/collapsed/selected 等)
- 放宽可保留节点条件,保留无文本输入类控件(edit/textfield/searchbox 等)
- 强化输出约束:单轮仅允许动作代码或 WAIT/DONE/FAIL,禁止动作与 DONE 同轮返回
- 补充 avogadro 示例步骤:展开 aromatics 并选择 benzene.cjson
2026-02-26 18:56:53 +08:00
07e66490dd
feat: 增强科研软件的 a11y tree 支持
...
- 扩展 heuristic_retrieve.py 白名单以覆盖科研软件 GUI 框架:
- 新增 prefix 规则: sunawt (Java Swing), qt5q/qt6q (Qt), ovito, pymol,
contentspanel, wx (wxWidgets), afx (MFC), thunderrt (VB6)
- 新增 endswith 规则: edit, widget, box, dialog, view, frame, menuitem,
menubar, toolbar, tabitem, treeitem, window
- 新增 Qt 控件和 Win32 控件的精确匹配
- 在 agent.py 中添加原始 a11y tree 的调试日志
- 修复 run.py 中 agent 初始化缺少 platform='windows' 的问题
- 添加 NO_PROXY 绕过本地/VM IP (兼容 Clash 全局代理)
- lib_run_single.py 中应用启动等待时间增加到 15 秒
- 新增 test_each_domain_a11y_tree.json (每个域一个任务用于 a11y 验证)
2026-02-26 15:04:28 +08:00
9899d4a0c7
feat: 新增科研软件 benchmark 任务数据
...
- 新增 avogadro/imagej/jade/origin/ovito/pymol/vesta 等科研软件任务 JSON
- 修改 vllm_eval.py,修改图片文件名称为第x步
- desktop_env.py 添加额外数据参数 config 和 metadata
2026-02-25 15:19:36 +08:00
cui0711
613f55f0da
feat(tools): add instructions extraction script for generating test cases
2026-02-09 17:47:02 +08:00
cui0711
ad46acc5f3
refactor(example): replace check_include_exclude with vllm_eval evaluator
2026-02-05 16:55:03 +08:00
cui0711
231f7a8fbc
feat(eval): add jade test case and update test categories
2026-01-30 16:29:05 +08:00
MillanK
cbc3b590ff
Task fix batch ( #383 )
...
* update 873cafdd-a581-47f6-8b33-b9696ddb7b05 task eval
* c1fa57f3-c3db-4596-8f09-020701085416 fix, add tolerance to url matching
* 8df7e444-8e06-4f93-8a1a-c5c974269d82 add more clear instruction to the filename for compress
* add address string normalization for 6f4073b8-d8ea-4ade-8a18-c5d1d5d5aa9a
---------
Co-authored-by: Jiaqi <dengjiaqi@moonshot.cn >
2025-11-19 17:24:25 +08:00
yiqilin
6d43dbc532
Update GIMP evaluation examples to replace local file paths with cloud file URLs for consistency and accessibility. ( #372 )
2025-11-07 21:49:49 +08:00
Timothyxxx
ff6285cfbb
Add safe browsing feature to Chrome evaluator
...
- Implemented `get_enable_safe_browsing` function to retrieve safe browsing settings based on the operating system.
- Updated the `__init__.py` to include the new function.
- Modified JSON examples to reflect the change from enabling enhanced safety browsing to enabling safe browsing.
- Added necessary commands in the JSON examples for setting up preferences for safe browsing.
2025-10-05 04:56:08 +00:00
Danyang Zhang
afd5952e44
ver Oct3rd ( #349 )
...
updated a series of instructions to ask the agent not to do any
unnecessary actions.
2025-10-04 00:13:29 +08:00
Timothyxxx
1572068035
Refactor evaluator functions in JSON examples to use URL pattern matching. Update expected URL formats to regex patterns for better validation in chrome evaluation examples.
2025-10-01 19:20:06 +00:00
Timothyxxx
9be518435c
Update GIMP evaluation examples to replace local file paths with cloud file URLs for consistency and accessibility.
2025-10-01 09:54:52 +00:00
Timothyxxx
756923beea
Update instruction wording in LibreOffice Impress example to clarify text color change requirements. Address https://github.com/xlang-ai/OSWorld/issues/324
2025-09-01 23:29:47 +08:00
Adam Yanxiao Zhao
aa05f6cc26
Add AutoGLM-OS agent ( #309 )
...
* autoglm-os initialize
* clean code
* chore: use proxy for download setup
* feat(autoglm-os): add parameter to toggle images
* fix: use temporary directory for files pulled from the vm to prevent potential collision when running multiple instances of the same task in parallel
* update
* add client_password
* update multienv
* fix
* fix prompt
* fix prompt
* fix prompt
* fix sys prompt
* feat: use proxy in file evaluator
* fix client_password
* fix note_prompt
* fix autoglm agent cmd type
* fix
* revert: fix: use temporary directory for files pulled from the vm to prevent potential collision when running multiple instances of the same task in parallel
reverts commit bab5473eea1de0e61b0e1d68b23ce324a5b0ee57
* feat(autoglm): setup tools
* fix(autoglm): remove second time of get a11y tree
* add osworld server restart
* Revert "add osworld server restart"
This reverts commit 7bd9d84122e246ce2a26de0e49c25494244c2b3d.
* fix _launch_setup
* fix autoglm agent tools & xml tree
* fix desktop_env
* fix bug for tool name capitalization
* fix: always use proxy for setup download
* add fail after exceeding max turns
* fix(autoglm): avoid adding image to message when screenshot is empty
* fix maximize_window
* fix maximize_window
* fix maximize_window
* fix import browsertools module bug
* fix task proxy config bug
* restore setup
* refactor desktop env
* restore image in provider
* restore file.py
* refactor desktop_env
* quick fix
* refactor desktop_env.step
* fix our env reset
* add max truns constraint
* clean run script
* clean lib_run_single.py
---------
Co-authored-by: hanyullai <hanyullai@outlook.com >
Co-authored-by: JingBh <jingbohao@yeah.net >
2025-08-17 12:08:40 +08:00
Danyang Zhang
7364a720a6
Calc eval fix ( #297 )
...
* ver Jun17th
updating annotations
* ver Jun17th
corrected annotation of 1d17
added check for cell merge
* ver Jun17th
updated several annotations
* ver Jun20th
fixed set-up config of 2bd59342-0664-4ccb-ba87-79379096cc08
* fix: Enhance instructions in LibreOffice Calc examples for clarity and specificity, including details on using Pivot Tables, column placements, and revenue calculations.
* ver Jun21st
updating calc evals
* ver Jun22nd
fixed an impress task
* ver Jun22ndv2
adjusted several calc tasks
* Clean scalfolds
* ver Jul18th
added two try-excepts to handle possible formula parsing and calculation
failures
* ver Jul19th
added supports for cellIs and some other new types of conditional
formatting for calc evaluation
* ver Aug4th
updated some instructions
* ver Aug4thv2
fixed a typo
---------
Co-authored-by: BowenBryanWang <bryanwang.nlp@connect.hku.hk >
Co-authored-by: yuanmengqi <yuanmengqi@mail.ustc.edu.cn >
2025-08-04 12:39:35 +08:00
yuanmengqi
a5b51e8010
refactor: update command in JSON example to use placeholder for client password
...
- Replaced the hardcoded password in the command with a placeholder `{CLIENT_PASSWORD}` for improved security and flexibility.
- Ensured that the overall structure of the JSON remains unchanged while enhancing the example's usability.
2025-07-31 05:20:04 +00:00
Jiaqi
23b81993fa
os task fix: set the default dim screen time to be 300s
2025-07-24 08:13:02 +00:00
ChenYXxxx
873f8a0359
Update 10a730d5-d414-4b40-b479-684bed1ae522.json
...
change the ight 2 the night
2025-07-24 15:44:52 +08:00
Yuan Mengqi
d128edbbc1
add nogdrive json ( #281 )
...
* add uitars agent code
* improve claude
* improve claude
* improve claude
* improve claude
* improve claude
* add nogdrive json
2025-07-23 19:12:42 +08:00
yuanmengqi
d6f2190a9f
fix: refine instruction in OS evaluation example to clarify restrictions on logging out or shutting down the machine
2025-07-18 18:51:01 +00:00
Danyang Zhang
53ffc05042
Calc eval fix ( #272 )
...
* ver Jun17th
updating annotations
* ver Jun17th
corrected annotation of 1d17
added check for cell merge
* ver Jun17th
updated several annotations
* ver Jun20th
fixed set-up config of 2bd59342-0664-4ccb-ba87-79379096cc08
* fix: Enhance instructions in LibreOffice Calc examples for clarity and specificity, including details on using Pivot Tables, column placements, and revenue calculations.
* ver Jun21st
updating calc evals
* ver Jun22nd
fixed an impress task
* ver Jun22ndv2
adjusted several calc tasks
* Clean scalfolds
* ver Jul18th
added two try-excepts to handle possible formula parsing and calculation
failures
---------
Co-authored-by: BowenBryanWang <bryanwang.nlp@connect.hku.hk >
Co-authored-by: yuanmengqi <yuanmengqi@mail.ustc.edu.cn >
2025-07-18 21:28:48 +08:00
shenzhennan
c7017a476d
fix impress instruction 0a211154
2025-07-18 07:14:35 +00:00
yuanmengqi
0fb625e4fd
Update instruction in OS evaluation example to include a restriction against shutting down the machine.
2025-07-18 05:28:43 +00:00
yuanmengqi
2c51950e73
feat: enhance evaluator configuration for Chrome with post-execution commands
...
- Added postconfig commands to multiple JSON files for Chrome evaluation examples.
- Included commands to terminate existing Chrome processes, launch Chrome with remote debugging, and introduce sleep intervals for timing.
- Updated logging messages in the AWS manager to improve clarity and user experience.
These changes enhance the automation and usability of the evaluation examples while preserving existing logic.
2025-07-17 10:50:10 +00:00