sci-gui-agent-benchmark

Author	SHA1	Message	Date
lizhanyuan	f2d40ed181	Update JADE benchmark task JSON files Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-03 10:39:33 +08:00
lizhanyuan	7eb8c4000c	data: 添加 Origin 输入数据文件并允许 data/origin 目录被追踪新增 example_with_graph.opju 和 example_with_graph_scaled_from2_to_10.opju，同时在 .gitignore 中为 **/data/origin/ 添加例外规则。	2026-03-30 18:02:22 +08:00
lizhanyuan	581ccc4dfd	data: 修订 Origin task2/3/4/5/8/9/11/12 的 instruction 与 steps 将部分任务从不可行路径（如 View→Formula Bar）改为实际可执行的操作路径（如 Window→Script Window），同步更新 steps 描述与 sleep 时长。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-30 17:58:52 +08:00
lizhanyuan	04dfd5a89a	fix: traj.jsonl 写入统一使用 UTF-8 编码并保留非 ASCII 字符所有 run_single_example 变体的 traj.jsonl 写入均加上 encoding="utf-8" 和 ensure_ascii=False，避免中文等字符被转义或写入失败。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-30 17:58:44 +08:00
lizhanyuan	b37e4d4372	feat: .exe启动自动最大化并强制前台（WaitForInputIdle+SetForegroundWindow）	2026-03-29 14:26:56 +08:00
lizhanyuan	d9986142b4	fix(origin): 修复任务配置缺少输入文件及task11设计错误 - task2/3/5/9/11: 补充upload_file+launch传入example.xlsx - task11: 修正instruction菜单路径，steps补充建图前置步骤	2026-03-27 16:49:24 +08:00
lizhanyuan	6a5fe8e8ca	config: 将评测域改为all以支持全软件批量评测	2026-03-27 14:35:35 +08:00
lizhanyuan	252d2f79ce	fix(eval): 修复vllm_eval截图排序bug并对齐reeval逻辑 - 修复_load_screenshots_from_dir中截图按字符串排序导致step_9被误判为最终帧的bug，改为数字排序 - 对齐reeval.py的prompt逻辑：明确要求模型优先检查最终截图（STEP 1 EXAMINE FINAL SCREENSHOT FIRST） - 评估temperature从0.7降至0.2提升一致性 - 新增batch_reeval.py：基于test_final.json批量重评测已有轨迹 - 新增reeval.py：单任务重评测脚本（final-frame-anchored evaluation） - test_final.json新增avogadro(11题)和origin(8题)	2026-03-27 14:34:32 +08:00
kingyang0	4e192cf013	Save local changes before pulling	2026-03-26 10:52:22 +08:00
kingyang0	04089fa218	Save local changes before pulling	2026-03-26 10:50:07 +08:00
kingyang0	a38d2faec3	修改了origin启动地址	2026-03-26 10:46:29 +08:00
lizhanyuan	b1ed0a4785	add a11y_tree recording to trajectory output	2026-03-25 23:27:47 +08:00
lizhanyuan	c9912ad54c	data: 删除 ovito remote_file_access/rendering 任务，更新 test_final.json	2026-03-25 23:27:47 +08:00
lizhanyuan	970d430dcf	feat: 本地修改 agent.py / run_proxmox / chrome tasks	2026-03-25 23:27:47 +08:00
kingyang0	fe1bdae9a6	updata avogadro examples	2026-03-24 11:54:17 +08:00
kingyang0	adb66ef972	updata avogadro examples	2026-03-24 11:36:51 +08:00
kingyang0	3f0ef4849a	Add origin data files	2026-03-19 18:05:13 +08:00
kingyang0	ae202be7b9	Update origin task	2026-03-19 17:58:11 +08:00
kingyang0	64e19ba17e	Add ovito data examples	2026-03-19 10:08:25 +08:00
kingyang0	ae92e80a0b	Update ovito evaluation examples	2026-03-19 10:00:27 +08:00
kingyang0	0e2702fb5b	Merge branch 'lzy/data-processing' of https://git.matai.center/lzy/sci-gui-agent-benchmark into lzy/data-processing	2026-03-18 17:33:11 +08:00
kingyang0	16ea3641bd	Add ovito example files	2026-03-18 16:49:11 +08:00
lizhanyuan	dc5fd173f1	data: update avogadro building-metal-complexes task1 & task3	2026-03-13 17:19:44 +08:00
lizhanyuan	19795a674b	chore: gitignore 添加 demo_task3 录制产物	2026-03-11 11:13:23 +08:00
lizhanyuan	349f2142fb	fix: vllm_eval 默认使用原始分辨率进行评估	2026-03-11 11:06:01 +08:00
lizhanyuan	a943c1e961	feat: 更新 Jade/VESTA 任务定义 + 最终评测清单 - Jade: 15个任务JSON更新 (instruction细化 + metadata.steps详细展开) - VESTA: 10个任务JSON重构 (统一使用NaCl.cif/anatase_TiO2.cif + 步骤重写) - VESTA: 删除task1, 新增2个CIF数据文件 - 新增 test_final.json (11 jade + 10 vesta = 21 tasks) - run_proxmox.sh: MODEL→gpt-5.4, MAX_STEPS→35, TEST_META→test_final.json	2026-03-11 11:02:26 +08:00
lizhanyuan	d71f1f976d	feat: vllm_eval 关键帧采样 + Gemini OpenAI 代理支持 - vllm_eval.py: 新增 _sample_key_frames 关键帧采样函数 - vllm_eval.py: 当截图超过 max_eval_images 时均匀采样 - vllm_eval.py: Gemini 模型支持通过 OpenAI 兼容代理调用 - test_single.json: 更新测试任务配置	2026-03-04 16:39:24 +08:00
lizhanyuan	4bde685bbd	feat: 新增 Proxmox provider 支持及 inject_steps 参数 - 新增 desktop_env/providers/proxmox/ (manager + provider) - desktop_env.py: 添加 proxmox 到 provider 名称列表 - providers/__init__.py: 工厂函数注册 proxmox provider - run.py: 新增 --inject_steps/--no_inject_steps 参数 - run_proxmox.sh: Proxmox 运行脚本	2026-03-04 16:39:08 +08:00
lizhanyuan	e70f1335f0	config: 更新测试任务配置文件	2026-03-04 10:44:00 +08:00
lizhanyuan	9431bd5bfc	data: 精炼已有 avogadro/imagej/origin/ovito/pymol/vesta 任务的 metadata steps	2026-03-04 10:43:49 +08:00
lizhanyuan	b1052c79cf	data: 新增 jade/avogadro/ovito/pymol 评测任务数据	2026-03-04 10:43:29 +08:00
lizhanyuan	ac3f38ed58	feat: 新增 refine_metadata 脚本，更新 extract_instructions_v2	2026-03-04 10:43:14 +08:00
lizhanyuan	e4b039fc02	refine jade metadata steps: add shortcuts & merge menu operations to avoid timeout	2026-02-27 18:19:04 +08:00
lizhanyuan	b75f6bf341	feat: 增强任务步骤注入与a11y状态表达，提升树形交互稳定性 - 打通 metadata.steps 传递链路，将任务步骤注入 agent 预测上下文 - 优化 a11y tree 线性化输出：使用中心坐标并新增 states 列（expanded/collapsed/selected 等） - 放宽可保留节点条件，保留无文本输入类控件（edit/textfield/searchbox 等） - 强化输出约束：单轮仅允许动作代码或 WAIT/DONE/FAIL，禁止动作与 DONE 同轮返回 - 补充 avogadro 示例步骤：展开 aromatics 并选择 benzene.cjson	2026-02-26 18:56:53 +08:00
lizhanyuan	07e66490dd	feat: 增强科研软件的 a11y tree 支持 - 扩展 heuristic_retrieve.py 白名单以覆盖科研软件 GUI 框架: - 新增 prefix 规则: sunawt (Java Swing), qt5q/qt6q (Qt), ovito, pymol, contentspanel, wx (wxWidgets), afx (MFC), thunderrt (VB6) - 新增 endswith 规则: edit, widget, box, dialog, view, frame, menuitem, menubar, toolbar, tabitem, treeitem, window - 新增 Qt 控件和 Win32 控件的精确匹配 - 在 agent.py 中添加原始 a11y tree 的调试日志 - 修复 run.py 中 agent 初始化缺少 platform='windows' 的问题 - 添加 NO_PROXY 绕过本地/VM IP (兼容 Clash 全局代理) - lib_run_single.py 中应用启动等待时间增加到 15 秒 - 新增 test_each_domain_a11y_tree.json (每个域一个任务用于 a11y 验证)	2026-02-26 15:04:28 +08:00
lizhanyuan	9899d4a0c7	feat: 新增科研软件 benchmark 任务数据 - 新增 avogadro/imagej/jade/origin/ovito/pymol/vesta 等科研软件任务 JSON - 修改 vllm_eval.py，修改图片文件名称为第x步 - desktop_env.py 添加额外数据参数 config 和 metadata	2026-02-25 15:19:36 +08:00
cui0711	613f55f0da	feat(tools): add instructions extraction script for generating test cases	2026-02-09 17:47:02 +08:00
cui0711	ba03784196	fix(env): handle None result_getter for vllm_eval evaluator	2026-02-09 17:46:05 +08:00
cui0711	3890ee5fc3	fix(vllm_eval): add image compression to prevent 413 error with large max_steps	2026-02-09 14:24:59 +08:00
cui0711	9bc54c0a66	feat(vllm_eval): add structured JSON response format with step analysis	2026-02-09 13:58:14 +08:00
cui0711	1e9281a1ab	feat(cli): add eval_model argument	2026-02-05 16:56:39 +08:00
cui0711	63484c7b7b	fix(runner): pass result_dir to evaluate and re-enable environment reset	2026-02-05 16:55:49 +08:00
cui0711	ad46acc5f3	refactor(example): replace check_include_exclude with vllm_eval evaluator	2026-02-05 16:55:03 +08:00
cui0711	58d411bf86	feat(evaluator): export vllm_eval module	2026-02-05 16:54:16 +08:00
cui0711	be24e77d93	feat(env): add eval_model parameter and result_dir support for vllm evaluation	2026-02-05 16:53:12 +08:00
cui0711	dd58a1de03	feat(evaluator): add vision-language model evaluator	2026-02-05 16:52:35 +08:00
cui0711	231f7a8fbc	feat(eval): add jade test case and update test categories	2026-01-30 16:29:05 +08:00
cui0711	716d82f4d1	feat: add flexible recording control and improve execution logging	2026-01-30 16:28:13 +08:00
cui0711	47bcfc0f0b	feat(agent): add screenshot compression and dynamic resolution support	2026-01-30 16:28:02 +08:00
cui0711	7e9090e115	fix(prompts): fix template variable syntax and add dynamic resolution	2026-01-30 16:28:02 +08:00

1 2 3 4 5 ...

1420 Commits