sci-gui-agent-benchmark

Author	SHA1	Message	Date
yuanmengqi	150234307e	Merge branch 'fix_chrome'	2025-07-17 04:14:47 +00:00
yuanmengqi	9eeabfc52d	Improve the parallel logic	2025-07-17 04:14:20 +00:00
yuanmengqi	cb070307ee	merge code	2025-07-15 14:57:14 +00:00
yuanmengqi	175b4b46c2	Merge remote-tracking branch 'upstream/main' into fix_chrome	2025-07-15 14:50:48 +00:00
yuanmengqi	7912880d16	Merge branch 'main' of github.com:xlang-ai/OSWorld	2025-07-15 07:24:38 +00:00
yuanmengqi	451bbf5fc2	Update multi_apps JSON examples: refined instructions for image processing in GIMP, replaced an open command with a launch command for VLC, and corrected assignment modification instruction in LibreOffice Calc example.	2025-07-15 07:24:33 +00:00
Yuan Mengqi	af47ed8fb1	fix infeasible&chrome tasks (#258 ) * fix chrome * fix: fix proxy setup * feat&fix: add proxy support in setup and remove hardcoded proxy from example * fix tasks * fix chrome finished * fix * clean chrome_fix code * clean chrome_fix code * fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774 * fix multiapps * fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774 * fix some multi_apps tasks * fix some multi_apps tasks * fix password&resolution * fix password&resolution * Improve code logic for password & resolution * edit * Merge branch 'main' into fix_chrome * fix chrome tasks * Merge branch 'fix_chrome' * fix insensible&chrome tasks --------- Co-authored-by: adlsdztony <zzl0712@connect.hku.hk>	2025-07-15 13:02:42 +08:00
yuanmengqi	68a9f647f4	fix: address https://github.com/xlang-ai/OSWorld/issues/257 by implement fix for PyAutoGUI '<' character bug in command execution. Introduced a new function to handle typewrite and press calls, ensuring correct behavior when using '<' in commands. Updated command execution logic to apply this fix before executing user commands.	2025-07-15 04:17:34 +00:00
yuanmengqi	8e18b7839a	fix insensible&chrome tasks	2025-07-15 02:19:27 +00:00
yuanmengqi	1bf0730dce	Merge remote-tracking branch 'upstream/main'	2025-07-15 02:14:41 +00:00
yuanmengqi	756ef96850	Merge branch 'fix_chrome'	2025-07-15 02:13:58 +00:00
yuanmengqi	08b4cf2c2f	fix infeasible&chome tasks	2025-07-15 02:09:40 +00:00
ChenYXxxx	698483390a	"Could you turn my image into CYMK mode?" add "within GIMP"	2025-07-14 23:53:20 +08:00
ChenYXxxx	9242becd87	"Please batch process all images on the desktop by increasing their brightness to 50, instead of adjusting them individually." add "within GIMP"	2025-07-14 23:52:41 +08:00
ChenYXxxx	56b2fe9cc4	Update d16c99dc-2a1e-46f2-b350-d97c86c85c15.json	2025-07-14 23:23:11 +08:00
ChenYXxxx	b481c794c5	Update 72f83cdc-bf76-4531-9a1b-eb893a13f8aa.json	2025-07-14 23:22:50 +08:00
ChenYXxxx	7f973a391c	Update f723c744-e62c-4ae6-98d1-750d3cd7d79d.json	2025-07-14 23:22:14 +08:00
shenzhennan	7f96cc0633	Merge branch 'main' of https://github.com/xlang-ai/OSWorld	2025-07-14 12:35:00 +00:00
shenzhennan	53983db9cb	fix impress eval : extending sleep time to ensure save	2025-07-14 12:34:43 +00:00
Xinyuan Wang	db83b9cb2c	Wxy/opencua (#256 ) * OpenCUA Agent code base * update url * debug, modify url input	2025-07-14 20:26:39 +08:00
Danyang Zhang	2339db20ca	ver Jul7th (#255 ) pip-installing directly from PyPI fails misteriously in postconfig execution, possible owing to proxy configuration in the VM, adjusted strategy by downloading the wheel on host and pip-installing it locally on VM in thunderbird/d38192b0-17dc-4e1d-99c3-786d0117de77	2025-07-14 20:26:29 +08:00
shenzhennan	60e26d2d0d	fix impress compare use gold file	2025-07-14 11:35:06 +00:00
yuanmengqi	90c4e894a4	Merge remote-tracking branch 'upstream/main' into fix_chrome	2025-07-14 07:14:19 +00:00
yuanmengqi	5d90faa548	run operagor	2025-07-14 07:13:17 +00:00
Zilong Zhou	74b7c189af	Feat/monitor (#254 ) * feat: add claude support * feat: add script for end-to-end evaluation with logging and task distribution * feat&fix: add tool result handling and update model default in evaluation script * chore: remove run_test_env.py script * feat&fix: implement action parsing for tool calls and update default action space * fix: update text formatting in action parsing and replace logger import * feat&fix: implement action parsing for tool calls and add screen size handling * feat: add setup instructions for Anthropic API integration * feat: add notice about image size limitations for Anthropic API * Delete test_env/logger.py * Delete test_env/utils.py * fix: update logger usage to use global logger and improve error handling * feat&fix: add configuration management API endpoints and update UI for configuration selection * feat&fix: update environment configuration, enhance task statistics, and improve UI responsiveness * feat&fix: add configuration toggle button in UI and improve task loading performance * feat&fix: add accuracy percentage display to score and style updates for UI	2025-07-14 13:43:41 +08:00
yuanmengqi	0651495d88	fix: Enhance error handling and logging across multiple evaluators - Added logging for file retrieval and error handling in file.py, improving robustness during file operations. - Implemented checks for file existence and parsing errors in general.py, enhancing reliability in JSON/YAML processing. - Improved table comparison logic in table.py with detailed error logging for sheet loading and cell value reading. - Enhanced metrics evaluation in slides.py with additional checks for paragraph and run counts, ensuring thorough comparison. - Updated utils.py to include file existence checks and detailed error logging during cell value reading.	2025-07-14 05:43:17 +00:00
yuanmengqi	b8b026f817	Merge remote-tracking branch 'upstream/main' into fix_chrome	2025-07-13 13:16:48 +00:00
yuanmengqi	7c807d4f3e	Merge remote-tracking branch 'upstream/main'	2025-07-13 13:12:26 +00:00
Zilong Zhou	349f2fd9fe	Feat/claude cua support (#253 ) * feat: add claude support * feat: add script for end-to-end evaluation with logging and task distribution * feat&fix: add tool result handling and update model default in evaluation script * chore: remove run_test_env.py script * feat&fix: implement action parsing for tool calls and update default action space * fix: update text formatting in action parsing and replace logger import * feat&fix: implement action parsing for tool calls and add screen size handling * feat: add setup instructions for Anthropic API integration * feat: add notice about image size limitations for Anthropic API * Delete test_env/logger.py * Delete test_env/utils.py	2025-07-13 21:10:49 +08:00
Yuan Mengqi	38a30734a6	Improve code logic for password & resolution (#252 ) * fix chrome * fix: fix proxy setup * feat&fix: add proxy support in setup and remove hardcoded proxy from example * fix tasks * fix chrome finished * fix * clean chrome_fix code * clean chrome_fix code * fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774 * fix multiapps * fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774 * fix some multi_apps tasks * fix some multi_apps tasks * fix password&resolution * fix password&resolution * Improve code logic for password & resolution * edit * Merge branch 'main' into fix_chrome * fix chrome tasks --------- Co-authored-by: adlsdztony <zzl0712@connect.hku.hk>	2025-07-13 21:04:07 +08:00
yuanmengqi	9f806d425d	fix chrome final	2025-07-13 12:46:00 +00:00
yuanmengqi	7279469d23	fix chrome tasks	2025-07-13 12:41:27 +00:00
yuanmengqi	572a94b6df	Merge branch 'main' into fix_chrome	2025-07-13 10:16:08 +00:00
yuanmengqi	a16b54c175	edit	2025-07-13 10:14:41 +00:00
yuanmengqi	94ea30cb45	Merge remote-tracking branch 'upstream/main'	2025-07-13 07:05:54 +00:00
yuanmengqi	d3bf4823cb	Improve code logic for password & resolution	2025-07-13 07:01:28 +00:00
yuanmengqi	a070ddda7e	Improve code logic for password & resolution	2025-07-13 06:59:45 +00:00
yuanmengqi	97ed6f99b0	Final review multi_apps fix the rest part	2025-07-12 20:28:55 +00:00
yuanmengqi	dbecf46057	Merge branch 'main' of github.com:xlang-ai/OSWorld	2025-07-12 16:35:02 +00:00
yuanmengqi	877e75a013	Final review multi_apps fix Xinzhuang part	2025-07-12 16:34:55 +00:00
Yuan Mengqi	27319ce1e3	fix password&resolution (#251 ) * fix chrome * fix: fix proxy setup * feat&fix: add proxy support in setup and remove hardcoded proxy from example * fix tasks * fix chrome finished * fix * clean chrome_fix code * clean chrome_fix code * fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774 * fix multiapps * fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774 * fix some multi_apps tasks * fix some multi_apps tasks * fix password&resolution * fix password&resolution --------- Co-authored-by: adlsdztony <zzl0712@connect.hku.hk>	2025-07-13 00:25:37 +08:00
yuanmengqi	3b698aa3c0	Merge branch 'fix_chrome'	2025-07-12 15:13:44 +00:00
yuanmengqi	08bbf77511	fix password&resolution	2025-07-12 15:11:42 +00:00
yuanmengqi	fb0c301e14	Merge branch 'fix_chrome'	2025-07-11 12:17:42 +00:00
yuanmengqi	37c56533f0	Merge remote-tracking branch 'upstream/main' into fix_chrome	2025-07-11 12:16:23 +00:00
yuanmengqi	fe3bb2fd92	fix password&resolution	2025-07-11 12:15:03 +00:00
yuanmengqi	6f0382c0c2	Merge branch 'main' of github.com:xlang-ai/OSWorld	2025-07-10 22:35:42 +00:00
yuanmengqi	6897e5320d	Enhance image text comparison functionality with detailed logging - Added logging for OCR results and text matching outcomes in compare_image_text function. - Updated JSON examples to support multiple expected results and improved structure for evaluator functions. - Enhanced handling of expected text rules to include multiple variations for better matching accuracy.	2025-07-10 22:32:53 +00:00
st2rb8g	61f265a082	fix some multi_apps tasks (#245 ) * fix chrome * fix some multi_apps tasks. * fix some multiapps tasks * fix some multiapps tasks --------- Co-authored-by: yuanmengqi <yuanmengqi@mail.ustc.edu.cn>	2025-07-11 06:32:13 +08:00
Yan98	4e3446d6fe	Fix Name (#249 ) * init * init	2025-07-11 00:15:46 +08:00

1 2 3 4 5 ...

1195 Commits