sci-gui-agent-benchmark

Author	SHA1	Message	Date
yuanmengqi	e433f35c1f	feat: standardize configuration fields across all evaluation examples - Add `fixed_ip` field to all 369 JSON files in examples directory - Set to `true` for 8 files listed in google_chrome.json multi_apps - Set to `false` for remaining 361 files - Add `possibility_of_env_change` field to 363 JSON files missing this field - Set to "low" for newly added fields - Preserve existing values (4 medium, 2 high) for 6 files that already had this field This ensures consistent configuration schema across all evaluation examples while maintaining backward compatibility with existing settings.	2025-07-16 13:45:34 +00:00
yuanmengqi	b9df320f31	Enhance check_python_file_by_test_suite function with robust error handling and logging. Added validation for file existence, module loading, and function execution. Improved resource cleanup and working directory management to ensure stability and reliability during test execution.	2025-07-16 11:44:46 +00:00
Xinyuan Wang	0f2655249c	Wxy/opencua (#260 ) * OpenCUA Agent code base * update url * debug, modify url input * debug opencua * show result * debug agent history overlap * modify opencua agent; add comment lines	2025-07-16 17:53:12 +08:00
yuanmengqi	5e5058c1f2	fix: standardize provider interface parameters across all implementations - Add screen_size parameter to get_vm_path() for all providers (with default 1920x1080) - Add os_type parameter to start_emulator() for Azure and VirtualBox providers - Add region parameter to stop_emulator() for VMware, Docker, and VirtualBox providers - Use args, *kwargs for better extensibility and parameter consistency - Add documentation comments explaining ignored parameters for interface consistency - Prevents TypeError exceptions when AWS-specific parameters are passed to other providers This ensures all providers can handle the same parameter sets while maintaining backward compatibility and avoiding interface fragmentation.	2025-07-15 21:38:34 +00:00
yuanmengqi	cb070307ee	merge code	2025-07-15 14:57:14 +00:00
yuanmengqi	175b4b46c2	Merge remote-tracking branch 'upstream/main' into fix_chrome	2025-07-15 14:50:48 +00:00
yuanmengqi	7912880d16	Merge branch 'main' of github.com:xlang-ai/OSWorld	2025-07-15 07:24:38 +00:00
yuanmengqi	451bbf5fc2	Update multi_apps JSON examples: refined instructions for image processing in GIMP, replaced an open command with a launch command for VLC, and corrected assignment modification instruction in LibreOffice Calc example.	2025-07-15 07:24:33 +00:00
Yuan Mengqi	af47ed8fb1	fix infeasible&chrome tasks (#258 ) * fix chrome * fix: fix proxy setup * feat&fix: add proxy support in setup and remove hardcoded proxy from example * fix tasks * fix chrome finished * fix * clean chrome_fix code * clean chrome_fix code * fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774 * fix multiapps * fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774 * fix some multi_apps tasks * fix some multi_apps tasks * fix password&resolution * fix password&resolution * Improve code logic for password & resolution * edit * Merge branch 'main' into fix_chrome * fix chrome tasks * Merge branch 'fix_chrome' * fix insensible&chrome tasks --------- Co-authored-by: adlsdztony <zzl0712@connect.hku.hk>	2025-07-15 13:02:42 +08:00
yuanmengqi	68a9f647f4	fix: address https://github.com/xlang-ai/OSWorld/issues/257 by implement fix for PyAutoGUI '<' character bug in command execution. Introduced a new function to handle typewrite and press calls, ensuring correct behavior when using '<' in commands. Updated command execution logic to apply this fix before executing user commands.	2025-07-15 04:17:34 +00:00
yuanmengqi	8e18b7839a	fix insensible&chrome tasks	2025-07-15 02:19:27 +00:00
yuanmengqi	1bf0730dce	Merge remote-tracking branch 'upstream/main'	2025-07-15 02:14:41 +00:00
yuanmengqi	756ef96850	Merge branch 'fix_chrome'	2025-07-15 02:13:58 +00:00
yuanmengqi	08b4cf2c2f	fix infeasible&chome tasks	2025-07-15 02:09:40 +00:00
ChenYXxxx	698483390a	"Could you turn my image into CYMK mode?" add "within GIMP"	2025-07-14 23:53:20 +08:00
ChenYXxxx	9242becd87	"Please batch process all images on the desktop by increasing their brightness to 50, instead of adjusting them individually." add "within GIMP"	2025-07-14 23:52:41 +08:00
ChenYXxxx	56b2fe9cc4	Update d16c99dc-2a1e-46f2-b350-d97c86c85c15.json	2025-07-14 23:23:11 +08:00
ChenYXxxx	b481c794c5	Update 72f83cdc-bf76-4531-9a1b-eb893a13f8aa.json	2025-07-14 23:22:50 +08:00
ChenYXxxx	7f973a391c	Update f723c744-e62c-4ae6-98d1-750d3cd7d79d.json	2025-07-14 23:22:14 +08:00
shenzhennan	7f96cc0633	Merge branch 'main' of https://github.com/xlang-ai/OSWorld	2025-07-14 12:35:00 +00:00
shenzhennan	53983db9cb	fix impress eval : extending sleep time to ensure save	2025-07-14 12:34:43 +00:00
Xinyuan Wang	db83b9cb2c	Wxy/opencua (#256 ) * OpenCUA Agent code base * update url * debug, modify url input	2025-07-14 20:26:39 +08:00
Danyang Zhang	2339db20ca	ver Jul7th (#255 ) pip-installing directly from PyPI fails misteriously in postconfig execution, possible owing to proxy configuration in the VM, adjusted strategy by downloading the wheel on host and pip-installing it locally on VM in thunderbird/d38192b0-17dc-4e1d-99c3-786d0117de77	2025-07-14 20:26:29 +08:00
shenzhennan	60e26d2d0d	fix impress compare use gold file	2025-07-14 11:35:06 +00:00
yuanmengqi	90c4e894a4	Merge remote-tracking branch 'upstream/main' into fix_chrome	2025-07-14 07:14:19 +00:00
yuanmengqi	5d90faa548	run operagor	2025-07-14 07:13:17 +00:00
Zilong Zhou	74b7c189af	Feat/monitor (#254 ) * feat: add claude support * feat: add script for end-to-end evaluation with logging and task distribution * feat&fix: add tool result handling and update model default in evaluation script * chore: remove run_test_env.py script * feat&fix: implement action parsing for tool calls and update default action space * fix: update text formatting in action parsing and replace logger import * feat&fix: implement action parsing for tool calls and add screen size handling * feat: add setup instructions for Anthropic API integration * feat: add notice about image size limitations for Anthropic API * Delete test_env/logger.py * Delete test_env/utils.py * fix: update logger usage to use global logger and improve error handling * feat&fix: add configuration management API endpoints and update UI for configuration selection * feat&fix: update environment configuration, enhance task statistics, and improve UI responsiveness * feat&fix: add configuration toggle button in UI and improve task loading performance * feat&fix: add accuracy percentage display to score and style updates for UI	2025-07-14 13:43:41 +08:00
yuanmengqi	0651495d88	fix: Enhance error handling and logging across multiple evaluators - Added logging for file retrieval and error handling in file.py, improving robustness during file operations. - Implemented checks for file existence and parsing errors in general.py, enhancing reliability in JSON/YAML processing. - Improved table comparison logic in table.py with detailed error logging for sheet loading and cell value reading. - Enhanced metrics evaluation in slides.py with additional checks for paragraph and run counts, ensuring thorough comparison. - Updated utils.py to include file existence checks and detailed error logging during cell value reading.	2025-07-14 05:43:17 +00:00
yuanmengqi	b8b026f817	Merge remote-tracking branch 'upstream/main' into fix_chrome	2025-07-13 13:16:48 +00:00
yuanmengqi	7c807d4f3e	Merge remote-tracking branch 'upstream/main'	2025-07-13 13:12:26 +00:00
Zilong Zhou	349f2fd9fe	Feat/claude cua support (#253 ) * feat: add claude support * feat: add script for end-to-end evaluation with logging and task distribution * feat&fix: add tool result handling and update model default in evaluation script * chore: remove run_test_env.py script * feat&fix: implement action parsing for tool calls and update default action space * fix: update text formatting in action parsing and replace logger import * feat&fix: implement action parsing for tool calls and add screen size handling * feat: add setup instructions for Anthropic API integration * feat: add notice about image size limitations for Anthropic API * Delete test_env/logger.py * Delete test_env/utils.py	2025-07-13 21:10:49 +08:00
Yuan Mengqi	38a30734a6	Improve code logic for password & resolution (#252 ) * fix chrome * fix: fix proxy setup * feat&fix: add proxy support in setup and remove hardcoded proxy from example * fix tasks * fix chrome finished * fix * clean chrome_fix code * clean chrome_fix code * fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774 * fix multiapps * fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774 * fix some multi_apps tasks * fix some multi_apps tasks * fix password&resolution * fix password&resolution * Improve code logic for password & resolution * edit * Merge branch 'main' into fix_chrome * fix chrome tasks --------- Co-authored-by: adlsdztony <zzl0712@connect.hku.hk>	2025-07-13 21:04:07 +08:00
yuanmengqi	9f806d425d	fix chrome final	2025-07-13 12:46:00 +00:00
yuanmengqi	7279469d23	fix chrome tasks	2025-07-13 12:41:27 +00:00
yuanmengqi	572a94b6df	Merge branch 'main' into fix_chrome	2025-07-13 10:16:08 +00:00
yuanmengqi	a16b54c175	edit	2025-07-13 10:14:41 +00:00
yuanmengqi	94ea30cb45	Merge remote-tracking branch 'upstream/main'	2025-07-13 07:05:54 +00:00
yuanmengqi	d3bf4823cb	Improve code logic for password & resolution	2025-07-13 07:01:28 +00:00
yuanmengqi	a070ddda7e	Improve code logic for password & resolution	2025-07-13 06:59:45 +00:00
yuanmengqi	97ed6f99b0	Final review multi_apps fix the rest part	2025-07-12 20:28:55 +00:00
yuanmengqi	dbecf46057	Merge branch 'main' of github.com:xlang-ai/OSWorld	2025-07-12 16:35:02 +00:00
yuanmengqi	877e75a013	Final review multi_apps fix Xinzhuang part	2025-07-12 16:34:55 +00:00
Yuan Mengqi	27319ce1e3	fix password&resolution (#251 ) * fix chrome * fix: fix proxy setup * feat&fix: add proxy support in setup and remove hardcoded proxy from example * fix tasks * fix chrome finished * fix * clean chrome_fix code * clean chrome_fix code * fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774 * fix multiapps * fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774 * fix some multi_apps tasks * fix some multi_apps tasks * fix password&resolution * fix password&resolution --------- Co-authored-by: adlsdztony <zzl0712@connect.hku.hk>	2025-07-13 00:25:37 +08:00
yuanmengqi	3b698aa3c0	Merge branch 'fix_chrome'	2025-07-12 15:13:44 +00:00
yuanmengqi	08bbf77511	fix password&resolution	2025-07-12 15:11:42 +00:00
yuanmengqi	fb0c301e14	Merge branch 'fix_chrome'	2025-07-11 12:17:42 +00:00
yuanmengqi	37c56533f0	Merge remote-tracking branch 'upstream/main' into fix_chrome	2025-07-11 12:16:23 +00:00
yuanmengqi	fe3bb2fd92	fix password&resolution	2025-07-11 12:15:03 +00:00
yuanmengqi	6f0382c0c2	Merge branch 'main' of github.com:xlang-ai/OSWorld	2025-07-10 22:35:42 +00:00
yuanmengqi	6897e5320d	Enhance image text comparison functionality with detailed logging - Added logging for OCR results and text matching outcomes in compare_image_text function. - Updated JSON examples to support multiple expected results and improved structure for evaluator functions. - Enhanced handling of expected text rules to include multiple variations for better matching accuracy.	2025-07-10 22:32:53 +00:00

1 2 3 4 5 ...

1297 Commits