sci-gui-agent-benchmark

Author	SHA1	Message	Date
Adam Yanxiao Zhao	aa05f6cc26	Add AutoGLM-OS agent (#309 ) * autoglm-os initialize * clean code * chore: use proxy for download setup * feat(autoglm-os): add parameter to toggle images * fix: use temporary directory for files pulled from the vm to prevent potential collision when running multiple instances of the same task in parallel * update * add client_password * update multienv * fix * fix prompt * fix prompt * fix prompt * fix sys prompt * feat: use proxy in file evaluator * fix client_password * fix note_prompt * fix autoglm agent cmd type * fix * revert: fix: use temporary directory for files pulled from the vm to prevent potential collision when running multiple instances of the same task in parallel reverts commit bab5473eea1de0e61b0e1d68b23ce324a5b0ee57 * feat(autoglm): setup tools * fix(autoglm): remove second time of get a11y tree * add osworld server restart * Revert "add osworld server restart" This reverts commit 7bd9d84122e246ce2a26de0e49c25494244c2b3d. * fix _launch_setup * fix autoglm agent tools & xml tree * fix desktop_env * fix bug for tool name capitalization * fix: always use proxy for setup download * add fail after exceeding max turns * fix(autoglm): avoid adding image to message when screenshot is empty * fix maximize_window * fix maximize_window * fix maximize_window * fix import browsertools module bug * fix task proxy config bug * restore setup * refactor desktop env * restore image in provider * restore file.py * refactor desktop_env * quick fix * refactor desktop_env.step * fix our env reset * add max truns constraint * clean run script * clean lib_run_single.py --------- Co-authored-by: hanyullai <hanyullai@outlook.com> Co-authored-by: JingBh <jingbohao@yeah.net>	2025-08-17 12:08:40 +08:00
SaiLong Li	c833d03a4b	feat: Update eip charge type to 'PayByTraffic' for volcengine. (#308 ) Co-authored-by: lisailong <lisailong.ze@bytedance.com>	2025-08-15 20:17:52 +08:00
SaiLong Li	cc6eddb466	feat: Add Volcengine provider support for desktop environment. (#307 ) Co-authored-by: lisailong <lisailong.ze@bytedance.com>	2025-08-15 18:53:13 +08:00
Timothyxxx	6ecbcf006b	chore: add ag2 dependency to requirements and setup files for CoACT-1 support - Included ag2 version 0.9.7 in requirements.txt and setup.py to ensure proper package installation. - Maintained existing code logic while enhancing dependency management.	2025-08-13 09:25:49 +00:00
Timothyxxx	50388cfe61	Merge branch 'main' of github.com:xlang-ai/OSWorld	2025-08-13 09:04:17 +00:00
Timothyxxx	7fb5860da0	feat: enhance run_coact.py and related agents with improved task handling and configuration - Updated TASK_DESCRIPTION in run_coact.py to clarify task-solving steps and requirements. - Modified configuration parameters for provider name and client password for better security and flexibility. - Enhanced OrchestratorUserProxyAgent to include user instruction in the auto-reply and improved screenshot handling. - Adjusted coding_agent.py to ensure proper verification of results before saving changes. - Improved CUA agent prompts to maintain application state and handle user instructions more effectively. - Ensured existing code logic remains unchanged while enhancing functionality and usability.	2025-08-13 09:04:09 +00:00
Quyu Kong	893b059e55	feat: Add Aliyun provider support for desktop environment (#304 ) * Adding support for aliyun as a provider * feat: enhance Aliyun provider support - Added Aliyun as a new provider in the desktop environment. - Updated the environment configuration guidelines for Aliyun, including prerequisites and environment variables. - Implemented instance allocation and management functions for Aliyun ECS, including signal handling for graceful termination. - Improved logging and error handling during instance creation and status checks. - Adjusted the provider's methods to utilize the new instance management functions.	2025-08-12 14:31:08 +08:00
Timothyxxx	d2ae0f697d	feat: enhance AnthropicAgent with start_coordinate handling and modifier key support - Added support for an optional start_coordinate parameter to facilitate drag actions from a specified starting point. - Implemented validation for start_coordinate to ensure it is a tuple of two integers. - Enhanced click actions to handle modifier keys, allowing for more complex interactions. - Ensured existing code logic remains unchanged while improving functionality and usability.	2025-08-12 05:34:18 +00:00
Timothyxxx	7418f5cf2f	chore: add traceback import for enhanced error handling - Introduced the traceback module to improve error reporting and debugging capabilities. - Ensured that existing code logic remains unchanged while preparing for future enhancements.	2025-08-12 05:15:54 +00:00
Timothyxxx	9e4d717cde	fix: update AMI mappings in AWS manager - Changed the AMI ID for the ap-east-1 region to a new value for better compatibility. - Added comments to clarify the usage of AMIs for CoACT-1 and the need for manual transfer from us-east-1. - Ensured existing logic remains unchanged while improving documentation for future reference.	2025-08-11 12:19:18 +00:00
Timothyxxx	e2d1887662	Merge branch 'main' of github.com:xlang-ai/OSWorld	2025-08-10 14:40:19 +00:00
Timothyxxx	bd6efcfc4d	fix: enhance screenshot retrieval in PythonController - Added a static method to validate image responses for PNG and JPEG formats using magic bytes. - Improved error handling in the get_screenshot method to log invalid payloads and retry attempts. - Updated the requests call to include a timeout for better reliability.	2025-08-10 14:40:18 +00:00
Timothyxxx	bc1db8d623	chore: update setup.py for version 1.0.0 release - Bumped version to 1.0.0. - Updated Python requirement to >=3.10. - Upgraded dependencies: numpy, Pillow, pandas, torch, and added new dependencies including pygame, backoff, openai, dashscope, google-generativeai, wandb, gdown, tiktoken, groq, docker, loguru, dotenv, tldextract, and anthropic. - Ensured existing logic remains intact while enhancing package capabilities.	2025-08-05 22:19:42 +08:00
Danyang Zhang	7364a720a6	Calc eval fix (#297 ) * ver Jun17th updating annotations * ver Jun17th corrected annotation of 1d17 added check for cell merge * ver Jun17th updated several annotations * ver Jun20th fixed set-up config of 2bd59342-0664-4ccb-ba87-79379096cc08 * fix: Enhance instructions in LibreOffice Calc examples for clarity and specificity, including details on using Pivot Tables, column placements, and revenue calculations. * ver Jun21st updating calc evals * ver Jun22nd fixed an impress task * ver Jun22ndv2 adjusted several calc tasks * Clean scalfolds * ver Jul18th added two try-excepts to handle possible formula parsing and calculation failures * ver Jul19th added supports for cellIs and some other new types of conditional formatting for calc evaluation * ver Aug4th updated some instructions * ver Aug4thv2 fixed a typo --------- Co-authored-by: BowenBryanWang <bryanwang.nlp@connect.hku.hk> Co-authored-by: yuanmengqi <yuanmengqi@mail.ustc.edu.cn>	2025-08-04 12:39:35 +08:00
yuanmengqi	84f407afdd	feat: enhance run_coact.py with logging and configuration options - Added logging configuration to capture runtime logs in both file and console with adjustable log levels. - Introduced new command-line arguments for provider name, region, and client password to improve flexibility and security. - Updated process_task function to accommodate new parameters, ensuring compatibility with existing logic. - Modified prompt templates in coding_agent.py and cua_agent.py to use the client password placeholder for enhanced security.	2025-07-31 05:47:58 +00:00
yuanmengqi	a5b51e8010	refactor: update command in JSON example to use placeholder for client password - Replaced the hardcoded password in the command with a placeholder `{CLIENT_PASSWORD}` for improved security and flexibility. - Ensured that the overall structure of the JSON remains unchanged while enhancing the example's usability.	2025-07-31 05:20:04 +00:00
yuanmengqi	5e24d72da6	fix: correct IP address return logic in AWSProvider - Reverted the return value in the AWSProvider class to use private IP address instead of public IP address. - Ensured that the logic remains intact while addressing the specific requirement for VNC access.	2025-07-31 05:14:00 +00:00
yuanmengqi	b081c328bf	Merge branch 'main' of github.com:xlang-ai/OSWorld	2025-07-31 04:16:42 +00:00
yuanmengqi	acd75476d8	docs: add acknowledgements section in README.md - Included a new section to acknowledge institutions and students who contributed feedback and participated in fixes. - Enhanced recognition of collaborative efforts in the project while maintaining the existing structure of the README.	2025-07-31 04:16:35 +00:00
Yuan Mengqi	239dd37d2e	clean claude run code (#293 ) * add uitars agent code * improve claude * improve claude * improve claude * improve claude * improve claude * add nogdrive json * merge claude code * clean code claude run * clean code claude run * clean code claude run	2025-07-31 12:09:08 +08:00
Linxin Song	b968155757	CoACT initialize (#292 )	2025-07-31 10:35:20 +08:00
Xinyuan Wang	862d704b8c	Wxy/opencua (#290 ) * OpenCUA Agent code base * update url * debug, modify url input * debug opencua * show result * debug agent history overlap * modify opencua agent; add comment lines * update parallel; clean code; use sleep 3s * ui-tars-0717 * update detail * add system password to system prompt * add running command	2025-07-31 08:53:49 +08:00
Xinyuan Wang	3d32556085	Uitars/dev (#291 ) * use aws pub ip * os task fix: set the default dim screen time to be 300s * add all the uitars agents: 1. run_multienv_uitars.py: Qwen2VL-based UITARS models 2. run_multienv_uitars15_v1.py: UITARS1.5-7B 3. run_multienv_uitars15_v2.py: SeedVL1.5 thining/non-thinking --------- Co-authored-by: Jiaqi <dengjiaqi@moonshot.cn>	2025-07-31 08:52:27 +08:00
yuanmengqi	dd488c7294	feat: enhance image comparison functionality in gimp.py - Added resizing logic to handle images of different sizes before comparison, ensuring consistent evaluation. - Implemented mode conversion to ensure both images are in the same format for accurate comparison. - Enhanced structure check by MSE to support conversion of numpy arrays to PIL Images, improving compatibility. - Maintained existing logic while improving robustness and accuracy of image comparison methods.	2025-07-30 06:07:49 +00:00
MillanK0817	4ae9d41da4	feat: update jedi agent with support for o3 as planner	2025-07-30 14:06:37 +08:00
yuanmengqi	99fa3b7cb9	docs: refine proxy configuration note in README.md for clarity - Updated the proxy configuration section to specify that some tasks may require proxy settings to function properly, depending on website defenses. - Enhanced user guidance by clarifying the importance of proper proxy configuration for task execution. - Maintained existing content while improving clarity and user understanding of configuration requirements.	2025-07-29 09:59:31 +00:00
yuanmengqi	c3469835f2	docs: update README.md with important configuration requirements for tasks - Added a section detailing essential configuration requirements for Google Account Tasks and proxy settings. - Highlighted the impact of missing configurations on task execution and evaluation scores. - Maintained existing content while enhancing user guidance and clarity in setup instructions.	2025-07-29 09:57:04 +00:00
yuanmengqi	00804f8118	feat: update provider and action space in DesktopEnv class - Changed the default provider name from "aws" to "vmware" to reflect new environment requirements. - Updated the action space from "computer_13" to "pyautogui" for improved interaction capabilities. - Maintained existing class structure and logic while implementing these updates for better functionality.	2025-07-29 06:48:41 +00:00
yuanmengqi	af64f4ef49	docs: update README.md with font download link and VSCode trust settings - Replaced the font download link for LibreOffice with a new source. - Added instructions for configuring VSCode to disable workspace trust prompts, enhancing user experience. - Maintained existing content while improving clarity and providing additional setup guidance.	2025-07-28 15:13:37 +00:00
yuanmengqi	70cf3e6982	docs: enhance AWS section in README.md for clarity and efficiency - Updated the AWS support section to emphasize the benefits of using cloud services for parallel evaluation, including potential time reductions. - Improved clarity in the username and password information for virtual machines, ensuring security measures are highlighted. - Maintained existing content while enhancing the overall readability and user guidance in the documentation.	2025-07-28 15:12:18 +00:00
yuanmengqi	0eb3a3d6d7	docs: update README.md with new evaluation sections and guidelines - Added a new section for Local Evaluation, clarifying the import process for `run_multienv.py`. - Introduced a Public Evaluation section detailing the process for verifying results on the leaderboard and requirements for sharing agent implementations. - Included links to the Public Evaluation Guideline for user reference. - Maintained existing content while enhancing clarity and providing additional resources for users.	2025-07-28 08:39:09 +00:00
yuanmengqi	0dc78937d0	docs: update README.md with enhanced OSWorld-Verified details and AWS support - Expanded the OSWorld-Verified update entry to include new model results and a comparison with previous benchmarks. - Added a new section on AWS support, detailing the benefits of using cloud services for parallel evaluation and providing links to setup guides. - Corrected the baseline agent command example to reflect the updated model name and added a new example for parallel execution. - Clarified the username and password information for virtual machines, emphasizing security measures for cloud services. - Maintained existing content while enhancing clarity and providing additional resources for users.	2025-07-28 08:22:25 +00:00
yuanmengqi	a37fe86925	feat: enhance logging and signal handling in run_multienv_claude.py - Refactored logging configuration to support dynamic log levels via command-line arguments, allowing for better control over log verbosity. - Introduced a new signal handler for graceful shutdown of environments and processes, improving robustness during termination. - Added functionality to save command-line arguments to a JSON file for better traceability of execution parameters. - Maintained existing logic while enhancing the overall structure and error handling capabilities of the script.	2025-07-28 07:43:13 +00:00
yuanmengqi	78651040e7	docs: update README.md with new OSWorld-Verified announcement and minor text corrections - Added a new update entry for the introduction of OSWorld-Verified highlighting major updates and community fixes. - Corrected the spelling of "VirtualBox" in the environment refactor entry. - Enhanced clarity in the Docker section title for better readability.	2025-07-28 07:19:39 +00:00
yuanmengqi	0f00788c4d	feat: add run_multienv_o3.py script for multi-environment evaluation - Introduced a new script `run_multienv_o3.py` to facilitate end-to-end evaluation across multiple environments. - Implemented command-line argument parsing for various configurations including environment settings, logging levels, and AWS parameters. - Integrated signal handling for graceful shutdown of environments and processes. - Enhanced logging capabilities for better traceability during execution. - Maintained existing logic from previous scripts while introducing new functionalities for improved evaluation processes.	2025-07-27 16:47:24 +00:00
yuanmengqi	1342bfe5ce	delete: remove show_result_opencua.py file and its associated functions	2025-07-27 16:37:40 +00:00
yuanmengqi	5fa490adf4	fix: update Flask port configuration to support environment variable - Modified the Flask application to allow the port to be set via the `FLASK_PORT` environment variable, defaulting to 8080 if not specified. - Ensured existing application logic remains unchanged while enhancing configurability for deployment environments.	2025-07-27 16:14:07 +00:00
yuanmengqi	523d553e88	feat: add client password argument to multiple agents and scripts - Introduced `--client_password` argument in `run_multienv_aguvis.py`, `run_multienv_claude.py`, and `run_multienv_gta1.py` for enhanced security and flexibility. - Updated agent classes (`PromptAgent`, `AguvisAgent`, `GTA1Agent`) to accept and utilize `client_password` for improved configuration. - Modified evaluation guidelines to reflect the new client password requirement. - Ensured existing logic remains intact while enhancing functionality for better user experience.	2025-07-27 16:11:23 +00:00
yuanmengqi	122b16742b	fix: improve EPUB processing by checking for file existence before reading - Added checks for the presence of "toc.ncx" and "content.opf" in the EPUB file before attempting to process them. - Introduced debug logging to notify when these files are not found, enhancing error handling and traceability. - Maintained existing logic while improving robustness of the EPUB processing function.	2025-07-26 20:42:18 +00:00
yuanmengqi	b25854edba	feat: introduce DummyAgent class for enhanced coordinate handling - Added DummyAgent class to facilitate coordinate generation and action assignment. - Updated GTA1Agent to utilize DummyAgent for improved planning and execution. - Increased max_steps and N_SEQ parameters for better performance. - Enhanced logging for planning and execution processes. - Maintained existing logic while integrating new functionality.	2025-07-26 08:26:23 +00:00
yuanmengqi	d49ca9cc2d	fix: enhance handling of '<' characters in pyautogui commands - Refactor _fix_pyautogui_less_than_bug to improve handling of press('<') and typewrite calls. - Introduce Unicode escape decoding for typewrite content to ensure proper '<' character processing. - Maintain existing logic while enhancing functionality for better compatibility.	2025-07-26 07:59:37 +00:00
yuanmengqi	123f51ea4a	Merge branch 'main' of github.com:xlang-ai/OSWorld	2025-07-26 07:28:39 +00:00
yuanmengqi	73caf53880	delete: remove img_utils.py and update imports in jedi_3b_agent.py and jedi_7b_agent.py to use qwen_vl_utils	2025-07-26 07:28:31 +00:00
张逸群	2ed0436c21	fix: update DockerVMManager method signatures for interface compatibility (#287 ) - Fix delete_vm() method to accept region and kwargs parameters - Fix occupy_vm() method to accept pid, region and kwargs parameters - Ensures consistency with base VMManager interface and other providers - Resolves runtime argument mismatch errors when calling these methods This maintains backward compatibility while fixing the interface contract.	2025-07-26 01:18:00 +08:00
yuanmengqi	40fdc6266f	chore: update default AWS instance type from t3.xlarge to t3.medium	2025-07-25 15:56:42 +00:00
yuanmengqi	39e5baf5ae	fix: remove unnecessary sleep and observation retrieval in run_single_example function	2025-07-25 15:51:20 +00:00
yuanmengqi	f5595df71c	delete: remove gat1_agent.py file	2025-07-25 07:11:55 +00:00
Zilong Zhou	b8b9e9b166	feat: add proxy handling logic and clean up imports (#285 )	2025-07-24 16:27:56 +08:00
Zilong Zhou	cbe650d0bb	refactor&delete: simplify AWS VM allocation and remove proxy support (#284 )	2025-07-24 16:27:18 +08:00
Jiaqi	23b81993fa	os task fix: set the default dim screen time to be 300s	2025-07-24 08:13:02 +00:00

1 2 3 4 5 ...

1300 Commits