sci-gui-agent-benchmark

Author	SHA1	Message	Date
yuanmengqi	a37fe86925	feat: enhance logging and signal handling in run_multienv_claude.py - Refactored logging configuration to support dynamic log levels via command-line arguments, allowing for better control over log verbosity. - Introduced a new signal handler for graceful shutdown of environments and processes, improving robustness during termination. - Added functionality to save command-line arguments to a JSON file for better traceability of execution parameters. - Maintained existing logic while enhancing the overall structure and error handling capabilities of the script.	2025-07-28 07:43:13 +00:00
yuanmengqi	78651040e7	docs: update README.md with new OSWorld-Verified announcement and minor text corrections - Added a new update entry for the introduction of OSWorld-Verified highlighting major updates and community fixes. - Corrected the spelling of "VirtualBox" in the environment refactor entry. - Enhanced clarity in the Docker section title for better readability.	2025-07-28 07:19:39 +00:00
yuanmengqi	0f00788c4d	feat: add run_multienv_o3.py script for multi-environment evaluation - Introduced a new script `run_multienv_o3.py` to facilitate end-to-end evaluation across multiple environments. - Implemented command-line argument parsing for various configurations including environment settings, logging levels, and AWS parameters. - Integrated signal handling for graceful shutdown of environments and processes. - Enhanced logging capabilities for better traceability during execution. - Maintained existing logic from previous scripts while introducing new functionalities for improved evaluation processes.	2025-07-27 16:47:24 +00:00
yuanmengqi	1342bfe5ce	delete: remove show_result_opencua.py file and its associated functions	2025-07-27 16:37:40 +00:00
yuanmengqi	5fa490adf4	fix: update Flask port configuration to support environment variable - Modified the Flask application to allow the port to be set via the `FLASK_PORT` environment variable, defaulting to 8080 if not specified. - Ensured existing application logic remains unchanged while enhancing configurability for deployment environments.	2025-07-27 16:14:07 +00:00
yuanmengqi	523d553e88	feat: add client password argument to multiple agents and scripts - Introduced `--client_password` argument in `run_multienv_aguvis.py`, `run_multienv_claude.py`, and `run_multienv_gta1.py` for enhanced security and flexibility. - Updated agent classes (`PromptAgent`, `AguvisAgent`, `GTA1Agent`) to accept and utilize `client_password` for improved configuration. - Modified evaluation guidelines to reflect the new client password requirement. - Ensured existing logic remains intact while enhancing functionality for better user experience.	2025-07-27 16:11:23 +00:00
yuanmengqi	122b16742b	fix: improve EPUB processing by checking for file existence before reading - Added checks for the presence of "toc.ncx" and "content.opf" in the EPUB file before attempting to process them. - Introduced debug logging to notify when these files are not found, enhancing error handling and traceability. - Maintained existing logic while improving robustness of the EPUB processing function.	2025-07-26 20:42:18 +00:00
yuanmengqi	b25854edba	feat: introduce DummyAgent class for enhanced coordinate handling - Added DummyAgent class to facilitate coordinate generation and action assignment. - Updated GTA1Agent to utilize DummyAgent for improved planning and execution. - Increased max_steps and N_SEQ parameters for better performance. - Enhanced logging for planning and execution processes. - Maintained existing logic while integrating new functionality.	2025-07-26 08:26:23 +00:00
yuanmengqi	d49ca9cc2d	fix: enhance handling of '<' characters in pyautogui commands - Refactor _fix_pyautogui_less_than_bug to improve handling of press('<') and typewrite calls. - Introduce Unicode escape decoding for typewrite content to ensure proper '<' character processing. - Maintain existing logic while enhancing functionality for better compatibility.	2025-07-26 07:59:37 +00:00
yuanmengqi	123f51ea4a	Merge branch 'main' of github.com:xlang-ai/OSWorld	2025-07-26 07:28:39 +00:00
yuanmengqi	73caf53880	delete: remove img_utils.py and update imports in jedi_3b_agent.py and jedi_7b_agent.py to use qwen_vl_utils	2025-07-26 07:28:31 +00:00
张逸群	2ed0436c21	fix: update DockerVMManager method signatures for interface compatibility (#287 ) - Fix delete_vm() method to accept region and kwargs parameters - Fix occupy_vm() method to accept pid, region and kwargs parameters - Ensures consistency with base VMManager interface and other providers - Resolves runtime argument mismatch errors when calling these methods This maintains backward compatibility while fixing the interface contract.	2025-07-26 01:18:00 +08:00
yuanmengqi	40fdc6266f	chore: update default AWS instance type from t3.xlarge to t3.medium	2025-07-25 15:56:42 +00:00
yuanmengqi	39e5baf5ae	fix: remove unnecessary sleep and observation retrieval in run_single_example function	2025-07-25 15:51:20 +00:00
yuanmengqi	f5595df71c	delete: remove gat1_agent.py file	2025-07-25 07:11:55 +00:00
Zilong Zhou	b8b9e9b166	feat: add proxy handling logic and clean up imports (#285 )	2025-07-24 16:27:56 +08:00
Zilong Zhou	cbe650d0bb	refactor&delete: simplify AWS VM allocation and remove proxy support (#284 )	2025-07-24 16:27:18 +08:00
Jiaqi	23b81993fa	os task fix: set the default dim screen time to be 300s	2025-07-24 08:13:02 +00:00
ChenYXxxx	873f8a0359	Update 10a730d5-d414-4b40-b479-684bed1ae522.json change the ight 2 the night	2025-07-24 15:44:52 +08:00
Jiarui Yao	4fd8b5be0a	support docker without kvm (#282 )	2025-07-24 12:31:43 +08:00
张逸群	bf78b6d05e	Add OPENAI_BASE_URL support for custom OpenAI-compatible endpoints (#283 ) Enables GPT models to use custom API endpoints through OPENAI_BASE_URL environment variable. This addresses the limitation where only Azure OpenAI supported custom endpoints while standard GPT models were hardcoded to api.openai.com. - Add intelligent URL handling to avoid duplicate /v1 paths - Maintain backward compatibility with default OpenAI API - Update README with configuration instructions - Non-breaking change preserving existing functionality Fixes API integration issues for users with custom OpenAI-compatible services.	2025-07-24 12:31:08 +08:00
yuanmengqi	0b0f8413df	Merge branch 'main' of github.com:xlang-ai/OSWorld	2025-07-23 17:19:54 +00:00
yuanmengqi	0a2929137b	Simplify the logic for Docker provider	2025-07-23 17:19:47 +00:00
Yan98	2f3a6c48f6	Fix Typos (#275 ) * init * init * fix typo	2025-07-24 00:06:04 +08:00
yuanmengqi	fd7381210e	Merge branch 'main' of github.com:xlang-ai/OSWorld	2025-07-23 16:05:42 +00:00
yuanmengqi	5d219e7a5b	Clean code	2025-07-23 16:05:39 +00:00
Yuan Mengqi	d128edbbc1	add nogdrive json (#281 ) * add uitars agent code * improve claude * improve claude * improve claude * improve claude * improve claude * add nogdrive json	2025-07-23 19:12:42 +08:00
张逸群	4d6e0fd031	Add --provider_name parameter to run.py and fix Docker provider initialization (#277 ) - Add command-line argument --provider_name to support flexible provider selection - Default provider remains vmware for backward compatibility - Fix Docker provider controller initialization issue with delayed setup - Add safety checks for controller existence in error handling This enables users to specify different virtualization providers directly from the command line and resolves Docker container lifecycle issues.	2025-07-23 04:09:36 +08:00
yuanmengqi	73de48af75	Update Public Evaluation Guidelines and README to require Python 3.10 and enhance installation instructions. Added troubleshooting tips for environment issues and clarified access key creation process in AWS for better security practices.	2025-07-22 19:57:55 +00:00
yuanmengqi	82c3cdd590	feat: refactor run_multienv_qwen25vl.py and qwen25vl_agent.py for improved logging and task management - Introduced signal handling for graceful shutdown of environments and processes. - Enhanced logging configuration to support dynamic log levels and structured output. - Updated argument parsing to include new parameters for model selection and task execution. - Refactored task distribution logic to streamline environment task management. - Improved error handling during task execution and environment cleanup. - Adjusted Qwen25VLAgent initialization to support new model and thought prefix options. - Reduced max tries for LLM calls to optimize performance.	2025-07-22 19:46:42 +00:00
张逸群	4a5d48000f	feat: add HuggingFace mirror support for VM providers (#278 ) Add support for HuggingFace mirror (hf-mirror.com) to improve download speeds in regions where huggingface.co access is slow. - Support HF_ENDPOINT environment variable detection - Automatically switch to hf-mirror.com when HF_ENDPOINT is set - Apply to Docker, VMware, and VirtualBox providers - Maintain backward compatibility with default huggingface.co URLs Users can now set HF_ENDPOINT=https://hf-mirror.com to use the mirror.	2025-07-23 03:40:35 +08:00
Yuan Mengqi	0a37cccd53	update claude (#280 ) * add uitars agent code * improve claude * improve claude * improve claude * improve claude * improve claude	2025-07-23 03:35:49 +08:00
Dunjie Lu	53fb96298a	support_qwen25vl (#276 ) Co-authored-by: root <ludunjie1219@github.com>	2025-07-22 16:33:03 +08:00
yuanmengqi	921321c5df	Update Public Evaluation Guidelines to clarify proxy settings. Added information on automatic proxy wrapping for proxy-sensitive tasks and retained the recommendation for users to disable the proxy if not needed. Ensured existing content structure remains intact.	2025-07-22 05:59:57 +00:00
yuanmengqi	2727696835	Enhance Public Evaluation Guidelines by adding new images for AWS setup and monitoring instructions. Included additional contact information for leaderboard updates and error reporting. Ensured clarity and usability for users while preserving existing content structure.	2025-07-22 05:53:33 +00:00
yuanmengqi	05e25ba1b7	Enhance Public Evaluation Guidelines with detailed AWS setup instructions and security configurations. Added new sections for host and client machine setup, including recommended instance types, storage considerations, and security group rules. Updated existing content for clarity and added a new image for Google Drive authentication. Ensure all changes maintain original logic while improving usability for users with varying AWS experience.	2025-07-22 05:35:58 +00:00
yuanmengqi	feaebbc2ec	Update AWS guidance	2025-07-20 16:42:14 +00:00
yuanmengqi	46c9407879	Clean elder version of opencua experiment runner	2025-07-20 07:57:27 +00:00
yuanmengqi	91bc6bb6ce	Merge branch 'main' of github.com:xlang-ai/OSWorld	2025-07-20 07:55:57 +00:00
yuanmengqi	88d5639a2a	Compatible with agents that cannot use runtime log	2025-07-20 07:55:53 +00:00
Xinyuan Wang	e10dd9267c	Wxy/opencua (#274 ) * OpenCUA Agent code base * update url * debug, modify url input * debug opencua * show result * debug agent history overlap * modify opencua agent; add comment lines * update parallel; clean code; use sleep 3s * ui-tars-0717	2025-07-20 15:52:23 +08:00
Danyang Zhang	bec7129fff	Calc eval fix (#273 ) * ver Jun17th updating annotations * ver Jun17th corrected annotation of 1d17 added check for cell merge * ver Jun17th updated several annotations * ver Jun20th fixed set-up config of 2bd59342-0664-4ccb-ba87-79379096cc08 * fix: Enhance instructions in LibreOffice Calc examples for clarity and specificity, including details on using Pivot Tables, column placements, and revenue calculations. * ver Jun21st updating calc evals * ver Jun22nd fixed an impress task * ver Jun22ndv2 adjusted several calc tasks * Clean scalfolds * ver Jul18th added two try-excepts to handle possible formula parsing and calculation failures * ver Jul19th added supports for cellIs and some other new types of conditional formatting for calc evaluation --------- Co-authored-by: BowenBryanWang <bryanwang.nlp@connect.hku.hk> Co-authored-by: yuanmengqi <yuanmengqi@mail.ustc.edu.cn>	2025-07-19 17:15:40 +08:00
yuanmengqi	c6c62c52d7	feat: add X11 image handling and enhanced OCR processing - Introduced a new function `read_x11_image` to read and convert X11 (XWD) format images to PIL Image, supporting both 24-bit and 32-bit formats. - Enhanced the `compare_image_text` function to include checks for X11 image formats, with multiple conversion attempts using PIL, a custom reader, and netpbm tools. - Improved error handling and logging for OCR processing, providing detailed feedback on conversion attempts and potential issues with X11 images. - Maintained existing logic while expanding functionality for better image processing reliability.	2025-07-18 19:26:29 +00:00
yuanmengqi	d6f2190a9f	fix: refine instruction in OS evaluation example to clarify restrictions on logging out or shutting down the machine	2025-07-18 18:51:01 +00:00
yuanmengqi	d52f3b1fca	feat: add safe image opening function with retry mechanism - Introduced a new function `safe_open_image_with_retry` to handle image file opening with retries for truncated or corrupted files. - Enhanced error handling and logging for image processing in `check_palette_and_structure_sim`. - Updated the logic to safely open both source and target images, ensuring robust evaluation without altering existing functionality. These changes improve the reliability of image handling in the GIMP evaluator while maintaining the original code logic.	2025-07-18 18:36:09 +00:00
yuanmengqi	4fa59ebba2	feat: enhance URL comparison logic and Chrome debugging configuration - Added a new function to ensure URLs have a scheme, defaulting to 'http://' if missing. - Integrated tldextract to normalize URLs by extracting domain parts and handling 'www' subdomains. - Updated the compare_urls function to include logging for better traceability during URL comparisons. - Added tldextract to requirements.txt to support the new functionality. - Updated the AWS manager with a new AMI ID for the specified resolution. - Modified Chrome desktop launcher to include --remote-debugging-port=1337 for GUI debugging support. These changes improve the robustness of URL handling and enable consistent Chrome debugging capabilities without altering existing logic.	2025-07-18 17:55:45 +00:00
yuanmengqi	1ade6fe439	Merge branch 'main' of github.com:xlang-ai/OSWorld	2025-07-18 14:17:37 +00:00
yuanmengqi	44bd66fc9a	Increase timeout for page load stability in Chrome evaluator - Updated the timeout for the page load state from 10 seconds to 60 seconds to ensure better stability during page processing. - Removed redundant retry mechanisms from the active tab checks to streamline the code while maintaining existing functionality. - Enhanced logging to provide clearer insights into the page loading process. These changes aim to improve the reliability of the Chrome evaluator without altering the core logic.	2025-07-18 14:16:16 +00:00
Danyang Zhang	53ffc05042	Calc eval fix (#272 ) * ver Jun17th updating annotations * ver Jun17th corrected annotation of 1d17 added check for cell merge * ver Jun17th updated several annotations * ver Jun20th fixed set-up config of 2bd59342-0664-4ccb-ba87-79379096cc08 * fix: Enhance instructions in LibreOffice Calc examples for clarity and specificity, including details on using Pivot Tables, column placements, and revenue calculations. * ver Jun21st updating calc evals * ver Jun22nd fixed an impress task * ver Jun22ndv2 adjusted several calc tasks * Clean scalfolds * ver Jul18th added two try-excepts to handle possible formula parsing and calculation failures --------- Co-authored-by: BowenBryanWang <bryanwang.nlp@connect.hku.hk> Co-authored-by: yuanmengqi <yuanmengqi@mail.ustc.edu.cn>	2025-07-18 21:28:48 +08:00
Zilong Zhou	7fb1cee575	fix: img path error (#271 ) * feat&style: add task status configuration and clear cache functionality; enhance UI styles * feat&refactor: enhance current configuration API and improve cache clearing logic * refactor&style: simplify task status update logic and improve page refresh mechanism * refactor&feat: streamline default configuration retrieval and enhance cache initialization logic * feat&refactor: add caching to default configuration retrieval and streamline task status logic * feat&style: add collapsible section for additional model parameters and enhance styling for config items * refactor&style: remove floating action button and clean up related styles * fix: update video and screenshot sources to include action space, observation type, and model name parameters	2025-07-18 19:52:03 +08:00

1 2 3 4 5 ...

1268 Commits