sci-gui-agent-benchmark

Author	SHA1	Message	Date
yuanmengqi	73de48af75	Update Public Evaluation Guidelines and README to require Python 3.10 and enhance installation instructions. Added troubleshooting tips for environment issues and clarified access key creation process in AWS for better security practices.	2025-07-22 19:57:55 +00:00
yuanmengqi	82c3cdd590	feat: refactor run_multienv_qwen25vl.py and qwen25vl_agent.py for improved logging and task management - Introduced signal handling for graceful shutdown of environments and processes. - Enhanced logging configuration to support dynamic log levels and structured output. - Updated argument parsing to include new parameters for model selection and task execution. - Refactored task distribution logic to streamline environment task management. - Improved error handling during task execution and environment cleanup. - Adjusted Qwen25VLAgent initialization to support new model and thought prefix options. - Reduced max tries for LLM calls to optimize performance.	2025-07-22 19:46:42 +00:00
张逸群	4a5d48000f	feat: add HuggingFace mirror support for VM providers (#278 ) Add support for HuggingFace mirror (hf-mirror.com) to improve download speeds in regions where huggingface.co access is slow. - Support HF_ENDPOINT environment variable detection - Automatically switch to hf-mirror.com when HF_ENDPOINT is set - Apply to Docker, VMware, and VirtualBox providers - Maintain backward compatibility with default huggingface.co URLs Users can now set HF_ENDPOINT=https://hf-mirror.com to use the mirror.	2025-07-23 03:40:35 +08:00
Yuan Mengqi	0a37cccd53	update claude (#280 ) * add uitars agent code * improve claude * improve claude * improve claude * improve claude * improve claude	2025-07-23 03:35:49 +08:00
Dunjie Lu	53fb96298a	support_qwen25vl (#276 ) Co-authored-by: root <ludunjie1219@github.com>	2025-07-22 16:33:03 +08:00
yuanmengqi	921321c5df	Update Public Evaluation Guidelines to clarify proxy settings. Added information on automatic proxy wrapping for proxy-sensitive tasks and retained the recommendation for users to disable the proxy if not needed. Ensured existing content structure remains intact.	2025-07-22 05:59:57 +00:00
yuanmengqi	2727696835	Enhance Public Evaluation Guidelines by adding new images for AWS setup and monitoring instructions. Included additional contact information for leaderboard updates and error reporting. Ensured clarity and usability for users while preserving existing content structure.	2025-07-22 05:53:33 +00:00
yuanmengqi	05e25ba1b7	Enhance Public Evaluation Guidelines with detailed AWS setup instructions and security configurations. Added new sections for host and client machine setup, including recommended instance types, storage considerations, and security group rules. Updated existing content for clarity and added a new image for Google Drive authentication. Ensure all changes maintain original logic while improving usability for users with varying AWS experience.	2025-07-22 05:35:58 +00:00
yuanmengqi	feaebbc2ec	Update AWS guidance	2025-07-20 16:42:14 +00:00
yuanmengqi	46c9407879	Clean elder version of opencua experiment runner	2025-07-20 07:57:27 +00:00
yuanmengqi	91bc6bb6ce	Merge branch 'main' of github.com:xlang-ai/OSWorld	2025-07-20 07:55:57 +00:00
yuanmengqi	88d5639a2a	Compatible with agents that cannot use runtime log	2025-07-20 07:55:53 +00:00
Xinyuan Wang	e10dd9267c	Wxy/opencua (#274 ) * OpenCUA Agent code base * update url * debug, modify url input * debug opencua * show result * debug agent history overlap * modify opencua agent; add comment lines * update parallel; clean code; use sleep 3s * ui-tars-0717	2025-07-20 15:52:23 +08:00
Danyang Zhang	bec7129fff	Calc eval fix (#273 ) * ver Jun17th updating annotations * ver Jun17th corrected annotation of 1d17 added check for cell merge * ver Jun17th updated several annotations * ver Jun20th fixed set-up config of 2bd59342-0664-4ccb-ba87-79379096cc08 * fix: Enhance instructions in LibreOffice Calc examples for clarity and specificity, including details on using Pivot Tables, column placements, and revenue calculations. * ver Jun21st updating calc evals * ver Jun22nd fixed an impress task * ver Jun22ndv2 adjusted several calc tasks * Clean scalfolds * ver Jul18th added two try-excepts to handle possible formula parsing and calculation failures * ver Jul19th added supports for cellIs and some other new types of conditional formatting for calc evaluation --------- Co-authored-by: BowenBryanWang <bryanwang.nlp@connect.hku.hk> Co-authored-by: yuanmengqi <yuanmengqi@mail.ustc.edu.cn>	2025-07-19 17:15:40 +08:00
yuanmengqi	c6c62c52d7	feat: add X11 image handling and enhanced OCR processing - Introduced a new function `read_x11_image` to read and convert X11 (XWD) format images to PIL Image, supporting both 24-bit and 32-bit formats. - Enhanced the `compare_image_text` function to include checks for X11 image formats, with multiple conversion attempts using PIL, a custom reader, and netpbm tools. - Improved error handling and logging for OCR processing, providing detailed feedback on conversion attempts and potential issues with X11 images. - Maintained existing logic while expanding functionality for better image processing reliability.	2025-07-18 19:26:29 +00:00
yuanmengqi	d6f2190a9f	fix: refine instruction in OS evaluation example to clarify restrictions on logging out or shutting down the machine	2025-07-18 18:51:01 +00:00
yuanmengqi	d52f3b1fca	feat: add safe image opening function with retry mechanism - Introduced a new function `safe_open_image_with_retry` to handle image file opening with retries for truncated or corrupted files. - Enhanced error handling and logging for image processing in `check_palette_and_structure_sim`. - Updated the logic to safely open both source and target images, ensuring robust evaluation without altering existing functionality. These changes improve the reliability of image handling in the GIMP evaluator while maintaining the original code logic.	2025-07-18 18:36:09 +00:00
yuanmengqi	4fa59ebba2	feat: enhance URL comparison logic and Chrome debugging configuration - Added a new function to ensure URLs have a scheme, defaulting to 'http://' if missing. - Integrated tldextract to normalize URLs by extracting domain parts and handling 'www' subdomains. - Updated the compare_urls function to include logging for better traceability during URL comparisons. - Added tldextract to requirements.txt to support the new functionality. - Updated the AWS manager with a new AMI ID for the specified resolution. - Modified Chrome desktop launcher to include --remote-debugging-port=1337 for GUI debugging support. These changes improve the robustness of URL handling and enable consistent Chrome debugging capabilities without altering existing logic.	2025-07-18 17:55:45 +00:00
yuanmengqi	1ade6fe439	Merge branch 'main' of github.com:xlang-ai/OSWorld	2025-07-18 14:17:37 +00:00
yuanmengqi	44bd66fc9a	Increase timeout for page load stability in Chrome evaluator - Updated the timeout for the page load state from 10 seconds to 60 seconds to ensure better stability during page processing. - Removed redundant retry mechanisms from the active tab checks to streamline the code while maintaining existing functionality. - Enhanced logging to provide clearer insights into the page loading process. These changes aim to improve the reliability of the Chrome evaluator without altering the core logic.	2025-07-18 14:16:16 +00:00
Danyang Zhang	53ffc05042	Calc eval fix (#272 ) * ver Jun17th updating annotations * ver Jun17th corrected annotation of 1d17 added check for cell merge * ver Jun17th updated several annotations * ver Jun20th fixed set-up config of 2bd59342-0664-4ccb-ba87-79379096cc08 * fix: Enhance instructions in LibreOffice Calc examples for clarity and specificity, including details on using Pivot Tables, column placements, and revenue calculations. * ver Jun21st updating calc evals * ver Jun22nd fixed an impress task * ver Jun22ndv2 adjusted several calc tasks * Clean scalfolds * ver Jul18th added two try-excepts to handle possible formula parsing and calculation failures --------- Co-authored-by: BowenBryanWang <bryanwang.nlp@connect.hku.hk> Co-authored-by: yuanmengqi <yuanmengqi@mail.ustc.edu.cn>	2025-07-18 21:28:48 +08:00
Zilong Zhou	7fb1cee575	fix: img path error (#271 ) * feat&style: add task status configuration and clear cache functionality; enhance UI styles * feat&refactor: enhance current configuration API and improve cache clearing logic * refactor&style: simplify task status update logic and improve page refresh mechanism * refactor&feat: streamline default configuration retrieval and enhance cache initialization logic * feat&refactor: add caching to default configuration retrieval and streamline task status logic * feat&style: add collapsible section for additional model parameters and enhance styling for config items * refactor&style: remove floating action button and clean up related styles * fix: update video and screenshot sources to include action space, observation type, and model name parameters	2025-07-18 19:52:03 +08:00
shenzhennan	1378c745e1	Merge branch 'main' of https://github.com/xlang-ai/OSWorld	2025-07-18 07:15:18 +00:00
shenzhennan	c7017a476d	fix impress instruction 0a211154	2025-07-18 07:14:35 +00:00
yuanmengqi	fcaefe7bb4	Enhance Chrome evaluator with improved error handling and retry mechanisms - Added robust error handling for page processing, including checks for closed pages and HTTP status codes. - Implemented retry logic for page loads and active tab checks to improve reliability. - Enhanced logging throughout the process to capture detailed information about failures and successes. - Preserved existing logic while ensuring better maintainability and robustness in the Chrome evaluator functions.	2025-07-18 07:13:13 +00:00
yuanmengqi	0fb625e4fd	Update instruction in OS evaluation example to include a restriction against shutting down the machine.	2025-07-18 05:28:43 +00:00
yuanmengqi	b9a646c11d	Merge branch 'main' of github.com:xlang-ai/OSWorld	2025-07-17 18:19:19 +00:00
yuanmengqi	d1ddd3eacd	feat: enhance VM wallpaper retrieval and image similarity checks - Added logging to the VM wallpaper retrieval function to capture errors and warnings related to content retrieval and file creation. - Implemented checks for None, empty, and invalid content types to ensure robustness in wallpaper handling. - Enhanced the SSIM structure check function with size validation and improved error handling for image processing. - Added logging for image size discrepancies and exceptions during SSIM computation to aid in debugging. These changes improve error handling and logging, ensuring better maintainability and reliability of the evaluators.	2025-07-17 18:19:09 +00:00
Zilong Zhou	66694c663d	Feat/monitor cache (#267 ) * feat&style: add task status configuration and clear cache functionality; enhance UI styles * feat&refactor: enhance current configuration API and improve cache clearing logic * refactor&style: simplify task status update logic and improve page refresh mechanism * refactor&feat: streamline default configuration retrieval and enhance cache initialization logic * feat&refactor: add caching to default configuration retrieval and streamline task status logic * feat&style: add collapsible section for additional model parameters and enhance styling for config items * refactor&style: remove floating action button and clean up related styles	2025-07-18 01:58:20 +08:00
yuanmengqi	e70cf0bd93	Merge branch 'main' of github.com:xlang-ai/OSWorld	2025-07-17 11:15:53 +00:00
yuanmengqi	9d04624e41	feat: enhance Chrome evaluator with improved retry logic and logging - Implemented retry mechanism for connecting to Chrome instances, allowing up to two attempts before failure. - Increased timeout settings for page navigation and loading to enhance reliability. - Added detailed logging for connection attempts, page loading status, and error handling to improve debugging and user experience. - Ensured existing logic is preserved while enhancing error handling and operational robustness. These changes improve the overall reliability and maintainability of the Chrome evaluator functions.	2025-07-17 11:15:47 +00:00
yuanmengqi	2c51950e73	feat: enhance evaluator configuration for Chrome with post-execution commands - Added postconfig commands to multiple JSON files for Chrome evaluation examples. - Included commands to terminate existing Chrome processes, launch Chrome with remote debugging, and introduce sleep intervals for timing. - Updated logging messages in the AWS manager to improve clarity and user experience. These changes enhance the automation and usability of the evaluation examples while preserving existing logic.	2025-07-17 10:50:10 +00:00
Yuan Mengqi	5ca516ac7a	add uitars agent code (#265 )	2025-07-17 18:17:13 +08:00
Xinyuan Wang	24fbad9015	Merge pull request #264 from yuanmengqi/main Improve the parallel logic	2025-07-17 12:28:48 +08:00
yuanmengqi	fe40011b5d	Improve the parallel logic	2025-07-17 04:21:42 +00:00
yuanmengqi	6788c58aa3	Improve the parallel logic	2025-07-17 04:20:59 +00:00
yuanmengqi	bb8b0b2582	Improve the parallel logic	2025-07-17 04:19:44 +00:00
yuanmengqi	150234307e	Merge branch 'fix_chrome'	2025-07-17 04:14:47 +00:00
yuanmengqi	9eeabfc52d	Improve the parallel logic	2025-07-17 04:14:20 +00:00
yuanmengqi	2a48500691	refactor: improve code readability and error handling in table.py - Reformatted import statements for better organization. - Enhanced error handling in file reading functions to provide clearer logging. - Introduced a new `_safe_read_file` function to handle multiple encoding attempts when reading files. - Updated the `compare_csv` function to utilize the new file reading method, improving robustness. - Ensured consistent return values across functions, replacing `0.` with `0.0` for clarity. These changes maintain existing logic while improving code maintainability and readability.	2025-07-16 18:11:05 +00:00
yuanmengqi	a9c1c6135a	Merge branch 'main' of github.com:xlang-ai/OSWorld	2025-07-16 17:38:39 +00:00
yuanmengqi	0939226020	feat: enhance evaluator configuration with post-execution commands for Chrome - Added a series of postconfig commands to the evaluator section in the JSON file. - Commands include executing a refresh in Chrome, managing Chrome processes, launching Chrome with remote debugging, and opening specific settings tabs. - Introduced sleep intervals to ensure proper execution timing between commands. This update improves the automation capabilities of the evaluation examples while maintaining existing logic.	2025-07-16 17:37:37 +00:00
Zilong Zhou	dc164d5269	feat&fix: update configuration management to save model arguments and enhance UI display for model args (#262 )	2025-07-16 21:46:35 +08:00
yuanmengqi	e433f35c1f	feat: standardize configuration fields across all evaluation examples - Add `fixed_ip` field to all 369 JSON files in examples directory - Set to `true` for 8 files listed in google_chrome.json multi_apps - Set to `false` for remaining 361 files - Add `possibility_of_env_change` field to 363 JSON files missing this field - Set to "low" for newly added fields - Preserve existing values (4 medium, 2 high) for 6 files that already had this field This ensures consistent configuration schema across all evaluation examples while maintaining backward compatibility with existing settings.	2025-07-16 13:45:34 +00:00
yuanmengqi	b9df320f31	Enhance check_python_file_by_test_suite function with robust error handling and logging. Added validation for file existence, module loading, and function execution. Improved resource cleanup and working directory management to ensure stability and reliability during test execution.	2025-07-16 11:44:46 +00:00
Xinyuan Wang	0f2655249c	Wxy/opencua (#260 ) * OpenCUA Agent code base * update url * debug, modify url input * debug opencua * show result * debug agent history overlap * modify opencua agent; add comment lines	2025-07-16 17:53:12 +08:00
yuanmengqi	5e5058c1f2	fix: standardize provider interface parameters across all implementations - Add screen_size parameter to get_vm_path() for all providers (with default 1920x1080) - Add os_type parameter to start_emulator() for Azure and VirtualBox providers - Add region parameter to stop_emulator() for VMware, Docker, and VirtualBox providers - Use args, *kwargs for better extensibility and parameter consistency - Add documentation comments explaining ignored parameters for interface consistency - Prevents TypeError exceptions when AWS-specific parameters are passed to other providers This ensures all providers can handle the same parameter sets while maintaining backward compatibility and avoiding interface fragmentation.	2025-07-15 21:38:34 +00:00
yuanmengqi	cb070307ee	merge code	2025-07-15 14:57:14 +00:00
yuanmengqi	175b4b46c2	Merge remote-tracking branch 'upstream/main' into fix_chrome	2025-07-15 14:50:48 +00:00
yuanmengqi	7912880d16	Merge branch 'main' of github.com:xlang-ai/OSWorld	2025-07-15 07:24:38 +00:00

1 2 3 4 5 ...

1240 Commits