Commit Graph

1240 Commits

Author SHA1 Message Date
yuanmengqi
73de48af75 Update Public Evaluation Guidelines and README to require Python 3.10 and enhance installation instructions. Added troubleshooting tips for environment issues and clarified access key creation process in AWS for better security practices. 2025-07-22 19:57:55 +00:00
yuanmengqi
82c3cdd590 feat: refactor run_multienv_qwen25vl.py and qwen25vl_agent.py for improved logging and task management
- Introduced signal handling for graceful shutdown of environments and processes.
- Enhanced logging configuration to support dynamic log levels and structured output.
- Updated argument parsing to include new parameters for model selection and task execution.
- Refactored task distribution logic to streamline environment task management.
- Improved error handling during task execution and environment cleanup.
- Adjusted Qwen25VLAgent initialization to support new model and thought prefix options.
- Reduced max tries for LLM calls to optimize performance.
2025-07-22 19:46:42 +00:00
张逸群
4a5d48000f feat: add HuggingFace mirror support for VM providers (#278)
Add support for HuggingFace mirror (hf-mirror.com) to improve download
speeds in regions where huggingface.co access is slow.

- Support HF_ENDPOINT environment variable detection
- Automatically switch to hf-mirror.com when HF_ENDPOINT is set
- Apply to Docker, VMware, and VirtualBox providers
- Maintain backward compatibility with default huggingface.co URLs

Users can now set HF_ENDPOINT=https://hf-mirror.com to use the mirror.
2025-07-23 03:40:35 +08:00
Yuan Mengqi
0a37cccd53 update claude (#280)
* add uitars agent code

* improve claude

* improve claude

* improve claude

* improve claude

* improve claude
2025-07-23 03:35:49 +08:00
Dunjie Lu
53fb96298a support_qwen25vl (#276)
Co-authored-by: root <ludunjie1219@github.com>
2025-07-22 16:33:03 +08:00
yuanmengqi
921321c5df Update Public Evaluation Guidelines to clarify proxy settings. Added information on automatic proxy wrapping for proxy-sensitive tasks and retained the recommendation for users to disable the proxy if not needed. Ensured existing content structure remains intact. 2025-07-22 05:59:57 +00:00
yuanmengqi
2727696835 Enhance Public Evaluation Guidelines by adding new images for AWS setup and monitoring instructions. Included additional contact information for leaderboard updates and error reporting. Ensured clarity and usability for users while preserving existing content structure. 2025-07-22 05:53:33 +00:00
yuanmengqi
05e25ba1b7 Enhance Public Evaluation Guidelines with detailed AWS setup instructions and security configurations. Added new sections for host and client machine setup, including recommended instance types, storage considerations, and security group rules. Updated existing content for clarity and added a new image for Google Drive authentication. Ensure all changes maintain original logic while improving usability for users with varying AWS experience. 2025-07-22 05:35:58 +00:00
yuanmengqi
feaebbc2ec Update AWS guidance 2025-07-20 16:42:14 +00:00
yuanmengqi
46c9407879 Clean elder version of opencua experiment runner 2025-07-20 07:57:27 +00:00
yuanmengqi
91bc6bb6ce Merge branch 'main' of github.com:xlang-ai/OSWorld 2025-07-20 07:55:57 +00:00
yuanmengqi
88d5639a2a Compatible with agents that cannot use runtime log 2025-07-20 07:55:53 +00:00
Xinyuan Wang
e10dd9267c Wxy/opencua (#274)
* OpenCUA Agent code base

* update url

* debug, modify url input

* debug opencua

* show result

* debug agent history overlap

* modify opencua agent; add comment lines

* update parallel; clean code; use sleep 3s

* ui-tars-0717
2025-07-20 15:52:23 +08:00
Danyang Zhang
bec7129fff Calc eval fix (#273)
* ver Jun17th

updating annotations

* ver Jun17th

corrected annotation of 1d17
added check for cell merge

* ver Jun17th

updated several annotations

* ver Jun20th

fixed set-up config of 2bd59342-0664-4ccb-ba87-79379096cc08

* fix: Enhance instructions in LibreOffice Calc examples for clarity and specificity, including details on using Pivot Tables, column placements, and revenue calculations.

* ver Jun21st

updating calc evals

* ver Jun22nd

fixed an impress task

* ver Jun22ndv2

adjusted several calc tasks

* Clean scalfolds

* ver Jul18th

added two try-excepts to handle possible formula parsing and calculation
failures

* ver Jul19th

added supports for cellIs and some other new types of conditional
formatting for calc evaluation

---------

Co-authored-by: BowenBryanWang <bryanwang.nlp@connect.hku.hk>
Co-authored-by: yuanmengqi <yuanmengqi@mail.ustc.edu.cn>
2025-07-19 17:15:40 +08:00
yuanmengqi
c6c62c52d7 feat: add X11 image handling and enhanced OCR processing
- Introduced a new function `read_x11_image` to read and convert X11 (XWD) format images to PIL Image, supporting both 24-bit and 32-bit formats.
- Enhanced the `compare_image_text` function to include checks for X11 image formats, with multiple conversion attempts using PIL, a custom reader, and netpbm tools.
- Improved error handling and logging for OCR processing, providing detailed feedback on conversion attempts and potential issues with X11 images.
- Maintained existing logic while expanding functionality for better image processing reliability.
2025-07-18 19:26:29 +00:00
yuanmengqi
d6f2190a9f fix: refine instruction in OS evaluation example to clarify restrictions on logging out or shutting down the machine 2025-07-18 18:51:01 +00:00
yuanmengqi
d52f3b1fca feat: add safe image opening function with retry mechanism
- Introduced a new function `safe_open_image_with_retry` to handle image file opening with retries for truncated or corrupted files.
- Enhanced error handling and logging for image processing in `check_palette_and_structure_sim`.
- Updated the logic to safely open both source and target images, ensuring robust evaluation without altering existing functionality.

These changes improve the reliability of image handling in the GIMP evaluator while maintaining the original code logic.
2025-07-18 18:36:09 +00:00
yuanmengqi
4fa59ebba2 feat: enhance URL comparison logic and Chrome debugging configuration
- Added a new function to ensure URLs have a scheme, defaulting to 'http://' if missing.
- Integrated tldextract to normalize URLs by extracting domain parts and handling 'www' subdomains.
- Updated the compare_urls function to include logging for better traceability during URL comparisons.
- Added tldextract to requirements.txt to support the new functionality.
- Updated the AWS manager with a new AMI ID for the specified resolution.
- Modified Chrome desktop launcher to include --remote-debugging-port=1337 for GUI debugging support.

These changes improve the robustness of URL handling and enable consistent Chrome debugging capabilities without altering existing logic.
2025-07-18 17:55:45 +00:00
yuanmengqi
1ade6fe439 Merge branch 'main' of github.com:xlang-ai/OSWorld 2025-07-18 14:17:37 +00:00
yuanmengqi
44bd66fc9a Increase timeout for page load stability in Chrome evaluator
- Updated the timeout for the page load state from 10 seconds to 60 seconds to ensure better stability during page processing.
- Removed redundant retry mechanisms from the active tab checks to streamline the code while maintaining existing functionality.
- Enhanced logging to provide clearer insights into the page loading process.

These changes aim to improve the reliability of the Chrome evaluator without altering the core logic.
2025-07-18 14:16:16 +00:00
Danyang Zhang
53ffc05042 Calc eval fix (#272)
* ver Jun17th

updating annotations

* ver Jun17th

corrected annotation of 1d17
added check for cell merge

* ver Jun17th

updated several annotations

* ver Jun20th

fixed set-up config of 2bd59342-0664-4ccb-ba87-79379096cc08

* fix: Enhance instructions in LibreOffice Calc examples for clarity and specificity, including details on using Pivot Tables, column placements, and revenue calculations.

* ver Jun21st

updating calc evals

* ver Jun22nd

fixed an impress task

* ver Jun22ndv2

adjusted several calc tasks

* Clean scalfolds

* ver Jul18th

added two try-excepts to handle possible formula parsing and calculation
failures

---------

Co-authored-by: BowenBryanWang <bryanwang.nlp@connect.hku.hk>
Co-authored-by: yuanmengqi <yuanmengqi@mail.ustc.edu.cn>
2025-07-18 21:28:48 +08:00
Zilong Zhou
7fb1cee575 fix: img path error (#271)
* feat&style: add task status configuration and clear cache functionality; enhance UI styles

* feat&refactor: enhance current configuration API and improve cache clearing logic

* refactor&style: simplify task status update logic and improve page refresh mechanism

* refactor&feat: streamline default configuration retrieval and enhance cache initialization logic

* feat&refactor: add caching to default configuration retrieval and streamline task status logic

* feat&style: add collapsible section for additional model parameters and enhance styling for config items

* refactor&style: remove floating action button and clean up related styles

* fix: update video and screenshot sources to include action space, observation type, and model name parameters
2025-07-18 19:52:03 +08:00
shenzhennan
1378c745e1 Merge branch 'main' of https://github.com/xlang-ai/OSWorld 2025-07-18 07:15:18 +00:00
shenzhennan
c7017a476d fix impress instruction 0a211154 2025-07-18 07:14:35 +00:00
yuanmengqi
fcaefe7bb4 Enhance Chrome evaluator with improved error handling and retry mechanisms
- Added robust error handling for page processing, including checks for closed pages and HTTP status codes.
- Implemented retry logic for page loads and active tab checks to improve reliability.
- Enhanced logging throughout the process to capture detailed information about failures and successes.
- Preserved existing logic while ensuring better maintainability and robustness in the Chrome evaluator functions.
2025-07-18 07:13:13 +00:00
yuanmengqi
0fb625e4fd Update instruction in OS evaluation example to include a restriction against shutting down the machine. 2025-07-18 05:28:43 +00:00
yuanmengqi
b9a646c11d Merge branch 'main' of github.com:xlang-ai/OSWorld 2025-07-17 18:19:19 +00:00
yuanmengqi
d1ddd3eacd feat: enhance VM wallpaper retrieval and image similarity checks
- Added logging to the VM wallpaper retrieval function to capture errors and warnings related to content retrieval and file creation.
- Implemented checks for None, empty, and invalid content types to ensure robustness in wallpaper handling.
- Enhanced the SSIM structure check function with size validation and improved error handling for image processing.
- Added logging for image size discrepancies and exceptions during SSIM computation to aid in debugging.

These changes improve error handling and logging, ensuring better maintainability and reliability of the evaluators.
2025-07-17 18:19:09 +00:00
Zilong Zhou
66694c663d Feat/monitor cache (#267)
* feat&style: add task status configuration and clear cache functionality; enhance UI styles

* feat&refactor: enhance current configuration API and improve cache clearing logic

* refactor&style: simplify task status update logic and improve page refresh mechanism

* refactor&feat: streamline default configuration retrieval and enhance cache initialization logic

* feat&refactor: add caching to default configuration retrieval and streamline task status logic

* feat&style: add collapsible section for additional model parameters and enhance styling for config items

* refactor&style: remove floating action button and clean up related styles
2025-07-18 01:58:20 +08:00
yuanmengqi
e70cf0bd93 Merge branch 'main' of github.com:xlang-ai/OSWorld 2025-07-17 11:15:53 +00:00
yuanmengqi
9d04624e41 feat: enhance Chrome evaluator with improved retry logic and logging
- Implemented retry mechanism for connecting to Chrome instances, allowing up to two attempts before failure.
- Increased timeout settings for page navigation and loading to enhance reliability.
- Added detailed logging for connection attempts, page loading status, and error handling to improve debugging and user experience.
- Ensured existing logic is preserved while enhancing error handling and operational robustness.

These changes improve the overall reliability and maintainability of the Chrome evaluator functions.
2025-07-17 11:15:47 +00:00
yuanmengqi
2c51950e73 feat: enhance evaluator configuration for Chrome with post-execution commands
- Added postconfig commands to multiple JSON files for Chrome evaluation examples.
- Included commands to terminate existing Chrome processes, launch Chrome with remote debugging, and introduce sleep intervals for timing.
- Updated logging messages in the AWS manager to improve clarity and user experience.

These changes enhance the automation and usability of the evaluation examples while preserving existing logic.
2025-07-17 10:50:10 +00:00
Yuan Mengqi
5ca516ac7a add uitars agent code (#265) 2025-07-17 18:17:13 +08:00
Xinyuan Wang
24fbad9015 Merge pull request #264 from yuanmengqi/main
Improve the parallel logic
2025-07-17 12:28:48 +08:00
yuanmengqi
fe40011b5d Improve the parallel logic 2025-07-17 04:21:42 +00:00
yuanmengqi
6788c58aa3 Improve the parallel logic 2025-07-17 04:20:59 +00:00
yuanmengqi
bb8b0b2582 Improve the parallel logic 2025-07-17 04:19:44 +00:00
yuanmengqi
150234307e Merge branch 'fix_chrome' 2025-07-17 04:14:47 +00:00
yuanmengqi
9eeabfc52d Improve the parallel logic 2025-07-17 04:14:20 +00:00
yuanmengqi
2a48500691 refactor: improve code readability and error handling in table.py
- Reformatted import statements for better organization.
- Enhanced error handling in file reading functions to provide clearer logging.
- Introduced a new `_safe_read_file` function to handle multiple encoding attempts when reading files.
- Updated the `compare_csv` function to utilize the new file reading method, improving robustness.
- Ensured consistent return values across functions, replacing `0.` with `0.0` for clarity.

These changes maintain existing logic while improving code maintainability and readability.
2025-07-16 18:11:05 +00:00
yuanmengqi
a9c1c6135a Merge branch 'main' of github.com:xlang-ai/OSWorld 2025-07-16 17:38:39 +00:00
yuanmengqi
0939226020 feat: enhance evaluator configuration with post-execution commands for Chrome
- Added a series of postconfig commands to the evaluator section in the JSON file.
- Commands include executing a refresh in Chrome, managing Chrome processes, launching Chrome with remote debugging, and opening specific settings tabs.
- Introduced sleep intervals to ensure proper execution timing between commands.

This update improves the automation capabilities of the evaluation examples while maintaining existing logic.
2025-07-16 17:37:37 +00:00
Zilong Zhou
dc164d5269 feat&fix: update configuration management to save model arguments and enhance UI display for model args (#262) 2025-07-16 21:46:35 +08:00
yuanmengqi
e433f35c1f feat: standardize configuration fields across all evaluation examples
- Add `fixed_ip` field to all 369 JSON files in examples directory
  - Set to `true` for 8 files listed in google_chrome.json multi_apps
  - Set to `false` for remaining 361 files
- Add `possibility_of_env_change` field to 363 JSON files missing this field
  - Set to "low" for newly added fields
  - Preserve existing values (4 medium, 2 high) for 6 files that already had this field

This ensures consistent configuration schema across all evaluation examples
while maintaining backward compatibility with existing settings.
2025-07-16 13:45:34 +00:00
yuanmengqi
b9df320f31 Enhance check_python_file_by_test_suite function with robust error handling and logging. Added validation for file existence, module loading, and function execution. Improved resource cleanup and working directory management to ensure stability and reliability during test execution. 2025-07-16 11:44:46 +00:00
Xinyuan Wang
0f2655249c Wxy/opencua (#260)
* OpenCUA Agent code base

* update url

* debug, modify url input

* debug opencua

* show result

* debug agent history overlap

* modify opencua agent; add comment lines
2025-07-16 17:53:12 +08:00
yuanmengqi
5e5058c1f2 fix: standardize provider interface parameters across all implementations
- Add screen_size parameter to get_vm_path() for all providers (with default 1920x1080)
- Add os_type parameter to start_emulator() for Azure and VirtualBox providers
- Add region parameter to stop_emulator() for VMware, Docker, and VirtualBox providers
- Use *args, **kwargs for better extensibility and parameter consistency
- Add documentation comments explaining ignored parameters for interface consistency
- Prevents TypeError exceptions when AWS-specific parameters are passed to other providers

This ensures all providers can handle the same parameter sets while maintaining
backward compatibility and avoiding interface fragmentation.
2025-07-15 21:38:34 +00:00
yuanmengqi
cb070307ee merge code 2025-07-15 14:57:14 +00:00
yuanmengqi
175b4b46c2 Merge remote-tracking branch 'upstream/main' into fix_chrome 2025-07-15 14:50:48 +00:00
yuanmengqi
7912880d16 Merge branch 'main' of github.com:xlang-ai/OSWorld 2025-07-15 07:24:38 +00:00