Commit Graph

1268 Commits

Author SHA1 Message Date
yuanmengqi
a37fe86925 feat: enhance logging and signal handling in run_multienv_claude.py
- Refactored logging configuration to support dynamic log levels via command-line arguments, allowing for better control over log verbosity.
- Introduced a new signal handler for graceful shutdown of environments and processes, improving robustness during termination.
- Added functionality to save command-line arguments to a JSON file for better traceability of execution parameters.
- Maintained existing logic while enhancing the overall structure and error handling capabilities of the script.
2025-07-28 07:43:13 +00:00
yuanmengqi
78651040e7 docs: update README.md with new OSWorld-Verified announcement and minor text corrections
- Added a new update entry for the introduction of **OSWorld-Verified** highlighting major updates and community fixes.
- Corrected the spelling of "VirtualBox" in the environment refactor entry.
- Enhanced clarity in the Docker section title for better readability.
2025-07-28 07:19:39 +00:00
yuanmengqi
0f00788c4d feat: add run_multienv_o3.py script for multi-environment evaluation
- Introduced a new script `run_multienv_o3.py` to facilitate end-to-end evaluation across multiple environments.
- Implemented command-line argument parsing for various configurations including environment settings, logging levels, and AWS parameters.
- Integrated signal handling for graceful shutdown of environments and processes.
- Enhanced logging capabilities for better traceability during execution.
- Maintained existing logic from previous scripts while introducing new functionalities for improved evaluation processes.
2025-07-27 16:47:24 +00:00
yuanmengqi
1342bfe5ce delete: remove show_result_opencua.py file and its associated functions 2025-07-27 16:37:40 +00:00
yuanmengqi
5fa490adf4 fix: update Flask port configuration to support environment variable
- Modified the Flask application to allow the port to be set via the `FLASK_PORT` environment variable, defaulting to 8080 if not specified.
- Ensured existing application logic remains unchanged while enhancing configurability for deployment environments.
2025-07-27 16:14:07 +00:00
yuanmengqi
523d553e88 feat: add client password argument to multiple agents and scripts
- Introduced `--client_password` argument in `run_multienv_aguvis.py`, `run_multienv_claude.py`, and `run_multienv_gta1.py` for enhanced security and flexibility.
- Updated agent classes (`PromptAgent`, `AguvisAgent`, `GTA1Agent`) to accept and utilize `client_password` for improved configuration.
- Modified evaluation guidelines to reflect the new client password requirement.
- Ensured existing logic remains intact while enhancing functionality for better user experience.
2025-07-27 16:11:23 +00:00
yuanmengqi
122b16742b fix: improve EPUB processing by checking for file existence before reading
- Added checks for the presence of "toc.ncx" and "content.opf" in the EPUB file before attempting to process them.
- Introduced debug logging to notify when these files are not found, enhancing error handling and traceability.
- Maintained existing logic while improving robustness of the EPUB processing function.
2025-07-26 20:42:18 +00:00
yuanmengqi
b25854edba feat: introduce DummyAgent class for enhanced coordinate handling
- Added DummyAgent class to facilitate coordinate generation and action assignment.
- Updated GTA1Agent to utilize DummyAgent for improved planning and execution.
- Increased max_steps and N_SEQ parameters for better performance.
- Enhanced logging for planning and execution processes.
- Maintained existing logic while integrating new functionality.
2025-07-26 08:26:23 +00:00
yuanmengqi
d49ca9cc2d fix: enhance handling of '<' characters in pyautogui commands
- Refactor _fix_pyautogui_less_than_bug to improve handling of press('<') and typewrite calls.
- Introduce Unicode escape decoding for typewrite content to ensure proper '<' character processing.
- Maintain existing logic while enhancing functionality for better compatibility.
2025-07-26 07:59:37 +00:00
yuanmengqi
123f51ea4a Merge branch 'main' of github.com:xlang-ai/OSWorld 2025-07-26 07:28:39 +00:00
yuanmengqi
73caf53880 delete: remove img_utils.py and update imports in jedi_3b_agent.py and jedi_7b_agent.py to use qwen_vl_utils 2025-07-26 07:28:31 +00:00
张逸群
2ed0436c21 fix: update DockerVMManager method signatures for interface compatibility (#287)
- Fix delete_vm() method to accept region and **kwargs parameters
- Fix occupy_vm() method to accept pid, region and **kwargs parameters
- Ensures consistency with base VMManager interface and other providers
- Resolves runtime argument mismatch errors when calling these methods

This maintains backward compatibility while fixing the interface contract.
2025-07-26 01:18:00 +08:00
yuanmengqi
40fdc6266f chore: update default AWS instance type from t3.xlarge to t3.medium 2025-07-25 15:56:42 +00:00
yuanmengqi
39e5baf5ae fix: remove unnecessary sleep and observation retrieval in run_single_example function 2025-07-25 15:51:20 +00:00
yuanmengqi
f5595df71c delete: remove gat1_agent.py file 2025-07-25 07:11:55 +00:00
Zilong Zhou
b8b9e9b166 feat: add proxy handling logic and clean up imports (#285) 2025-07-24 16:27:56 +08:00
Zilong Zhou
cbe650d0bb refactor&delete: simplify AWS VM allocation and remove proxy support (#284) 2025-07-24 16:27:18 +08:00
Jiaqi
23b81993fa os task fix: set the default dim screen time to be 300s 2025-07-24 08:13:02 +00:00
ChenYXxxx
873f8a0359 Update 10a730d5-d414-4b40-b479-684bed1ae522.json
change the ight 2 the night
2025-07-24 15:44:52 +08:00
Jiarui Yao
4fd8b5be0a support docker without kvm (#282) 2025-07-24 12:31:43 +08:00
张逸群
bf78b6d05e Add OPENAI_BASE_URL support for custom OpenAI-compatible endpoints (#283)
Enables GPT models to use custom API endpoints through OPENAI_BASE_URL environment variable. This addresses the limitation where only Azure OpenAI supported custom endpoints while standard GPT models were hardcoded to api.openai.com.

- Add intelligent URL handling to avoid duplicate /v1 paths
- Maintain backward compatibility with default OpenAI API
- Update README with configuration instructions
- Non-breaking change preserving existing functionality

Fixes API integration issues for users with custom OpenAI-compatible services.
2025-07-24 12:31:08 +08:00
yuanmengqi
0b0f8413df Merge branch 'main' of github.com:xlang-ai/OSWorld 2025-07-23 17:19:54 +00:00
yuanmengqi
0a2929137b Simplify the logic for Docker provider 2025-07-23 17:19:47 +00:00
Yan98
2f3a6c48f6 Fix Typos (#275)
* init

* init

* fix typo
2025-07-24 00:06:04 +08:00
yuanmengqi
fd7381210e Merge branch 'main' of github.com:xlang-ai/OSWorld 2025-07-23 16:05:42 +00:00
yuanmengqi
5d219e7a5b Clean code 2025-07-23 16:05:39 +00:00
Yuan Mengqi
d128edbbc1 add nogdrive json (#281)
* add uitars agent code

* improve claude

* improve claude

* improve claude

* improve claude

* improve claude

* add nogdrive json
2025-07-23 19:12:42 +08:00
张逸群
4d6e0fd031 Add --provider_name parameter to run.py and fix Docker provider initialization (#277)
- Add command-line argument --provider_name to support flexible provider selection
- Default provider remains vmware for backward compatibility
- Fix Docker provider controller initialization issue with delayed setup
- Add safety checks for controller existence in error handling

This enables users to specify different virtualization providers directly
from the command line and resolves Docker container lifecycle issues.
2025-07-23 04:09:36 +08:00
yuanmengqi
73de48af75 Update Public Evaluation Guidelines and README to require Python 3.10 and enhance installation instructions. Added troubleshooting tips for environment issues and clarified access key creation process in AWS for better security practices. 2025-07-22 19:57:55 +00:00
yuanmengqi
82c3cdd590 feat: refactor run_multienv_qwen25vl.py and qwen25vl_agent.py for improved logging and task management
- Introduced signal handling for graceful shutdown of environments and processes.
- Enhanced logging configuration to support dynamic log levels and structured output.
- Updated argument parsing to include new parameters for model selection and task execution.
- Refactored task distribution logic to streamline environment task management.
- Improved error handling during task execution and environment cleanup.
- Adjusted Qwen25VLAgent initialization to support new model and thought prefix options.
- Reduced max tries for LLM calls to optimize performance.
2025-07-22 19:46:42 +00:00
张逸群
4a5d48000f feat: add HuggingFace mirror support for VM providers (#278)
Add support for HuggingFace mirror (hf-mirror.com) to improve download
speeds in regions where huggingface.co access is slow.

- Support HF_ENDPOINT environment variable detection
- Automatically switch to hf-mirror.com when HF_ENDPOINT is set
- Apply to Docker, VMware, and VirtualBox providers
- Maintain backward compatibility with default huggingface.co URLs

Users can now set HF_ENDPOINT=https://hf-mirror.com to use the mirror.
2025-07-23 03:40:35 +08:00
Yuan Mengqi
0a37cccd53 update claude (#280)
* add uitars agent code

* improve claude

* improve claude

* improve claude

* improve claude

* improve claude
2025-07-23 03:35:49 +08:00
Dunjie Lu
53fb96298a support_qwen25vl (#276)
Co-authored-by: root <ludunjie1219@github.com>
2025-07-22 16:33:03 +08:00
yuanmengqi
921321c5df Update Public Evaluation Guidelines to clarify proxy settings. Added information on automatic proxy wrapping for proxy-sensitive tasks and retained the recommendation for users to disable the proxy if not needed. Ensured existing content structure remains intact. 2025-07-22 05:59:57 +00:00
yuanmengqi
2727696835 Enhance Public Evaluation Guidelines by adding new images for AWS setup and monitoring instructions. Included additional contact information for leaderboard updates and error reporting. Ensured clarity and usability for users while preserving existing content structure. 2025-07-22 05:53:33 +00:00
yuanmengqi
05e25ba1b7 Enhance Public Evaluation Guidelines with detailed AWS setup instructions and security configurations. Added new sections for host and client machine setup, including recommended instance types, storage considerations, and security group rules. Updated existing content for clarity and added a new image for Google Drive authentication. Ensure all changes maintain original logic while improving usability for users with varying AWS experience. 2025-07-22 05:35:58 +00:00
yuanmengqi
feaebbc2ec Update AWS guidance 2025-07-20 16:42:14 +00:00
yuanmengqi
46c9407879 Clean elder version of opencua experiment runner 2025-07-20 07:57:27 +00:00
yuanmengqi
91bc6bb6ce Merge branch 'main' of github.com:xlang-ai/OSWorld 2025-07-20 07:55:57 +00:00
yuanmengqi
88d5639a2a Compatible with agents that cannot use runtime log 2025-07-20 07:55:53 +00:00
Xinyuan Wang
e10dd9267c Wxy/opencua (#274)
* OpenCUA Agent code base

* update url

* debug, modify url input

* debug opencua

* show result

* debug agent history overlap

* modify opencua agent; add comment lines

* update parallel; clean code; use sleep 3s

* ui-tars-0717
2025-07-20 15:52:23 +08:00
Danyang Zhang
bec7129fff Calc eval fix (#273)
* ver Jun17th

updating annotations

* ver Jun17th

corrected annotation of 1d17
added check for cell merge

* ver Jun17th

updated several annotations

* ver Jun20th

fixed set-up config of 2bd59342-0664-4ccb-ba87-79379096cc08

* fix: Enhance instructions in LibreOffice Calc examples for clarity and specificity, including details on using Pivot Tables, column placements, and revenue calculations.

* ver Jun21st

updating calc evals

* ver Jun22nd

fixed an impress task

* ver Jun22ndv2

adjusted several calc tasks

* Clean scalfolds

* ver Jul18th

added two try-excepts to handle possible formula parsing and calculation
failures

* ver Jul19th

added supports for cellIs and some other new types of conditional
formatting for calc evaluation

---------

Co-authored-by: BowenBryanWang <bryanwang.nlp@connect.hku.hk>
Co-authored-by: yuanmengqi <yuanmengqi@mail.ustc.edu.cn>
2025-07-19 17:15:40 +08:00
yuanmengqi
c6c62c52d7 feat: add X11 image handling and enhanced OCR processing
- Introduced a new function `read_x11_image` to read and convert X11 (XWD) format images to PIL Image, supporting both 24-bit and 32-bit formats.
- Enhanced the `compare_image_text` function to include checks for X11 image formats, with multiple conversion attempts using PIL, a custom reader, and netpbm tools.
- Improved error handling and logging for OCR processing, providing detailed feedback on conversion attempts and potential issues with X11 images.
- Maintained existing logic while expanding functionality for better image processing reliability.
2025-07-18 19:26:29 +00:00
yuanmengqi
d6f2190a9f fix: refine instruction in OS evaluation example to clarify restrictions on logging out or shutting down the machine 2025-07-18 18:51:01 +00:00
yuanmengqi
d52f3b1fca feat: add safe image opening function with retry mechanism
- Introduced a new function `safe_open_image_with_retry` to handle image file opening with retries for truncated or corrupted files.
- Enhanced error handling and logging for image processing in `check_palette_and_structure_sim`.
- Updated the logic to safely open both source and target images, ensuring robust evaluation without altering existing functionality.

These changes improve the reliability of image handling in the GIMP evaluator while maintaining the original code logic.
2025-07-18 18:36:09 +00:00
yuanmengqi
4fa59ebba2 feat: enhance URL comparison logic and Chrome debugging configuration
- Added a new function to ensure URLs have a scheme, defaulting to 'http://' if missing.
- Integrated tldextract to normalize URLs by extracting domain parts and handling 'www' subdomains.
- Updated the compare_urls function to include logging for better traceability during URL comparisons.
- Added tldextract to requirements.txt to support the new functionality.
- Updated the AWS manager with a new AMI ID for the specified resolution.
- Modified Chrome desktop launcher to include --remote-debugging-port=1337 for GUI debugging support.

These changes improve the robustness of URL handling and enable consistent Chrome debugging capabilities without altering existing logic.
2025-07-18 17:55:45 +00:00
yuanmengqi
1ade6fe439 Merge branch 'main' of github.com:xlang-ai/OSWorld 2025-07-18 14:17:37 +00:00
yuanmengqi
44bd66fc9a Increase timeout for page load stability in Chrome evaluator
- Updated the timeout for the page load state from 10 seconds to 60 seconds to ensure better stability during page processing.
- Removed redundant retry mechanisms from the active tab checks to streamline the code while maintaining existing functionality.
- Enhanced logging to provide clearer insights into the page loading process.

These changes aim to improve the reliability of the Chrome evaluator without altering the core logic.
2025-07-18 14:16:16 +00:00
Danyang Zhang
53ffc05042 Calc eval fix (#272)
* ver Jun17th

updating annotations

* ver Jun17th

corrected annotation of 1d17
added check for cell merge

* ver Jun17th

updated several annotations

* ver Jun20th

fixed set-up config of 2bd59342-0664-4ccb-ba87-79379096cc08

* fix: Enhance instructions in LibreOffice Calc examples for clarity and specificity, including details on using Pivot Tables, column placements, and revenue calculations.

* ver Jun21st

updating calc evals

* ver Jun22nd

fixed an impress task

* ver Jun22ndv2

adjusted several calc tasks

* Clean scalfolds

* ver Jul18th

added two try-excepts to handle possible formula parsing and calculation
failures

---------

Co-authored-by: BowenBryanWang <bryanwang.nlp@connect.hku.hk>
Co-authored-by: yuanmengqi <yuanmengqi@mail.ustc.edu.cn>
2025-07-18 21:28:48 +08:00
Zilong Zhou
7fb1cee575 fix: img path error (#271)
* feat&style: add task status configuration and clear cache functionality; enhance UI styles

* feat&refactor: enhance current configuration API and improve cache clearing logic

* refactor&style: simplify task status update logic and improve page refresh mechanism

* refactor&feat: streamline default configuration retrieval and enhance cache initialization logic

* feat&refactor: add caching to default configuration retrieval and streamline task status logic

* feat&style: add collapsible section for additional model parameters and enhance styling for config items

* refactor&style: remove floating action button and clean up related styles

* fix: update video and screenshot sources to include action space, observation type, and model name parameters
2025-07-18 19:52:03 +08:00