Commit Graph

1215 Commits

Author SHA1 Message Date
yuanmengqi
0fb625e4fd Update instruction in OS evaluation example to include a restriction against shutting down the machine. 2025-07-18 05:28:43 +00:00
yuanmengqi
b9a646c11d Merge branch 'main' of github.com:xlang-ai/OSWorld 2025-07-17 18:19:19 +00:00
yuanmengqi
d1ddd3eacd feat: enhance VM wallpaper retrieval and image similarity checks
- Added logging to the VM wallpaper retrieval function to capture errors and warnings related to content retrieval and file creation.
- Implemented checks for None, empty, and invalid content types to ensure robustness in wallpaper handling.
- Enhanced the SSIM structure check function with size validation and improved error handling for image processing.
- Added logging for image size discrepancies and exceptions during SSIM computation to aid in debugging.

These changes improve error handling and logging, ensuring better maintainability and reliability of the evaluators.
2025-07-17 18:19:09 +00:00
Zilong Zhou
66694c663d Feat/monitor cache (#267)
* feat&style: add task status configuration and clear cache functionality; enhance UI styles

* feat&refactor: enhance current configuration API and improve cache clearing logic

* refactor&style: simplify task status update logic and improve page refresh mechanism

* refactor&feat: streamline default configuration retrieval and enhance cache initialization logic

* feat&refactor: add caching to default configuration retrieval and streamline task status logic

* feat&style: add collapsible section for additional model parameters and enhance styling for config items

* refactor&style: remove floating action button and clean up related styles
2025-07-18 01:58:20 +08:00
yuanmengqi
e70cf0bd93 Merge branch 'main' of github.com:xlang-ai/OSWorld 2025-07-17 11:15:53 +00:00
yuanmengqi
9d04624e41 feat: enhance Chrome evaluator with improved retry logic and logging
- Implemented retry mechanism for connecting to Chrome instances, allowing up to two attempts before failure.
- Increased timeout settings for page navigation and loading to enhance reliability.
- Added detailed logging for connection attempts, page loading status, and error handling to improve debugging and user experience.
- Ensured existing logic is preserved while enhancing error handling and operational robustness.

These changes improve the overall reliability and maintainability of the Chrome evaluator functions.
2025-07-17 11:15:47 +00:00
yuanmengqi
2c51950e73 feat: enhance evaluator configuration for Chrome with post-execution commands
- Added postconfig commands to multiple JSON files for Chrome evaluation examples.
- Included commands to terminate existing Chrome processes, launch Chrome with remote debugging, and introduce sleep intervals for timing.
- Updated logging messages in the AWS manager to improve clarity and user experience.

These changes enhance the automation and usability of the evaluation examples while preserving existing logic.
2025-07-17 10:50:10 +00:00
Yuan Mengqi
5ca516ac7a add uitars agent code (#265) 2025-07-17 18:17:13 +08:00
Xinyuan Wang
24fbad9015 Merge pull request #264 from yuanmengqi/main
Improve the parallel logic
2025-07-17 12:28:48 +08:00
yuanmengqi
fe40011b5d Improve the parallel logic 2025-07-17 04:21:42 +00:00
yuanmengqi
6788c58aa3 Improve the parallel logic 2025-07-17 04:20:59 +00:00
yuanmengqi
bb8b0b2582 Improve the parallel logic 2025-07-17 04:19:44 +00:00
yuanmengqi
150234307e Merge branch 'fix_chrome' 2025-07-17 04:14:47 +00:00
yuanmengqi
9eeabfc52d Improve the parallel logic 2025-07-17 04:14:20 +00:00
yuanmengqi
2a48500691 refactor: improve code readability and error handling in table.py
- Reformatted import statements for better organization.
- Enhanced error handling in file reading functions to provide clearer logging.
- Introduced a new `_safe_read_file` function to handle multiple encoding attempts when reading files.
- Updated the `compare_csv` function to utilize the new file reading method, improving robustness.
- Ensured consistent return values across functions, replacing `0.` with `0.0` for clarity.

These changes maintain existing logic while improving code maintainability and readability.
2025-07-16 18:11:05 +00:00
yuanmengqi
a9c1c6135a Merge branch 'main' of github.com:xlang-ai/OSWorld 2025-07-16 17:38:39 +00:00
yuanmengqi
0939226020 feat: enhance evaluator configuration with post-execution commands for Chrome
- Added a series of postconfig commands to the evaluator section in the JSON file.
- Commands include executing a refresh in Chrome, managing Chrome processes, launching Chrome with remote debugging, and opening specific settings tabs.
- Introduced sleep intervals to ensure proper execution timing between commands.

This update improves the automation capabilities of the evaluation examples while maintaining existing logic.
2025-07-16 17:37:37 +00:00
Zilong Zhou
dc164d5269 feat&fix: update configuration management to save model arguments and enhance UI display for model args (#262) 2025-07-16 21:46:35 +08:00
yuanmengqi
e433f35c1f feat: standardize configuration fields across all evaluation examples
- Add `fixed_ip` field to all 369 JSON files in examples directory
  - Set to `true` for 8 files listed in google_chrome.json multi_apps
  - Set to `false` for remaining 361 files
- Add `possibility_of_env_change` field to 363 JSON files missing this field
  - Set to "low" for newly added fields
  - Preserve existing values (4 medium, 2 high) for 6 files that already had this field

This ensures consistent configuration schema across all evaluation examples
while maintaining backward compatibility with existing settings.
2025-07-16 13:45:34 +00:00
yuanmengqi
b9df320f31 Enhance check_python_file_by_test_suite function with robust error handling and logging. Added validation for file existence, module loading, and function execution. Improved resource cleanup and working directory management to ensure stability and reliability during test execution. 2025-07-16 11:44:46 +00:00
Xinyuan Wang
0f2655249c Wxy/opencua (#260)
* OpenCUA Agent code base

* update url

* debug, modify url input

* debug opencua

* show result

* debug agent history overlap

* modify opencua agent; add comment lines
2025-07-16 17:53:12 +08:00
yuanmengqi
5e5058c1f2 fix: standardize provider interface parameters across all implementations
- Add screen_size parameter to get_vm_path() for all providers (with default 1920x1080)
- Add os_type parameter to start_emulator() for Azure and VirtualBox providers
- Add region parameter to stop_emulator() for VMware, Docker, and VirtualBox providers
- Use *args, **kwargs for better extensibility and parameter consistency
- Add documentation comments explaining ignored parameters for interface consistency
- Prevents TypeError exceptions when AWS-specific parameters are passed to other providers

This ensures all providers can handle the same parameter sets while maintaining
backward compatibility and avoiding interface fragmentation.
2025-07-15 21:38:34 +00:00
yuanmengqi
cb070307ee merge code 2025-07-15 14:57:14 +00:00
yuanmengqi
175b4b46c2 Merge remote-tracking branch 'upstream/main' into fix_chrome 2025-07-15 14:50:48 +00:00
yuanmengqi
7912880d16 Merge branch 'main' of github.com:xlang-ai/OSWorld 2025-07-15 07:24:38 +00:00
yuanmengqi
451bbf5fc2 Update multi_apps JSON examples: refined instructions for image processing in GIMP, replaced an open command with a launch command for VLC, and corrected assignment modification instruction in LibreOffice Calc example. 2025-07-15 07:24:33 +00:00
Yuan Mengqi
af47ed8fb1 fix infeasible&chrome tasks (#258)
* fix chrome

* fix: fix proxy setup

* feat&fix: add proxy support in setup and remove hardcoded proxy from example

* fix tasks

* fix chrome finished

* fix

* clean chrome_fix code

* clean chrome_fix code

* fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774

* fix multiapps

* fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774

* fix some multi_apps tasks

* fix some multi_apps tasks

* fix password&resolution

* fix password&resolution

* Improve code logic for password & resolution

* edit

* Merge branch 'main' into fix_chrome

* fix chrome tasks

* Merge branch 'fix_chrome'

* fix insensible&chrome tasks

---------

Co-authored-by: adlsdztony <zzl0712@connect.hku.hk>
2025-07-15 13:02:42 +08:00
yuanmengqi
68a9f647f4 fix: address https://github.com/xlang-ai/OSWorld/issues/257 by implement fix for PyAutoGUI '<' character bug in command execution. Introduced a new function to handle typewrite and press calls, ensuring correct behavior when using '<' in commands. Updated command execution logic to apply this fix before executing user commands. 2025-07-15 04:17:34 +00:00
yuanmengqi
8e18b7839a fix insensible&chrome tasks 2025-07-15 02:19:27 +00:00
yuanmengqi
1bf0730dce Merge remote-tracking branch 'upstream/main' 2025-07-15 02:14:41 +00:00
yuanmengqi
756ef96850 Merge branch 'fix_chrome' 2025-07-15 02:13:58 +00:00
yuanmengqi
08b4cf2c2f fix infeasible&chome tasks 2025-07-15 02:09:40 +00:00
ChenYXxxx
698483390a "Could you turn my image into CYMK mode?" add "within GIMP" 2025-07-14 23:53:20 +08:00
ChenYXxxx
9242becd87 "Please batch process all images on the desktop by increasing their brightness to 50, instead of adjusting them individually." add "within GIMP" 2025-07-14 23:52:41 +08:00
ChenYXxxx
56b2fe9cc4 Update d16c99dc-2a1e-46f2-b350-d97c86c85c15.json 2025-07-14 23:23:11 +08:00
ChenYXxxx
b481c794c5 Update 72f83cdc-bf76-4531-9a1b-eb893a13f8aa.json 2025-07-14 23:22:50 +08:00
ChenYXxxx
7f973a391c Update f723c744-e62c-4ae6-98d1-750d3cd7d79d.json 2025-07-14 23:22:14 +08:00
shenzhennan
7f96cc0633 Merge branch 'main' of https://github.com/xlang-ai/OSWorld 2025-07-14 12:35:00 +00:00
shenzhennan
53983db9cb fix impress eval : extending sleep time to ensure save 2025-07-14 12:34:43 +00:00
Xinyuan Wang
db83b9cb2c Wxy/opencua (#256)
* OpenCUA Agent code base

* update url

* debug, modify url input
2025-07-14 20:26:39 +08:00
Danyang Zhang
2339db20ca ver Jul7th (#255)
pip-installing directly from PyPI fails misteriously in postconfig
execution, possible owing to proxy configuration in the VM, adjusted
strategy by downloading the wheel on host and pip-installing it locally
on VM in thunderbird/d38192b0-17dc-4e1d-99c3-786d0117de77
2025-07-14 20:26:29 +08:00
shenzhennan
60e26d2d0d fix impress compare use gold file 2025-07-14 11:35:06 +00:00
yuanmengqi
90c4e894a4 Merge remote-tracking branch 'upstream/main' into fix_chrome 2025-07-14 07:14:19 +00:00
yuanmengqi
5d90faa548 run operagor 2025-07-14 07:13:17 +00:00
Zilong Zhou
74b7c189af Feat/monitor (#254)
* feat: add claude support

* feat: add script for end-to-end evaluation with logging and task distribution

* feat&fix: add tool result handling and update model default in evaluation script

* chore: remove run_test_env.py script

* feat&fix: implement action parsing for tool calls and update default action space

* fix: update text formatting in action parsing and replace logger import

* feat&fix: implement action parsing for tool calls and add screen size handling

* feat: add setup instructions for Anthropic API integration

* feat: add notice about image size limitations for Anthropic API

* Delete test_env/logger.py

* Delete test_env/utils.py

* fix: update logger usage to use global logger and improve error handling

* feat&fix: add configuration management API endpoints and update UI for configuration selection

* feat&fix: update environment configuration, enhance task statistics, and improve UI responsiveness

* feat&fix: add configuration toggle button in UI and improve task loading performance

* feat&fix: add accuracy percentage display to score and style updates for UI
2025-07-14 13:43:41 +08:00
yuanmengqi
0651495d88 fix: Enhance error handling and logging across multiple evaluators
- Added logging for file retrieval and error handling in file.py, improving robustness during file operations.
- Implemented checks for file existence and parsing errors in general.py, enhancing reliability in JSON/YAML processing.
- Improved table comparison logic in table.py with detailed error logging for sheet loading and cell value reading.
- Enhanced metrics evaluation in slides.py with additional checks for paragraph and run counts, ensuring thorough comparison.
- Updated utils.py to include file existence checks and detailed error logging during cell value reading.
2025-07-14 05:43:17 +00:00
yuanmengqi
b8b026f817 Merge remote-tracking branch 'upstream/main' into fix_chrome 2025-07-13 13:16:48 +00:00
yuanmengqi
7c807d4f3e Merge remote-tracking branch 'upstream/main' 2025-07-13 13:12:26 +00:00
Zilong Zhou
349f2fd9fe Feat/claude cua support (#253)
* feat: add claude support

* feat: add script for end-to-end evaluation with logging and task distribution

* feat&fix: add tool result handling and update model default in evaluation script

* chore: remove run_test_env.py script

* feat&fix: implement action parsing for tool calls and update default action space

* fix: update text formatting in action parsing and replace logger import

* feat&fix: implement action parsing for tool calls and add screen size handling

* feat: add setup instructions for Anthropic API integration

* feat: add notice about image size limitations for Anthropic API

* Delete test_env/logger.py

* Delete test_env/utils.py
2025-07-13 21:10:49 +08:00
Yuan Mengqi
38a30734a6 Improve code logic for password & resolution (#252)
* fix chrome

* fix: fix proxy setup

* feat&fix: add proxy support in setup and remove hardcoded proxy from example

* fix tasks

* fix chrome finished

* fix

* clean chrome_fix code

* clean chrome_fix code

* fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774

* fix multiapps

* fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774

* fix some multi_apps tasks

* fix some multi_apps tasks

* fix password&resolution

* fix password&resolution

* Improve code logic for password & resolution

* edit

* Merge branch 'main' into fix_chrome

* fix chrome tasks

---------

Co-authored-by: adlsdztony <zzl0712@connect.hku.hk>
2025-07-13 21:04:07 +08:00