Commit Graph

1157 Commits

Author SHA1 Message Date
yuanmengqi
e433f35c1f feat: standardize configuration fields across all evaluation examples
- Add `fixed_ip` field to all 369 JSON files in examples directory
  - Set to `true` for 8 files listed in google_chrome.json multi_apps
  - Set to `false` for remaining 361 files
- Add `possibility_of_env_change` field to 363 JSON files missing this field
  - Set to "low" for newly added fields
  - Preserve existing values (4 medium, 2 high) for 6 files that already had this field

This ensures consistent configuration schema across all evaluation examples
while maintaining backward compatibility with existing settings.
2025-07-16 13:45:34 +00:00
yuanmengqi
b9df320f31 Enhance check_python_file_by_test_suite function with robust error handling and logging. Added validation for file existence, module loading, and function execution. Improved resource cleanup and working directory management to ensure stability and reliability during test execution. 2025-07-16 11:44:46 +00:00
Xinyuan Wang
0f2655249c Wxy/opencua (#260)
* OpenCUA Agent code base

* update url

* debug, modify url input

* debug opencua

* show result

* debug agent history overlap

* modify opencua agent; add comment lines
2025-07-16 17:53:12 +08:00
yuanmengqi
5e5058c1f2 fix: standardize provider interface parameters across all implementations
- Add screen_size parameter to get_vm_path() for all providers (with default 1920x1080)
- Add os_type parameter to start_emulator() for Azure and VirtualBox providers
- Add region parameter to stop_emulator() for VMware, Docker, and VirtualBox providers
- Use *args, **kwargs for better extensibility and parameter consistency
- Add documentation comments explaining ignored parameters for interface consistency
- Prevents TypeError exceptions when AWS-specific parameters are passed to other providers

This ensures all providers can handle the same parameter sets while maintaining
backward compatibility and avoiding interface fragmentation.
2025-07-15 21:38:34 +00:00
yuanmengqi
7912880d16 Merge branch 'main' of github.com:xlang-ai/OSWorld 2025-07-15 07:24:38 +00:00
yuanmengqi
451bbf5fc2 Update multi_apps JSON examples: refined instructions for image processing in GIMP, replaced an open command with a launch command for VLC, and corrected assignment modification instruction in LibreOffice Calc example. 2025-07-15 07:24:33 +00:00
Yuan Mengqi
af47ed8fb1 fix infeasible&chrome tasks (#258)
* fix chrome

* fix: fix proxy setup

* feat&fix: add proxy support in setup and remove hardcoded proxy from example

* fix tasks

* fix chrome finished

* fix

* clean chrome_fix code

* clean chrome_fix code

* fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774

* fix multiapps

* fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774

* fix some multi_apps tasks

* fix some multi_apps tasks

* fix password&resolution

* fix password&resolution

* Improve code logic for password & resolution

* edit

* Merge branch 'main' into fix_chrome

* fix chrome tasks

* Merge branch 'fix_chrome'

* fix insensible&chrome tasks

---------

Co-authored-by: adlsdztony <zzl0712@connect.hku.hk>
2025-07-15 13:02:42 +08:00
yuanmengqi
68a9f647f4 fix: address https://github.com/xlang-ai/OSWorld/issues/257 by implement fix for PyAutoGUI '<' character bug in command execution. Introduced a new function to handle typewrite and press calls, ensuring correct behavior when using '<' in commands. Updated command execution logic to apply this fix before executing user commands. 2025-07-15 04:17:34 +00:00
ChenYXxxx
698483390a "Could you turn my image into CYMK mode?" add "within GIMP" 2025-07-14 23:53:20 +08:00
ChenYXxxx
9242becd87 "Please batch process all images on the desktop by increasing their brightness to 50, instead of adjusting them individually." add "within GIMP" 2025-07-14 23:52:41 +08:00
ChenYXxxx
56b2fe9cc4 Update d16c99dc-2a1e-46f2-b350-d97c86c85c15.json 2025-07-14 23:23:11 +08:00
ChenYXxxx
b481c794c5 Update 72f83cdc-bf76-4531-9a1b-eb893a13f8aa.json 2025-07-14 23:22:50 +08:00
ChenYXxxx
7f973a391c Update f723c744-e62c-4ae6-98d1-750d3cd7d79d.json 2025-07-14 23:22:14 +08:00
shenzhennan
7f96cc0633 Merge branch 'main' of https://github.com/xlang-ai/OSWorld 2025-07-14 12:35:00 +00:00
shenzhennan
53983db9cb fix impress eval : extending sleep time to ensure save 2025-07-14 12:34:43 +00:00
Xinyuan Wang
db83b9cb2c Wxy/opencua (#256)
* OpenCUA Agent code base

* update url

* debug, modify url input
2025-07-14 20:26:39 +08:00
Danyang Zhang
2339db20ca ver Jul7th (#255)
pip-installing directly from PyPI fails misteriously in postconfig
execution, possible owing to proxy configuration in the VM, adjusted
strategy by downloading the wheel on host and pip-installing it locally
on VM in thunderbird/d38192b0-17dc-4e1d-99c3-786d0117de77
2025-07-14 20:26:29 +08:00
shenzhennan
60e26d2d0d fix impress compare use gold file 2025-07-14 11:35:06 +00:00
Zilong Zhou
74b7c189af Feat/monitor (#254)
* feat: add claude support

* feat: add script for end-to-end evaluation with logging and task distribution

* feat&fix: add tool result handling and update model default in evaluation script

* chore: remove run_test_env.py script

* feat&fix: implement action parsing for tool calls and update default action space

* fix: update text formatting in action parsing and replace logger import

* feat&fix: implement action parsing for tool calls and add screen size handling

* feat: add setup instructions for Anthropic API integration

* feat: add notice about image size limitations for Anthropic API

* Delete test_env/logger.py

* Delete test_env/utils.py

* fix: update logger usage to use global logger and improve error handling

* feat&fix: add configuration management API endpoints and update UI for configuration selection

* feat&fix: update environment configuration, enhance task statistics, and improve UI responsiveness

* feat&fix: add configuration toggle button in UI and improve task loading performance

* feat&fix: add accuracy percentage display to score and style updates for UI
2025-07-14 13:43:41 +08:00
yuanmengqi
0651495d88 fix: Enhance error handling and logging across multiple evaluators
- Added logging for file retrieval and error handling in file.py, improving robustness during file operations.
- Implemented checks for file existence and parsing errors in general.py, enhancing reliability in JSON/YAML processing.
- Improved table comparison logic in table.py with detailed error logging for sheet loading and cell value reading.
- Enhanced metrics evaluation in slides.py with additional checks for paragraph and run counts, ensuring thorough comparison.
- Updated utils.py to include file existence checks and detailed error logging during cell value reading.
2025-07-14 05:43:17 +00:00
Zilong Zhou
349f2fd9fe Feat/claude cua support (#253)
* feat: add claude support

* feat: add script for end-to-end evaluation with logging and task distribution

* feat&fix: add tool result handling and update model default in evaluation script

* chore: remove run_test_env.py script

* feat&fix: implement action parsing for tool calls and update default action space

* fix: update text formatting in action parsing and replace logger import

* feat&fix: implement action parsing for tool calls and add screen size handling

* feat: add setup instructions for Anthropic API integration

* feat: add notice about image size limitations for Anthropic API

* Delete test_env/logger.py

* Delete test_env/utils.py
2025-07-13 21:10:49 +08:00
Yuan Mengqi
38a30734a6 Improve code logic for password & resolution (#252)
* fix chrome

* fix: fix proxy setup

* feat&fix: add proxy support in setup and remove hardcoded proxy from example

* fix tasks

* fix chrome finished

* fix

* clean chrome_fix code

* clean chrome_fix code

* fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774

* fix multiapps

* fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774

* fix some multi_apps tasks

* fix some multi_apps tasks

* fix password&resolution

* fix password&resolution

* Improve code logic for password & resolution

* edit

* Merge branch 'main' into fix_chrome

* fix chrome tasks

---------

Co-authored-by: adlsdztony <zzl0712@connect.hku.hk>
2025-07-13 21:04:07 +08:00
yuanmengqi
97ed6f99b0 Final review multi_apps fix the rest part 2025-07-12 20:28:55 +00:00
yuanmengqi
dbecf46057 Merge branch 'main' of github.com:xlang-ai/OSWorld 2025-07-12 16:35:02 +00:00
yuanmengqi
877e75a013 Final review multi_apps fix Xinzhuang part 2025-07-12 16:34:55 +00:00
Yuan Mengqi
27319ce1e3 fix password&resolution (#251)
* fix chrome

* fix: fix proxy setup

* feat&fix: add proxy support in setup and remove hardcoded proxy from example

* fix tasks

* fix chrome finished

* fix

* clean chrome_fix code

* clean chrome_fix code

* fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774

* fix multiapps

* fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774

* fix some multi_apps tasks

* fix some multi_apps tasks

* fix password&resolution

* fix password&resolution

---------

Co-authored-by: adlsdztony <zzl0712@connect.hku.hk>
2025-07-13 00:25:37 +08:00
yuanmengqi
6f0382c0c2 Merge branch 'main' of github.com:xlang-ai/OSWorld 2025-07-10 22:35:42 +00:00
yuanmengqi
6897e5320d Enhance image text comparison functionality with detailed logging
- Added logging for OCR results and text matching outcomes in compare_image_text function.
- Updated JSON examples to support multiple expected results and improved structure for evaluator functions.
- Enhanced handling of expected text rules to include multiple variations for better matching accuracy.
2025-07-10 22:32:53 +00:00
st2rb8g
61f265a082 fix some multi_apps tasks (#245)
* fix chrome

* fix some multi_apps tasks.

* fix some multiapps tasks

* fix some multiapps tasks

---------

Co-authored-by: yuanmengqi <yuanmengqi@mail.ustc.edu.cn>
2025-07-11 06:32:13 +08:00
Yan98
4e3446d6fe Fix Name (#249)
* init

* init
2025-07-11 00:15:46 +08:00
Shenzhennan
29caebb765 Impress check and fix (all font compare issue) (#247)
* Enhance PPTX comparison logic in slides.py

- Improved alignment comparison to treat None and LEFT as equivalent.
- Added special handling for font bold and italic properties to consider None and False as equivalent.
- Introduced a new bullet comparison function that allows for minor differences and tolerates formatting variations.
- Updated JSON examples to support multiple file comparisons and results.

* fix all fonts json file f23ac

* fix clean the shape examination in unrelevatn part-top position check

* Refactor JSON structure for PPTX comparison

- Updated the instruction formatting for clarity.
- Modified the comparison logic to support multiple expected and result files, enhancing flexibility in evaluations.
- Changed the function key to an array to accommodate multiple comparison functions.
- Introduced a conjunction key to specify logical relationships between comparisons.

* fix impress-e4ef0baf by adding all fonts gold file

* update impress bf4e9888 task ins

* fix impress b8adbc24 font size

* Enhance PPTX comparison functionality in slides.py

- Introduced a debug logger for detailed output during PPTX comparisons.
- Added a new function to recursively retrieve all text shapes, including those within groups.
- Enabled debug logging to provide insights on slide and shape comparisons.
- Updated JSON examples to support multiple expected and result files for enhanced evaluation flexibility.

* Enable debug logging by default in PPTX comparison and enhance debug output for shape mismatches. Updated JSON examples to support multiple expected and result files for improved evaluation consistency.

* fix impress all fons compare file

* Refactor PPTX comparison logic and JSON examples for height modification tasks

- Added critical notes in slides.py to clarify the execution order of shape examination and height modification checks.
- Updated JSON examples to support multiple expected and result files, enhancing evaluation consistency.
- Ensured that examine_shape must be set to False for examine_modify_height to function correctly, preventing premature termination of comparisons.

* Enhance debug logging in PPTX comparison for detailed font attribute mismatches

- Added debug logging for differences in font color, bold, italic, and underline attributes during table cell comparisons.
- Improved clarity of debug output by including specific slide, shape, and cell indices for mismatches.
- Ensured that existing comparison logic remains intact while enhancing debugging capabilities.

* Enhance debug logging for font attribute mismatches in PPTX comparison

- Added detailed debug logging for font name and size mismatches during PPTX comparisons, including specific slide, shape, and paragraph indices.
- Updated JSON examples to support multiple expected and result files, improving evaluation consistency.
- Maintained existing comparison logic while enhancing debugging capabilities.

* fix impress 3161de json file

---------

Co-authored-by: yuanmengqi <yuanmengqi@mail.ustc.edu.cn>
2025-07-10 00:36:32 +08:00
Yan98
0a5058342d init (#246) 2025-07-10 00:29:42 +08:00
Yuan Mengqi
093679b90d fix some multi_apps task (#243)
* fix chrome

* fix: fix proxy setup

* feat&fix: add proxy support in setup and remove hardcoded proxy from example

* fix tasks

* fix chrome finished

* fix

* clean chrome_fix code

* clean chrome_fix code

* fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774

* fix multiapps

* fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774

* fix some multi_apps tasks

* fix some multi_apps tasks

---------

Co-authored-by: adlsdztony <zzl0712@connect.hku.hk>
2025-07-08 18:59:00 +08:00
Zeyi Sun
b6d9a804fa fix compare_videos in vlc.py (#242)
fix result in the same format of float number.
2025-07-08 16:25:00 +08:00
yuanmengqi
7d0ad02706 Merge branch 'main' of github.com:xlang-ai/OSWorld 2025-07-06 19:38:26 +00:00
yuanmengqi
a68d6f7ab6 Enhance GIMP metrics evaluator with logging and transparency handling
- Replaced print statements with logging for better traceability in gimp.py.
- Added handling for transparent images in structure checks and size evaluations.
- Updated JSON examples to include delays in pyautogui commands for improved execution reliability.
- Changed image URL in example to a more accessible source.
2025-07-06 19:38:22 +00:00
shuyhere
3afc01f1fe fix chrome examples (#240) 2025-07-07 02:25:59 +08:00
MillanK
8facb285a1 VS Code Task Fix (#237)
* vscode fix

* vscode task fix
2025-07-06 16:47:23 +08:00
yuanmengqi
a1891f7d88 Merge branch 'main' of github.com:xlang-ai/OSWorld 2025-07-06 07:52:42 +00:00
yuanmengqi
9be6fcd688 Check and fix on Chrome tasks
- Added `pytz` dependency to `requirements.txt` for timezone handling.
- Introduced `get_macys_product_url_parse` function to replace the old `get_url_path_parse` for better clarity and maintain backward compatibility.
- Enhanced logging throughout the `get_active_tab_html_parse` and `get_rule_relativeTime` functions for improved debugging and traceability.
- Updated JSON examples to reflect changes in expected keys and added new fields for better evaluation context.
- Removed deprecated execution commands from JSON examples to streamline the evaluation process.
2025-07-06 07:52:37 +00:00
Danyang Zhang
30a346eb68 Merge pull request #239 from xlang-ai/thbd_eval_fix
Thbd eval fix
fix w.r.t. a11y tree check function
2025-07-04 23:24:33 +08:00
zdy023
690f6ed6e7 ver Jul4th
fixed check_accessibility_tree function, updated the namespace
definitons according the values defined in server/main.py
2025-07-04 23:20:51 +08:00
zdy023
bfa796dc45 Merge branch 'main' into thbd_eval_fix 2025-07-04 23:14:00 +08:00
XXZ
c8a6a22aad Fix VLC task design (#238)
* fix: fix multiapp tasks

* fix: update instructions for VLC evaluation examples

---------

Co-authored-by: adlsdztony <zzl0712@connect.hku.hk>
2025-07-04 20:39:48 +08:00
Shenzhennan
1b40a458de Impress eval fix (#226)
* fix compare_pptx

* Fix impress-4ed5abd0-8b5d-47bd-839f-cacfa15ca37a eval script:Fix temporarily by ignoring the contaminated  To fix completely, compare source file needs to be updated

* fix impress domain

* fix a53 by changing gold

* fix impress a53

* fix impress b8d origin file

* add table font color check

* fix left pane check

---------

Co-authored-by: chenjix <3107760494@qq.com>
Co-authored-by: moonshot <moonshot@moonshotznshenMacBook-Pro.local>
Co-authored-by: Shen Zhennan <shenzhennan@moonshot.cn>
2025-07-04 13:32:02 +08:00
Zilong Zhou
587f929567 fix: proxy setup (#234) 2025-07-04 13:31:51 +08:00
Zilong Zhou
1308a80029 Update 5990457f-2adb-467b-a4af-5c857c92d762.json (#235) 2025-07-04 13:31:18 +08:00
yuanmengqi
3cd79c9830 Merge branch 'main' of github.com:xlang-ai/OSWorld 2025-07-03 16:57:49 +00:00
yuanmengqi
a651b04e49 Update AWS AMI ID, enhance directory creation logic in file upload, modify osworld service configuration, and refine JSON evaluation examples for improved clarity and functionality. 2025-07-03 16:57:41 +00:00
Danyang Zhang
adc9ad88c2 Thunderbird eval fix (#233)
* ver Jul2nd

updated task requiring set up new email account

* ver Jul3rd

fixed several tasks
2025-07-03 21:55:55 +08:00