Commit Graph

1160 Commits

Author SHA1 Message Date
yuanmengqi
5d90faa548 run operagor 2025-07-14 07:13:17 +00:00
yuanmengqi
b8b026f817 Merge remote-tracking branch 'upstream/main' into fix_chrome 2025-07-13 13:16:48 +00:00
Zilong Zhou
349f2fd9fe Feat/claude cua support (#253)
* feat: add claude support

* feat: add script for end-to-end evaluation with logging and task distribution

* feat&fix: add tool result handling and update model default in evaluation script

* chore: remove run_test_env.py script

* feat&fix: implement action parsing for tool calls and update default action space

* fix: update text formatting in action parsing and replace logger import

* feat&fix: implement action parsing for tool calls and add screen size handling

* feat: add setup instructions for Anthropic API integration

* feat: add notice about image size limitations for Anthropic API

* Delete test_env/logger.py

* Delete test_env/utils.py
2025-07-13 21:10:49 +08:00
Yuan Mengqi
38a30734a6 Improve code logic for password & resolution (#252)
* fix chrome

* fix: fix proxy setup

* feat&fix: add proxy support in setup and remove hardcoded proxy from example

* fix tasks

* fix chrome finished

* fix

* clean chrome_fix code

* clean chrome_fix code

* fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774

* fix multiapps

* fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774

* fix some multi_apps tasks

* fix some multi_apps tasks

* fix password&resolution

* fix password&resolution

* Improve code logic for password & resolution

* edit

* Merge branch 'main' into fix_chrome

* fix chrome tasks

---------

Co-authored-by: adlsdztony <zzl0712@connect.hku.hk>
2025-07-13 21:04:07 +08:00
yuanmengqi
7279469d23 fix chrome tasks 2025-07-13 12:41:27 +00:00
yuanmengqi
572a94b6df Merge branch 'main' into fix_chrome 2025-07-13 10:16:08 +00:00
yuanmengqi
a16b54c175 edit 2025-07-13 10:14:41 +00:00
yuanmengqi
a070ddda7e Improve code logic for password & resolution 2025-07-13 06:59:45 +00:00
yuanmengqi
97ed6f99b0 Final review multi_apps fix the rest part 2025-07-12 20:28:55 +00:00
yuanmengqi
dbecf46057 Merge branch 'main' of github.com:xlang-ai/OSWorld 2025-07-12 16:35:02 +00:00
yuanmengqi
877e75a013 Final review multi_apps fix Xinzhuang part 2025-07-12 16:34:55 +00:00
Yuan Mengqi
27319ce1e3 fix password&resolution (#251)
* fix chrome

* fix: fix proxy setup

* feat&fix: add proxy support in setup and remove hardcoded proxy from example

* fix tasks

* fix chrome finished

* fix

* clean chrome_fix code

* clean chrome_fix code

* fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774

* fix multiapps

* fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774

* fix some multi_apps tasks

* fix some multi_apps tasks

* fix password&resolution

* fix password&resolution

---------

Co-authored-by: adlsdztony <zzl0712@connect.hku.hk>
2025-07-13 00:25:37 +08:00
yuanmengqi
08bbf77511 fix password&resolution 2025-07-12 15:11:42 +00:00
yuanmengqi
37c56533f0 Merge remote-tracking branch 'upstream/main' into fix_chrome 2025-07-11 12:16:23 +00:00
yuanmengqi
fe3bb2fd92 fix password&resolution 2025-07-11 12:15:03 +00:00
yuanmengqi
6f0382c0c2 Merge branch 'main' of github.com:xlang-ai/OSWorld 2025-07-10 22:35:42 +00:00
yuanmengqi
6897e5320d Enhance image text comparison functionality with detailed logging
- Added logging for OCR results and text matching outcomes in compare_image_text function.
- Updated JSON examples to support multiple expected results and improved structure for evaluator functions.
- Enhanced handling of expected text rules to include multiple variations for better matching accuracy.
2025-07-10 22:32:53 +00:00
st2rb8g
61f265a082 fix some multi_apps tasks (#245)
* fix chrome

* fix some multi_apps tasks.

* fix some multiapps tasks

* fix some multiapps tasks

---------

Co-authored-by: yuanmengqi <yuanmengqi@mail.ustc.edu.cn>
2025-07-11 06:32:13 +08:00
Yan98
4e3446d6fe Fix Name (#249)
* init

* init
2025-07-11 00:15:46 +08:00
Shenzhennan
29caebb765 Impress check and fix (all font compare issue) (#247)
* Enhance PPTX comparison logic in slides.py

- Improved alignment comparison to treat None and LEFT as equivalent.
- Added special handling for font bold and italic properties to consider None and False as equivalent.
- Introduced a new bullet comparison function that allows for minor differences and tolerates formatting variations.
- Updated JSON examples to support multiple file comparisons and results.

* fix all fonts json file f23ac

* fix clean the shape examination in unrelevatn part-top position check

* Refactor JSON structure for PPTX comparison

- Updated the instruction formatting for clarity.
- Modified the comparison logic to support multiple expected and result files, enhancing flexibility in evaluations.
- Changed the function key to an array to accommodate multiple comparison functions.
- Introduced a conjunction key to specify logical relationships between comparisons.

* fix impress-e4ef0baf by adding all fonts gold file

* update impress bf4e9888 task ins

* fix impress b8adbc24 font size

* Enhance PPTX comparison functionality in slides.py

- Introduced a debug logger for detailed output during PPTX comparisons.
- Added a new function to recursively retrieve all text shapes, including those within groups.
- Enabled debug logging to provide insights on slide and shape comparisons.
- Updated JSON examples to support multiple expected and result files for enhanced evaluation flexibility.

* Enable debug logging by default in PPTX comparison and enhance debug output for shape mismatches. Updated JSON examples to support multiple expected and result files for improved evaluation consistency.

* fix impress all fons compare file

* Refactor PPTX comparison logic and JSON examples for height modification tasks

- Added critical notes in slides.py to clarify the execution order of shape examination and height modification checks.
- Updated JSON examples to support multiple expected and result files, enhancing evaluation consistency.
- Ensured that examine_shape must be set to False for examine_modify_height to function correctly, preventing premature termination of comparisons.

* Enhance debug logging in PPTX comparison for detailed font attribute mismatches

- Added debug logging for differences in font color, bold, italic, and underline attributes during table cell comparisons.
- Improved clarity of debug output by including specific slide, shape, and cell indices for mismatches.
- Ensured that existing comparison logic remains intact while enhancing debugging capabilities.

* Enhance debug logging for font attribute mismatches in PPTX comparison

- Added detailed debug logging for font name and size mismatches during PPTX comparisons, including specific slide, shape, and paragraph indices.
- Updated JSON examples to support multiple expected and result files, improving evaluation consistency.
- Maintained existing comparison logic while enhancing debugging capabilities.

* fix impress 3161de json file

---------

Co-authored-by: yuanmengqi <yuanmengqi@mail.ustc.edu.cn>
2025-07-10 00:36:32 +08:00
Yan98
0a5058342d init (#246) 2025-07-10 00:29:42 +08:00
Yuan Mengqi
093679b90d fix some multi_apps task (#243)
* fix chrome

* fix: fix proxy setup

* feat&fix: add proxy support in setup and remove hardcoded proxy from example

* fix tasks

* fix chrome finished

* fix

* clean chrome_fix code

* clean chrome_fix code

* fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774

* fix multiapps

* fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774

* fix some multi_apps tasks

* fix some multi_apps tasks

---------

Co-authored-by: adlsdztony <zzl0712@connect.hku.hk>
2025-07-08 18:59:00 +08:00
yuanmengqi
349c31fa55 Merge remote-tracking branch 'upstream/main' into fix_chrome 2025-07-08 10:45:57 +00:00
yuanmengqi
5778078596 fix some multi_apps tasks 2025-07-08 10:35:47 +00:00
yuanmengqi
8d670df32d fix some multi_apps tasks 2025-07-08 10:31:10 +00:00
Zeyi Sun
b6d9a804fa fix compare_videos in vlc.py (#242)
fix result in the same format of float number.
2025-07-08 16:25:00 +08:00
yuanmengqi
7d0ad02706 Merge branch 'main' of github.com:xlang-ai/OSWorld 2025-07-06 19:38:26 +00:00
yuanmengqi
a68d6f7ab6 Enhance GIMP metrics evaluator with logging and transparency handling
- Replaced print statements with logging for better traceability in gimp.py.
- Added handling for transparent images in structure checks and size evaluations.
- Updated JSON examples to include delays in pyautogui commands for improved execution reliability.
- Changed image URL in example to a more accessible source.
2025-07-06 19:38:22 +00:00
shuyhere
3afc01f1fe fix chrome examples (#240) 2025-07-07 02:25:59 +08:00
MillanK
8facb285a1 VS Code Task Fix (#237)
* vscode fix

* vscode task fix
2025-07-06 16:47:23 +08:00
yuanmengqi
a1891f7d88 Merge branch 'main' of github.com:xlang-ai/OSWorld 2025-07-06 07:52:42 +00:00
yuanmengqi
9be6fcd688 Check and fix on Chrome tasks
- Added `pytz` dependency to `requirements.txt` for timezone handling.
- Introduced `get_macys_product_url_parse` function to replace the old `get_url_path_parse` for better clarity and maintain backward compatibility.
- Enhanced logging throughout the `get_active_tab_html_parse` and `get_rule_relativeTime` functions for improved debugging and traceability.
- Updated JSON examples to reflect changes in expected keys and added new fields for better evaluation context.
- Removed deprecated execution commands from JSON examples to streamline the evaluation process.
2025-07-06 07:52:37 +00:00
Danyang Zhang
30a346eb68 Merge pull request #239 from xlang-ai/thbd_eval_fix
Thbd eval fix
fix w.r.t. a11y tree check function
2025-07-04 23:24:33 +08:00
zdy023
690f6ed6e7 ver Jul4th
fixed check_accessibility_tree function, updated the namespace
definitons according the values defined in server/main.py
2025-07-04 23:20:51 +08:00
zdy023
bfa796dc45 Merge branch 'main' into thbd_eval_fix 2025-07-04 23:14:00 +08:00
XXZ
c8a6a22aad Fix VLC task design (#238)
* fix: fix multiapp tasks

* fix: update instructions for VLC evaluation examples

---------

Co-authored-by: adlsdztony <zzl0712@connect.hku.hk>
2025-07-04 20:39:48 +08:00
yuanmengqi
8aa686a2a9 fix multiapps 2025-07-04 07:18:15 +00:00
yuanmengqi
66e669b50b fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774 2025-07-04 07:14:54 +00:00
Shenzhennan
1b40a458de Impress eval fix (#226)
* fix compare_pptx

* Fix impress-4ed5abd0-8b5d-47bd-839f-cacfa15ca37a eval script:Fix temporarily by ignoring the contaminated  To fix completely, compare source file needs to be updated

* fix impress domain

* fix a53 by changing gold

* fix impress a53

* fix impress b8d origin file

* add table font color check

* fix left pane check

---------

Co-authored-by: chenjix <3107760494@qq.com>
Co-authored-by: moonshot <moonshot@moonshotznshenMacBook-Pro.local>
Co-authored-by: Shen Zhennan <shenzhennan@moonshot.cn>
2025-07-04 13:32:02 +08:00
Zilong Zhou
587f929567 fix: proxy setup (#234) 2025-07-04 13:31:51 +08:00
Zilong Zhou
1308a80029 Update 5990457f-2adb-467b-a4af-5c857c92d762.json (#235) 2025-07-04 13:31:18 +08:00
yuanmengqi
3cd79c9830 Merge branch 'main' of github.com:xlang-ai/OSWorld 2025-07-03 16:57:49 +00:00
yuanmengqi
a651b04e49 Update AWS AMI ID, enhance directory creation logic in file upload, modify osworld service configuration, and refine JSON evaluation examples for improved clarity and functionality. 2025-07-03 16:57:41 +00:00
Danyang Zhang
adc9ad88c2 Thunderbird eval fix (#233)
* ver Jul2nd

updated task requiring set up new email account

* ver Jul3rd

fixed several tasks
2025-07-03 21:55:55 +08:00
XXZ
ac24ccce99 fix: fix multiapp tasks (#229)
Co-authored-by: adlsdztony <zzl0712@connect.hku.hk>
2025-07-03 21:53:58 +08:00
yuanmengqi
7b2120c843 Merge branch 'main' of github.com:xlang-ai/OSWorld 2025-07-03 13:50:35 +00:00
yuanmengqi
cb4bed20a0 Refactor compare_python_pure_text function for improved normalization and error handling. Update JSON example to clarify instruction for extracting Python code from Colab, changing output file names for consistency. 2025-07-03 13:50:21 +00:00
Yuan Mengqi
b2fb8b4222 fix chrome tasks (#230)
* fix chrome

* fix: fix proxy setup

* feat&fix: add proxy support in setup and remove hardcoded proxy from example

* fix tasks

* fix chrome finished

* fix

* clean chrome_fix code

* clean chrome_fix code

---------

Co-authored-by: adlsdztony <zzl0712@connect.hku.hk>
2025-07-03 21:32:41 +08:00
zdy023
3cf80eaab8 ver Jul3rd
fixed several tasks
2025-07-03 20:55:30 +08:00
ChenYXxxx
bdaf37e0e5 fix_os&gimp (#220)
* Update ec4e3f68-9ea4-4c18-a5c9-69f89d1178b3.json

* Update c288e301-e626-4b98-a1ab-159dcb162af5.json

* Update 3ce045a0-877b-42aa-8d2c-b4a863336ab8.json

* Update b3d4a89c-53f2-4d6b-8b6a-541fb5d205fa.json

* Update 2e6f678f-472d-4c55-99cc-8e7c5c402a71.json

Please batch process all images on the desktop by increasing their brightness to 50, instead of adjusting them individually.

* Update 5ca86c6f-f317-49d8-b6a7-b527541caae8.json

* Update a746add2-cab0-4740-ac36-c3769d9bfb46.json

* Update a746add2-cab0-4740-ac36-c3769d9bfb46.json

* Update 62f7fd55-0687-4a43-b6e1-3eda16fc6252.json

* Update d52d6308-ec58-42b7-a2c9-de80e4837b2b.json

* Update d16c99dc-2a1e-46f2-b350-d97c86c85c15.json

* Update d16c99dc-2a1e-46f2-b350-d97c86c85c15.json

* Update 58d3eeeb-e9d0-499f-962e-fd0db2a744d8.json
2025-07-03 16:59:05 +08:00