Commit Graph

491 Commits

Author SHA1 Message Date
yuanmengqi
9be6fcd688 Check and fix on Chrome tasks
- Added `pytz` dependency to `requirements.txt` for timezone handling.
- Introduced `get_macys_product_url_parse` function to replace the old `get_url_path_parse` for better clarity and maintain backward compatibility.
- Enhanced logging throughout the `get_active_tab_html_parse` and `get_rule_relativeTime` functions for improved debugging and traceability.
- Updated JSON examples to reflect changes in expected keys and added new fields for better evaluation context.
- Removed deprecated execution commands from JSON examples to streamline the evaluation process.
2025-07-06 07:52:37 +00:00
Shenzhennan
1b40a458de Impress eval fix (#226)
* fix compare_pptx

* Fix impress-4ed5abd0-8b5d-47bd-839f-cacfa15ca37a eval script:Fix temporarily by ignoring the contaminated  To fix completely, compare source file needs to be updated

* fix impress domain

* fix a53 by changing gold

* fix impress a53

* fix impress b8d origin file

* add table font color check

* fix left pane check

---------

Co-authored-by: chenjix <3107760494@qq.com>
Co-authored-by: moonshot <moonshot@moonshotznshenMacBook-Pro.local>
Co-authored-by: Shen Zhennan <shenzhennan@moonshot.cn>
2025-07-04 13:32:02 +08:00
Zilong Zhou
1308a80029 Update 5990457f-2adb-467b-a4af-5c857c92d762.json (#235) 2025-07-04 13:31:18 +08:00
yuanmengqi
3cd79c9830 Merge branch 'main' of github.com:xlang-ai/OSWorld 2025-07-03 16:57:49 +00:00
yuanmengqi
a651b04e49 Update AWS AMI ID, enhance directory creation logic in file upload, modify osworld service configuration, and refine JSON evaluation examples for improved clarity and functionality. 2025-07-03 16:57:41 +00:00
Danyang Zhang
adc9ad88c2 Thunderbird eval fix (#233)
* ver Jul2nd

updated task requiring set up new email account

* ver Jul3rd

fixed several tasks
2025-07-03 21:55:55 +08:00
XXZ
ac24ccce99 fix: fix multiapp tasks (#229)
Co-authored-by: adlsdztony <zzl0712@connect.hku.hk>
2025-07-03 21:53:58 +08:00
yuanmengqi
7b2120c843 Merge branch 'main' of github.com:xlang-ai/OSWorld 2025-07-03 13:50:35 +00:00
yuanmengqi
cb4bed20a0 Refactor compare_python_pure_text function for improved normalization and error handling. Update JSON example to clarify instruction for extracting Python code from Colab, changing output file names for consistency. 2025-07-03 13:50:21 +00:00
Yuan Mengqi
b2fb8b4222 fix chrome tasks (#230)
* fix chrome

* fix: fix proxy setup

* feat&fix: add proxy support in setup and remove hardcoded proxy from example

* fix tasks

* fix chrome finished

* fix

* clean chrome_fix code

* clean chrome_fix code

---------

Co-authored-by: adlsdztony <zzl0712@connect.hku.hk>
2025-07-03 21:32:41 +08:00
ChenYXxxx
bdaf37e0e5 fix_os&gimp (#220)
* Update ec4e3f68-9ea4-4c18-a5c9-69f89d1178b3.json

* Update c288e301-e626-4b98-a1ab-159dcb162af5.json

* Update 3ce045a0-877b-42aa-8d2c-b4a863336ab8.json

* Update b3d4a89c-53f2-4d6b-8b6a-541fb5d205fa.json

* Update 2e6f678f-472d-4c55-99cc-8e7c5c402a71.json

Please batch process all images on the desktop by increasing their brightness to 50, instead of adjusting them individually.

* Update 5ca86c6f-f317-49d8-b6a7-b527541caae8.json

* Update a746add2-cab0-4740-ac36-c3769d9bfb46.json

* Update a746add2-cab0-4740-ac36-c3769d9bfb46.json

* Update 62f7fd55-0687-4a43-b6e1-3eda16fc6252.json

* Update d52d6308-ec58-42b7-a2c9-de80e4837b2b.json

* Update d16c99dc-2a1e-46f2-b350-d97c86c85c15.json

* Update d16c99dc-2a1e-46f2-b350-d97c86c85c15.json

* Update 58d3eeeb-e9d0-499f-962e-fd0db2a744d8.json
2025-07-03 16:59:05 +08:00
Tianbao Xie
bba367b8bc fix: fix multiapps tasks (#231)
* Update JSON example for multi_apps: change snapshot name and specify presenter in instructions for clarity.

* Enhance PDF image comparison in chrome.py by adding existence checks for input files and improving image extraction logic. Introduce image hashing for similarity scoring with a configurable threshold. Update docs.py to support fuzzy matching in DOCX file comparisons, allowing for similarity scoring based on text content. Modify example JSON to enable fuzzy matching option.

---------

Co-authored-by: yuanmengqi <yuanmengqi@mail.ustc.edu.cn>
2025-07-03 16:58:43 +08:00
Tianbao Xie
e9c657b714 fix: Libreoffice writer fix (#232)
* Refactor LibreOffice Writer example JSON to support multiple expected and result files for line spacing comparison, enhancing evaluation flexibility. Updated function calls and added additional expected file paths.

* Update source link in LibreOffice Writer example JSON to a more relevant help page for inserting tables, improving instructional clarity.

---------

Co-authored-by: yuanmengqi <yuanmengqi@mail.ustc.edu.cn>
2025-07-03 16:58:08 +08:00
Zilong Zhou
595a704aff fix: fix proxy setup (#227)
* fix: fix proxy setup

* feat&fix: add proxy support in setup and remove hardcoded proxy from example
2025-07-02 01:36:32 +08:00
Danyang Zhang
d4273d992e Calc eval fix (#225)
* ver Jun17th

updating annotations

* ver Jun17th

corrected annotation of 1d17
added check for cell merge

* ver Jun17th

updated several annotations

* ver Jun20th

fixed set-up config of 2bd59342-0664-4ccb-ba87-79379096cc08

* fix: Enhance instructions in LibreOffice Calc examples for clarity and specificity, including details on using Pivot Tables, column placements, and revenue calculations.

* ver Jun21st

updating calc evals

* ver Jun22nd

fixed an impress task

* ver Jun22ndv2

adjusted several calc tasks

* Clean scalfolds

---------

Co-authored-by: BowenBryanWang <bryanwang.nlp@connect.hku.hk>
Co-authored-by: yuanmengqi <yuanmengqi@mail.ustc.edu.cn>
2025-06-30 18:23:09 +08:00
Tianbao Xie
30138c5db1 VLC fix (#224)
* Enhance SetupController with improved logging and error handling during setup and file upload processes. Update instance type to t3.xlarge and AMI ID for AWS configuration. Add download progress logging and exception handling for better debugging.

* Enhance VLC status evaluation by adding multiple paths for file and URL information extraction, improving robustness against varying VLC XML structures. Implement detailed logging for better debugging and error handling in case of mismatches or missing data. Update example JSON for VLC evaluation to use a valid HLS stream URL.

* Improve audio comparison robustness in VLC evaluator by adding error handling for audio file loading and extraction. Implement detailed logging for empty or corrupt files, and normalize DTW distance calculation for more accurate similarity scoring. Remove deprecated audio fingerprint comparison function.

---------

Co-authored-by: yuanmengqi <yuanmengqi@mail.ustc.edu.cn>
2025-06-29 20:18:44 +08:00
Tianbao Xie
0cc93543a8 Environment is_used flag; OS domain fix (#219)
* Refactor evaluator structure in LibreOffice Writer example JSON to support multiple expected and result files, enhancing evaluation flexibility.

* Update instance type to t3.large and add VNC access URL logging for allocated VMs, enhancing remote access capabilities.

* Update instance type to t3.large and add VNC access URL logging for allocated VMs, enhancing remote access capabilities.

* Update time format in get_vm_file function to include hours, minutes, and seconds for more precise file naming with time suffix.

* More delay for 936321ce-5236-426a-9a20-e0e3c5dc536f; support one more potential solutions.

* Enhance SetupController with configurable retry limit and improved error handling for file opening requests. Introduce new function to compare unique training records, and update logging for better debugging. Adjust JSON examples for evaluation to support multiple expected and result files.

* Clean debug code

* Enhance DesktopEnv to track environment usage for optimized snapshot management. Introduce is_environment_used flag to determine if a snapshot revert is necessary based on provider type. Update setup and step methods to mark environment usage appropriately. Add new execute_with_verification method in SetupController for command execution with result verification, improving reliability. Change AWS instance type to m5.large for better performance and update AMI ID for compatibility. Update file opening logic in main.py to handle both file paths and application commands more effectively.

---------

Co-authored-by: yuanmengqi <yuanmengqi@mail.ustc.edu.cn>
2025-06-28 00:45:53 +08:00
MillanK
48ac57697a VSCode fix (#222) 2025-06-24 17:08:09 +08:00
Tianbao Xie
4e11eafd1d Robust Evaluation, Blocking File Open, Grader Sensitivity, and LibreOffice Writer Fixes (#217)
* Refactor evaluator structure in LibreOffice Writer example JSON to support multiple expected and result files, enhancing evaluation flexibility.

* Update instance type to t3.large and add VNC access URL logging for allocated VMs, enhancing remote access capabilities.

* Update instance type to t3.large and add VNC access URL logging for allocated VMs, enhancing remote access capabilities.

* Update time format in get_vm_file function to include hours, minutes, and seconds for more precise file naming with time suffix.

* More delay for 936321ce-5236-426a-9a20-e0e3c5dc536f; support one more potential solutions.

* Enhance SetupController with configurable retry limit and improved error handling for file opening requests. Introduce new function to compare unique training records, and update logging for better debugging. Adjust JSON examples for evaluation to support multiple expected and result files.

* Clean debug code

---------

Co-authored-by: yuanmengqi <yuanmengqi@mail.ustc.edu.cn>
2025-06-16 21:37:19 +08:00
yuanmengqi
2bae228803 merge upstream 2025-06-10 13:23:03 +00:00
yuanmengqi
7315aec6e6 clean code 2025-06-10 04:06:54 +00:00
yuanmengqi
630f92fd7c fix: correct URL encoding in JSON examples for invoice paths 2025-06-09 08:06:27 +00:00
yuanmengqi
aee1207fff fix error 2025-06-09 04:20:59 +00:00
yuanmengqi
3e541bb393 Merge remote-tracking branch 'upstream/feat/aws-provider-support' 2025-06-08 04:01:35 +00:00
yuanmengqi
8853671220 fix: enhance instruction clarity and adjust timing in automation script for LibreOffice Impress example 2025-06-07 21:17:00 +00:00
yuanmengqi
9fa768d24d refactor: update URLs in multiple JSON files to ensure proper encoding of special characters 2025-06-07 17:26:45 +00:00
yuanmengqi
8471394cc1 add branch feat/aws-provider-support 2025-06-07 15:57:18 +00:00
yuanmengqi
f48d80002f Merge remote-tracking branch 'upstream/feat/aws-provider-support' 2025-06-07 13:22:53 +00:00
yuanmengqi
8d0ff7c99c refactor: update VLC command configurations to suppress audio and video title display across multiple JSON examples 2025-06-07 09:02:49 +00:00
yuanmengqi
e61acece84 problems from the community 2025-06-07 05:30:40 +00:00
yuanmengqi
a146c1e0b7 edit prompt 2025-06-07 05:21:04 +00:00
yuanmengqi
4ea24ddfd3 add proxy 2025-06-06 09:41:22 +00:00
yuanmengqi
a6300e05c9 Merge remote-tracking branch 'upstream/feat/aws-provider-support' 2025-06-05 13:31:42 +00:00
yuanmengqi
71578d994e edit 2025-06-05 13:29:16 +00:00
Timothyxxx
fb7bafb885 feat: Add proxy configuration to all 369 evaluation examples - 55 with proxy, 314 without 2025-06-05 18:46:53 +08:00
yuanmengqi
b211df3385 fix timeout 2025-06-04 10:23:45 +00:00
yuanmengqi
98a810d31e edit operator 2025-06-02 12:11:25 +00:00
Timothyxxx
34748567a5 feat: Migrate OSWorld files to HuggingFace cache with comprehensive documentation
- Add detailed README for file cache repository
- Implement migration script with retry logic and browser simulation
- Support automatic file type detection and deduplication
- Ensure reliable hosting for OSWorld evaluation files
2025-05-28 04:29:37 +08:00
Danyang Zhang
7bf99cb823 Update 15c3b339-88f7-4a86-ab16-e71c58dcb01e.json 2025-05-06 16:29:35 +08:00
Danyang Zhang
e4097783bb Update dfac9ee8-9bc4-4cdc-b465-4a4bfcd2f397.json 2025-05-06 16:28:52 +08:00
Thomas Kuntz
af993b3a3d fix: Broken profile path in 3 Thunderbird tasks 2025-05-04 14:03:06 +02:00
Xubin Ren
1d10514125 Fix Search Engine Detection Discrepancy in Chrome Evaluation (#172)
* Update bb5e4c0d-f964-439c-97b6-bdb9747de3f4.json

* Update __init__.py

* Update general.py
2025-04-10 17:24:50 +08:00
Parth A. Patel
bbfeecb475 fix: af2d657a-e6b3-4c6a-9f67-9e3ed015974c task config has type (#169)
Type on "examine_alignment" option results in false negatives
2025-04-06 02:20:51 +08:00
Timothyxxx
d373817edb Modify VLC launch command and fullscreen detection
- Add VLC_VERBOSE=-1 to suppress verbose logging in VLC launch commands across multiple example files
- Update is_vlc_fullscreen function to handle cases where screen size or window size is None
- Improve robustness of VLC-related metrics and example configurations
2025-03-06 22:11:42 +08:00
Timothyxxx
13127de01e Fix id 2025-03-03 18:26:32 +08:00
Timothyxxx
2f0f3f31aa Fix Duplicate ids; Remove unused JSON files across multiple applications 2025-02-10 15:49:54 +08:00
MillanK
983283a86a patch: minor bug fixes for evaluator and task configurations, documentation update (#121)
* fix: /cursor_position api return format fix

* chore: update README.md to remove deprecated command

* fix: add base score for evaluators and minor bug fixes

* fix: add base score for setup configurations

---------

Co-authored-by: Jiaqi Deng <jiaqideng@Jiaqis-MacBook-Pro.local>
2025-01-18 22:25:18 +08:00
YangJL2003
3148973ce9 Update c1fa57f3-c3db-4596-8f09-020701085416.json 2025-01-14 22:56:32 +08:00
Timothyxxx
63e69cab08 Fix one instruction error in chrome 6766f2b8-8a72-417f-a9e5-56fcaa735837 2024-12-09 12:35:02 +08:00
Tianbao Xie
afba17b510 Server setup readme revision (#108)
* Initialize

* add note for resolution

* Organize

* draft version and todos

* ver Nov24th

supplemented socat installation and switching off automatic suspend and
  screen-off

* Finish Tianbao todos

* Finish Tianbao todos

* Fix typos

* update font install

* Finish Xiaochuan's Part

* Finish Xiaochuan's Part update

* Update README.md

* Fix format

---------

Co-authored-by: zdy023 <zdy004007@126.com>
Co-authored-by: tsuky_chen <3107760494@qq.com>
Co-authored-by: Jason Lee <lixiaochuan20@gmail.com>
Co-authored-by: Siheng Zhao <77528902+sihengz02@users.noreply.github.com>
2024-11-25 16:30:59 +08:00