Commit Graph

53 Commits

Author SHA1 Message Date
Tianbao Xie
4e11eafd1d Robust Evaluation, Blocking File Open, Grader Sensitivity, and LibreOffice Writer Fixes (#217)
* Refactor evaluator structure in LibreOffice Writer example JSON to support multiple expected and result files, enhancing evaluation flexibility.

* Update instance type to t3.large and add VNC access URL logging for allocated VMs, enhancing remote access capabilities.

* Update instance type to t3.large and add VNC access URL logging for allocated VMs, enhancing remote access capabilities.

* Update time format in get_vm_file function to include hours, minutes, and seconds for more precise file naming with time suffix.

* More delay for 936321ce-5236-426a-9a20-e0e3c5dc536f; support one more potential solutions.

* Enhance SetupController with configurable retry limit and improved error handling for file opening requests. Introduce new function to compare unique training records, and update logging for better debugging. Adjust JSON examples for evaluation to support multiple expected and result files.

* Clean debug code

---------

Co-authored-by: yuanmengqi <yuanmengqi@mail.ustc.edu.cn>
2025-06-16 21:37:19 +08:00
yuanmengqi
2bae228803 merge upstream 2025-06-10 13:23:03 +00:00
yuanmengqi
7315aec6e6 clean code 2025-06-10 04:06:54 +00:00
adlsdztony
bfae51d74d fix: enhance setup method with retry logic and return status 2025-06-09 16:07:13 +00:00
adlsdztony
493abdeeab feat&refactor: add proxy setup functionality and update .gitignore for proxy config file 2025-06-07 11:24:49 +00:00
adlsdztony
71e9a1ead8 fix&refactor: improve error handling in download process and enhance start_emulator method signature 2025-06-06 09:08:14 +00:00
adlsdztony
0ca0085b18 fix: improve connection logging in SetupController 2025-06-05 11:04:33 +08:00
adlsdztony
d8ae209162 fix&refactor: improve connection retry logic and remove unnecessary wait time for AWS instance readiness 2025-05-28 13:05:32 +08:00
adlsdztony
431a762421 feat&fix: add logging for setup function calls and include snapshot name in AWS provider configuration 2025-05-26 20:37:20 +08:00
Tianbao Xie
20442244fa [Feature] Initialize and Implement Aguvis Evaluation on OSWorld (#98)
* Initialize Aguvis eval on OSWorld

* Debug

* Debug

* v1, internal version

* Add experiments script

* Fix minor bugs

* Update new endpoint

* Update ip

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Fix model name

* Fix docker close issues; update prompting

* Fix missed

* Fix the default port to avoid crashing on examples like '_update_browse_history_setup'

* Fix server and chromium ports in setup

* Revert and add missed dependency

* Add VLC port for docker

* Update

* Clean

---------

Co-authored-by: Tianbao Xie <tianbaoxie@U-492FC39R-0217.local>
Co-authored-by: FredWuCZ <fredwucz@outlook.com>
2024-11-11 12:36:16 +08:00
Pierre Carrier
b35dc40ff4 SetupController: no server_port for chrome (#96) 2024-11-07 00:33:03 +08:00
HappySix
6419d707bc Support Docker VM manager and provider (#75)
* Add docker provider framework

* Update VM download link

* Add stop container

* Update docker manager & provider

* Update

* Update

* Update provider
2024-09-28 21:10:40 +08:00
Timothyxxx
df231889c9 Fix minor bug 2024-08-04 11:35:44 +08:00
Jason Lee
fcdaf7ce0b Update setup.py for update_browse_history function 2024-07-04 09:37:13 -05:00
Timothyxxx
97b567a287 Update README and ROADMAP; Fix typos; optimize the code for llm calling in agent.py 2024-04-26 13:32:41 +08:00
Timothyxxx
9c75df5dce Clean code; Refactor environment to pass screenshot content instead of path 2024-04-13 23:34:01 +08:00
rhythmcao
da0dafc32c add multi-apps 5 examples by ruisheng 2024-03-06 2024-03-06 21:20:26 +08:00
David Chang
c39926fc57 Merge branch 'main' into zdy 2024-02-15 22:27:10 +08:00
Timothyxxx
fdb5655c89 Update chrome examples 2024-02-08 13:49:29 +08:00
David Chang
c46fcbfcbe ver Feb2ndv3
working on human eval for multi_apps
2024-02-02 09:30:10 +08:00
David Chang
5ee9621e0d ver Feb2nd
human evaluation as non-expert on chrome tasks
2024-02-02 05:13:12 +08:00
Timothyxxx
d65b6994d3 Fix minor bugs of multiple apps examples 2024-01-31 19:40:41 +08:00
tsuky_chen
932b73c67d load libreoffice writer eval -batch 2 2024-01-26 02:15:42 +08:00
tsuky_chen
3e7cfa8699 load libreoffice writer eval -batch 2 2024-01-26 02:07:26 +08:00
rhythmcao
5ac80dc309 update examples 2024-01-26 00:53:35 +08:00
rhythmcao
5a5309c0fd add multi-app example, fix googledrive functions 2024-01-25 20:30:54 +08:00
Timothyxxx
b9ae4174b1 Fix OS examples annotated by Yitao 2024-01-25 19:57:32 +08:00
rhythmcao
f194fb8d75 add multi_apps; update chrome utilities 2024-01-25 13:53:19 +08:00
David Chang
ffc4c32bac ver Jan17th
updated the existing task configs
2024-01-17 17:27:08 +08:00
David Chang
fc289a3427 Merge branch 'main' into zdy 2024-01-15 12:12:05 +08:00
David Chang
59fdd9f1a2 ver Jan14th
setup method for Thunderbird composing tasks
2024-01-14 23:16:54 +08:00
Timothyxxx
d52b692ee5 Finish loading the vscode examples v1; Improve on the infra: Add accessibility tree into the observation; Add activate window function, etc 2024-01-14 18:30:49 +08:00
Timothyxxx
2228f346a9 Fix minor bugs caused from merging in setupcontroller; Initialize vscode example loading 2024-01-14 00:51:26 +08:00
Timothyxxx
186df65683 Merge remote-tracking branch 'origin/main'
# Conflicts:
#	desktop_env/controllers/setup.py
#	desktop_env/evaluators/metrics/utils.py
2024-01-12 17:30:15 +08:00
Timothyxxx
5a93a32958 Update on Chrome examples; Refactor on logic of controlling 2024-01-12 17:24:47 +08:00
David Chang
27eaf2f5d5 ver Jan11th
finally set up a simple task, or which should be simple
2024-01-11 20:03:33 +08:00
David Chang
cebae4b183 Merge branch 'main' into zdy 2024-01-10 22:16:25 +08:00
David Chang
1515b05666 ver Jan10thv2
a new example config for Thunderbird
fixed several bugs
2024-01-10 21:58:29 +08:00
Timothyxxx
abcafce750 VLC updates, and some infra bugs fix 2024-01-09 23:14:06 +08:00
Timothyxxx
fa84b20ea5 VLC updates, and some infra bugs fix 2024-01-09 09:30:11 +08:00
David Chang
26b7d9010d Merge branch 'zdy' 2024-01-05 15:55:41 +08:00
David Chang
eeb8a120d6 ver Jan5th
debugged
2024-01-05 15:20:47 +08:00
David Chang
5fedf5b891 ver Jan4th
updated interfaces for thunderbird evaluation, not tested
2024-01-04 22:41:57 +08:00
Timothyxxx
ab71ebb2ba Initialize VLC getters and metrics, fix some bugs in infra logic, needs to be refactored later on 2024-01-04 17:05:17 +08:00
David Chang
15a63074bc Merge branch 'zdy' 2023-12-25 21:05:44 +08:00
David Chang
ade9002da4 Merge branch 'main' into zdy 2023-12-25 20:29:20 +08:00
David Chang
82e3353f65 ver Dec25th
added cache and upload function for setup
2023-12-25 14:40:30 +08:00
Timothyxxx
236fcb0938 Refactor examples; Start to load examples into benchmark; vlc initialization 2023-12-25 00:24:13 +08:00
David Chang
295d09f1b2 ver Dec21stv2
updated usage of tmp and cache direcotories
added cache function for evaluation resources acquiring
2023-12-21 16:12:32 +08:00
David Chang
4a643abc31 ver Dec21st
updated setup configs from dict-style to list-style to support more
flexible setup steps
2023-12-21 10:30:23 +08:00