Tianbao Xie
4e11eafd1d
Robust Evaluation, Blocking File Open, Grader Sensitivity, and LibreOffice Writer Fixes ( #217 )
...
* Refactor evaluator structure in LibreOffice Writer example JSON to support multiple expected and result files, enhancing evaluation flexibility.
* Update instance type to t3.large and add VNC access URL logging for allocated VMs, enhancing remote access capabilities.
* Update instance type to t3.large and add VNC access URL logging for allocated VMs, enhancing remote access capabilities.
* Update time format in get_vm_file function to include hours, minutes, and seconds for more precise file naming with time suffix.
* More delay for 936321ce-5236-426a-9a20-e0e3c5dc536f; support one more potential solutions.
* Enhance SetupController with configurable retry limit and improved error handling for file opening requests. Introduce new function to compare unique training records, and update logging for better debugging. Adjust JSON examples for evaluation to support multiple expected and result files.
* Clean debug code
---------
Co-authored-by: yuanmengqi <yuanmengqi@mail.ustc.edu.cn >
2025-06-16 21:37:19 +08:00
yuanmengqi
2bae228803
merge upstream
2025-06-10 13:23:03 +00:00
yuanmengqi
7315aec6e6
clean code
2025-06-10 04:06:54 +00:00
adlsdztony
bfae51d74d
fix: enhance setup method with retry logic and return status
2025-06-09 16:07:13 +00:00
adlsdztony
493abdeeab
feat&refactor: add proxy setup functionality and update .gitignore for proxy config file
2025-06-07 11:24:49 +00:00
adlsdztony
71e9a1ead8
fix&refactor: improve error handling in download process and enhance start_emulator method signature
2025-06-06 09:08:14 +00:00
adlsdztony
0ca0085b18
fix: improve connection logging in SetupController
2025-06-05 11:04:33 +08:00
adlsdztony
d8ae209162
fix&refactor: improve connection retry logic and remove unnecessary wait time for AWS instance readiness
2025-05-28 13:05:32 +08:00
adlsdztony
431a762421
feat&fix: add logging for setup function calls and include snapshot name in AWS provider configuration
2025-05-26 20:37:20 +08:00
Tianbao Xie
20442244fa
[Feature] Initialize and Implement Aguvis Evaluation on OSWorld ( #98 )
...
* Initialize Aguvis eval on OSWorld
* Debug
* Debug
* v1, internal version
* Add experiments script
* Fix minor bugs
* Update new endpoint
* Update ip
* Update
* Update
* Update
* Update
* Update
* Update
* Update
* Update
* Fix model name
* Fix docker close issues; update prompting
* Fix missed
* Fix the default port to avoid crashing on examples like '_update_browse_history_setup'
* Fix server and chromium ports in setup
* Revert and add missed dependency
* Add VLC port for docker
* Update
* Clean
---------
Co-authored-by: Tianbao Xie <tianbaoxie@U-492FC39R-0217.local >
Co-authored-by: FredWuCZ <fredwucz@outlook.com >
2024-11-11 12:36:16 +08:00
Pierre Carrier
b35dc40ff4
SetupController: no server_port for chrome ( #96 )
2024-11-07 00:33:03 +08:00
HappySix
6419d707bc
Support Docker VM manager and provider ( #75 )
...
* Add docker provider framework
* Update VM download link
* Add stop container
* Update docker manager & provider
* Update
* Update
* Update provider
2024-09-28 21:10:40 +08:00
Timothyxxx
df231889c9
Fix minor bug
2024-08-04 11:35:44 +08:00
Jason Lee
fcdaf7ce0b
Update setup.py for update_browse_history function
2024-07-04 09:37:13 -05:00
Timothyxxx
97b567a287
Update README and ROADMAP; Fix typos; optimize the code for llm calling in agent.py
2024-04-26 13:32:41 +08:00
Timothyxxx
9c75df5dce
Clean code; Refactor environment to pass screenshot content instead of path
2024-04-13 23:34:01 +08:00
rhythmcao
da0dafc32c
add multi-apps 5 examples by ruisheng 2024-03-06
2024-03-06 21:20:26 +08:00
David Chang
c39926fc57
Merge branch 'main' into zdy
2024-02-15 22:27:10 +08:00
Timothyxxx
fdb5655c89
Update chrome examples
2024-02-08 13:49:29 +08:00
David Chang
c46fcbfcbe
ver Feb2ndv3
...
working on human eval for multi_apps
2024-02-02 09:30:10 +08:00
David Chang
5ee9621e0d
ver Feb2nd
...
human evaluation as non-expert on chrome tasks
2024-02-02 05:13:12 +08:00
Timothyxxx
d65b6994d3
Fix minor bugs of multiple apps examples
2024-01-31 19:40:41 +08:00
tsuky_chen
932b73c67d
load libreoffice writer eval -batch 2
2024-01-26 02:15:42 +08:00
tsuky_chen
3e7cfa8699
load libreoffice writer eval -batch 2
2024-01-26 02:07:26 +08:00
rhythmcao
5ac80dc309
update examples
2024-01-26 00:53:35 +08:00
rhythmcao
5a5309c0fd
add multi-app example, fix googledrive functions
2024-01-25 20:30:54 +08:00
Timothyxxx
b9ae4174b1
Fix OS examples annotated by Yitao
2024-01-25 19:57:32 +08:00
rhythmcao
f194fb8d75
add multi_apps; update chrome utilities
2024-01-25 13:53:19 +08:00
David Chang
ffc4c32bac
ver Jan17th
...
updated the existing task configs
2024-01-17 17:27:08 +08:00
David Chang
fc289a3427
Merge branch 'main' into zdy
2024-01-15 12:12:05 +08:00
David Chang
59fdd9f1a2
ver Jan14th
...
setup method for Thunderbird composing tasks
2024-01-14 23:16:54 +08:00
Timothyxxx
d52b692ee5
Finish loading the vscode examples v1; Improve on the infra: Add accessibility tree into the observation; Add activate window function, etc
2024-01-14 18:30:49 +08:00
Timothyxxx
2228f346a9
Fix minor bugs caused from merging in setupcontroller; Initialize vscode example loading
2024-01-14 00:51:26 +08:00
Timothyxxx
186df65683
Merge remote-tracking branch 'origin/main'
...
# Conflicts:
# desktop_env/controllers/setup.py
# desktop_env/evaluators/metrics/utils.py
2024-01-12 17:30:15 +08:00
Timothyxxx
5a93a32958
Update on Chrome examples; Refactor on logic of controlling
2024-01-12 17:24:47 +08:00
David Chang
27eaf2f5d5
ver Jan11th
...
finally set up a simple task, or which should be simple
2024-01-11 20:03:33 +08:00
David Chang
cebae4b183
Merge branch 'main' into zdy
2024-01-10 22:16:25 +08:00
David Chang
1515b05666
ver Jan10thv2
...
a new example config for Thunderbird
fixed several bugs
2024-01-10 21:58:29 +08:00
Timothyxxx
abcafce750
VLC updates, and some infra bugs fix
2024-01-09 23:14:06 +08:00
Timothyxxx
fa84b20ea5
VLC updates, and some infra bugs fix
2024-01-09 09:30:11 +08:00
David Chang
26b7d9010d
Merge branch 'zdy'
2024-01-05 15:55:41 +08:00
David Chang
eeb8a120d6
ver Jan5th
...
debugged
2024-01-05 15:20:47 +08:00
David Chang
5fedf5b891
ver Jan4th
...
updated interfaces for thunderbird evaluation, not tested
2024-01-04 22:41:57 +08:00
Timothyxxx
ab71ebb2ba
Initialize VLC getters and metrics, fix some bugs in infra logic, needs to be refactored later on
2024-01-04 17:05:17 +08:00
David Chang
15a63074bc
Merge branch 'zdy'
2023-12-25 21:05:44 +08:00
David Chang
ade9002da4
Merge branch 'main' into zdy
2023-12-25 20:29:20 +08:00
David Chang
82e3353f65
ver Dec25th
...
added cache and upload function for setup
2023-12-25 14:40:30 +08:00
Timothyxxx
236fcb0938
Refactor examples; Start to load examples into benchmark; vlc initialization
2023-12-25 00:24:13 +08:00
David Chang
295d09f1b2
ver Dec21stv2
...
updated usage of tmp and cache direcotories
added cache function for evaluation resources acquiring
2023-12-21 16:12:32 +08:00
David Chang
4a643abc31
ver Dec21st
...
updated setup configs from dict-style to list-style to support more
flexible setup steps
2023-12-21 10:30:23 +08:00