sci-gui-agent-benchmark

Files

Zilong Zhou 349f2fd9fe Feat/claude cua support (#253 )

* feat: add claude support

* feat: add script for end-to-end evaluation with logging and task distribution

* feat&fix: add tool result handling and update model default in evaluation script

* chore: remove run_test_env.py script

* feat&fix: implement action parsing for tool calls and update default action space

* fix: update text formatting in action parsing and replace logger import

* feat&fix: implement action parsing for tool calls and add screen size handling

* feat: add setup instructions for Anthropic API integration

* feat: add notice about image size limitations for Anthropic API

* Delete test_env/logger.py

* Delete test_env/utils.py

2025-07-13 21:10:49 +08:00

controllers

Improve code logic for password & resolution (#252 )

2025-07-13 21:04:07 +08:00

evaluators

Final review multi_apps fix the rest part

2025-07-12 20:28:55 +00:00

providers

Improve code logic for password & resolution (#252 )

2025-07-13 21:04:07 +08:00

server

Update AWS AMI ID, enhance directory creation logic in file upload, modify osworld service configuration, and refine JSON evaluation examples for improved clarity and functionality.