feat: add client password argument to multiple agents and scripts

- Introduced `--client_password` argument in `run_multienv_aguvis.py`, `run_multienv_claude.py`, and `run_multienv_gta1.py` for enhanced security and flexibility.
- Updated agent classes (`PromptAgent`, `AguvisAgent`, `GTA1Agent`) to accept and utilize `client_password` for improved configuration.
- Modified evaluation guidelines to reflect the new client password requirement.
- Ensured existing logic remains intact while enhancing functionality for better user experience.
This commit is contained in:
yuanmengqi
2025-07-27 16:11:23 +00:00
parent 122b16742b
commit 523d553e88
9 changed files with 627 additions and 28 deletions

View File

@@ -270,6 +270,7 @@ Use the `run_multienv_xxx.py` scripts to launch tasks in parallel.
Example (with the OpenAI CUA agent):
```bash
# --client_password set to the one you set to the client machine
# Run OpenAI CUA
python run_multienv_openaicua.py \
--headless \
@@ -279,7 +280,8 @@ python run_multienv_openaicua.py \
--test_all_meta_path evaluation_examples/test_all.json \
--region us-east-1 \
--max_steps 50 \
--num_envs 5
--num_envs 5 \
--client_password osworld-public-evaluation
# Run Anthropic (via AWS Bedrock), please modify agent if you want Anthropic endpoint
python run_multienv_claude.py \
@@ -291,7 +293,8 @@ python run_multienv_claude.py \
--test_all_meta_path evaluation_examples/test_all.json \
--max_steps 50 \
--num_envs 5 \
--provider_name aws
--provider_name aws \
--client_password osworld-public-evaluation
```
Key Parameters:
@@ -330,7 +333,7 @@ For more, see: [MONITOR_README](./monitor/README.md)
### 4.2 VNC Remote Desktop Access
We pre-install vnc for every virtual machine so you can have a look on it during the running.
You can access via VNC at`http://<client-public-ip>:5910/vnc.html`
The password set default is `osworld-public-evaluation` to prevent attack.
The password set default is `osworld-public-evaluation` in our AMI to prevent attack.
## 5. Contact the team to update leaderboard and fix errors (optional)