feat: add client password argument to multiple agents and scripts

- Introduced `--client_password` argument in `run_multienv_aguvis.py`, `run_multienv_claude.py`, and `run_multienv_gta1.py` for enhanced security and flexibility. - Updated agent classes (`PromptAgent`, `AguvisAgent`, `GTA1Agent`) to accept and utilize `client_password` for improved configuration. - Modified evaluation guidelines to reflect the new client password requirement. - Ensured existing logic remains intact while enhancing functionality for better user experience.
2025-07-27 16:11:23 +00:00
parent 122b16742b
commit 523d553e88
9 changed files with 627 additions and 28 deletions
--- a/PUBLIC_EVALUATION_GUIDELINE.md
+++ b/PUBLIC_EVALUATION_GUIDELINE.md
@@ -270,6 +270,7 @@ Use the `run_multienv_xxx.py` scripts to launch tasks in parallel.
 Example (with the OpenAI CUA agent):

 ```bash
+# --client_password set to the one you set to the client machine
 # Run OpenAI CUA
 python run_multienv_openaicua.py \
 --headless \
@@ -279,7 +280,8 @@ python run_multienv_openaicua.py \
 --test_all_meta_path evaluation_examples/test_all.json \
 --region us-east-1 \
 --max_steps 50 \
--num_envs 5
+--num_envs 5 \
+--client_password osworld-public-evaluation

 # Run Anthropic (via AWS Bedrock), please modify agent if you want Anthropic endpoint
 python run_multienv_claude.py \
@@ -291,7 +293,8 @@ python run_multienv_claude.py \
 --test_all_meta_path evaluation_examples/test_all.json \
 --max_steps 50 \
 --num_envs 5 \
--provider_name aws
+--provider_name aws \
+--client_password osworld-public-evaluation
 ```

 Key Parameters:
@@ -330,7 +333,7 @@ For more, see: [MONITOR_README](./monitor/README.md)
 ### 4.2 VNC Remote Desktop Access
 We pre-install vnc for every virtual machine so you can have a look on it during the running.
 You can access via VNC at`http://<client-public-ip>:5910/vnc.html`
-The password set default is `osworld-public-evaluation` to prevent attack.
+The password set default is `osworld-public-evaluation` in our AMI to prevent attack.

 ## 5. Contact the team to update leaderboard and fix errors (optional)