Files
sci-gui-agent-benchmark/PUBLIC_EVALUATION_GUIDELINE.md
Yuan Mengqi 40354322e8 fix pub eval readme typo (#214)
* update clean code

* fix pub eval readme typo
2025-06-10 22:57:16 +08:00

5.4 KiB

Public Evaluation Platform User Guide

We have built an AWS-based platform for large-scale parallel evaluation of OSWorld tasks. The system follows a Host-Client architecture:

  • Host Instance: The central controller that stores code, configurations, and manages task execution.
  • Client Instances: Worker nodes automatically launched to perform tasks in parallel.

All instances use a preconfigured AMI to ensure a consistent environment.

1. Platform Deployment & Connection

1.1 Launch the Host Instance

Create an EC2 instance in the AWS Console with the following settings:

Configuration Item Value
AMI ID ami-0e49e0a70044dde43
Instance Type - t3.medium (Recommended for ≤5 parallel tasks)
- t3.large (Recommended for ≤15 parallel tasks)

- These numbers are based on using VSCode over SSH. You can save resources by running via CLI—t3.large supports up to 20 tasks that way.
- For higher parallelism, use a more powerful instance.
VPC vpc-0f207282fe145bcda
Subnet subnet-0a4b0c5b8f6066712
Firewall (security groups) sg-05f8e79c10a7768e4
Storage 50GB
- Consider increasing if storing multiple results to avoid crashes.

Once launched, you will receive an instance ID like i-xxxxxx.

1.2 Connect to the Host Instance

Step 1: Prepare Your SSH Key

  • When launching the instance, choose "Create new key pair" and download the .pem file (e.g. osworld-host-key.pem). Save it locally.

pubeval1

  • Set appropriate permissions:

    chmod 400 <your_key_file_path>
    
  • Find your instance's public IP and DNS:

    • Go to the EC2 Instances page on the AWS Console.
    • Locate your Host instance by its ID.

pubeval2

  • On the instance detail page:

    • Public IP/DNS: used for browser/VNC access and SSH connection
    • Instance metadata: e.g. storage, can be adjusted post-launch

pubeval3

Step 2: Connect via SSH or VSCode

  • SSH:

    ssh -i <your_key_path> ubuntu@<your_public_dns>
    
  • VSCode Remote SSH configuration:

    Host host_example
        HostName <your_public_dns>
        User ubuntu
        IdentityFile <your_key_path>
    

1.3 Get AWS Access Keys & Secret Access Key

Click on Security Credentials from the drop-down menu under your account in the top-right corner.

pubeval4

In the Access keys section, click "Create access key" to generate your own key.

pubeval5

2. Environment Setup

2.1 Google Drive Integration

Follow the instructions in ACCOUNT_GUIDELINE, specifically the section "Generating credentials.json for Public Eval". This part is necessary if using public evaluation.

2.2 Proxy Setup

  • Register at DataImpulse.

  • Configure your credentials in OSWorld/evaluation_examples/settings/proxy/dataimpulse.json:

    [
        {
            "host": "gw.dataimpulse.com",
            "port": 823,
            "username": "your_username",
            "password": "your_password",
            "protocol": "http",
            "provider": "dataimpulse",
            "type": "residential",
            "country": "US",
            "note": "Dataimpulse Residential Proxy"
        }
    ] 
    

2.3 Set Environment Variables

export OPENAI_API_KEY_CUA="your_api_key"
export AWS_ACCESS_KEY_ID="your_access_key"
export AWS_SECRET_ACCESS_KEY="your_security_access_key"
export AWS_REGION="your_aws_region" # eg. us-east-1
export AWS_SUBNET_ID="subnet-0a4b0c5b8f6066712"
export AWS_SECURITY_GROUP_ID="sg-08a53433e9b4abde6"

3. Running Evaluations

Use the run_multienv_xxx.py scripts to launch tasks in parallel.

Example (with the OpenAI CUA agent):

python run_multienv_openaicua.py \
--headless \
--observation_type screenshot \
--model computer-use-preview \
--result_dir ./results_all \
--test_all_meta_path evaluation_examples/test_all.json \
--region us-east-1 \
--max_steps 150 \
--num_envs 5

Key Parameters:

  • --num_envs: Number of parallel environments
  • --max_steps: Max steps per task
  • --result_dir: Output directory for results
  • --test_all_meta_path: Path to the test set metadata
  • --region: AWS region

4. Viewing Results

4.1 Web Monitoring Tool

cd monitor
pip install -r requirements.txt
python main.py

Then, open your Host's public IP on port 8080 in a browser. (eg. http://<client-public-ip>:8080)

For more, see: MONITOR_README

4.2 VNC Remote Desktop Access

You can also access Client instances via VNC athttp://<client-public-ip>:5090/vnc.html