diff --git a/PUBLIC_EVALUATION_GUIDELINE.md b/PUBLIC_EVALUATION_GUIDELINE.md new file mode 100644 index 0000000..faa19ea --- /dev/null +++ b/PUBLIC_EVALUATION_GUIDELINE.md @@ -0,0 +1,161 @@ +# Public Evaluation Platform User Guide + +We have built an AWS-based platform for large-scale parallel evaluation of OSWorld tasks. The system follows a Host-Client architecture: + +- **Host Instance**: The central controller that stores code, configurations, and manages task execution. +- **Client Instances**: Worker nodes automatically launched to perform tasks in parallel. + +All instances use a preconfigured AMI to ensure a consistent environment. + +## 1. Platform Deployment & Connection + +### 1.1 Launch the Host Instance + +Create an EC2 instance in the AWS Console with the following settings: + +| Configuration Item | Value | +| -------------------------- | ------------------------------------------------------------ | +| AMI ID | `ami-0e49e0a70044dde43` | +| Instance Type | - `t3.medium` (Recommended for ≤5 parallel tasks)
- ` t3.large ` (Recommended for ≤15 parallel tasks)

- These numbers are based on using VSCode over SSH. You can save resources by running via CLI—`t3.large` supports up to 20 tasks that way.
- For higher parallelism, use a more powerful instance. | +| VPC | `vpc-0f207282fe145bcda` | +| Subnet | `subnet-0a4b0c5b8f6066712` | +| Firewall (security groups) | `sg-05f8e79c10a7768e4` | +| Storage | 50GB
- Consider increasing if storing multiple results to avoid crashes. | + +Once launched, you will receive an instance ID like `i-xxxxxx`. + +### 1.2 Connect to the Host Instance + +#### Step 1: Prepare Your SSH Key + +* When launching the instance, choose "Create new key pair" and download the `.pem` file (e.g. `osworld-host-key.pem`). Save it locally.pubeval1 + +* Set appropriate permissions: + + ```bash + chmod 400 + ``` + +* Find your instance's **public IP** and **DNS**: + + - Go to the EC2 **Instances** page on the AWS Console. + - Locate your Host instance by its ID. + + pubeval2 + + * On the instance detail page: + + - **Public IP/DNS**: used for browser/VNC access and SSH connection + - **Instance metadata**: e.g. storage, can be adjusted post-launch + + pubeval3 + +#### Step 2: Connect via SSH or VSCode + +* SSH: + + ```bash + ssh -i ubuntu@ + ``` + +* VSCode Remote SSH configuration: + + ``` + Host host_example + HostName + User ubuntu + IdentityFile + ``` + +### 1.3 Get AWS Access Keys & Secret Access Key + +Click on **Security Credentials** from the drop-down menu under your account in the top-right corner. + +pubeval4 + +In the **Access keys** section, click **"Create access key"** to generate your own key. + +pubeval5 + +## 2. Environment Setup + +### 2.1 Google Drive Integration + +Follow the instructions in [ACCOUNT_GUIDELINE.md](./ACCOUNT_GUIDELINE.md), specifically the section "Generating `credentials.json` for Public Eval". This part is necessary if using public evaluation. + +### 2.2 Proxy Setup + +- Register at [DataImpulse](https://dataimpulse.com/). + +- Configure your credentials in `OSWorld/evaluation_examples/settings/proxy/dataimpulse.json`: + + ```json + [ + { + "host": "gw.dataimpulse.com", + "port": 823, + "username": "your_username", + "password": "your_password", + "protocol": "http", + "provider": "dataimpulse", + "type": "residential", + "country": "US", + "note": "Dataimpulse Residential Proxy" + } + ] + ``` + +### 2.3 Set Environment Variables + +```bash +export OPENAI_API_KEY_CUA="your_api_key" +export AWS_ACCESS_KEY_ID="your_access_key" +export AWS_SECRET_ACCESS_KEY="your_security_access_key" +export AWS_REGION="your_aws_region" # eg. us-east-1 +export AWS_SUBNET_ID="subnet-0a4b0c5b8f6066712" +export AWS_SECURITY_GROUP_ID="sg-08a53433e9b4abde6" +``` + +## 3. Running Evaluations + +Use the `run_multienv_xxx.py` scripts to launch tasks in parallel. + +Example (with the OpenAI CUA agent): + +```bash +python run_multienv_openaicua.py \ +--headless \ +--observation_type screenshot \ +--model computer-use-preview \ +--result_dir ./results_all \ +--test_all_meta_path evaluation_examples/test_all.json \ +--region us-east-1 \ +--max_steps 150 \ +--num_envs 5 +``` + +Key Parameters: + +- `--num_envs`: Number of parallel environments +- `--max_steps`: Max steps per task +- `--result_dir`: Output directory for results +- `--test_all_meta_path`: Path to the test set metadata +- `--region`: AWS region + +## 4. Viewing Results + +### 4.1 Web Monitoring Tool + +```bash +cd monitor +pip install -r requirements.txt +python main.py +``` + +Then, open your Host's **public IP** on port `8080` in a browser. (eg. `http://3.80.23.14:8080`) + +For more, see: `OSWorld/monitor/README.md` + +### 4.2 VNC Remote Desktop Access + +You can also access Client instances via VNC at`http://:5090/vnc.html` \ No newline at end of file diff --git a/assets/pubeval1.png b/assets/pubeval1.png new file mode 100644 index 0000000..28d3030 Binary files /dev/null and b/assets/pubeval1.png differ diff --git a/assets/pubeval2.png b/assets/pubeval2.png new file mode 100644 index 0000000..6797e34 Binary files /dev/null and b/assets/pubeval2.png differ diff --git a/assets/pubeval3.png b/assets/pubeval3.png new file mode 100644 index 0000000..4167df8 Binary files /dev/null and b/assets/pubeval3.png differ diff --git a/assets/pubeval4.png b/assets/pubeval4.png new file mode 100644 index 0000000..9ad053d Binary files /dev/null and b/assets/pubeval4.png differ diff --git a/assets/pubeval5.png b/assets/pubeval5.png new file mode 100644 index 0000000..37fe6fd Binary files /dev/null and b/assets/pubeval5.png differ diff --git a/desktop_env/controllers/setup.py b/desktop_env/controllers/setup.py index 073a020..4373600 100644 --- a/desktop_env/controllers/setup.py +++ b/desktop_env/controllers/setup.py @@ -27,8 +27,8 @@ import dotenv # Load environment variables from .env file dotenv.load_dotenv() -CLIENT_PASSWORD = os.getenv("CLIENT_PASSWORD", "password") # Default password for sudo operations -PROXY_CONFIG_FILE = os.getenv("PROXY_CONFIG_FILE", "dataimpulse_proxy_config.json") # Default proxy config file +CLIENT_PASSWORD = os.getenv("CLIENT_PASSWORD", "osworld-public-evaluation") # Default password for sudo operations +PROXY_CONFIG_FILE = os.getenv("PROXY_CONFIG_FILE", "evaluation_examples/settings/proxy/dataimpulse.json") # Default proxy config file logger = logging.getLogger("desktopenv.setup") diff --git a/desktop_env/desktop_env.py b/desktop_env/desktop_env.py index 58a178d..72bea71 100644 --- a/desktop_env/desktop_env.py +++ b/desktop_env/desktop_env.py @@ -60,7 +60,7 @@ class DesktopEnv(gym.Env): self.provider_name = provider_name self.enable_proxy = enable_proxy # Store proxy enablement setting - # Default TODO: + # Default self.server_port = 5000 self.chromium_port = 9222 self.vnc_port = 8006 diff --git a/evaluation_examples/settings/proxy/dataimpulse.json b/evaluation_examples/settings/proxy/dataimpulse.json new file mode 100644 index 0000000..2e7e65a --- /dev/null +++ b/evaluation_examples/settings/proxy/dataimpulse.json @@ -0,0 +1,13 @@ +[ + { + "host": "gw.dataimpulse.com", + "port": 823, + "username": "your_username", + "password": "your_password", + "protocol": "http", + "provider": "dataimpulse", + "type": "residential", + "country": "US", + "note": "Dataimpulse Residential Proxy" + } +] \ No newline at end of file