merge upstream
This commit is contained in:
161
PUBLIC_EVALUATION_GUIDELINE.md
Normal file
161
PUBLIC_EVALUATION_GUIDELINE.md
Normal file
@@ -0,0 +1,161 @@
|
||||
# Public Evaluation Platform User Guide
|
||||
|
||||
We have built an AWS-based platform for large-scale parallel evaluation of OSWorld tasks. The system follows a Host-Client architecture:
|
||||
|
||||
- **Host Instance**: The central controller that stores code, configurations, and manages task execution.
|
||||
- **Client Instances**: Worker nodes automatically launched to perform tasks in parallel.
|
||||
|
||||
All instances use a preconfigured AMI to ensure a consistent environment.
|
||||
|
||||
## 1. Platform Deployment & Connection
|
||||
|
||||
### 1.1 Launch the Host Instance
|
||||
|
||||
Create an EC2 instance in the AWS Console with the following settings:
|
||||
|
||||
| Configuration Item | Value |
|
||||
| -------------------------- | ------------------------------------------------------------ |
|
||||
| AMI ID | `ami-0e49e0a70044dde43` |
|
||||
| Instance Type | - `t3.medium` (Recommended for ≤5 parallel tasks)<br />- ` t3.large ` (Recommended for ≤15 parallel tasks)<br /><br /> - These numbers are based on using VSCode over SSH. You can save resources by running via CLI—`t3.large` supports up to 20 tasks that way.<br /> - For higher parallelism, use a more powerful instance. |
|
||||
| VPC | `vpc-0f207282fe145bcda` |
|
||||
| Subnet | `subnet-0a4b0c5b8f6066712` |
|
||||
| Firewall (security groups) | `sg-05f8e79c10a7768e4` |
|
||||
| Storage | 50GB<br /> - Consider increasing if storing multiple results to avoid crashes. |
|
||||
|
||||
Once launched, you will receive an instance ID like `i-xxxxxx`.
|
||||
|
||||
### 1.2 Connect to the Host Instance
|
||||
|
||||
#### Step 1: Prepare Your SSH Key
|
||||
|
||||
* When launching the instance, choose "Create new key pair" and download the `.pem` file (e.g. `osworld-host-key.pem`). Save it locally.<img src="./assets/pubeval1.png" alt="pubeval1" style="zoom:50%;" />
|
||||
|
||||
* Set appropriate permissions:
|
||||
|
||||
```bash
|
||||
chmod 400 <your_key_file_path>
|
||||
```
|
||||
|
||||
* Find your instance's **public IP** and **DNS**:
|
||||
|
||||
- Go to the EC2 **Instances** page on the AWS Console.
|
||||
- Locate your Host instance by its ID.
|
||||
|
||||
<img src="./assets/pubeval2.png" alt="pubeval2" style="zoom:67%;" />
|
||||
|
||||
* On the instance detail page:
|
||||
|
||||
- **Public IP/DNS**: used for browser/VNC access and SSH connection
|
||||
- **Instance metadata**: e.g. storage, can be adjusted post-launch
|
||||
|
||||
<img src="./assets/pubeval3.png" alt="pubeval3" style="zoom:67%;" />
|
||||
|
||||
#### Step 2: Connect via SSH or VSCode
|
||||
|
||||
* SSH:
|
||||
|
||||
```bash
|
||||
ssh -i <your_key_path> ubuntu@<your_public_dns>
|
||||
```
|
||||
|
||||
* VSCode Remote SSH configuration:
|
||||
|
||||
```
|
||||
Host host_example
|
||||
HostName <your_public_dns>
|
||||
User ubuntu
|
||||
IdentityFile <your_key_path>
|
||||
```
|
||||
|
||||
### 1.3 Get AWS Access Keys & Secret Access Key
|
||||
|
||||
Click on **Security Credentials** from the drop-down menu under your account in the top-right corner.
|
||||
|
||||
<img src="./assets/pubeval4.png" alt="pubeval4" style="zoom: 33%;" />
|
||||
|
||||
In the **Access keys** section, click **"Create access key"** to generate your own key.
|
||||
|
||||
<img src="./assets/pubeval5.png" alt="pubeval5" style="zoom: 33%;" />
|
||||
|
||||
## 2. Environment Setup
|
||||
|
||||
### 2.1 Google Drive Integration
|
||||
|
||||
Follow the instructions in [ACCOUNT_GUIDELINE.md](./ACCOUNT_GUIDELINE.md), specifically the section "Generating `credentials.json` for Public Eval". This part is necessary if using public evaluation.
|
||||
|
||||
### 2.2 Proxy Setup
|
||||
|
||||
- Register at [DataImpulse](https://dataimpulse.com/).
|
||||
|
||||
- Configure your credentials in `OSWorld/evaluation_examples/settings/proxy/dataimpulse.json`:
|
||||
|
||||
```json
|
||||
[
|
||||
{
|
||||
"host": "gw.dataimpulse.com",
|
||||
"port": 823,
|
||||
"username": "your_username",
|
||||
"password": "your_password",
|
||||
"protocol": "http",
|
||||
"provider": "dataimpulse",
|
||||
"type": "residential",
|
||||
"country": "US",
|
||||
"note": "Dataimpulse Residential Proxy"
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
### 2.3 Set Environment Variables
|
||||
|
||||
```bash
|
||||
export OPENAI_API_KEY_CUA="your_api_key"
|
||||
export AWS_ACCESS_KEY_ID="your_access_key"
|
||||
export AWS_SECRET_ACCESS_KEY="your_security_access_key"
|
||||
export AWS_REGION="your_aws_region" # eg. us-east-1
|
||||
export AWS_SUBNET_ID="subnet-0a4b0c5b8f6066712"
|
||||
export AWS_SECURITY_GROUP_ID="sg-08a53433e9b4abde6"
|
||||
```
|
||||
|
||||
## 3. Running Evaluations
|
||||
|
||||
Use the `run_multienv_xxx.py` scripts to launch tasks in parallel.
|
||||
|
||||
Example (with the OpenAI CUA agent):
|
||||
|
||||
```bash
|
||||
python run_multienv_openaicua.py \
|
||||
--headless \
|
||||
--observation_type screenshot \
|
||||
--model computer-use-preview \
|
||||
--result_dir ./results_all \
|
||||
--test_all_meta_path evaluation_examples/test_all.json \
|
||||
--region us-east-1 \
|
||||
--max_steps 150 \
|
||||
--num_envs 5
|
||||
```
|
||||
|
||||
Key Parameters:
|
||||
|
||||
- `--num_envs`: Number of parallel environments
|
||||
- `--max_steps`: Max steps per task
|
||||
- `--result_dir`: Output directory for results
|
||||
- `--test_all_meta_path`: Path to the test set metadata
|
||||
- `--region`: AWS region
|
||||
|
||||
## 4. Viewing Results
|
||||
|
||||
### 4.1 Web Monitoring Tool
|
||||
|
||||
```bash
|
||||
cd monitor
|
||||
pip install -r requirements.txt
|
||||
python main.py
|
||||
```
|
||||
|
||||
Then, open your Host's **public IP** on port `8080` in a browser. (eg. `http://3.80.23.14:8080`)
|
||||
|
||||
For more, see: `OSWorld/monitor/README.md`
|
||||
|
||||
### 4.2 VNC Remote Desktop Access
|
||||
|
||||
You can also access Client instances via VNC at`http://<client-public-ip>:5090/vnc.html`
|
||||
BIN
assets/pubeval1.png
Normal file
BIN
assets/pubeval1.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 59 KiB |
BIN
assets/pubeval2.png
Normal file
BIN
assets/pubeval2.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 174 KiB |
BIN
assets/pubeval3.png
Normal file
BIN
assets/pubeval3.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 309 KiB |
BIN
assets/pubeval4.png
Normal file
BIN
assets/pubeval4.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 69 KiB |
BIN
assets/pubeval5.png
Normal file
BIN
assets/pubeval5.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 53 KiB |
@@ -27,8 +27,8 @@ import dotenv
|
||||
# Load environment variables from .env file
|
||||
dotenv.load_dotenv()
|
||||
|
||||
CLIENT_PASSWORD = os.getenv("CLIENT_PASSWORD", "password") # Default password for sudo operations
|
||||
PROXY_CONFIG_FILE = os.getenv("PROXY_CONFIG_FILE", "dataimpulse_proxy_config.json") # Default proxy config file
|
||||
CLIENT_PASSWORD = os.getenv("CLIENT_PASSWORD", "osworld-public-evaluation") # Default password for sudo operations
|
||||
PROXY_CONFIG_FILE = os.getenv("PROXY_CONFIG_FILE", "evaluation_examples/settings/proxy/dataimpulse.json") # Default proxy config file
|
||||
|
||||
logger = logging.getLogger("desktopenv.setup")
|
||||
|
||||
|
||||
@@ -60,7 +60,7 @@ class DesktopEnv(gym.Env):
|
||||
self.provider_name = provider_name
|
||||
self.enable_proxy = enable_proxy # Store proxy enablement setting
|
||||
|
||||
# Default TODO:
|
||||
# Default
|
||||
self.server_port = 5000
|
||||
self.chromium_port = 9222
|
||||
self.vnc_port = 8006
|
||||
|
||||
13
evaluation_examples/settings/proxy/dataimpulse.json
Normal file
13
evaluation_examples/settings/proxy/dataimpulse.json
Normal file
@@ -0,0 +1,13 @@
|
||||
[
|
||||
{
|
||||
"host": "gw.dataimpulse.com",
|
||||
"port": 823,
|
||||
"username": "your_username",
|
||||
"password": "your_password",
|
||||
"protocol": "http",
|
||||
"provider": "dataimpulse",
|
||||
"type": "residential",
|
||||
"country": "US",
|
||||
"note": "Dataimpulse Residential Proxy"
|
||||
}
|
||||
]
|
||||
Reference in New Issue
Block a user