docs: update README.md with enhanced OSWorld-Verified details and AWS support

- Expanded the OSWorld-Verified update entry to include new model results and a comparison with previous benchmarks.
- Added a new section on AWS support, detailing the benefits of using cloud services for parallel evaluation and providing links to setup guides.
- Corrected the baseline agent command example to reflect the updated model name and added a new example for parallel execution.
- Clarified the username and password information for virtual machines, emphasizing security measures for cloud services.
- Maintained existing content while enhancing clarity and providing additional resources for users.
This commit is contained in:
yuanmengqi
2025-07-28 08:22:25 +00:00
parent a37fe86925
commit 0dc78937d0

View File

@@ -33,7 +33,7 @@
## 📢 Updates
- 2025-07-28: <span style="color: red">Introducing **OSWorld-Verified**!</span> We have made major updates, fixed several issues reported by the community, with more support for AWS, and making the benchmark signals more effective. Check out more in the [report](https://xlang.ai/blog/osworld-verified)!
- 2025-07-28: Introducing **OSWorld-Verified**! We have made major updates, fixed several issues reported by the community, with more support for AWS (can reduce evaluation time to within 1 hour through parallelization!), and making the benchmark signals more effective. Check out more in the [report](https://xlang.ai/blog/osworld-verified). We have run new model results in the latest version and updated them on the [official website](https://os-world.github.io/). Please compare your OSWorld results with the new benchmark results when running the latest version.
- 2025-05-01: If you need pre-downloaded files for init state setup, we downloaded for you [here](https://drive.google.com/file/d/1XlEy49otYDyBlA3O9NbR0BpPfr2TXgaD/view?usp=drive_link).
- 2024-10-22: We supported Docker🐳 for hosting virtual machines on virtualized platforms. Check below for detailed instructions!
- 2024-06-15: We refactor the code of environment part to decompose VMware Integration, and start to support other platforms such as VirtualBox, AWS, Azure, etc. Hold tight!
@@ -94,6 +94,9 @@ Add the following arguments when initializing `DesktopEnv`:
- `os_type`: `Ubuntu` or `Windows`, depending on the OS of the VM
> **Note**: If the experiment is interrupted abnormally (e.g., by interrupting signals), there may be residual docker containers which could affect system performance over time. Please run `docker stop $(docker ps -q) && docker rm $(docker ps -a -q)` to clean up.
### AWS
Using cloud services for parallel evaluation can significantly accelerate evaluation efficiency and can even be used as infrastructure for training. We provide comprehensive AWS support with a Host-Client architecture that enables large-scale parallel evaluation of OSWorld tasks. For detailed setup instructions, see [Public Evaluation Guideline](https://github.com/xlang-ai/OSWorld/blob/main/PUBLIC_EVALUATION_GUIDELINE.md) and [AWS Configuration Guide](https://github.com/xlang-ai/OSWorld/blob/main/desktop_env/providers/aws/AWS_GUIDELINE.md).
### Others
We are working on supporting more 👷. Please hold tight!
@@ -144,7 +147,7 @@ You will see all the logs of the system running normally, including the successf
## 🧪 Experiments
### Agent Baselines
If you wish to run the baseline agent used in our paper, you can execute the following command as an example under the GPT-4V pure-screenshot setting:
If you wish to run the baseline agent used in our paper, you can execute the following command as an example under the GPT-4o pure-screenshot setting:
Set **OPENAI_API_KEY** environment variable with your API key
```bash
@@ -156,10 +159,35 @@ Optionally, set **OPENAI_BASE_URL** to use a custom OpenAI-compatible API endpoi
export OPENAI_BASE_URL='http://your-custom-endpoint.com/v1' # Optional: defaults to https://api.openai.com
```
Single-threaded execution (deprecated, using `vmware` provider as example)
```bash
python run.py --path_to_vm Ubuntu/Ubuntu.vmx --headless --observation_type screenshot --model gpt-4-vision-preview --result_dir ./results
python run.py \
--provider_name vmware \
--path_to_vm Ubuntu/Ubuntu.vmx \
--headless \
--observation_type screenshot \
--model gpt-4o \
--sleep_after_execution 3 \
--max_steps 15 \
--result_dir ./results \
--client_password password
```
The results, which include screenshots, actions, and video recordings of the agent's task completion, will be saved in the `./results` directory in this case. You can then run the following command to obtain the result:
Parallel execution (example showing switching provider to `docker`)
```bash
python run_multienv.py \
--provider_name docker \
--headless \
--observation_type screenshot \
--model gpt-4o \
--sleep_after_execution 3 \
--max_steps 15 \
--num_envs 10 \
--client_password password
```
The results, which include screenshots, actions, and video recordings of the agent's task completion, will be saved in the `./results` (or other `result_dir` you specified) directory in this case.
You can then run the following command to obtain the result:
```bash
python show_result.py
```
@@ -171,26 +199,16 @@ Afterward, you can execute a command similar to the one in the previous section
## ❓ FAQ
### What is the username and password for the virtual machines?
The username and password for the virtual machines are as follows:
- **Ubuntu:** `user` / `password`
The username and password for the virtual machines are as follows (for provider `vmware`, `virtualbox` and `docker`): we set the account credentials for Ubuntu as `user` / `password`. For cloud service providers like `aws`, to prevent attacks due to weak passwords, we default to `osworld-public-evaluation`. If you make further modifications, remember to set the client_password variable and pass it to DesktopEnv and Agent (if supported) when running experiments. Some features like setting up proxy require the environment to have the client VM password to obtain sudo privileges, and for some OSWorld tasks, the agent needs the password to obtain sudo privileges to complete them.
### How to setup the account and credentials for Google and Google Drive?
See [Account Guideline](ACCOUNT_GUIDELINE.md).
### How can I configure a proxy for the VM if I'm behind a GFW?
### How can I configure a proxy for the VM (if I'm behind the GFW, or I don't want some of my tasks to be identified as bot and get lower scores)?
See [Proxy Guideline](PROXY_GUIDELINE.md).
### What are the running times and costs under different settings?
| Setting | Expected Time* | Budget Cost (Full Test Set/Small Test Set) |
| ------------------------------ | -------------- | ------------------------------------------ |
| GPT-4V (screenshot) | 10h | $100 ($10) |
| Gemini-ProV (screenshot) | 15h | $0 ($0) |
| Claude-3 Opus (screenshot) | 15h | $150 ($15) |
| GPT-4V (a11y tree, SoM, etc.) | 30h | $500 ($50) |
\*No environment parallelism. Calculated in April 2024.
If you want to set it up yourself, please refer to [Proxy Guideline](PROXY_GUIDELINE.md).
We also provide a pre-configured solution based on dataimpulse, please refer to [proxy-setup section in PUBLIC_EVALUATION_GUIDELINE](https://github.com/xlang-ai/OSWorld/blob/main/PUBLIC_EVALUATION_GUIDELINE.md#22-proxy-setup).
### Open Source Contributors