docs: update README.md with new evaluation sections and guidelines

- Added a new section for Local Evaluation, clarifying the import process for `run_multienv.py`.
- Introduced a Public Evaluation section detailing the process for verifying results on the leaderboard and requirements for sharing agent implementations.
- Included links to the Public Evaluation Guideline for user reference.
- Maintained existing content while enhancing clarity and providing additional resources for users.
This commit is contained in:
yuanmengqi
2025-07-28 08:39:09 +00:00
parent 0dc78937d0
commit 0eb3a3d6d7

View File

@@ -192,11 +192,19 @@ You can then run the following command to obtain the result:
python show_result.py
```
### Evaluation
## Evaluation
### Local Evaluation
Please start by reading through the [agent interface](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/README.md) and the [environment interface](https://github.com/xlang-ai/OSWorld/blob/main/desktop_env/README.md).
Correctly implement the agent interface and import your customized version in the `run.py` file.
Correctly implement the agent interface and import your customized version in the `run.py` or `run_multienv.py` file.
Afterward, you can execute a command similar to the one in the previous section to run the benchmark on your agent.
### Public Evaluation
If you want your results to be verified and displayed on the verified leaderboard, you need to schedule a meeting with us (current maintainer: tianbaoxiexxx@gmail.com, yuanmengqi732@gmail.com) to run your agent code on our side and have us report the results.
You need to upload and allow us to disclose your agent implementation under the OSWorld framework (you may choose not to expose your model API to the public), along with a report that allows the public to understand what's happening behind the scenes.
Alternatively, if you are from a trusted institution, you can share your monitoring data and trajectories with us.
Please carefully follow the [Public Evaluation Guideline](https://github.com/xlang-ai/OSWorld/blob/main/PUBLIC_EVALUATION_GUIDELINE.md) to get results.
## ❓ FAQ
### What is the username and password for the virtual machines?
The username and password for the virtual machines are as follows (for provider `vmware`, `virtualbox` and `docker`): we set the account credentials for Ubuntu as `user` / `password`. For cloud service providers like `aws`, to prevent attacks due to weak passwords, we default to `osworld-public-evaluation`. If you make further modifications, remember to set the client_password variable and pass it to DesktopEnv and Agent (if supported) when running experiments. Some features like setting up proxy require the environment to have the client VM password to obtain sudo privileges, and for some OSWorld tasks, the agent needs the password to obtain sudo privileges to complete them.