sci-gui-agent-benchmark

Go to file

Shenzhennan 29caebb765 Impress check and fix (all font compare issue) (#247 )

* Enhance PPTX comparison logic in slides.py

- Improved alignment comparison to treat None and LEFT as equivalent.
- Added special handling for font bold and italic properties to consider None and False as equivalent.
- Introduced a new bullet comparison function that allows for minor differences and tolerates formatting variations.
- Updated JSON examples to support multiple file comparisons and results.

* fix all fonts json file f23ac

* fix clean the shape examination in unrelevatn part-top position check

* Refactor JSON structure for PPTX comparison

- Updated the instruction formatting for clarity.
- Modified the comparison logic to support multiple expected and result files, enhancing flexibility in evaluations.
- Changed the function key to an array to accommodate multiple comparison functions.
- Introduced a conjunction key to specify logical relationships between comparisons.

* fix impress-e4ef0baf by adding all fonts gold file

* update impress bf4e9888 task ins

* fix impress b8adbc24 font size

* Enhance PPTX comparison functionality in slides.py

- Introduced a debug logger for detailed output during PPTX comparisons.
- Added a new function to recursively retrieve all text shapes, including those within groups.
- Enabled debug logging to provide insights on slide and shape comparisons.
- Updated JSON examples to support multiple expected and result files for enhanced evaluation flexibility.

* Enable debug logging by default in PPTX comparison and enhance debug output for shape mismatches. Updated JSON examples to support multiple expected and result files for improved evaluation consistency.

* fix impress all fons compare file

* Refactor PPTX comparison logic and JSON examples for height modification tasks

- Added critical notes in slides.py to clarify the execution order of shape examination and height modification checks.
- Updated JSON examples to support multiple expected and result files, enhancing evaluation consistency.
- Ensured that examine_shape must be set to False for examine_modify_height to function correctly, preventing premature termination of comparisons.

* Enhance debug logging in PPTX comparison for detailed font attribute mismatches

- Added debug logging for differences in font color, bold, italic, and underline attributes during table cell comparisons.
- Improved clarity of debug output by including specific slide, shape, and cell indices for mismatches.
- Ensured that existing comparison logic remains intact while enhancing debugging capabilities.

* Enhance debug logging for font attribute mismatches in PPTX comparison

- Added detailed debug logging for font name and size mismatches during PPTX comparisons, including specific slide, shape, and paragraph indices.
- Updated JSON examples to support multiple expected and result files, improving evaluation consistency.
- Maintained existing comparison logic while enhancing debugging capabilities.

* fix impress 3161de json file

---------

Co-authored-by: yuanmengqi <yuanmengqi@mail.ustc.edu.cn>

2025-07-10 00:36:32 +08:00

assets

merge upstream

2025-06-10 13:23:03 +00:00

desktop_env

Impress check and fix (all font compare issue) (#247 )

2025-07-10 00:36:32 +08:00

evaluation_examples

Impress check and fix (all font compare issue) (#247 )

2025-07-10 00:36:32 +08:00

mm_agents

init (#246 )

2025-07-10 00:29:42 +08:00

monitor

clean code

2025-06-10 04:06:54 +00:00

.envrc

[completely optional] direnv+mise autosetup (#87 )

2024-10-30 09:43:10 +08:00

.gitignore

feat&refactor: add proxy setup functionality and update .gitignore for proxy config file

2025-06-07 11:24:49 +00:00

.mise.toml

[completely optional] direnv+mise autosetup (#87 )

2024-10-30 09:43:10 +08:00

ACCOUNT_GUIDELINE.md

add GDrive guideline

2025-06-09 14:59:47 +00:00

CONTRIBUTION.md

Clean Code; Refactor README

2024-03-27 16:21:49 +08:00

lib_run_single.py

fix error

2025-06-09 04:20:59 +00:00

LICENSE

Update LICENSE

2024-04-02 20:16:58 +08:00

main.py

Support Docker VM manager and provider (#75 )

2024-09-28 21:10:40 +08:00

PROXY_GUIDELINE.md

PROXY_GUIDELINE.md Updates by Changyu Pang from Tsinghua (#41 )

2024-05-31 00:30:19 +08:00

PUBLIC_EVALUATION_GUIDELINE.md

fix chrome tasks (#230 )

2025-07-03 21:32:41 +08:00

README.md

Update README.md

2025-05-01 21:51:14 +08:00

requirements.txt

Check and fix on Chrome tasks

2025-07-06 07:52:37 +00:00

ROADMAP.md

Refactoring VMware Integration and Implementing AWS Support (#44 )

2024-06-15 20:52:29 +08:00

run_multienv_aguvis.py

Finish Aguvis eval on OSWorld (#107 )

2024-11-24 16:43:25 +08:00

run_multienv_openaicua.py

clean code

2025-06-10 04:06:54 +00:00

run_multienv.py

fix: remove unused region parameter from DesktopEnv initialization

2025-05-27 16:51:12 +08:00

run_uitars.py

Dev/uitars 15 (#194 )

2025-05-19 17:15:17 +08:00

run.py

Docker (#92 )

2024-11-02 22:28:23 +08:00

setup.py

VirtualBox (#46 )

2024-06-17 22:46:04 +08:00

show_result.py

Fix minor problems when aggragating the results (#106 )

2024-11-22 17:37:34 +08:00

README.md

Website • Paper • Doc • Data • Data Viewer • Discord • Cache

📢 Updates

2025-05-01: If you need pre-downloaded files for init state setup, we downloaded for you here.
2024-10-22: We supported Docker🐳 for hosting virtual machines on virtualized platforms. Check below for detailed instructions!
2024-06-15: We refactor the code of environment part to decompose VMware Integration, and start to support other platforms such as VitualBox, AWS, Azure, etc. Hold tight!
2024-04-11: We released our paper, environment and benchmark, and project page. Check it out!

💾 Installation

VMware/VirtualBox (Desktop, Laptop, Bare Metal Machine)

Suppose you are operating on a system that has not been virtualized (e.g. your desktop, laptop, bare metal machine), meaning you are not utilizing a virtualized environment like AWS, Azure, or k8s. If this is the case, proceed with the instructions below. However, if you are on a virtualized platform, please refer to the Docker section.

First, clone this repository and cd into it. Then, install the dependencies listed in requirements.txt. It is recommended that you use the latest version of Conda to manage the environment, but you can also choose to manually install the dependencies. Please ensure that the version of Python is >= 3.9.

# Clone the OSWorld repository
git clone https://github.com/xlang-ai/OSWorld

# Change directory into the cloned repository
cd OSWorld

# Optional: Create a Conda environment for OSWorld
# conda create -n osworld python=3.9
# conda activate osworld

# Install required dependencies
pip install -r requirements.txt

Alternatively, you can install the environment without any benchmark tasks:

pip install desktop-env

Install VMware Workstation Pro (for systems with Apple Chips, you should install VMware Fusion) and configure the vmrun command. The installation process can refer to How to install VMware Worksation Pro. Verify the successful installation by running the following:

vmrun -T ws list

If the installation along with the environment variable set is successful, you will see the message showing the current running virtual machines.

Note: We also support using VirtualBox if you have issues with VMware Pro. However, features such as parallelism and macOS on Apple chips might not be well-supported.

All set! Our setup script will automatically download the necessary virtual machines and configure the environment for you.

Docker (Server (with KVM Support for the better))

If you are running on a non-bare metal server, or prefer not to use VMware and VirtualBox platforms, we recommend using our Docker support.

Prerequisite: Check if your machine supports KVM

We recommend running the VM with KVM support. To check if your hosting platform supports KVM, run

egrep -c '(vmx|svm)' /proc/cpuinfo

on Linux. If the return value is greater than zero, the processor should be able to support KVM.

Note

: macOS hosts generally do not support KVM. You are advised to use VMware if you would like to run OSWorld on macOS.

Install Docker

If your hosting platform supports a graphical user interface (GUI), you may refer to Install Docker Desktop on Linux or Install Docker Desktop on Windows based on your OS. Otherwise, you may Install Docker Engine.

Running Experiments

Add the following arguments when initializing DesktopEnv:

provider_name: docker
os_type: Ubuntu or Windows, depending on the OS of the VM

Note

: If the experiment is interrupted abnormally (e.g., by interrupting signals), there may be residual docker containers which could affect system performance over time. Please run docker stop $(docker ps -q) && docker rm $(docker ps -a -q) to clean up.

Others

We are working on supporting more 👷. Please hold tight!

🚀 Quick Start

Run the following minimal example to interact with the environment:

from desktop_env.desktop_env import DesktopEnv

example = {
    "id": "94d95f96-9699-4208-98ba-3c3119edf9c2",
    "instruction": "I want to install Spotify on my current system. Could you please help me?",
    "config": [
        {
            "type": "execute",
            "parameters": {
                "command": [
                    "python",
                    "-c",
                    "import pyautogui; import time; pyautogui.click(960, 540); time.sleep(0.5);"
                ]
            }
        }
    ],
    "evaluator": {
        "func": "check_include_exclude",
        "result": {
            "type": "vm_command_line",
            "command": "which spotify"
        },
        "expected": {
            "type": "rule",
            "rules": {
                "include": ["spotify"],
                "exclude": ["not found"]
            }
        }
    }
}

env = DesktopEnv(action_space="pyautogui")

obs = env.reset(task_config=example)
obs, reward, done, info = env.step("pyautogui.rightClick()")

You will see all the logs of the system running normally, including the successful creation of the environment, completion of setup, and successful execution of actions. In the end, you will observe a successful right-click on the screen, which means you are ready to go.

🧪 Experiments

Agent Baselines

If you wish to run the baseline agent used in our paper, you can execute the following command as an example under the GPT-4V pure-screenshot setting:

Set OPENAI_API_KEY environment variable with your API key

export OPENAI_API_KEY='changeme'

python run.py --path_to_vm Ubuntu/Ubuntu.vmx --headless --observation_type screenshot --model gpt-4-vision-preview --result_dir ./results

The results, which include screenshots, actions, and video recordings of the agent's task completion, will be saved in the ./results directory in this case. You can then run the following command to obtain the result:

python show_result.py

Evaluation

Please start by reading through the agent interface and the environment interface. Correctly implement the agent interface and import your customized version in the run.py file. Afterward, you can execute a command similar to the one in the previous section to run the benchmark on your agent.

❓ FAQ

What is the username and password for the virtual machines?

The username and password for the virtual machines are as follows:

Ubuntu: user / password

How to setup the account and credentials for Google and Google Drive?

See Account Guideline.

How can I configure a proxy for the VM if I'm behind a GFW?

See Proxy Guideline.

What are the running times and costs under different settings?

Setting	Expected Time*	Budget Cost (Full Test Set/Small Test Set)
GPT-4V (screenshot)	10h	$100 ($10)
Gemini-ProV (screenshot)	15h	$0 ($0)
Claude-3 Opus (screenshot)	15h	$150 ($15)
GPT-4V (a11y tree, SoM, etc.)	30h	$500 ($50)

*No environment parallelism. Calculated in April 2024.

Open Source Contributors

Thanks to all the contributors!

📄 Citation

If you find this environment useful, please consider citing our work:

@misc{OSWorld,
      title={OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments}, 
      author={Tianbao Xie and Danyang Zhang and Jixuan Chen and Xiaochuan Li and Siheng Zhao and Ruisheng Cao and Toh Jing Hua and Zhoujun Cheng and Dongchan Shin and Fangyu Lei and Yitao Liu and Yiheng Xu and Shuyan Zhou and Silvio Savarese and Caiming Xiong and Victor Zhong and Tao Yu},
      year={2024},
      eprint={2404.07972},
      archivePrefix={arXiv},
      primaryClass={cs.AI}
}