@@ -1,5 +1,13 @@
|
||||
# UiPath Screen Agent
|
||||
|
||||
### 23 Dec 2025
|
||||
- Updated the planner model to [Claude 4.5 Opus](https://www.anthropic.com/news/claude-opus-4-5)
|
||||
- Updated the grounder model to an internally finetuned version of [Qwen3-VL](https://github.com/QwenLM/Qwen3-VL) and allowing it to predict "refusal" (similar to OSWorld-G) for elements that do not exist
|
||||
- Added memory for storing relevant information across steps
|
||||
- Improved utilization of the UI element detector for fine grained details (such as cell corners)
|
||||
- Refactoring and various small fixes
|
||||
|
||||
### 18 Sep 2025
|
||||
We propose a simple, yet effective implementation of a Computer Use Agent, which achieves a performance of **53.6%** on the **OSWorld** benchmark with 50 steps, demonstrating competitive results with a relatively lightweight setup and UI only actions.
|
||||
|
||||
Our system builds upon recent approaches in agentic computer use and follows the literature in adopting a two-stage architecture that separates high-level reasoning from low-level execution. Specifically, the system is composed of:
|
||||
@@ -32,7 +40,7 @@ The interaction history is structured as a conversation: the user reports the ta
|
||||
By combining the current state with this structured history, the Action Planner generates context-aware, informed predictions at every step, being able to reconstruct the sequence of actions that led him to this point, noticing eventual failures, and plan the subsequent steps.
|
||||
|
||||
We support a concise set of actions for interacting with the environment, focusing specifically on UI-related activities:
|
||||
- Click (left, right, double click)
|
||||
- Click (left, right, double, triple, click)
|
||||
- Type
|
||||
- Scroll
|
||||
- Drag
|
||||
@@ -68,4 +76,3 @@ This process gives the model multiple opportunities to predict within a relevant
|
||||
|
||||
## Conclusion
|
||||
Our method offers a clean and simple yet competitive pipeline for Computer Use tasks. It is cost effective, minimizing token usage during planning, avoiding parallel planning and reliance on numerous past images, and incorporate only **direct UI actions** with refined grounding actions to improve accuracy. With this approach, we achieve **53.6%** accuracy on OSWorld with a 50-step horizon.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user