uipath v2 (#413)

* submission v2 * small updates
2026-01-09 02:47:20 +02:00
parent 5ef8bdfa35
commit 5463d3bb89
11 changed files with 643 additions and 425 deletions
--- a/mm_agents/uipath/README.md
+++ b/mm_agents/uipath/README.md
@@ -1,5 +1,13 @@
 # UiPath Screen Agent

+### 23 Dec 2025
+- Updated the planner model to [Claude 4.5 Opus](https://www.anthropic.com/news/claude-opus-4-5)
+- Updated the grounder model to an internally finetuned version of [Qwen3-VL](https://github.com/QwenLM/Qwen3-VL) and allowing it to predict "refusal" (similar to OSWorld-G) for elements that do not exist 
+- Added memory for storing relevant information across steps
+- Improved utilization of the UI element detector for fine grained details (such as cell corners)
+- Refactoring and various small fixes
+
+### 18 Sep 2025
 We propose a simple, yet effective implementation of a Computer Use Agent, which achieves a performance of **53.6%** on the **OSWorld** benchmark with 50 steps, demonstrating competitive results with a relatively lightweight setup and UI only actions. 

 Our system builds upon recent approaches in agentic computer use and follows the literature in adopting a two-stage architecture that separates high-level reasoning from low-level execution. Specifically, the system is composed of:
@@ -32,7 +40,7 @@ The interaction history is structured as a conversation: the user reports the ta
 By combining the current state with this structured history, the Action Planner generates context-aware, informed predictions at every step, being able to reconstruct the sequence of actions that led him to this point, noticing eventual failures, and plan the subsequent steps.

 We support a concise set of actions for interacting with the environment, focusing specifically on UI-related activities:
- Click (left, right, double click)
+- Click (left, right, double, triple, click)
 - Type
 - Scroll
 - Drag
@@ -68,4 +76,3 @@ This process gives the model multiple opportunities to predict within a relevant

 ## Conclusion
 Our method offers a clean and simple yet competitive pipeline for Computer Use tasks. It is cost effective, minimizing token usage during planning, avoiding parallel planning and reliance on numerous past images, and incorporate only **direct UI actions** with refined grounding actions to improve accuracy. With this approach, we achieve **53.6%** accuracy on OSWorld with a 50-step horizon.
-