feat: 增强任务步骤注入与a11y状态表达，提升树形交互稳定性

- 打通 metadata.steps 传递链路，将任务步骤注入 agent 预测上下文 - 优化 a11y tree 线性化输出：使用中心坐标并新增 states 列（expanded/collapsed/selected 等） - 放宽可保留节点条件，保留无文本输入类控件（edit/textfield/searchbox 等） - 强化输出约束：单轮仅允许动作代码或 WAIT/DONE/FAIL，禁止动作与 DONE 同轮返回 - 补充 avogadro 示例步骤：展开 aromatics 并选择 benzene.cjson
feat: 增强科研软件的 a11y tree 支持
2026-02-26 18:56:53 +08:00 · 2026-02-26 15:04:28 +08:00 · 2026-02-25 15:19:36 +08:00 · 2026-02-09 17:47:02 +08:00 · 2026-02-09 17:46:05 +08:00 · 2026-02-09 14:24:59 +08:00
112 changed files with 7632 additions and 1111 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -208,7 +208,17 @@ quick_start.py
 result_multi_apps_pengxiang_transformers12evaluation_examples/settings/proxy/dataimpulse.json
 evaluation_examples/settings/proxy/dataimpulse.json

+# Benchmark input data (large binary files - share via cloud storage or Git LFS)
+evaluation_examples/inputs/
+
+# Temporary data processing workspace (scraped docs, intermediate scripts)
+evaluation_examples/sandbox/
+
+# Image cache
+evaluation_examples/inputs/.img_cache/
+
 # Local test configurations (not for public repo)
 evaluation_examples/spiderman.json
 evaluation_examples/test_50_random_proportional.json
 evaluation_examples/test_chrome.json
+evaluation_examples/prepare_input_files.py
--- a/desktop_env/desktop_env.py
+++ b/desktop_env/desktop_env.py
@@ -20,42 +20,42 @@ Metric = Callable[[Any, Any], float]
 Getter = Callable[[gym.Env, Dict[str, Any]], Any]

 MAX_RETRIES = 5 # Maximum retries for environment setup
-            
+


 def _fix_pyautogui_less_than_bug(command: str) -> str:
    """
    Fix PyAutoGUI '<' character bug by converting it to hotkey("shift", ',') calls.
-    
+
    This fixes the known PyAutoGUI issue where typing '<' produces '>' instead.
    References:
    - https://github.com/asweigart/pyautogui/issues/198
    - https://github.com/xlang-ai/OSWorld/issues/257
-    
+
    Args:
        command (str): The original pyautogui command
-        
+
    Returns:
        str: The fixed command with '<' characters handled properly
    """
-    # Pattern to match press('<') or press('\u003c') calls  
+    # Pattern to match press('<') or press('\u003c') calls
    press_pattern = r'pyautogui\.press\(["\'](?:<|\\u003c)["\']\)'

    # Handle press('<') calls
    def replace_press_less_than(match):
        return 'pyautogui.hotkey("shift", ",")'
-    
+
    # First handle press('<') calls
    command = re.sub(press_pattern, replace_press_less_than, command)

    # Pattern to match typewrite calls with quoted strings
    typewrite_pattern = r'pyautogui\.typewrite\((["\'])(.*?)\1\)'
-    
+
    # Then handle typewrite calls
    def process_typewrite_match(match):
        quote_char = match.group(1)
        content = match.group(2)
-        
+
        # Preprocess: Try to decode Unicode escapes like \u003c to actual '<'
        # This handles cases where '<' is represented as escaped Unicode
        try:
@@ -65,15 +65,15 @@ def _fix_pyautogui_less_than_bug(command: str) -> str:
        except UnicodeDecodeError:
            # If decoding fails, proceed with original content to avoid breaking existing logic
            pass  # English comment: Graceful degradation - fall back to original content if decoding fails
-        
+
        # Check if content contains '<'
        if '<' not in content:
            return match.group(0)
-        
+
        # Split by '<' and rebuild
        parts = content.split('<')
        result_parts = []
-        
+
        for i, part in enumerate(parts):
            if i == 0:
                # First part
@@ -84,11 +84,11 @@ def _fix_pyautogui_less_than_bug(command: str) -> str:
                result_parts.append('pyautogui.hotkey("shift", ",")')
                if part:
                    result_parts.append(f"pyautogui.typewrite({quote_char}{part}{quote_char})")
-        
+
        return '; '.join(result_parts)
-    
+
    command = re.sub(typewrite_pattern, process_typewrite_match, command)
-    
+
    return command


@@ -101,7 +101,7 @@ class DesktopEnv(gym.Env):
            provider_name: str = "vmware",
            region: str = None,
            path_to_vm: str = None,
-            snapshot_name: str = "init_state",
+            snapshot_name: str = "snapshot",
            action_space: str = "pyautogui",
            cache_dir: str = "cache",
            screen_size: Tuple[int] = (int(os.environ.get("SCREEN_WIDTH", 1920)), int(os.environ.get("SCREEN_HEIGHT", 1080))),
@@ -111,13 +111,14 @@ class DesktopEnv(gym.Env):
            os_type: str = "Ubuntu",
            enable_proxy: bool = False,
            client_password: str = "",
+            eval_model: str = "gpt-5.2-chat-latest"
    ):
        """
        Args:
            provider_name (str): virtualization provider name, default to "vmware"
            region (str): the region for allocate machines, work for cloud services, default to  "us-east-1"
            path_to_vm (str): path to .vmx file
-            snapshot_name (str): snapshot name to revert to, default to "init_state"
+            snapshot_name (str): snapshot name to revert to, default to "snapshot"
            action_space (str): "computer_13" | "pyautogui"
            cache_dir (str): cache directory to cache task-related stuffs like
              reference file for evaluation
@@ -127,6 +128,7 @@ class DesktopEnv(gym.Env):
            require_terminal (bool): whether to require terminal output
            os_type (str): operating system type, default to "Ubuntu"
            enable_proxy (bool): whether to enable proxy support, default to False
+            eval_model (str): evaluation model to use, default to "gpt-5.2-chat-latest"
        """
        # Initialize VM manager and vitualization provider
        self.region = region
@@ -143,12 +145,12 @@ class DesktopEnv(gym.Env):
        self.screen_width = screen_size[0]
        self.screen_height = screen_size[1]

-        # Default 
+        # Default
        self.server_port = 5000
        self.chromium_port = 9222
        self.vnc_port = 8006
        self.vlc_port = 8080
-        
+
        # Initialize with default (no proxy) provider
        self.current_use_proxy = False
        self.manager, self.provider = create_vm_manager_and_provider(provider_name, region, use_proxy=False)
@@ -171,7 +173,7 @@ class DesktopEnv(gym.Env):
                if provider_name in {"vmware", "virtualbox"} else path_to_vm
        else:
            self.path_to_vm = self.manager.get_vm_path(os_type=self.os_type, region=region, screen_size=(self.screen_width, self.screen_height))
-        
+
        self.snapshot_name = snapshot_name
        self.cache_dir_base: str = cache_dir
        # todo: add the logic to get the screen size from the VM
@@ -179,6 +181,9 @@ class DesktopEnv(gym.Env):
        self.require_a11y_tree = require_a11y_tree
        self.require_terminal = require_terminal

+        # Evaluation model
+        self.eval_model = eval_model
+
        # Initialize emulator and controller
        logger.info("Initializing...")
        self._start_emulator()
@@ -224,8 +229,8 @@ class DesktopEnv(gym.Env):
        # due to the fact it could be changed when implemented by cloud services
        path_to_vm = self.provider.revert_to_snapshot(self.path_to_vm, self.snapshot_name)
        if path_to_vm and not path_to_vm == self.path_to_vm:
-            # path_to_vm has to be a new path 
-            
+            # path_to_vm has to be a new path
+
            self.manager.delete_vm(self.path_to_vm, self.region)
            self.manager.add_vm(path_to_vm, self.region)
            self.manager.occupy_vm(path_to_vm, os.getpid(), self.region)
@@ -240,7 +245,7 @@ class DesktopEnv(gym.Env):
        self.provider.stop_emulator(self.path_to_vm)

    def reset(self, task_config: Optional[Dict[str, Any]] = None, seed=None, options=None) -> Dict[str, Any]:
-        
+
        # Reset to certain task in OSWorld
        logger.info("Resetting environment...")
        logger.info("Switching task...")
@@ -253,19 +258,19 @@ class DesktopEnv(gym.Env):
            # Only revert to snapshot if environment has been used (step/setup)
            # This optimization is especially important for cloud providers like AWS
            # where unnecessary snapshot operations are costly and time-consuming
-            
+
            if task_config is not None:
                # Only consider task proxy requirement if proxy is enabled at system level
                task_use_proxy = task_config.get("proxy", False) and self.enable_proxy
                if not self.enable_proxy and task_config.get("proxy", False):
                    logger.info("Task requires proxy but proxy is disabled at system level, ignoring proxy requirement.")
-                
+
                if task_use_proxy != self.current_use_proxy:
                    # keep because get_info_from_website depend on this
                    self.current_use_proxy = task_use_proxy
-            
+
            if self.is_environment_used:
-                logger.info("Environment has been used, reverting to snapshot {}...".format(self.snapshot_name))
+                logger.info("Environment has been used, reverting to snapshot: {}...".format(self.snapshot_name))
                self._revert_to_snapshot()
                logger.info("Starting emulator...")
                self._start_emulator()
@@ -297,7 +302,7 @@ class DesktopEnv(gym.Env):
                    time.sleep(5)
            else:
                break
-            
+
        logger.info("Environment setup complete.")

        observation = self._get_obs()
@@ -328,7 +333,8 @@ class DesktopEnv(gym.Env):
        os.makedirs(self.cache_dir, exist_ok=True)
        self.instruction = task_config["instruction"]
        self.config = task_config["config"] if "config" in task_config else []
-        
+        self.metadata = task_config.get("metadata", {})
+
        self._set_evaluator_info(task_config)

    def _set_evaluator_info(self, task_config: Dict[str, Any]):
@@ -381,7 +387,7 @@ class DesktopEnv(gym.Env):
    def step(self, action, pause=2):
        self._step_no += 1
        self.action_history.append(action)
-        
+
        # Mark environment as used when step is called
        self.is_environment_used = True

@@ -402,6 +408,7 @@ class DesktopEnv(gym.Env):

        if self.action_space == "computer_13":
            # the set of all possible actions defined in the action representation
+            logger.info(f"======executing here======{self.action_space}========================")
            self.controller.execute_action(action)
        elif self.action_space == "pyautogui" or self.action_space == "claude_computer_use":
            if action in ['WAIT', 'FAIL', 'DONE'] or (type(action) == dict and action.get('action_type') in ['WAIT', 'FAIL', 'DONE']):
@@ -411,6 +418,8 @@ class DesktopEnv(gym.Env):
                if type(action) == str:
                    # Fix PyAutoGUI '<' character bug before execution
                    fixed_command = _fix_pyautogui_less_than_bug(action)
+                    logger.info(f"======executing here======{self.action_space}========================")
+                    logger.info(f"Fixed command: {fixed_command}")
                    self.controller.execute_python_command(fixed_command)
                elif type(action) == dict:
                    # Fix PyAutoGUI '<' character bug before execution
@@ -422,7 +431,7 @@ class DesktopEnv(gym.Env):

        return observation, reward, done, info

-    def evaluate(self):
+    def evaluate(self, result_dir: Optional[str] = None):
        """
        Evaluate whether the task is successfully completed.
        """
@@ -445,6 +454,24 @@ class DesktopEnv(gym.Env):
                if last_action == "FAIL" or (type(last_action) == dict and last_action.get('action_type') == 'FAIL'):
                    return 0

+        if self.evaluator['func'] == "vllm_eval":
+            logger.info("Preparing vllm_eval metric options...")
+            screenshot_bytes = self.controller.get_screenshot()
+
+            import base64
+            self.metric_options["instruction"] = self.instruction
+            self.metric_options["eval_model"] = self.eval_model
+
+            # Pass pre-configured environment info and expected steps
+            self.metric_options["config"] = self.config
+            self.metric_options["metadata"] = self.metadata
+
+            if result_dir:
+                self.metric_options["result_dir"] = result_dir
+                logger.info(f"Using result_dir for vllm_eval: {result_dir}")
+
+            logger.info(f"Evaluation options prepared: {self.metric_options.keys()}")
+
        if type(self.metric) == list:
            # Multiple metrics to evaluate whether the task is successfully completed
            results = []
@@ -452,13 +479,18 @@ class DesktopEnv(gym.Env):
            if "expected" in self.evaluator:
                assert len(self.metric) == len(self.expected_getter), "The number of metrics and expected getters must be the same"
            for idx, metric in enumerate(self.metric):
-                try:
-                    config = self.evaluator["result"][idx]
-                    result_state = self.result_getter[idx](self, config)
-                except FileNotFoundError:
-                    logger.error("File not found!")
-                    if self.metric_conj == 'and':
-                        return 0
+                # Skip result state extraction if result_getter is None (e.g., for vllm_eval)
+                if self.result_getter[idx] is not None:
+                    try:
+                        config = self.evaluator["result"][idx]
+                        result_state = self.result_getter[idx](self, config)
+                    except FileNotFoundError:
+                        logger.error("File not found!")
+                        if self.metric_conj == 'and':
+                            return 0
+                else:
+                    # For evaluators that don't need result state (e.g., vllm_eval)
+                    result_state = None

                if "expected" in self.evaluator and self.expected_getter and self.evaluator["expected"]:
                    expected_state = self.expected_getter[idx](self, self.evaluator["expected"][idx])
@@ -476,11 +508,16 @@ class DesktopEnv(gym.Env):
            return sum(results) / len(results) if self.metric_conj == 'and' else max(results)
        else:
            # Single metric to evaluate whether the task is successfully completed
-            try:
-                result_state = self.result_getter(self, self.evaluator["result"])
-            except FileNotFoundError:
-                logger.error("File not found!")
-                return 0
+            # For evaluators like vllm_eval that don't need result_getter, skip result state extraction
+            if self.result_getter is not None:
+                try:
+                    result_state = self.result_getter(self, self.evaluator["result"])
+                except FileNotFoundError:
+                    logger.error("File not found!")
+                    return 0
+            else:
+                # For evaluators that don't need result state (e.g., vllm_eval)
+                result_state = None

            if "expected" in self.evaluator and self.expected_getter and self.evaluator["expected"]:
                expected_state = self.expected_getter(self, self.evaluator["expected"])
--- a/desktop_env/desktop_env_os_symphony.py
+++ b/desktop_env/desktop_env_os_symphony.py
@@ -151,10 +151,9 @@ class DesktopEnv(gym.Env):
        
        # Initialize with default (no proxy) provider
        self.current_use_proxy = False
-        # self.manager, self.provider = create_vm_manager_and_provider(provider_name, region, use_proxy=False)
        self.manager, self.provider = None, None
        self.os_type = os_type
-
+        self.path_to_vm = path_to_vm
        # Track whether environment has been used (step/setup) to optimize snapshot revert
        # docker, aws, gcp, azure are always unused as the emulator starts from a clean state
        # vmware, virtualbox are always used as the emulator starts from a dirty state
@@ -165,24 +164,12 @@ class DesktopEnv(gym.Env):
        else:
            raise ValueError(f"Invalid provider name: {self.provider_name}")

-        # Initialize environment variables
-        if path_to_vm:
-            self.path_to_vm = os.path.abspath(os.path.expandvars(os.path.expanduser(path_to_vm))) \
-                if provider_name in {"vmware", "virtualbox"} else path_to_vm
-        else:
-            self.path_to_vm = self.manager.get_vm_path(os_type=self.os_type, region=region, screen_size=(self.screen_width, self.screen_height))
-        
        self.snapshot_name = snapshot_name
        self.cache_dir_base: str = cache_dir
-        # todo: add the logic to get the screen size from the VM
        self.headless = headless
        self.require_a11y_tree = require_a11y_tree
        self.require_terminal = require_terminal

-        # Initialize emulator and controller
-        # logger.info("Initializing...")
-        # self._start_emulator()
-
        # mode: human or machine
        self.instruction = None
        assert action_space in ["computer_13", "pyautogui", "claude_computer_use", "autoglm_computer_use"]
@@ -199,11 +186,13 @@ class DesktopEnv(gym.Env):
        if not self.manager and not self.provider:
            logger.info("Initializing...")
            self.manager, self.provider = create_vm_manager_and_provider(self.provider_name, self.region, use_proxy=False)
+
            if self.path_to_vm:
                self.path_to_vm = os.path.abspath(os.path.expandvars(os.path.expanduser(self.path_to_vm))) \
                    if self.provider_name in {"vmware", "virtualbox"} else self.path_to_vm
            else:
                self.path_to_vm = self.manager.get_vm_path(os_type=self.os_type, region=self.region, screen_size=(self.screen_width, self.screen_height))
+
            self._start_emulator()

    def _start_emulator(self):
@@ -344,6 +333,8 @@ class DesktopEnv(gym.Env):

    def _set_evaluator_info(self, task_config: Dict[str, Any]):
        """Set evaluator information from task config"""
+        if "evaluator" not in task_config:
+            return
        # evaluator dict
        # func -> metric function string, or list of metric function strings
        # conj -> conjunction of multiple metrics if func is a list with length > 1, "and"/"or"
--- a/desktop_env/evaluators/metrics/init.py
+++ b/desktop_env/evaluators/metrics/init.py
@@ -158,3 +158,5 @@ from .vscode import (

 def infeasible():
    pass
+
+from .vllm_eval import vllm_eval
--- a/desktop_env/evaluators/metrics/vllm_eval.py
+++ b/desktop_env/evaluators/metrics/vllm_eval.py
@@ -0,0 +1,600 @@
+import os
+from typing import Optional, List, Dict, Any
+from dotenv import load_dotenv
+import logging
+import base64
+import glob
+from io import BytesIO
+from PIL import Image
+
+logger = logging.getLogger("desktopenv.vllm_eval")
+load_dotenv()
+
+
+def _compress_image(img_b64: str, max_size: int = 800, quality: int = 85) -> str:
+    """
+    Compress base64 encoded image to reduce size
+
+    Args:
+        img_b64: Base64 encoded image string
+        max_size: Maximum dimension (width or height) in pixels
+        quality: JPEG quality (1-100), lower means smaller file size
+
+    Returns:
+        Compressed base64 encoded image string
+    """
+    try:
+        # Decode base64 to image
+        img_data = base64.b64decode(img_b64)
+        img = Image.open(BytesIO(img_data))
+
+        # Convert to RGB if necessary (for PNG with transparency)
+        if img.mode in ('RGBA', 'LA', 'P'):
+            background = Image.new('RGB', img.size, (255, 255, 255))
+            if img.mode == 'P':
+                img = img.convert('RGBA')
+            background.paste(img, mask=img.split()[-1] if img.mode in ('RGBA', 'LA') else None)
+            img = background
+
+        # Resize if image is too large
+        original_size = img.size
+        if max(img.size) > max_size:
+            ratio = max_size / max(img.size)
+            new_size = tuple(int(dim * ratio) for dim in img.size)
+            img = img.resize(new_size, Image.Resampling.LANCZOS)
+            logger.info(f"Resized image from {original_size} to {new_size}")
+
+        # Compress to JPEG
+        buffer = BytesIO()
+        img.save(buffer, format='JPEG', quality=quality, optimize=True)
+        compressed_data = buffer.getvalue()
+
+        # Encode back to base64
+        compressed_b64 = base64.b64encode(compressed_data).decode('utf-8')
+
+        # Log compression ratio
+        original_size_kb = len(img_b64) * 3 / 4 / 1024  # base64 to bytes to KB
+        compressed_size_kb = len(compressed_b64) * 3 / 4 / 1024
+        compression_ratio = (1 - compressed_size_kb / original_size_kb) * 100
+        logger.info(f"Compressed image: {original_size_kb:.1f}KB -> {compressed_size_kb:.1f}KB ({compression_ratio:.1f}% reduction)")
+
+        return compressed_b64
+
+    except Exception as e:
+        logger.warning(f"Failed to compress image: {e}, using original")
+        return img_b64
+
+
+class UnifiedLLM:
+
+    def __init__(self, model: str):
+        if model.startswith("gpt"):
+            self.provider = "openai"
+        elif model.startswith("claude"):
+            self.provider = "anthropic"
+        elif model.startswith("gemini"):
+            self.provider = "gemini"
+        else:
+            self.provider = "unknown"
+
+        self.model = model
+        self.client = self._init_client()
+
+    def _init_client(self):
+        """Initialize client"""
+        if self.provider == "openai":
+            from openai import OpenAI
+            return OpenAI(
+                base_url=os.getenv("OPENAI_BASE_URL"),
+                api_key=os.getenv("OPENAI_API_KEY")
+            )
+
+        elif self.provider == "anthropic":
+            from anthropic import Anthropic
+            return Anthropic(
+                base_url=os.getenv("ANTHROPIC_BASE_URL"),
+                api_key=os.getenv("ANTHROPIC_API_KEY")
+            )
+
+        elif self.provider == "gemini":
+            logger.warning("Using Google Gemini model, make sure your internet connection is working.")
+            import google.generativeai as genai
+            genai.configure(api_key=os.getenv("GOOGLE_API_KEY"))
+            return genai.GenerativeModel(self.model)
+
+        else:
+            logger.error(f"Unsupported LLM provider for model: {self.model}")
+            raise ValueError(f"Unsupported LLM provider for model: {self.model}")
+
+    def _get_supported_params(self, temperature: float, max_tokens: int, top_p: float) -> Dict[str, Any]:
+        """Get supported parameters for each provider"""
+        base_params = {
+            "temperature": temperature,
+            "max_tokens": max_tokens
+        }
+
+        # GPT-5.2 and newer models may not support top_p
+        if self.provider == "openai":
+            # Only add top_p for older models
+            if not self.model.startswith("gpt-5"):
+                base_params["top_p"] = top_p
+        elif self.provider == "anthropic":
+            base_params["top_p"] = top_p
+        elif self.provider == "gemini":
+            base_params["top_p"] = top_p
+
+        return base_params
+
+    def generate(
+        self,
+        prompt: str,
+        temperature: float = 0.7,
+        max_tokens: int = 16384,
+        top_p: float = 1.0,
+        **kwargs
+    ) -> str:
+        """
+        Args:
+            prompt: Input prompt
+            temperature: Temperature (0.0-2.0)
+            max_tokens: Maximum number of tokens
+            top_p: Top-p sampling (0.0-1.0)
+
+        Returns:
+            Generated text
+        """
+        params = self._get_supported_params(temperature, max_tokens, top_p)
+
+        if self.provider == "openai":
+            try:
+                response = self.client.chat.completions.create(
+                    model=self.model,
+                    messages=[{"role": "user", "content": prompt}],
+                    **params
+                )
+                return response.choices[0].message.content
+            except Exception as e:
+                logger.error(f"OpenAI API error: {e}")
+                raise e
+
+        elif self.provider == "anthropic":
+            try:
+                response = self.client.messages.create(
+                    model=self.model,
+                    messages=[{"role": "user", "content": prompt}],
+                    **params
+                )
+                return response.content[0].text
+            except Exception as e:
+                logger.error(f"Anthropic API error: {e}")
+                raise e
+
+        elif self.provider == "gemini":
+            try:
+                import google.generativeai as genai
+                config = genai.GenerationConfig(
+                    temperature=params["temperature"],
+                    max_output_tokens=params["max_tokens"],
+                    top_p=params.get("top_p", 1.0)
+                )
+                response = self.client.generate_content(prompt, generation_config=config)
+                return response.text
+            except Exception as e:
+                logger.error(f"Gemini API error: {e}")
+                raise e
+
+    def generate_with_images(
+        self,
+        prompt: str,
+        images_b64: List[str],
+        temperature: float = 0.7,
+        max_tokens: int = 16384,
+        top_p: float = 1.0,
+        **kwargs
+    ) -> str:
+        """
+        Generate with multiple images in a single request
+
+        Args:
+            prompt: Instruction prompt
+            images_b64: List of base64 encoded images
+            temperature: Temperature (0.0-2.0)
+            max_tokens: Maximum number of tokens
+            top_p: Top-p sampling (0.0-1.0)
+
+        Returns:
+            Generated text
+        """
+        if not images_b64:
+            logger.warning("No images provided, falling back to text-only generation")
+            return self.generate(prompt, temperature, max_tokens, top_p, **kwargs)
+
+        params = self._get_supported_params(temperature, max_tokens, top_p)
+
+        if self.provider == "openai":
+            # Build content with text and all images
+            content = [{"type": "text", "text": prompt}]
+
+            for img_b64 in images_b64:
+                content.append({
+                    "type": "image_url",
+                    "image_url": {
+                        "url": f"data:image/jpeg;base64,{img_b64}"
+                    }
+                })
+
+            try:
+                response = self.client.chat.completions.create(
+                    model=self.model,
+                    messages=[{"role": "user", "content": content}],
+                    **params
+                )
+                return response.choices[0].message.content
+            except Exception as e:
+                logger.error(f"OpenAI API error: {e}")
+                raise e
+
+        elif self.provider == "anthropic":
+            # Build content with text and all images
+            content = [{"type": "text", "text": prompt}]
+
+            for img_b64 in images_b64:
+                content.append({
+                    "type": "image",
+                    "source": {
+                        "type": "base64",
+                        "media_type": "image/jpeg",
+                        "data": img_b64
+                    }
+                })
+
+            try:
+                response = self.client.messages.create(
+                    model=self.model,
+                    messages=[{"role": "user", "content": content}],
+                    **params
+                )
+                return response.content[0].text
+            except Exception as e:
+                logger.error(f"Anthropic API error: {e}")
+                raise e
+
+        elif self.provider == "gemini":
+            import google.generativeai as genai
+
+            config = genai.GenerationConfig(
+                temperature=params["temperature"],
+                max_output_tokens=params["max_tokens"],
+                top_p=params.get("top_p", 1.0)
+            )
+
+            # Build content parts
+            content_parts = [prompt]
+
+            for img_b64 in images_b64:
+                img_data = base64.b64decode(img_b64)
+                img = Image.open(BytesIO(img_data))
+                content_parts.append(img)
+
+            try:
+                response = self.client.generate_content(content_parts, generation_config=config)
+                return response.text
+            except Exception as e:
+                logger.error(f"Gemini API error: {e}")
+                raise e
+
+        else:
+            raise ValueError(f"Unsupported provider: {self.provider}")
+
+
+def _load_screenshots_from_dir(result_dir: str, compress: bool = True, max_size: int = 800, quality: int = 85) -> tuple:
+    """
+    Load all step screenshots from result directory and convert to base64
+
+    Args:
+        result_dir: Path to result directory containing step_*.png files
+        compress: Whether to compress images (default: True)
+        max_size: Maximum dimension for compression (default: 800)
+        quality: JPEG quality for compression (default: 85)
+
+    Returns:
+        Tuple of (list of base64 encoded screenshot strings, list of short filenames like 'step_1', 'step_2', ...)
+    """
+    screenshots = []
+    filenames = []
+
+    # Find all step screenshot files (e.g., step_1_20240101@120000.png)
+    pattern = os.path.join(result_dir, "step_*.png")
+    screenshot_files = sorted(glob.glob(pattern))
+
+    if not screenshot_files:
+        logger.warning(f"No screenshot files found in {result_dir}")
+        return screenshots, filenames
+
+    import re as _re
+    for filepath in screenshot_files:
+        try:
+            with open(filepath, "rb") as f:
+                img_data = f.read()
+                img_b64 = base64.b64encode(img_data).decode('utf-8')
+
+                # Compress if enabled
+                if compress:
+                    img_b64 = _compress_image(img_b64, max_size=max_size, quality=quality)
+
+                screenshots.append(img_b64)
+                # Extract short name like 'step_1' from 'step_1_20240101@120000.png'
+                basename = os.path.basename(filepath)
+                match = _re.match(r'(step_\d+)', basename)
+                short_name = match.group(1) if match else basename
+                filenames.append(short_name)
+        except Exception as e:
+            logger.error(f"Error loading screenshot {filepath}: {e}")
+
+    logger.info(f"Loaded {len(screenshots)} screenshots from {result_dir}: {filenames}")
+    return screenshots, filenames
+
+
+def vllm_eval(result_state, **options) -> float:
+    """
+    Evaluate task completion using vision-language model
+
+    Args:
+        result_state: Current state description
+        **options: Additional options including:
+            - result_dir: Path to result directory containing step screenshots (recommended)
+            - screenshots: List of base64 encoded screenshots (deprecated, use result_dir instead)
+            - instruction: Task instruction
+            - eval_model: Model name to use
+            - compress_images: Whether to compress images (default: True)
+            - max_image_size: Maximum image dimension for compression (default: 800)
+            - image_quality: JPEG quality for compression (default: 85)
+            - temperature: Temperature parameter
+            - max_tokens: Maximum tokens
+            - top_p: Top-p parameter
+
+    Returns:
+        Score between 0.0 and 1.0
+    """
+    # Try to load screenshots from result_dir if provided
+    result_dir = options.get("result_dir", None)
+    screenshots = options.get("screenshots", [])
+
+    # Image compression options
+    compress_images = options.get("compress_images", True)
+    max_image_size = options.get("max_image_size", 800)
+    image_quality = options.get("image_quality", 85)
+
+    screenshot_filenames = []  # Short names like 'step_1', 'step_2', ...
+
+    if result_dir and not screenshots:
+        screenshots, screenshot_filenames = _load_screenshots_from_dir(
+            result_dir,
+            compress=compress_images,
+            max_size=max_image_size,
+            quality=image_quality
+        )
+        logger.info(f"Loaded {len(screenshots)} screenshots from result_dir: {result_dir}")
+    elif screenshots:
+        logger.info(f"Using {len(screenshots)} screenshots from options")
+        screenshot_filenames = [f"step_{i+1}" for i in range(len(screenshots))]
+        # Compress screenshots if needed
+        if compress_images:
+            logger.info("Compressing provided screenshots...")
+            screenshots = [_compress_image(img, max_size=max_image_size, quality=image_quality) for img in screenshots]
+
+    instruction = options.get("instruction", "")
+    eval_model = options.get("eval_model", "gpt-4-vision-preview")
+    config = options.get("config", [])
+    metadata = options.get("metadata", {})
+
+    params = {
+        "temperature": options.get("temperature", 0.7),
+        "max_tokens": options.get("max_tokens", 16384),
+        "top_p": options.get("top_p", 1.0)
+    }
+
+    llm = UnifiedLLM(eval_model)
+
+    # Build pre-configured environment description from config
+    preconfig_items = []
+    for cfg in config:
+        if cfg.get("type") == "launch":
+            cmds = cfg.get("parameters", {}).get("command", [])
+            if cmds:
+                app_name = os.path.basename(cmds[0]) if cmds else "unknown"
+                preconfig_items.append(f"Application '{app_name}' was automatically launched before the agent started.")
+        elif cfg.get("type") == "sleep":
+            pass  # not relevant to scoring
+        elif cfg.get("type") == "open":
+            path = cfg.get("parameters", {}).get("path", "")
+            preconfig_items.append(f"File/URL '{path}' was automatically opened before the agent started.")
+
+    preconfig_section = ""
+    if preconfig_items:
+        preconfig_desc = "\n".join(f"  - {item}" for item in preconfig_items)
+        preconfig_section = f"""
+PRE-CONFIGURED ENVIRONMENT (done BEFORE the agent started, NOT the agent's work):
+{preconfig_desc}
+IMPORTANT: The above actions were performed automatically as part of environment setup. The agent did NOT perform these actions. Do NOT give ANY credit for them. For example, if the application was pre-launched, the agent merely having the application open is worth 0 points - that was the starting state."""
+
+    # Build expected steps section from metadata
+    expected_steps_section = ""
+    if metadata.get("steps"):
+        expected_steps_section = f"""
+EXPECTED STEPS for this task (use as reference for what the agent should have done):
+{metadata['steps']}
+NOTE: Evaluate the screenshots against these expected steps. Only give credit for steps that show VISIBLE evidence of completion BEYOND the pre-configured starting state."""
+
+    # Build image list description for the prompt
+    if screenshot_filenames:
+        img_list_str = ", ".join(screenshot_filenames)
+        img_info = f"""\nYou are provided with exactly {len(screenshot_filenames)} screenshots in chronological order: {img_list_str}
+The FIRST screenshot is: {screenshot_filenames[0]}
+The LAST screenshot (final state): {screenshot_filenames[-1]}
+IMPORTANT: Only reference screenshots from the list above. Do NOT reference any screenshot that is not listed."""
+    else:
+        img_info = "\nNo screenshots were provided."
+
+    prompt = f"""You are a STRICT and RIGOROUS evaluator for desktop environment tasks. Your job is to score ONLY based on concrete, visible evidence of task completion in the screenshots.
+
+Task Instruction: {instruction}
+{preconfig_section}
+{expected_steps_section}
+{img_info}
+
+Analyze ONLY the FINAL screenshot ({screenshot_filenames[-1] if screenshot_filenames else 'N/A'}) to determine the end state, while using earlier screenshots for context.
+
+CRITICAL SCORING RULES:
+1. Score ONLY based on what the AGENT actually accomplished. The pre-configured environment (application already launched, files already opened, etc.) is the STARTING STATE and worth 0 points.
+2. Score ONLY based on what is ACTUALLY VISIBLE in the screenshots. Do NOT give credit for assumed or potential progress.
+3. If the screenshots show NO meaningful action beyond the initial pre-configured state, the score MUST be 0.
+4. Do NOT give partial credit for "having the system on", "desktop being visible", "the application being open" (if it was pre-launched), or "the application being installed". These are prerequisites or pre-configured state, NOT progress.
+5. Each point must correspond to a SPECIFIC, VERIFIABLE action that was successfully completed BY THE AGENT toward the task goal.
+
+SCORING GUIDE (0-10):
+- 0: No progress beyond the pre-configured starting state. If the app was pre-launched, merely having it open is 0. If the screenshots only show the desktop or the initial app state without any agent action, score is 0.
+- 1-2: The agent performed one minor action (e.g., clicked on a menu) but did not make meaningful progress toward the task goal.
+- 3-4: Some initial steps toward the task have been taken but the task is far from complete.
+- 5-6: Significant progress - about half the required steps are completed with visible evidence.
+- 7-8: Most steps are completed but the final result is not fully achieved or has minor issues.
+- 9: The task is essentially complete with very minor cosmetic differences.
+- 10: The task is perfectly and completely finished with clear evidence in the final screenshot.
+
+IMPORTANT: You must respond with ONLY a valid JSON object (no additional text before or after). Use the following exact format:
+
+{{
+  "steps_analysis": [
+    {{"step": "Step description", "status": "Success/Fail", "evidence_img": "step_X.png", "reason": "Brief explanation of VISIBLE evidence"}},
+    {{"step": "Another step", "status": "Success/Fail", "evidence_img": "step_Y.png", "reason": "Brief explanation of VISIBLE evidence"}}
+  ],
+  "final_completion": "True/False",
+  "score": 0-10
+}}
+
+Where:
+- "steps_analysis": Array of steps you identified from the screenshots. Each step must cite VISIBLE evidence from a specific screenshot. Do NOT include pre-configured actions as agent steps.
+- "status": Either "Success" or "Fail" for each step
+- "evidence_img": The screenshot filename that shows evidence for this step (e.g., "step_2.png")
+- "reason": Explanation of what is VISUALLY observed in the screenshot as evidence
+- "final_completion": "True" ONLY if the overall task is fully completed with clear visual proof, "False" otherwise
+- "score": Integer from 0 to 10, following the strict scoring guide above
+
+Remember: Return ONLY the JSON object, no additional text. Be STRICT - when in doubt, score LOWER."""
+
+    try:
+        result = llm.generate_with_images(
+            prompt=prompt,
+            images_b64=screenshots,
+            **params
+        )
+
+        # Parse score from result
+        score = _parse_score(result)
+        logger.info(f"Evaluation result: {result}")
+        logger.info(f"Parsed score: {score}")
+
+        # Save raw result to file for reference
+        if result_dir:
+            eval_output_path = os.path.join(result_dir, "vllm_evaluation_result.json")
+            with open(eval_output_path, "w", encoding="utf-8") as f:
+                f.write(result)
+            logger.info(f"Saved evaluation result to {eval_output_path}")
+
+        return score
+    except Exception as e:
+        logger.error(f"Error during evaluation: {e}")
+        return 0.0
+
+
+def _parse_evaluation_response(text: str) -> Dict[str, Any]:
+    """
+    Parse the JSON evaluation response from the model
+
+    Returns:
+        Dictionary containing steps_analysis, final_completion, and score
+    """
+    import re
+    import json
+
+    # Try to extract JSON from the response
+    # Sometimes models wrap JSON in markdown code blocks
+    text = text.strip()
+
+    # Remove markdown code blocks if present
+    if text.startswith("```"):
+        # Extract content between ``` markers
+        match = re.search(r'```(?:json)?\s*(\{.*?\})\s*```', text, re.DOTALL)
+        if match:
+            text = match.group(1)
+        else:
+            # Try to remove opening and closing ```
+            text = re.sub(r'^```(?:json)?\s*', '', text)
+            text = re.sub(r'\s*```$', '', text)
+
+    try:
+        result = json.loads(text)
+
+        # Validate required fields
+        if "steps_analysis" not in result:
+            logger.warning("Missing 'steps_analysis' field in response")
+            result["steps_analysis"] = []
+
+        if "final_completion" not in result:
+            logger.warning("Missing 'final_completion' field in response")
+            result["final_completion"] = "False"
+
+        if "score" not in result:
+            logger.warning("Missing 'score' field in response")
+            result["score"] = 0
+
+        return result
+
+    except json.JSONDecodeError as e:
+        logger.error(f"Failed to parse JSON response: {e}")
+        logger.error(f"Response text: {text[:500]}")
+
+        # Return a default structure
+        return {
+            "steps_analysis": [],
+            "final_completion": "False",
+            "score": 0
+        }
+
+
+def _parse_score(text: str) -> float:
+    """
+    Parse score from model response and convert to 0.0-1.0 range
+
+    Args:
+        text: Raw model response (expected to be JSON format)
+
+    Returns:
+        Score between 0.0 and 1.0
+    """
+    result = _parse_evaluation_response(text)
+
+    # Extract score (0-10) and convert to 0.0-1.0
+    score = result.get("score", 0)
+
+    try:
+        score = float(score)
+        # Clamp to [0, 10] then normalize to [0.0, 1.0]
+        score = max(0.0, min(10.0, score))
+        normalized_score = score / 10.0
+
+        logger.info(f"Final completion: {result.get('final_completion')}")
+        logger.info(f"Raw score (0-10): {score}, Normalized score (0-1): {normalized_score}")
+
+        # Log steps analysis if available
+        steps = result.get("steps_analysis", [])
+        if steps:
+            logger.info(f"Steps analysis ({len(steps)} steps):")
+            for i, step in enumerate(steps):
+                logger.info(f"  Step {i+1}: {step.get('step', 'N/A')} - {step.get('status', 'N/A')}")
+
+        return normalized_score
+
+    except (ValueError, TypeError) as e:
+        logger.warning(f"Could not parse score: {e}")
+        return 0.0
--- a/desktop_env/server/main.py
+++ b/desktop_env/server/main.py
@@ -4,25 +4,27 @@ import platform
 import shlex
 import json
 import subprocess, signal
+import sys
 import time
 from pathlib import Path
 from typing import Any, Optional, Sequence
 from typing import List, Dict, Tuple, Literal
 import concurrent.futures

-import Xlib
 import lxml.etree
 import pyautogui
 import requests
 import re
 from PIL import Image, ImageGrab
-from Xlib import display, X
 from flask import Flask, request, jsonify, send_file, abort  # , send_from_directory
 from lxml.etree import _Element

 platform_name: str = platform.system()

 if platform_name == "Linux":
+    import Xlib
+    from Xlib import display, X
+    from pyxcursor import Xcursor
    import pyatspi
    from pyatspi import Accessible, StateType, STATE_SHOWING
    from pyatspi import Action as ATAction
@@ -39,9 +41,14 @@ elif platform_name == "Windows":
    import win32ui, win32gui

    Accessible = Any
+    Xlib = None
+    display = None
+    X = None
+    Xcursor = None

 elif platform_name == "Darwin":
    import plistlib
+    from pyxcursor import Xcursor

    import AppKit
    import ApplicationServices
@@ -51,13 +58,16 @@ elif platform_name == "Darwin":

    Accessible = Any
    BaseWrapper = Any
+    Xlib = None

 else:
    # Platform not supported
    Accessible = None
    BaseWrapper = Any
-
-from pyxcursor import Xcursor
+    Xlib = None
+    display = None
+    X = None
+    Xcursor = None

 # todo: need to reformat and organize this whole file

@@ -89,6 +99,10 @@ def execute_command():
        if arg.startswith("~/"):
            command[i] = os.path.expanduser(arg)

+    # Replace 'python' with sys.executable to use the same Python interpreter as the server
+    if len(command) > 0 and command[0] in ['python', 'python3', 'python.exe', 'python3.exe']:
+        command[0] = sys.executable
+
    # Execute the command without any safety checks.
    try:
        if platform_name == "Windows":
@@ -262,15 +276,12 @@ def launch_app():

@app.route('/screenshot', methods=['GET'])
 def capture_screen_with_cursor():
-    # fixme: when running on virtual machines, the cursor is not captured, don't know why
-
    file_path = os.path.join(os.path.dirname(__file__), "screenshots", "screenshot.png")
    user_platform = platform.system()

    # Ensure the screenshots directory exists
    os.makedirs(os.path.dirname(file_path), exist_ok=True)

-    # fixme: This is a temporary fix for the cursor not being captured on Windows and Linux
    if user_platform == "Windows":
        def get_cursor():
            hcursor = win32gui.GetCursorInfo()[1]
@@ -303,19 +314,53 @@ def capture_screen_with_cursor():

        ratio = ctypes.windll.shcore.GetScaleFactorForDevice(0) / 100

+        # get logical screen size
+        user32 = ctypes.windll.user32
+        logical_width = user32.GetSystemMetrics(0)
+        logical_height = user32.GetSystemMetrics(1)
+
+        # ===== Key fix: get cursor position before taking screenshot =====
+        # win32gui.GetCursorPos() returns logical coordinates (consistent with pyautogui)
+        pos_win = win32gui.GetCursorPos()
+        logger.info(f"Cursor position (logical coordinates): {pos_win}")
+
+        # Take screenshot immediately to reduce time difference
        img = ImageGrab.grab(bbox=None, include_layered_windows=True)
+        # =============================================
+
+        # ===== DPI scaling fix =====
+        if ratio != 1.0:
+            physical_width, physical_height = img.size
+            logger.info(f"Detected DPI scaling: {ratio}x ({ratio*100}%)")
+            logger.info(f"Physical screenshot size: {physical_width}x{physical_height}")
+            logger.info(f"Logical resolution: {logical_width}x{logical_height}")
+            logger.info(f"Resizing screenshot to match logical resolution...")
+            img = img.resize((logical_width, logical_height), Image.Resampling.LANCZOS)
+            logger.info(f"Screenshot resized to: {img.size}")
+        # ==========================

        try:
            cursor, (hotspotx, hotspoty) = get_cursor()

-            pos_win = win32gui.GetCursorPos()
-            pos = (round(pos_win[0]*ratio - hotspotx), round(pos_win[1]*ratio - hotspoty))
+            # ===== Cursor position handling =====
+            # win32gui.GetCursorPos() and pyautogui both use logical coordinates
+            # The screenshot has been resized to logical resolution, so use directly
+            logical_cursor_x = pos_win[0]
+            logical_cursor_y = pos_win[1]
+
+            pos = (logical_cursor_x - hotspotx, logical_cursor_y - hotspoty)
+
+            logger.info(f"Cursor position (logical coordinates): ({logical_cursor_x}, {logical_cursor_y})")
+            logger.info(f"Hotspot offset: ({hotspotx}, {hotspoty})")
+            logger.info(f"Final paste position: {pos}")
+            # ===================================

            img.paste(cursor, pos, cursor)
        except Exception as e:
-            logger.warning(f"Failed to capture cursor on Windows, screenshot will not have a cursor. Error: {e}")
+            logger.warning(f"Failed to capture cursor on Windows, screenshot will not include cursor. Error: {e}")

        img.save(file_path)
+       
    elif user_platform == "Linux":
        cursor_obj = Xcursor()
        imgarray = cursor_obj.getCursorImageArrayFast()
@@ -324,17 +369,19 @@ def capture_screen_with_cursor():
        cursor_x, cursor_y = pyautogui.position()
        screenshot.paste(cursor_img, (cursor_x, cursor_y), cursor_img)
        screenshot.save(file_path)
+        
    elif user_platform == "Darwin":  # (Mac OS)
-        # Use the screencapture utility to capture the screen with the cursor
        subprocess.run(["screencapture", "-C", file_path])
+        
    else:
        logger.warning(f"The platform you're using ({user_platform}) is not currently supported")

    return send_file(file_path, mimetype='image/png')


+
 def _has_active_terminal(desktop: Accessible) -> bool:
-    """ A quick check whether the terminal window is open and active.
+    """ A quick check whether the terminal window is open and active (Linux only).
    """
    for app in desktop:
        if app.getRoleName() == "application" and app.name == "gnome-terminal-server":
@@ -344,6 +391,87 @@ def _has_active_terminal(desktop: Accessible) -> bool:
    return False


+def _get_windows_terminal_output() -> Optional[str]:
+    """ Get terminal output on Windows platform.
+    Supports Windows Terminal, PowerShell, Command Prompt, and ConHost.
+    """
+    try:
+        from pywinauto import Desktop
+        from pywinauto.findwindows import ElementNotFoundError
+        
+        desktop = Desktop(backend="uia")
+        
+        # Common terminal applications on Windows
+        terminal_apps = [
+            "WindowsTerminal.exe",  # Windows Terminal
+            "powershell.exe",       # PowerShell
+            "pwsh.exe",             # PowerShell Core
+            "cmd.exe",              # Command Prompt
+            "conhost.exe"           # Console Host
+        ]
+        
+        # Try to find active terminal windows
+        for window in desktop.windows():
+            try:
+                # Check if window is visible and not minimized
+                if not window.is_visible() or window.is_minimized():
+                    continue
+                
+                # Get window process name
+                process_name = window.element_info.name.lower()
+                
+                # Check if this is a terminal window
+                is_terminal = False
+                for term_app in terminal_apps:
+                    if term_app.lower() in process_name or \
+                       any(term_name in process_name for term_name in ['terminal', 'powershell', 'command prompt', 'cmd']):
+                        is_terminal = True
+                        break
+                
+                if not is_terminal:
+                    continue
+                
+                # Try to get text content from the terminal
+                # First, try to find console/edit controls that contain the output
+                try:
+                    # For Windows Terminal and modern consoles
+                    # Look for Edit or Document controls that contain the text
+                    text_controls = window.descendants(control_type="Edit")
+                    if not text_controls:
+                        text_controls = window.descendants(control_type="Document")
+                    if not text_controls:
+                        text_controls = window.descendants(control_type="Text")
+                    
+                    for control in text_controls:
+                        try:
+                            text = control.window_text()
+                            if text and len(text.strip()) > 0:
+                                return text.rstrip()
+                        except:
+                            pass
+                    
+                    # If no text controls found, try to get the window text directly
+                    window_text = window.window_text()
+                    if window_text and len(window_text.strip()) > 0:
+                        # Filter out just the window title
+                        if window_text not in ['Windows PowerShell', 'Command Prompt', 'PowerShell', 'Administrator: Windows PowerShell']:
+                            return window_text.rstrip()
+                    
+                except Exception as e:
+                    logger.debug(f"Error getting text from window {process_name}: {e}")
+                    continue
+                    
+            except Exception as e:
+                logger.debug(f"Error processing window: {e}")
+                continue
+        
+        return None
+        
+    except Exception as e:
+        logger.error(f"Error in _get_windows_terminal_output: {e}")
+        return None
+
+
@app.route('/terminal', methods=['GET'])
 def get_terminal_output():
    user_platform = platform.system()
@@ -358,8 +486,10 @@ def get_terminal_output():
                xpath = '//application[@name="gnome-terminal-server"]/frame[@st:active="true"]//terminal[@st:focused="true"]'
                terminals: List[_Element] = desktop_xml.xpath(xpath, namespaces=_accessibility_ns_map_ubuntu)
                output = terminals[0].text.rstrip() if len(terminals) == 1 else None
-        else:  # windows and macos platform is not implemented currently
-            # raise NotImplementedError
+        elif user_platform == "Windows":
+            output = _get_windows_terminal_output()
+            logger.debug(f"Terminal output retrieved: {output}")
+        else:  # macOS platform is not implemented currently
            return "Currently not implemented for platform {:}.".format(platform.platform()), 500
        return jsonify({"output": output, "status": "success"})
    except Exception as e:
@@ -989,6 +1119,9 @@ def get_window_size():
    else:
        return jsonify({"error": "app_class_name is required"}), 400

+    if platform_name != "Linux":
+        return jsonify({"error": "window_size is only supported on Linux"}), 501
+
    d = display.Display()
    root = d.screen().root
    window_ids = root.get_full_property(d.intern_atom('_NET_CLIENT_LIST'), X.AnyPropertyType).value
@@ -1505,11 +1638,19 @@ def start_recording():
            logger.error(f"Error removing old recording file: {e}")
            return jsonify({'status': 'error', 'message': f'Failed to remove old recording file: {e}'}), 500

-    d = display.Display()
-    screen_width = d.screen().width_in_pixels
-    screen_height = d.screen().height_in_pixels
-
-    start_command = f"ffmpeg -y -f x11grab -draw_mouse 1 -s {screen_width}x{screen_height} -i :0.0 -c:v libx264 -r 30 {recording_path}"
+    if platform_name == "Linux":
+        d = display.Display()
+        screen_width = d.screen().width_in_pixels
+        screen_height = d.screen().height_in_pixels
+        start_command = f"ffmpeg -y -f x11grab -draw_mouse 1 -s {screen_width}x{screen_height} -i :0.0 -c:v libx264 -r 30 {recording_path}"
+    elif platform_name == "Windows":
+        user32 = ctypes.windll.user32
+        screen_width = user32.GetSystemMetrics(0)
+        screen_height = user32.GetSystemMetrics(1)
+        # Use gdigrab for Windows screen capture
+        start_command = f"ffmpeg -y -f gdigrab -draw_mouse 1 -framerate 30 -video_size {screen_width}x{screen_height} -i desktop -c:v libx264 -r 30 {recording_path}"
+    else:
+        return jsonify({'status': 'error', 'message': f'Recording not supported on {platform_name}'}), 501

    # Use stderr=PIPE to capture potential errors from ffmpeg
    recording_process = subprocess.Popen(shlex.split(start_command),
@@ -1544,11 +1685,22 @@ def end_recording():
    error_output = ""
    try:
        # Send SIGINT for a graceful shutdown, allowing ffmpeg to finalize the file.
-        recording_process.send_signal(signal.SIGINT)
+        # On Windows, use CTRL_C_EVENT; on Unix, use SIGINT
+        if platform_name == "Windows":
+            # On Windows, we need to terminate the process gracefully
+            # ffmpeg responds to standard input 'q' to quit gracefully
+            try:
+                recording_process.stdin.write(b'q')
+                recording_process.stdin.flush()
+            except:
+                # If stdin is not available, use terminate
+                recording_process.terminate()
+        else:
+            recording_process.send_signal(signal.SIGINT)
        # Wait for ffmpeg to terminate. communicate() gets output and waits.
        _, error_output = recording_process.communicate(timeout=15)
    except subprocess.TimeoutExpired:
-        logger.error("ffmpeg did not respond to SIGINT, killing the process.")
+        logger.error("ffmpeg did not respond to stop signal, killing the process.")
        recording_process.kill()
        # After killing, communicate to get any remaining output.
        _, error_output = recording_process.communicate()
@@ -1589,8 +1741,9 @@ def run_python():
            f.write(code)
        
        # Execute the file using subprocess to capture all output
+        # Use sys.executable to use the same Python interpreter as the Flask server
        result = subprocess.run(
-            ['/usr/bin/python3', temp_filename],
+            [sys.executable, temp_filename],
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
            text=True,
--- a/evaluation_examples/examples/avogadro/building-metal-complexes_task1.json
+++ b/evaluation_examples/examples/avogadro/building-metal-complexes_task1.json
@@ -0,0 +1,44 @@
+{
+  "id": "building-metal-complexes_task1",
+  "snapshot": "avogadro",
+  "instruction": "在 Avogadro 2 中，使用 Template Tool 创建 [Co(NH3)6]3+ 配位化合物，设置为八面体配位几何。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\Avogadro2\\bin\\avogadro2.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "avogadro"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 打开 Template Tool（快捷键 Ctrl+3 或点击工具栏图标）。\n2. 切换到 Centers 选项卡。\n3. 输入 'Co' 或从弹出菜单中选择钴元素。\n4. 点击三次 '+' 符号，将正电荷设置为 +3。\n5. 按键 '6' 或选择八面体几何形状。\n6. 点击空白区域，放置钴中心，六个氢原子会显示在配位位置。"
+  }
+}
--- a/evaluation_examples/examples/avogadro/building-metal-complexes_task3.json
+++ b/evaluation_examples/examples/avogadro/building-metal-complexes_task3.json
@@ -0,0 +1,44 @@
+{
+  "id": "building-metal-complexes_task3",
+  "snapshot": "avogadro",
+  "instruction": "在 Avogadro 2 中，使用 Template Tool 创建 [Ni(en)(NH3)2]2+ 配位化合物，设置为平面四方配位几何。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\Avogadro2\\bin\\avogadro2.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "avogadro"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 打开 Template Tool。\n2. 切换到 Centers 选项卡。\n3. 输入 'Ni' 或从弹出菜单中选择镍元素。\n4. 点击两次 '+' 符号，将正电荷设置为 +2。\n5. 按键 '44' 或选择平面四方几何形状。\n6. 点击空白区域，放置镍中心，四个氢原子会显示在配位位置。"
+  }
+}
--- a/evaluation_examples/examples/avogadro/building-metal-complexes_task7.json
+++ b/evaluation_examples/examples/avogadro/building-metal-complexes_task7.json
@@ -0,0 +1,44 @@
+{
+  "id": "building-metal-complexes_task7",
+  "snapshot": "avogadro",
+  "instruction": "在 Avogadro 2 中，创建具有两个环戊二烯基 (Cp) 和两个氯配体的 ZrCp2Cl2 配合物。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\Avogadro2\\bin\\avogadro2.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "avogadro"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 打开 Template Tool，点击 Centers 选项卡。\n2. 输入 'Zr' 或选择锆元素。\n3. 点击四次 '+'，将正电荷设置为 +4。\n4. 按键 '4'，选择四面体几何形状。\n5. 在空白区域放置锆中心。\n6. 切换到 Ligands 选项卡，输入 'cp' 或选择环戊二烯基。\n7. 点击一个氢原子，添加第一个 Cp 配体。\n8. 点击相邻氢，添加第二个 Cp 配体。\n9. 切换到 Draw Tool（快捷键 Ctrl+2）。\n10. 选择 Cl 元素。\n11. 点击两个剩余氢原子，每次点击替换为氯配体。"
+  }
+}
--- a/evaluation_examples/examples/avogadro/building-organic-molecules_task1.json
+++ b/evaluation_examples/examples/avogadro/building-organic-molecules_task1.json
@@ -0,0 +1,44 @@
+{
+  "id": "building-organic-molecules_task1",
+  "snapshot": "avogadro",
+  "instruction": "在 Avogadro 2 中，使用软件的 Build 工具插入一个苯环。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\Avogadro2\\bin\\avogadro2.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "avogadro"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 点击菜单栏 Build(构建) → Insert(插入) → Molecule(分子…)，打开\"插入片段\"对话框。\n2. 在\"筛选\"输入框中输入 benzene（注意：需要先切换到英文输入法再输入）。\n3. 筛选结果会显示一个 aromatics 文件夹（树形结构），需要双击或点击展开该文件夹。\n4. 展开后选中列表中的 benzene.cjson 文件。\n5. 点击\"插入\"按钮将苯环插入到工作区。\n6. 关闭\"插入片段\"对话框，确认苯环已显示在主工作界面中。"
+  }
+}
--- a/evaluation_examples/examples/avogadro/building-organic-molecules_task3.json
+++ b/evaluation_examples/examples/avogadro/building-organic-molecules_task3.json
@@ -0,0 +1,44 @@
+{
+  "id": "building-organic-molecules_task3",
+  "snapshot": "avogadro",
+  "instruction": "在 Avogadro 2 中，在甲苯分子的对位添加一硝基（-NO2），生成 4-硝基甲苯。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\Avogadro2\\bin\\avogadro2.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "avogadro"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 按 'N' 键选择硝基。\n2. 点击甲基对位（苯环上的一个氢原子），将其替换为 -NO2。\n3. 确保分子结构正确。"
+  }
+}
--- a/evaluation_examples/examples/avogadro/building-organic-molecules_task4.json
+++ b/evaluation_examples/examples/avogadro/building-organic-molecules_task4.json
@@ -0,0 +1,44 @@
+{
+  "id": "building-organic-molecules_task4",
+  "snapshot": "avogadro",
+  "instruction": "在 Avogadro 2 中，为甲苯分子执行几何优化。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\Avogadro2\\bin\\avogadro2.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "avogadro"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 按 Ctrl+Alt+O 或点击 Auto Optimize 工具执行几何优化。\n2. 检查分子是否获得合乎逻辑的几何结构。"
+  }
+}
--- a/evaluation_examples/examples/avogadro/building-organic-molecules_task5.json
+++ b/evaluation_examples/examples/avogadro/building-organic-molecules_task5.json
@@ -0,0 +1,44 @@
+{
+  "id": "building-organic-molecules_task5",
+  "snapshot": "avogadro",
+  "instruction": "在 Avogadro 2 中，使用 Draw Tool 创建一个单碳结构，然后添加一个羧基（-COOH）。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\Avogadro2\\bin\\avogadro2.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "avogadro"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 使用 Draw Tool 在界面中绘制一个单碳。\n2. 激活 Template Tool，通过按 Ctrl+3 或点击工具栏上的图标进入 Groups。\n3. 按 'C' 或 'co' 选择羧基。\n4. 点击单碳结构上的一个氢原子，将其替换为羧基。"
+  }
+}
--- a/evaluation_examples/examples/avogadro/building-organic-molecules_task9.json
+++ b/evaluation_examples/examples/avogadro/building-organic-molecules_task9.json
@@ -0,0 +1,44 @@
+{
+  "id": "building-organic-molecules_task9",
+  "snapshot": "avogadro",
+  "instruction": "在 Avogadro 2 中，创建一个 4-甲氧基-3-硝基苯甲酸分子，包含苯环、羧基、硝基和甲氧基。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\Avogadro2\\bin\\avogadro2.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "avogadro"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 插入苯环。\n2. 按 'C' 键选择羧基，并添加到苯环的第 1 个位置。\n3. 按 'N' 键选择硝基，并添加到苯环的第 3 个位置。\n4. 按 'om' 键选择甲氧基，并添加到苯环的第 4 个位置。\n5. 使用优化工具进行几何优化并检查分子是否正确。"
+  }
+}
--- a/evaluation_examples/examples/avogadro/naming-a-molecule_task1.json
+++ b/evaluation_examples/examples/avogadro/naming-a-molecule_task1.json
@@ -0,0 +1,44 @@
+{
+  "id": "naming-a-molecule_task1",
+  "snapshot": "avogadro",
+  "instruction": "在 Avogadro 中通过 Analysis → Properties → Molecular... 查看当前分子的 IUPAC 名称及相关性质。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\Avogadro2\\bin\\avogadro2.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "avogadro"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 打开 Avogadro 软件。\n2. 点击菜单栏中的 Analysis。\n3. 从下拉菜单选择 Properties。\n4. 点击 Molecular...。\n5. 在弹出的 'Molecular Properties' 窗口中查看分子的名字和相关信息，例如分子质量、化学式、原子数和键数。"
+  }
+}
--- a/evaluation_examples/examples/avogadro/viewing-electrostatic-potential_task1.json
+++ b/evaluation_examples/examples/avogadro/viewing-electrostatic-potential_task1.json
@@ -0,0 +1,44 @@
+{
+  "id": "viewing-electrostatic-potential_task1",
+  "snapshot": "avogadro",
+  "instruction": "在 Avogadro 中通过 Analyze → Create Surfaces 菜单创建 Van der Waals 表面并设置电荷分布可视化。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\Avogadro2\\bin\\avogadro2.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "avogadro"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 打开 Avogadro 软件并加载目标分子的模型。\n2. 通过菜单栏选择 Analyze → Create Surfaces。\n3. 在弹出的 Create Surfaces 对话框中，将 Surface 设置为 'Van der Waals'。\n4. 将 Color By 设置为 'Electrostatic Potential'。\n5. 选择一个电荷模型（例如 'EEM'）。\n6. 选择色阶为 'Balance'。\n7. 点击 'Calculate' 按钮开始计算表面。\n8. 等待软件完成计算，点击 'Close' 关闭对话框。"
+  }
+}
--- a/evaluation_examples/examples/imagej/user-guide_task1.json
+++ b/evaluation_examples/examples/imagej/user-guide_task1.json
@@ -0,0 +1,44 @@
+{
+  "id": "user-guide_task1",
+  "snapshot": "imagej",
+  "instruction": "在 ImageJ 中，通过 File → New → Image 创建一个名为 'Text Image' 的新图像，设置图像类型为 8-bit，背景填充为 White，并设置宽度为 40 像素，高度为 40 像素。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\ImageJ\\ImageJ.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "imagej"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 在菜单栏点击 File → New → Image。\n2. 在弹出的对话框中输入名称 'Text Image'。\n3. 从 Type 下拉菜单中选择 '8-bit'。\n4. 从 Fill With 下拉菜单中选择 'White'。\n5. 在宽度（Width）框中输入 40。\n6. 在高度（Height）框中输入 40。\n7. 点击 OK 按钮完成操作。"
+  }
+}
--- a/evaluation_examples/examples/imagej/user-guide_task10.json
+++ b/evaluation_examples/examples/imagej/user-guide_task10.json
@@ -0,0 +1,44 @@
+{
+  "id": "user-guide_task10",
+  "snapshot": "imagej",
+  "instruction": "在 ImageJ 中，通过 Edit → Selection → Restore Selection 恢复之前存储的选区。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\ImageJ\\ImageJ.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "imagej"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 在菜单栏点击 Edit → Selection → Restore Selection。\n2. 在图像上确保选区可见。\n3. 查看并确认选区正确恢复。"
+  }
+}
--- a/evaluation_examples/examples/imagej/user-guide_task2.json
+++ b/evaluation_examples/examples/imagej/user-guide_task2.json
@@ -0,0 +1,57 @@
+{
+  "id": "user-guide_task2",
+  "snapshot": "imagej",
+  "instruction": "在 ImageJ 中，通过 Process → Find Maxima 对 blobs.gif 图像执行 Maxima 寻找，设置噪声容忍值为 50，并选择 Output Type 为 'Single Points'。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "upload_file",
+      "parameters": {
+        "files": [
+          {
+            "local_path": "evaluation_examples/data/imagej/blobs.gif",
+            "path": "C:\\Users\\user\\Desktop\\blobs.gif"
+          }
+        ]
+      }
+    },
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\ImageJ\\ImageJ.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "imagej"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [
+      "blobs.gif"
+    ],
+    "steps": "1. 打开 blobs.gif 文件。\n2. 在菜单栏点击 Process → Find Maxima。\n3. 在弹出的对话框中，将 Noise Tolerance 设置为 50。\n4. 从 Output Type 下拉菜单中选择 'Single Points'。\n5. 点击 OK 按钮完成操作。"
+  }
+}
--- a/evaluation_examples/examples/imagej/user-guide_task3.json
+++ b/evaluation_examples/examples/imagej/user-guide_task3.json
@@ -0,0 +1,44 @@
+{
+  "id": "user-guide_task3",
+  "snapshot": "imagej",
+  "instruction": "在 ImageJ 中，通过 Plugins → Utilities → Find Commands 查找关键字 'threshold' 的相关命令，显示完整信息并运行 'Adaptive3DThreshold'。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\ImageJ\\ImageJ.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "imagej"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 在菜单栏点击 Plugins → Utilities → Find Commands。\n2. 在弹出的 Command Finder 窗口中输入 'threshold'。\n3. 勾选 'Show full information'。\n4. 在列表中选择 'Adaptive3DThreshold' 并双击运行命令。"
+  }
+}
--- a/evaluation_examples/examples/imagej/user-guide_task4.json
+++ b/evaluation_examples/examples/imagej/user-guide_task4.json
@@ -0,0 +1,44 @@
+{
+  "id": "user-guide_task4",
+  "snapshot": "imagej",
+  "instruction": "在 ImageJ 中，通过 Image → Adjust → Threshold 使用 'Default' 自动阈值法对当前图像进行阈值分割，并设置显示模式为 Over/Under。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\ImageJ\\ImageJ.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "imagej"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 在菜单栏点击 Image → Adjust → Threshold。\n2. 在弹出的对话框中，从 Method 下拉菜单选择 'Default'。\n3. 确保 Display 模式设置为 'Over/Under'。\n4. 点击 Apply 按钮完成操作。"
+  }
+}
--- a/evaluation_examples/examples/imagej/user-guide_task5.json
+++ b/evaluation_examples/examples/imagej/user-guide_task5.json
@@ -0,0 +1,44 @@
+{
+  "id": "user-guide_task5",
+  "snapshot": "imagej",
+  "instruction": "在 ImageJ 中，通过 Analyze → Tools → Curve Fitting 对数据拟合二次多项式，并将最大迭代次数设为 100。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\ImageJ\\ImageJ.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "imagej"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 在菜单栏点击 Analyze → Tools → Curve Fitting。\n2. 在弹出的对话框中，从 Function 下拉菜单选择 '2nd Degree Polynomial'。\n3. 点击 Fit 按钮。\n4. 在 Simplex Fitting Options 中，将 Maximum number of iterations 设置为 100。\n5. 点击 OK 按钮完成拟合。"
+  }
+}
--- a/evaluation_examples/examples/imagej/user-guide_task6.json
+++ b/evaluation_examples/examples/imagej/user-guide_task6.json
@@ -0,0 +1,57 @@
+{
+  "id": "user-guide_task6",
+  "snapshot": "imagej",
+  "instruction": "在 ImageJ 中，通过 Image → Transform → Rotate 90 Degrees Right 旋转 mri-stack.tif 图像 90°。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "upload_file",
+      "parameters": {
+        "files": [
+          {
+            "local_path": "evaluation_examples/data/imagej/mri-stack.tif",
+            "path": "C:\\Users\\user\\Desktop\\mri-stack.tif"
+          }
+        ]
+      }
+    },
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\ImageJ\\ImageJ.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "imagej"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [
+      "mri-stack.tif"
+    ],
+    "steps": "1. 打开 mri-stack.tif 文件。\n2. 在菜单栏点击 Image → Transform → Rotate 90 Degrees Right。\n3. 确保图像正确旋转后保存或查看结果。"
+  }
+}
--- a/evaluation_examples/examples/imagej/user-guide_task7.json
+++ b/evaluation_examples/examples/imagej/user-guide_task7.json
@@ -0,0 +1,44 @@
+{
+  "id": "user-guide_task7",
+  "snapshot": "imagej",
+  "instruction": "在 ImageJ 中，通过 Process → Binary → Options 设置黑色背景选项为开启状态并进行预览。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\ImageJ\\ImageJ.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "imagej"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 在菜单栏点击 Process → Binary → Options。\n2. 在弹出的对话框中勾选 'Black Background'。\n3. 点击 Preview 查看效果。\n4. 点击 OK 按钮保存选项。"
+  }
+}
--- a/evaluation_examples/examples/imagej/user-guide_task8.json
+++ b/evaluation_examples/examples/imagej/user-guide_task8.json
@@ -0,0 +1,44 @@
+{
+  "id": "user-guide_task8",
+  "snapshot": "imagej",
+  "instruction": "在 ImageJ 中，通过 File → Save As → PNG 保存当前图像为 PNG 格式，设置透明索引为 255。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\ImageJ\\ImageJ.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "imagej"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 在菜单栏点击 File → Save As → PNG。\n2. 在弹出的对话框中，将透明索引设置为 255。\n3. 输入文件名并指定保存路径。\n4. 点击 OK 按钮完成保存。"
+  }
+}
--- a/evaluation_examples/examples/imagej/user-guide_task9.json
+++ b/evaluation_examples/examples/imagej/user-guide_task9.json
@@ -0,0 +1,44 @@
+{
+  "id": "user-guide_task9",
+  "snapshot": "imagej",
+  "instruction": "在 ImageJ 中，通过 Analyze → Measure 测量当前选区的面积和灰度值。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\ImageJ\\ImageJ.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "imagej"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 创建或选择一个区域选区。\n2. 在菜单栏点击 Analyze → Measure。\n3. 在弹出的 Results 窗口中查看面积和灰度值等测量结果。"
+  }
+}
--- a/evaluation_examples/examples/jade/MDIJade6.5使用手册_task1.json
+++ b/evaluation_examples/examples/jade/MDIJade6.5使用手册_task1.json
@@ -0,0 +1,46 @@
+{
+  "id": "MDIJade6.5使用手册_task1",
+  "snapshot": "jade",
+  "instruction": "在 MDI Jade 中通过菜单 File → Patterns 加载衍射数据文件 DEMO001.MDI。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\JADE\\jade 6.5\\MDI Jade 6.5\\jade6.5.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "jade"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [
+      "DEMO001.MDI"
+    ],
+    "steps": "1. 在桌面找到 MDI Jade 图标，双击打开软件。\n2. 点击菜单 File → Patterns。\n3. 在弹出的对话框中选择 DEMO001.MDI 并点击 Open。"
+  }
+}
--- a/evaluation_examples/examples/jade/MDIJade6.5使用手册_task10.json
+++ b/evaluation_examples/examples/jade/MDIJade6.5使用手册_task10.json
@@ -0,0 +1,44 @@
+{
+  "id": "MDIJade6.5使用手册_task10",
+  "snapshot": "jade",
+  "instruction": "从菜单 Options → Cell Refinement 打开晶胞点阵参数对话框并精修点阵常数。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\JADE\\jade 6.5\\MDI Jade 6.5\\jade6.5.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "jade"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 点击菜单 Options → Cell Refinement。\n2. 在弹出的对话框中检查点阵参数。\n3. 点击 Refine 按钮进行精修。\n4. 检查精修结果并保存。"
+  }
+}
--- a/evaluation_examples/examples/jade/MDIJade6.5使用手册_task2.json
+++ b/evaluation_examples/examples/jade/MDIJade6.5使用手册_task2.json
@@ -0,0 +1,44 @@
+{
+  "id": "MDIJade6.5使用手册_task2",
+  "snapshot": "jade",
+  "instruction": "在 JADE 中将当前打开的衍射图谱通过 File → Save-Primary Pattern as *.txt 导出为 ASCII 格式，保存为 DEMO001.txt。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\JADE\\jade 6.5\\MDI Jade 6.5\\jade6.5.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "jade"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 点击菜单 File → Save-Primary Pattern as *.txt。\n2. 在弹出的保存对话框中，设置文件名为 DEMO001.txt。\n3. 点击 Save 按钮保存文件。"
+  }
+}
--- a/evaluation_examples/examples/jade/MDIJade6.5使用手册_task3.json
+++ b/evaluation_examples/examples/jade/MDIJade6.5使用手册_task3.json
@@ -0,0 +1,44 @@
+{
+  "id": "MDIJade6.5使用手册_task3",
+  "snapshot": "jade",
+  "instruction": "使用 Search/Match 功能进行物相检索，并限制元素范围为 Al, Sn, O。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\JADE\\jade 6.5\\MDI Jade 6.5\\jade6.5.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "jade"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 点击菜单 S/M 按钮。\n2. 在 Search/Match 对话框中勾选 Use Chemistry Filter。\n3. 输入限定元素 Al, Sn, O，点击 OK。\n4. 等待物相检索完成，检查结果列表。"
+  }
+}
--- a/evaluation_examples/examples/jade/MDIJade6.5使用手册_task4.json
+++ b/evaluation_examples/examples/jade/MDIJade6.5使用手册_task4.json
@@ -0,0 +1,44 @@
+{
+  "id": "MDIJade6.5使用手册_task4",
+  "snapshot": "jade",
+  "instruction": "在 JADE 中通过 Options → Report-Peak ID Extended 计算 RIR 方法的物相质量分数，并保存结果为 PDF 格式。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\JADE\\jade 6.5\\MDI Jade 6.5\\jade6.5.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "jade"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 打开菜单 Options → Report-Peak ID Extended。\n2. 确认结果数据，确保内容显示完整。\n3. 点击 Save，选择文件类型为 PDF。\n4. 输入文件名并点击保存。"
+  }
+}
--- a/evaluation_examples/examples/jade/MDIJade6.5使用手册_task5.json
+++ b/evaluation_examples/examples/jade/MDIJade6.5使用手册_task5.json
@@ -0,0 +1,44 @@
+{
+  "id": "MDIJade6.5使用手册_task5",
+  "snapshot": "jade",
+  "instruction": "通过 Report → Peak Search Report 菜单计算晶粒大小及微观应变，设置 D 值为 1。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\JADE\\jade 6.5\\MDI Jade 6.5\\jade6.5.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "jade"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 点击菜单 Report → Peak Search Report。\n2. 在弹出的对话框中选择 Size/strain 选项。\n3. 设置反卷积参数 D 值为 1。\n4. 点击 Save 按钮保存计算结果。"
+  }
+}
--- a/evaluation_examples/examples/jade/MDIJade6.5使用手册_task6.json
+++ b/evaluation_examples/examples/jade/MDIJade6.5使用手册_task6.json
@@ -0,0 +1,44 @@
+{
+  "id": "MDIJade6.5使用手册_task6",
+  "snapshot": "jade",
+  "instruction": "通过 File → Save 菜单保存当前仪器半高宽校正曲线到 Si_hw_curve.fwhm。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\JADE\\jade 6.5\\MDI Jade 6.5\\jade6.5.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "jade"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 点击菜单 File。\n2. 选择 Save → FWHM Curve of Peaks。\n3. 在保存对话框中输入文件名 Si_hw_curve.fwhm。\n4. 点击 Save 按钮完成保存。"
+  }
+}
--- a/evaluation_examples/examples/jade/MDIJade6.5使用手册_task7.json
+++ b/evaluation_examples/examples/jade/MDIJade6.5使用手册_task7.json
@@ -0,0 +1,44 @@
+{
+  "id": "MDIJade6.5使用手册_task7",
+  "snapshot": "jade",
+  "instruction": "调用 Options → D-Spacing 菜单计算已知结构的衍射谱，加权强度公式为 Z = 12。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\JADE\\jade 6.5\\MDI Jade 6.5\\jade6.5.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "jade"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 打开菜单 Options → D-Spacing。\n2. 在计算衍射谱对话框中，设置加权强度公式参数 Z 值为 12。\n3. 点击 Calculate 按钮以生成计算结果。\n4. 检查结果并关闭窗口。"
+  }
+}
--- a/evaluation_examples/examples/jade/MDIJade6.5使用手册_task8.json
+++ b/evaluation_examples/examples/jade/MDIJade6.5使用手册_task8.json
@@ -0,0 +1,44 @@
+{
+  "id": "MDIJade6.5使用手册_task8",
+  "snapshot": "jade",
+  "instruction": "通过 Options → Calculate Stress 菜单计算残余应力，使用 Fit All 功能拟合曲线。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\JADE\\jade 6.5\\MDI Jade 6.5\\jade6.5.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "jade"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 点击菜单 Options → Calculate Stress。\n2. 在弹出的对话框中，选择以 Fit All 功能拟合所有数据。\n3. 检查拟合结果图。\n4. 点击 Save 按钮以保存拟合结果文件。"
+  }
+}
--- a/evaluation_examples/examples/origin/Origin_User_Guide_2025b_E_task1.json
+++ b/evaluation_examples/examples/origin/Origin_User_Guide_2025b_E_task1.json
@@ -0,0 +1,57 @@
+{
+  "id": "Origin_User_Guide_2025b_E_task1",
+  "snapshot": "origin",
+  "instruction": "在 Origin 中通过 Data → Connect to File 导入一个本地 Excel 文件 example.xlsx",
+  "source": "custom",
+  "config": [
+    {
+      "type": "upload_file",
+      "parameters": {
+        "files": [
+          {
+            "local_path": "evaluation_examples/data/origin/example.xlsx",
+            "path": "C:\\Users\\user\\Desktop\\example.xlsx"
+          }
+        ]
+      }
+    },
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\OriginLab\\Origin2025b\\Origin64.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "origin"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [
+      "example.xlsx"
+    ],
+    "steps": "1. 在 Origin 的主菜单中选择 Data → Connect to File。\n2. 点击 Connect to File 菜单中的按钮。\n3. 选择文件 example.xlsx 并点击 Open。\n4. 数据将被加载到当前的工作表中。"
+  }
+}
--- a/evaluation_examples/examples/origin/Origin_User_Guide_2025b_E_task11.json
+++ b/evaluation_examples/examples/origin/Origin_User_Guide_2025b_E_task11.json
@@ -0,0 +1,44 @@
+{
+  "id": "Origin_User_Guide_2025b_E_task11",
+  "snapshot": "origin",
+  "instruction": "在 Origin 中通过 Graph → Adding Error Bars 添加误差条到现有图表",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\OriginLab\\Origin2025b\\Origin64.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "origin"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 打开一个现有图表并右键点击图表元素。\n2. 选择 Graph → Adding Error Bars。\n3. 选择误差数据列并点击 OK 应用。\n4. 查看图表是否正确添加了误差条。"
+  }
+}
--- a/evaluation_examples/examples/origin/Origin_User_Guide_2025b_E_task12.json
+++ b/evaluation_examples/examples/origin/Origin_User_Guide_2025b_E_task12.json
@@ -0,0 +1,44 @@
+{
+  "id": "Origin_User_Guide_2025b_E_task12",
+  "snapshot": "origin",
+  "instruction": "在 Origin 中通过 Tools → Pick Data Points 工具拾取数据点并保存到新的工作表中",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\OriginLab\\Origin2025b\\Origin64.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "origin"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 打开一个包含数据点的图表。\n2. 在主菜单中选择 Tools → Pick Data Points。\n3. 使用交叉标记在图中选择数据点。\n4. 点击 Done 按钮以保存选择的数据点到新的工作表。"
+  }
+}
--- a/evaluation_examples/examples/origin/Origin_User_Guide_2025b_E_task2.json
+++ b/evaluation_examples/examples/origin/Origin_User_Guide_2025b_E_task2.json
@@ -0,0 +1,44 @@
+{
+  "id": "Origin_User_Guide_2025b_E_task2",
+  "snapshot": "origin",
+  "instruction": "在 Origin 中通过 View → Formula Bar 打开公式栏，并在公式栏输入 =stdev(B1:B10)",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\OriginLab\\Origin2025b\\Origin64.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "origin"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 在主菜单中选择 View → Formula Bar。\n2. 在出现的公式栏中，点击当前单元格内并输入 =stdev(B1:B10)。\n3. 按 Enter 键以应用公式并计算结果。\n4. 检查公式栏输出的结果是否正确。"
+  }
+}
--- a/evaluation_examples/examples/origin/Origin_User_Guide_2025b_E_task3.json
+++ b/evaluation_examples/examples/origin/Origin_User_Guide_2025b_E_task3.json
@@ -0,0 +1,44 @@
+{
+  "id": "Origin_User_Guide_2025b_E_task3",
+  "snapshot": "origin",
+  "instruction": "在 Origin 中通过 Axis Dialog 修改 X 轴的范围为 20 到 180",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\OriginLab\\Origin2025b\\Origin64.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "origin"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 在图层的 X 轴区域右键点击并选择 Axis Dialog。\n2. 在左侧选择 Scale 标签。\n3. 将 From 值修改为 20，将 To 值修改为 180。\n4. 点击 Apply To 按钮以应用更改，然后点击 OK 完成。"
+  }
+}
--- a/evaluation_examples/examples/origin/Origin_User_Guide_2025b_E_task4.json
+++ b/evaluation_examples/examples/origin/Origin_User_Guide_2025b_E_task4.json
@@ -0,0 +1,44 @@
+{
+  "id": "Origin_User_Guide_2025b_E_task4",
+  "snapshot": "origin",
+  "instruction": "在 Origin 中通过 Graph → Rescale to Show All 重设比例以显示所有数据",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\OriginLab\\Origin2025b\\Origin64.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "origin"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 打开一个包含数据的图表。\n2. 在主菜单选择 Graph → Rescale to Show All。\n3. 图表比例重设以显示所有数据点。"
+  }
+}
--- a/evaluation_examples/examples/origin/Origin_User_Guide_2025b_E_task5.json
+++ b/evaluation_examples/examples/origin/Origin_User_Guide_2025b_E_task5.json
@@ -0,0 +1,44 @@
+{
+  "id": "Origin_User_Guide_2025b_E_task5",
+  "snapshot": "origin",
+  "instruction": "在 Origin 中通过 Tools → Data Slicer 激活数据切片器并设置切片条件为 X=50",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\OriginLab\\Origin2025b\\Origin64.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "origin"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 在主菜单中选择 Tools → Data Slicer。\n2. 数据切片器面板将被激活。\n3. 在切片器的条件中选择 X=50 并应用切片。\n4. 图表中将显示切片后的数据点。"
+  }
+}
--- a/evaluation_examples/examples/origin/Origin_User_Guide_2025b_E_task8.json
+++ b/evaluation_examples/examples/origin/Origin_User_Guide_2025b_E_task8.json
@@ -0,0 +1,44 @@
+{
+  "id": "Origin_User_Guide_2025b_E_task8",
+  "snapshot": "origin",
+  "instruction": "在 Origin 中通过 Worksheet → Convert to Matrix 将活动表格转换成矩阵",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\OriginLab\\Origin2025b\\Origin64.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "origin"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 打开一个包含数据的活动表格。\n2. 在主菜单中选择 Worksheet → Convert to Matrix。\n3. 根据对话框选择矩阵转换选项（例如 X Across Columns）。\n4. 点击 OK 完成转换，生成矩阵数据。"
+  }
+}
--- a/evaluation_examples/examples/origin/Origin_User_Guide_2025b_E_task9.json
+++ b/evaluation_examples/examples/origin/Origin_User_Guide_2025b_E_task9.json
@@ -0,0 +1,44 @@
+{
+  "id": "Origin_User_Guide_2025b_E_task9",
+  "snapshot": "origin",
+  "instruction": "在 Origin 中通过 Object Edit Toolbar 对齐选中的图表对象",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\OriginLab\\Origin2025b\\Origin64.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "origin"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 使用鼠标选择需要对齐的对象。\n2. 打开 Object Edit Toolbar。\n3. 点击对齐按钮，例如 Align Left 或 Align Center。\n4. 所选对象将以统一对齐样式排列。"
+  }
+}
--- a/evaluation_examples/examples/ovito/animation_task3.json
+++ b/evaluation_examples/examples/ovito/animation_task3.json
@@ -0,0 +1,44 @@
+{
+  "id": "animation_task3",
+  "snapshot": "ovito",
+  "instruction": "在 OVITO 中将动画帧数从默认设置改为 10 帧每秒。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\OVITO Basic\\ovito.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "ovito"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 在 OVITO 的顶部菜单栏中选择 'Animation settings'。\n2. 在弹出的 'Animation settings' 窗口中，找到 'Frames per second' 输入框。\n3. 将帧速率设置为 10。\n4. 点击 'OK' 以保存更改。"
+  }
+}
--- a/evaluation_examples/examples/ovito/aspherical_particles_task1.json
+++ b/evaluation_examples/examples/ovito/aspherical_particles_task1.json
@@ -0,0 +1,44 @@
+{
+  "id": "aspherical_particles_task1",
+  "snapshot": "ovito",
+  "instruction": "在 OVITO 中，为粒子指定球形形状。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\OVITO Basic\\ovito.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "ovito"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 打开 OVITO 软件。\n2. 选择 File → New 或打开一个现有的粒子数据文件。\n3. 在 'Pipeline' 界面选择 'Particles' 可视化元素。\n4. 转到 'Particle types' 面板，将粒子形状调整为 Sphere（球形）。\n5. 确认并应用更改，确保形状为球形并更新可视化。"
+  }
+}
--- a/evaluation_examples/examples/ovito/clone_pipeline_task1.json
+++ b/evaluation_examples/examples/ovito/clone_pipeline_task1.json
@@ -0,0 +1,44 @@
+{
+  "id": "clone_pipeline_task1",
+  "snapshot": "ovito",
+  "instruction": "在 OVITO 中，通过主工具栏中的 Pipeline 下拉菜单，选择 'Clone current pipeline...' 选项来克隆当前数据通道。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\OVITO Basic\\ovito.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "ovito"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 在 OVITO 的主工具栏中找到 'Pipelines' 下拉菜单。\n2. 点击下拉菜单并选择 'Clone current pipeline...'。\n3. 在打开的 'Clone pipeline' 对话框中，选择克隆模式（如 Copy 或 Share）。\n4. 点击 'OK' 按钮完成克隆操作。\n5. 确认在可视化场景中同时显示原始通道和克隆通道的输出。"
+  }
+}
--- a/evaluation_examples/examples/ovito/code_generation_task1.json
+++ b/evaluation_examples/examples/ovito/code_generation_task1.json
@@ -0,0 +1,44 @@
+{
+  "id": "code_generation_task1",
+  "snapshot": "ovito",
+  "instruction": "在 OVITO 中，通过 File → Generate Python Script 打开代码生成器窗口。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\OVITO Basic\\ovito.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "ovito"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 启动 OVITO 软件。\n2. 点击顶部菜单栏中的 File。\n3. 在下拉菜单中选择 Generate Python Script。\n4. 确保代码生成器窗口正常打开（可见 Python 代码编辑界面）。"
+  }
+}
--- a/evaluation_examples/examples/ovito/customize_init_state_task1.json
+++ b/evaluation_examples/examples/ovito/customize_init_state_task1.json
@@ -0,0 +1,44 @@
+{
+  "id": "customize_init_state_task1",
+  "snapshot": "ovito",
+  "instruction": "在 OVITO 中创建一个名为 defaults.ovito 的文件，以保存空的初始会话状态。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\OVITO Basic\\ovito.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "ovito"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 打开 OVITO。\n2. 点击菜单栏中的 File → Save Session State As。\n3. 在弹出的文件保存对话框中，将文件命名为 defaults.ovito。\n4. 确保会话为空（即不包含数据集和管道）。\n5. 点击保存按钮。"
+  }
+}
--- a/evaluation_examples/examples/ovito/data_model_task1.json
+++ b/evaluation_examples/examples/ovito/data_model_task1.json
@@ -0,0 +1,44 @@
+{
+  "id": "data_model_task1",
+  "snapshot": "ovito",
+  "instruction": "在 OVITO 中，通过 Data Inspector 检查导入的粒子属性表，包括 Position 和 Potential Energy 列。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\OVITO Basic\\ovito.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "ovito"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 打开 OVITO 软件。\n2. 导入一个包含粒子属性的模拟文件（如 .xyz 格式）。\n3. 点击顶部工具栏中的 Data Inspector 按钮。\n4. 在 Data Inspector 面板查看粒子属性表，包括 Position 和 Potential Energy 列。"
+  }
+}
--- a/evaluation_examples/examples/ovito/export_task1.json
+++ b/evaluation_examples/examples/ovito/export_task1.json
@@ -0,0 +1,44 @@
+{
+  "id": "export_task1",
+  "snapshot": "ovito",
+  "instruction": "在 OVITO 中，将当前数据管道导出为粒子及其属性的数据表，保存为文件 particle_data.csv。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\OVITO Basic\\ovito.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "ovito"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 在菜单栏中，点击 File → Export File。\n2. 在弹出的对话框中，选择导出格式为 'Table of Particles'.\n3. 指定文件名为 particle_data.csv，并选择保存位置。\n4. 点击 'Save' 按钮以完成导出。"
+  }
+}
--- a/evaluation_examples/examples/ovito/marker_particles_task2.json
+++ b/evaluation_examples/examples/ovito/marker_particles_task2.json
@@ -0,0 +1,44 @@
+{
+  "id": "marker_particles_task2",
+  "snapshot": "ovito",
+  "instruction": "在 OVITO 中，调整动画播放速度为每秒 15 帧。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\OVITO Basic\\ovito.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "ovito"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 在 OVITO 界面上，点击 Animation Settings 按钮（小钟图标）。\n2. 在 Animation Settings 窗口中，找到 'Frames per second'（帧率）选项。\n3. 将 Frames per second 的值设置为 15。\n4. 点击 'OK' 确认设置。"
+  }
+}
--- a/evaluation_examples/examples/ovito/miscellaneous_task1.json
+++ b/evaluation_examples/examples/ovito/miscellaneous_task1.json
@@ -0,0 +1,44 @@
+{
+  "id": "miscellaneous_task1",
+  "snapshot": "ovito",
+  "instruction": "在 OVITO 中，通过 File → Save Session State 保存当前会话为 'session.ovitostate'。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\OVITO Basic\\ovito.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "ovito"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 点击菜单栏中的 File。\n2. 选择 Save Session State。\n3. 在弹出的保存对话框中选择目标路径并输入文件名 'session.ovitostate'。\n4. 点击 Save 保存文件。"
+  }
+}
--- a/evaluation_examples/examples/ovito/python_extensions_task1.json
+++ b/evaluation_examples/examples/ovito/python_extensions_task1.json
@@ -0,0 +1,44 @@
+{
+  "id": "python_extensions_task1",
+  "snapshot": "ovito",
+  "instruction": "在 OVITO Pro 中，通过 Edit → Python Extensions 打开扩展目录。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\OVITO Basic\\ovito.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "ovito"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 打开 OVITO Pro 软件。\n2. 点击顶部菜单栏中的 Edit 菜单。\n3. 从下拉菜单中选择 Python Extensions。\n4. 查看扩展目录窗口，确认已打开。"
+  }
+}
--- a/evaluation_examples/examples/ovito/remote_file_access_task1.json
+++ b/evaluation_examples/examples/ovito/remote_file_access_task1.json
@@ -0,0 +1,44 @@
+{
+  "id": "remote_file_access_task1",
+  "snapshot": "ovito",
+  "instruction": "在 OVITO 中，通过 File → Load Remote File 打开远程 SSH 文件 sftp://user@hostname/path/file",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\OVITO Basic\\ovito.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "ovito"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 打开 OVITO 软件。\n2. 点击菜单 File → Load Remote File。\n3. 在弹出的对话框中填写 Remote URL 字段，例如：sftp://user@hostname/path/file。\n4. 在 File type 下选择 Auto-detect file format。\n5. 在 SSH connection method 下选择 Integrated client (default)。\n6. 点击 Open 完成连接并加载文件。"
+  }
+}
--- a/evaluation_examples/examples/ovito/remote_rendering_task1.json
+++ b/evaluation_examples/examples/ovito/remote_rendering_task1.json
@@ -0,0 +1,44 @@
+{
+  "id": "remote_rendering_task1",
+  "snapshot": "ovito",
+  "instruction": "在 OVITO Pro 中设置远程渲染任务的导出目录并配置 CPU 核心数。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\OVITO Basic\\ovito.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "ovito"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 打开 OVITO Pro 软件。\n2. 点击顶部菜单中的 Utilities 标签。\n3. 选择 Render On Remote Computer 工具。\n4. 在弹出的对话框中，点击 'Choose' 按钮为 Bundle Export Directory 设置一个本地导出目录。\n5. 在 CPU cores per task 选项框中输入渲染任务所需的 CPU 核心数量（可为空，默认使用所有核心）。"
+  }
+}
--- a/evaluation_examples/examples/ovito/rendering_task1.json
+++ b/evaluation_examples/examples/ovito/rendering_task1.json
@@ -0,0 +1,44 @@
+{
+  "id": "rendering_task1",
+  "snapshot": "ovito",
+  "instruction": "在 OVITO 中，通过 Render Settings 面板渲染主动观察窗口为分辨率 1024x768 的图像，背景为透明色。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\OVITO Basic\\ovito.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "ovito"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 打开 OVITO 软件。\n2. 确保观察窗口激活（黄色边框）。\n3. 点击右侧命令面板上的 Render 图标。\n4. 在弹出的 Render Settings 面板中，选择 'Single frame'。\n5. 设置输出图像大小为 Width: 1024 和 Height: 768。\n6. 选择背景为 'Transparent'。\n7. 点击 'Render active viewport' 按钮完成渲染。"
+  }
+}
--- a/evaluation_examples/examples/ovito/transparent_particles_task1.json
+++ b/evaluation_examples/examples/ovito/transparent_particles_task1.json
@@ -0,0 +1,44 @@
+{
+  "id": "transparent_particles_task1",
+  "snapshot": "ovito",
+  "instruction": "在软件中，将所有粒子的 Transparency 属性设置为 0.5，使粒子半透明。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\OVITO Basic\\ovito.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "ovito"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 打开软件并加载需要的粒子数据。\n2. 插入 Compute 属性修正器到数据管道。\n3. 在 Compute 属性修正器中找到 Transparency 属性。\n4. 在表达式字段中输入透明度值 0.5。\n5. 应用设置，确保 Transparency 属性被分配给所有粒子。"
+  }
+}
--- a/evaluation_examples/examples/pymol/MovieSchool_1_task1.json
+++ b/evaluation_examples/examples/pymol/MovieSchool_1_task1.json
@@ -0,0 +1,44 @@
+{
+  "id": "MovieSchool_1_task1",
+  "snapshot": "pymol",
+  "instruction": "在 PyMOL 中，通过命令行制作一个简单动画，播放 NMR ensemble。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\PYMOL\\PyMOLWin.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "pymol"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 在 PyMOL 命令行中输入 `fetch 1nmr` 来加载 NMR ensemble。\n2. 输入 `mplay` 命令开始播放动画。\n3. 如果需要停止播放动画，输入 `mstop` 。"
+  }
+}
--- a/evaluation_examples/examples/pymol/MovieSchool_1_task2.json
+++ b/evaluation_examples/examples/pymol/MovieSchool_1_task2.json
@@ -0,0 +1,44 @@
+{
+  "id": "MovieSchool_1_task2",
+  "snapshot": "pymol",
+  "instruction": "在 PyMOL 中制作一个场景绕 Y 轴 360 度旋转的动画。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\PYMOL\\PyMOLWin.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "pymol"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 从菜单中选择适当选项，用于创建场景绕 Y 轴旋转的动画。\n2. 按下“Pressplay”开始播放动画。"
+  }
+}
--- a/evaluation_examples/examples/pymol/MovieSchool_1_task3.json
+++ b/evaluation_examples/examples/pymol/MovieSchool_1_task3.json
@@ -0,0 +1,44 @@
+{
+  "id": "MovieSchool_1_task3",
+  "snapshot": "pymol",
+  "instruction": "在 PyMOL 中制作场景摇摆动画，可选择 30, 60, 90, 120 或 180 度摇摆角度。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\PYMOL\\PyMOLWin.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "pymol"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 从菜单中选择用于设置场景摇摆动画的选项。\n2. 选择摇摆角度（例如 30、60、90、120 或 180 度）。\n3. 按下“Pressplay”启动动画。"
+  }
+}
--- a/evaluation_examples/examples/pymol/MovieSchool_1_task4.json
+++ b/evaluation_examples/examples/pymol/MovieSchool_1_task4.json
@@ -0,0 +1,44 @@
+{
+  "id": "MovieSchool_1_task4",
+  "snapshot": "pymol",
+  "instruction": "在 PyMOL 中制作一个简单的场景“摇摆（Nutate）”动画。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\PYMOL\\PyMOLWin.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "pymol"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 从菜单中选择制作摇摆动画的选项。\n2. 设置摇摆效果参数。\n3. 按下“Pressplay”启动摇摆动画。"
+  }
+}
--- a/evaluation_examples/examples/pymol/MovieSchool_1_task5.json
+++ b/evaluation_examples/examples/pymol/MovieSchool_1_task5.json
@@ -0,0 +1,44 @@
+{
+  "id": "MovieSchool_1_task5",
+  "snapshot": "pymol",
+  "instruction": "在 PyMOL 中使用 Scene Loop 制作一个从原子缩放并返回的动画。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\PYMOL\\PyMOLWin.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "pymol"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 创建一个场景并设置为缩放到特定原子。\n2. 保存该场景。\n3. 使用 PyMOL 中的 Scene Loop 功能连接多个保存的场景。\n4. 播放动画以观察缩放效果。"
+  }
+}
--- a/evaluation_examples/examples/pymol/MovieSchool_3_task1.json
+++ b/evaluation_examples/examples/pymol/MovieSchool_3_task1.json
@@ -0,0 +1,44 @@
+{
+  "id": "MovieSchool_3_task1",
+  "snapshot": "pymol",
+  "instruction": "在 PyMOL 中，通过 Movie 菜单添加 2 秒到当前视频的尾部。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\PYMOL\\PyMOLWin.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "pymol"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 点击菜单栏中的 Movie。\n2. 从下拉菜单中选择 Append。\n3. 在 Append 子菜单中选择 2 seconds。"
+  }
+}
--- a/evaluation_examples/examples/pymol/MovieSchool_3_task10.json
+++ b/evaluation_examples/examples/pymol/MovieSchool_3_task10.json
@@ -0,0 +1,44 @@
+{
+  "id": "MovieSchool_3_task10",
+  "snapshot": "pymol",
+  "instruction": "在 PyMOL 中，通过 ALA Motions 菜单查看 ALA fragment 的运动选项。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\PYMOL\\PyMOLWin.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "pymol"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 在 PyMOL 的底部工具栏上点击 All。\n2. 从下拉菜单中选择 ALA。\n3. 在 ALA 菜单中选择 Motions。\n4. 浏览显示的运动/位置选项。"
+  }
+}
--- a/evaluation_examples/examples/pymol/MovieSchool_3_task2.json
+++ b/evaluation_examples/examples/pymol/MovieSchool_3_task2.json
@@ -0,0 +1,44 @@
+{
+  "id": "MovieSchool_3_task2",
+  "snapshot": "pymol",
+  "instruction": "在 PyMOL 中，通过 Movie 菜单设置视频帧率为 15 FPS。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\PYMOL\\PyMOLWin.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "pymol"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 点击菜单栏中的 Movie。\n2. 从下拉菜单中选择 Frame Rate。\n3. 在 Frame Rate 子菜单中选择 15 FPS。"
+  }
+}
--- a/evaluation_examples/examples/pymol/MovieSchool_3_task3.json
+++ b/evaluation_examples/examples/pymol/MovieSchool_3_task3.json
@@ -0,0 +1,44 @@
+{
+  "id": "MovieSchool_3_task3",
+  "snapshot": "pymol",
+  "instruction": "在 PyMOL 中，通过 Scene 菜单将当前场景存储为名称 'my_scene'。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\PYMOL\\PyMOLWin.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "pymol"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 点击菜单栏中的 Scene。\n2. 从下拉菜单中选择 Store。\n3. 在弹出的窗口中输入 'my_scene' 作为场景名称。\n4. 点击确认按钮存储场景。"
+  }
+}
--- a/evaluation_examples/examples/pymol/MovieSchool_3_task4.json
+++ b/evaluation_examples/examples/pymol/MovieSchool_3_task4.json
@@ -0,0 +1,44 @@
+{
+  "id": "MovieSchool_3_task4",
+  "snapshot": "pymol",
+  "instruction": "在 PyMOL 中，通过 Scene 菜单清除所有存储的场景。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\PYMOL\\PyMOLWin.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "pymol"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 点击菜单栏中的 Scene。\n2. 从下拉菜单中选择 Clear。\n3. 在确认对话框中点击是以清除所有场景。"
+  }
+}
--- a/evaluation_examples/examples/pymol/MovieSchool_3_task5.json
+++ b/evaluation_examples/examples/pymol/MovieSchool_3_task5.json
@@ -0,0 +1,44 @@
+{
+  "id": "MovieSchool_3_task5",
+  "snapshot": "pymol",
+  "instruction": "在 PyMOL 中，通过 Mouse 菜单将鼠标模式设置为 3 Button Motions。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\PYMOL\\PyMOLWin.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "pymol"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 点击菜单栏中的 Mouse。\n2. 从下拉菜单中选择 Edit。\n3. 在 Edit 菜单中选择 Motions。\n4. 在弹出的子菜单中选择 3 Button Motions。"
+  }
+}
--- a/evaluation_examples/examples/pymol/Mutagenesis_task4.json
+++ b/evaluation_examples/examples/pymol/Mutagenesis_task4.json
@@ -0,0 +1,44 @@
+{
+  "id": "Mutagenesis_task4",
+  "snapshot": "pymol",
+  "instruction": "解释 PyMOL Mutagenesis 工具中的颜色代码，以理解范德瓦尔斯半径重叠情况。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\PYMOL\\PyMOLWin.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "pymol"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 打开 PyMOL 软件并加载任意结构。\n2. 通过 Wizard → Mutagenesis 打开 Mutagenesis 工具。\n3. 查看 Mutagenesis 工具中指定区域的颜色提示。\n4. 确认颜色解释：绿色表示轻微重叠，红色表示显著重叠。\n5. 使用颜色信息选择适当的操作来优化结构。"
+  }
+}
--- a/evaluation_examples/examples/pymol/Practical_Pymol_for_Beginners_task6.json
+++ b/evaluation_examples/examples/pymol/Practical_Pymol_for_Beginners_task6.json
@@ -0,0 +1,44 @@
+{
+  "id": "Practical_Pymol_for_Beginners_task6",
+  "snapshot": "pymol",
+  "instruction": "在 PyMOL 中，通过 File → Save Session 保存当前会话为 .pse 文件。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\PYMOL\\PyMOLWin.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "pymol"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 确保所有需要的对象和场景已设置好。\n2. 点击菜单栏 File → Save Session。\n3. 在弹出的窗口中命名文件并保存为 .pse 格式。\n4. 确认会话被成功保存。"
+  }
+}
--- a/evaluation_examples/examples/vesta/VESTA_Manual_task1.json
+++ b/evaluation_examples/examples/vesta/VESTA_Manual_task1.json
@@ -0,0 +1,57 @@
+{
+  "id": "VESTA_Manual_task1",
+  "snapshot": "vesta",
+  "instruction": "在 VESTA 中启动软件并加载结构文件 example_structure.cif。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "upload_file",
+      "parameters": {
+        "files": [
+          {
+            "local_path": "evaluation_examples/data/vesta/example_structure.cif",
+            "path": "C:\\Users\\user\\Desktop\\example_structure.cif"
+          }
+        ]
+      }
+    },
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\VESTA-win64\\VESTA.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "vesta"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [
+      "example_structure.cif"
+    ],
+    "steps": "1. 启动 VESTA 软件。\n2. 点击 File → Open。\n3. 在文件浏览窗口中选择 example_structure.cif 文件。\n4. 点击 Open 按钮加载文件。\n5. 确认结构已显示在视图窗口中。"
+  }
+}
--- a/evaluation_examples/examples/vesta/VESTA_Manual_task10.json
+++ b/evaluation_examples/examples/vesta/VESTA_Manual_task10.json
@@ -0,0 +1,57 @@
+{
+  "id": "VESTA_Manual_task10",
+  "snapshot": "vesta",
+  "instruction": "在 VESTA 中导入 Crystallographic Information File (CIF) 并查看其对称性。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "upload_file",
+      "parameters": {
+        "files": [
+          {
+            "local_path": "evaluation_examples/data/vesta/sample.cif",
+            "path": "C:\\Users\\user\\Desktop\\sample.cif"
+          }
+        ]
+      }
+    },
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\VESTA-win64\\VESTA.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "vesta"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [
+      "sample.cif"
+    ],
+    "steps": "1. 启动 VESTA 软件。\n2. 点击 File → Open，打开文件浏览器。\n3. 选择 sample.cif 文件并点击 Open。\n4. 加载文件后，点击 Edit → Data。\n5. 选择 Unit Cell 标签。\n6. 查看 Symmetry 选项卡中显示的对称性信息。"
+  }
+}
--- a/evaluation_examples/examples/vesta/VESTA_Manual_task11.json
+++ b/evaluation_examples/examples/vesta/VESTA_Manual_task11.json
@@ -0,0 +1,57 @@
+{
+  "id": "VESTA_Manual_task11",
+  "snapshot": "vesta",
+  "instruction": "在 VESTA 中生成 polyhedra 并调整其透明度。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "upload_file",
+      "parameters": {
+        "files": [
+          {
+            "local_path": "evaluation_examples/data/vesta/example_structure.cif",
+            "path": "C:\\Users\\user\\Desktop\\example_structure.cif"
+          }
+        ]
+      }
+    },
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\VESTA-win64\\VESTA.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "vesta"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [
+      "example_structure.cif"
+    ],
+    "steps": "1. 打开 VESTA 软件并加载 example_structure.vesta 文件。\n2. 点击 Edit → Properties。\n3. 在 Properties 对话框中选择 Polyhedra 标签。\n4. 勾选 Enable Polyhedra 绘图。\n5. 调整 Transparency 滑块到所需透明度值，例如 50%。\n6. 点击 OK 按钮保存设置。\n7. 验证主视图窗口中 Polyhedra 的更新显示。"
+  }
+}
--- a/evaluation_examples/examples/vesta/VESTA_Manual_task2.json
+++ b/evaluation_examples/examples/vesta/VESTA_Manual_task2.json
@@ -0,0 +1,57 @@
+{
+  "id": "VESTA_Manual_task2",
+  "snapshot": "vesta",
+  "instruction": "在 VESTA 中设置显示模式为 Ball-and-Stick，用于 example_structure.cif 文件。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "upload_file",
+      "parameters": {
+        "files": [
+          {
+            "local_path": "evaluation_examples/data/vesta/example_structure.cif",
+            "path": "C:\\Users\\user\\Desktop\\example_structure.cif"
+          }
+        ]
+      }
+    },
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\VESTA-win64\\VESTA.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "vesta"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [
+      "example_structure.cif"
+    ],
+    "steps": "1. 打开 VESTA 软件并加载文件 loaded_structure.vesta。\n2. 在顶部菜单中选择 View → Display Style。\n3. 在弹出的对话框中选择 Ball-and-Stick 模式。\n4. 点击 OK 按钮应用设置。\n5. 查看主视图窗口以确认显示模式已改变。"
+  }
+}
--- a/evaluation_examples/examples/vesta/VESTA_Manual_task3.json
+++ b/evaluation_examples/examples/vesta/VESTA_Manual_task3.json
@@ -0,0 +1,44 @@
+{
+  "id": "VESTA_Manual_task3",
+  "snapshot": "vesta",
+  "instruction": "在 VESTA 中测量当前晶体中两个原子之间的距离。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\VESTA-win64\\VESTA.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "vesta"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [],
+    "steps": "1. 在 VESTA 软件中打开任何结构文件。\n2. 点击垂直工具栏中的 Measure Distance 工具。\n3. 在主视图窗口中选择两个要测量距离的原子。\n4. 在 Measure Distance 工具下确认显示两个原子之间的距离。\n5. 验证输出的距离值是否正确显示。"
+  }
+}
--- a/evaluation_examples/examples/vesta/VESTA_Manual_task4.json
+++ b/evaluation_examples/examples/vesta/VESTA_Manual_task4.json
@@ -0,0 +1,57 @@
+{
+  "id": "VESTA_Manual_task4",
+  "snapshot": "vesta",
+  "instruction": "在 VESTA 中定义自定义绘图边界。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "upload_file",
+      "parameters": {
+        "files": [
+          {
+            "local_path": "evaluation_examples/data/vesta/MgB2.cif",
+            "path": "C:\\Users\\user\\Desktop\\MgB2.cif"
+          }
+        ]
+      }
+    },
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\VESTA-win64\\VESTA.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "vesta"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [
+      "MgB2.cif"
+    ],
+    "steps": "1. 打开 VESTA 软件并加载文件 MgB2.cif。\n2. 点击左侧侧边栏的 Objects → Boundary 按钮，打开 Boundary 对话框。\n3. 在对话框中调整范围 (x[min], x[max], y[min], y[max], z[min], z[max]) 为自定义值，例如 0 到 1。\n4. 点击 OK 或 Apply 按钮。\n5. 查看修改后的晶体绘图边界显示在主视图中。"
+  }
+}
--- a/evaluation_examples/examples/vesta/VESTA_Manual_task5.json
+++ b/evaluation_examples/examples/vesta/VESTA_Manual_task5.json
@@ -0,0 +1,57 @@
+{
+  "id": "VESTA_Manual_task5",
+  "snapshot": "vesta",
+  "instruction": "通过 VESTA 的 Properties 对话框调整晶体键的颜色和半径。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "upload_file",
+      "parameters": {
+        "files": [
+          {
+            "local_path": "evaluation_examples/data/vesta/xTiO2.cif",
+            "path": "C:\\Users\\user\\Desktop\\xTiO2.cif"
+          }
+        ]
+      }
+    },
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\VESTA-win64\\VESTA.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "vesta"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [
+      "xTiO2.cif"
+    ],
+    "steps": "1. 打开 VESTA 软件并加载 xTiO2.vesta 文件。\n2. 点击 Edit → Properties。\n3. 在对话框中导航到 Bonds 页面。\n4. 调整 Radius (cylinder) 输入框值，例如更改为 0.3。\n5. 修改颜色设置为 RGB 值 (100, 150, 200)。\n6. 点击 OK 按钮保存更改并关闭对话框。\n7. 确保更改在主视图中可见。"
+  }
+}
--- a/evaluation_examples/examples/vesta/VESTA_Manual_task6.json
+++ b/evaluation_examples/examples/vesta/VESTA_Manual_task6.json
@@ -0,0 +1,57 @@
+{
+  "id": "VESTA_Manual_task6",
+  "snapshot": "vesta",
+  "instruction": "在 VESTA 中切换晶体投影为 [110] 方向。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "upload_file",
+      "parameters": {
+        "files": [
+          {
+            "local_path": "evaluation_examples/data/vesta/Si.cif",
+            "path": "C:\\Users\\user\\Desktop\\Si.cif"
+          }
+        ]
+      }
+    },
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\VESTA-win64\\VESTA.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "vesta"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [
+      "Si.cif"
+    ],
+    "steps": "1. 打开 VESTA 软件并加载文件 Si.cif。\n2. 在顶部菜单中选择 View → Lattice Planes。\n3. 在对话框中选择 [110] 方向作为投影。\n4. 点击 OK 按钮应用更改。\n5. 确认主视图窗口中显示的是 [110] 方向的晶体投影。"
+  }
+}
--- a/evaluation_examples/examples/vesta/VESTA_Manual_task7.json
+++ b/evaluation_examples/examples/vesta/VESTA_Manual_task7.json
@@ -0,0 +1,57 @@
+{
+  "id": "VESTA_Manual_task7",
+  "snapshot": "vesta",
+  "instruction": "在 VESTA 中生成晶体的二维 (2D) 投影视图。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "upload_file",
+      "parameters": {
+        "files": [
+          {
+            "local_path": "evaluation_examples/data/vesta/rutile_TiO2.cif",
+            "path": "C:\\Users\\user\\Desktop\\rutile_TiO2.cif"
+          }
+        ]
+      }
+    },
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\VESTA-win64\\VESTA.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "vesta"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [
+      "rutile_TiO2.cif"
+    ],
+    "steps": "1. 打开 VESTA 软件并加载 rutile_TiO2.cif 文件。\n2. 在顶部菜单中选择 File → Export → 2D Image。\n3. 在弹出的对话框中设置输出格式为 PNG，并选择合适的分辨率 (例如 300 dpi)。\n4. 设置保存路径为桌面并命名文件为 projection.png。\n5. 点击 Save 以导出图像。\n6. 验证桌面的 PNG 文件是否正确生成。"
+  }
+}
--- a/evaluation_examples/examples/vesta/VESTA_Manual_task8.json
+++ b/evaluation_examples/examples/vesta/VESTA_Manual_task8.json
@@ -0,0 +1,57 @@
+{
+  "id": "VESTA_Manual_task8",
+  "snapshot": "vesta",
+  "instruction": "在 VESTA 中查看倒易晶格的详细几何参数。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "upload_file",
+      "parameters": {
+        "files": [
+          {
+            "local_path": "evaluation_examples/data/vesta/YBa2Cu3O7.cif",
+            "path": "C:\\Users\\user\\Desktop\\YBa2Cu3O7.cif"
+          }
+        ]
+      }
+    },
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\VESTA-win64\\VESTA.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "vesta"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [
+      "YBa2Cu3O7.cif"
+    ],
+    "steps": "1. 打开 VESTA 软件并加载文件 YBa2Cu3O7.vesta。\n2. 在顶部菜单中选择 Edit → Data → Reciprocal Lattice Parameters。\n3. 查看弹出的对话框中的倒易晶格详细数据。\n4. 点击 OK 关闭对话框。\n5. 验证数据是否已在 Text Area 中正确显示。"
+  }
+}
--- a/evaluation_examples/examples/vesta/VESTA_Manual_task9.json
+++ b/evaluation_examples/examples/vesta/VESTA_Manual_task9.json
@@ -0,0 +1,57 @@
+{
+  "id": "VESTA_Manual_task9",
+  "snapshot": "vesta",
+  "instruction": "在 VESTA 中使用 Fourier Synthesis 生成电子密度图。",
+  "source": "custom",
+  "config": [
+    {
+      "type": "upload_file",
+      "parameters": {
+        "files": [
+          {
+            "local_path": "evaluation_examples/data/vesta/monazite.cif",
+            "path": "C:\\Users\\user\\Desktop\\monazite.cif"
+          }
+        ]
+      }
+    },
+    {
+      "type": "launch",
+      "parameters": {
+        "command": [
+          "C:\\VESTA-win64\\VESTA.exe"
+        ]
+      }
+    },
+    {
+      "type": "sleep",
+      "parameters": {
+        "seconds": 5
+      }
+    }
+  ],
+  "trajectory": "trajectories/",
+  "related_apps": [
+    "vesta"
+  ],
+  "evaluator": {
+    "postconfig": [
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
+    "func": "vllm_eval"
+  },
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low",
+  "metadata": {
+    "input_files": [
+      "monazite.cif"
+    ],
+    "steps": "1. 打开 VESTA 软件并加载文件 monazite.vesta。\n2. 在顶部菜单中选择 Utilities → Fourier Synthesis。\n3. 在弹出的对话框中设置分辨率值为 0.05。\n4. 点击 Calculate 按钮开始生成电子密度图。\n5. 验证生成的图形是否出现在主视图中。"
+  }
+}
--- a/evaluation_examples/extract_instructions.py
+++ b/evaluation_examples/extract_instructions.py
@@ -0,0 +1,604 @@
+import os
+import sys
+import asyncio
+import aiohttp
+import base64
+import logging
+from pathlib import Path
+from typing import List, Optional
+import tempfile
+import shutil
+from dataclasses import dataclass
+from datetime import datetime
+import json
+
+# Configuration
+SCRIPT_DIR = Path(__file__).parent
+PROJECT_ROOT = SCRIPT_DIR.parent
+
+API_BASE_URL = os.getenv("OPENAI_BASE_URL")
+API_URL = f"{API_BASE_URL}/chat/completions" if API_BASE_URL else None
+API_KEY = os.getenv("OPENAI_API_KEY")
+MODEL_NAME = "gemini-2.5-pro"
+MAX_CONCURRENT_REQUESTS = 5
+INPUT_FOLDER = "/Users/cuihang/Downloads/test_files"
+EXAMPLES_FOLDER = PROJECT_ROOT / "evaluation_examples" / "examples"
+TEST_ALL_JSON = PROJECT_ROOT / "evaluation_examples" / "test_all.json"
+
+# Retry configuration
+MAX_RETRY_ATTEMPTS = 3
+RETRY_DELAY = 5
+RETRY_BACKOFF = 2
+
+# Image limit
+MAX_IMAGES_PER_REQUEST = 50
+
+# Supported file extensions
+SUPPORTED_EXTENSIONS = {'.docx', '.doc', '.ppt', '.pptx', '.pdf', '.mp4', '.avi', '.mov', '.mkv'}
+
+SYSTEM_PROMPT = """You are an AI assistant that generates precise, executable step-by-step instructions for desktop software operations.
+
+Your task:
+Convert the provided document information into precise operation instructions that can be executed step-by-step by an AI agent in a software GUI.
+
+Output requirements (no additional explanatory text):
+------------------------------------------------
+
+[Task Goal]
+Describe in one sentence the final task result to be achieved in the software.
+
+[Input Files]
+Specify the file names, types, and locations involved in this operation.
+- If the document provides complete paths, record them as is
+- If only file names are mentioned (e.g., data.xlsx), record the filename and note "complete path not specified in document"
+- If no input files are mentioned, write "no input files required"
+
+[Detailed Operation Steps (GUI Level)]
+Break down the task into atomic GUI operation steps.
+Each step must meet the following conditions:
+- Contains only one explicit, indivisible GUI atomic action
+- Must specify the menus, panels, buttons, or controls involved
+- Must specify parameter names and option values involved
+- Arranged in the actual operation order of the software
+- Must include software launch steps (e.g., double-click desktop icon, launch from start menu, etc.)
+
+Step format example:
+1. Double-click the [Software Name] icon on the desktop to launch the software.
+2. Click "File → Open" in the main menu bar.
+3. In the file selection dialog, navigate to the specified directory and select file [filename].
+4. Click the "Open" button to confirm.
+5. ... (and so on)
+
+------------------------------------------------
+
+[Handling Uncertain Information]
+- Strictly generate operation steps based on document content, do not add features or menus not mentioned
+- If operation steps are unclear or ambiguous, infer based on common software operation flows
+- If parameter values in the document are unclear, note "[set according to actual needs]" in the step
+
+[Output Format]
+Output in JSON format with the following fields:
+{
+    "input_files": ["file1", "file2", "..."],
+    "task_goal": "...",
+    "steps": "A string containing all operation steps, arranged in order, with numbered prefix for each step, separated by newlines"
+}
+Note: Output must be strict JSON format, with no extra text or explanations."""
+
+
+# Logging configuration
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s - %(levelname)s - %(message)s',
+    handlers=[
+        logging.StreamHandler(sys.stdout)
+    ]
+)
+logger = logging.getLogger(__name__)
+
+
+@dataclass
+class ProcessingStats:
+    """Processing statistics tracker"""
+    total_files: int = 0
+    completed_files: int = 0
+    failed_files: int = 0
+    retried_files: int = 0
+    start_time: datetime = None
+    failed_list: List[tuple] = None
+
+    def __post_init__(self):
+        if self.start_time is None:
+            self.start_time = datetime.now()
+        if self.failed_list is None:
+            self.failed_list = []
+
+    def add_completed(self):
+        self.completed_files += 1
+        self._log_progress()
+
+    def add_failed(self, file_path: str, error: str):
+        self.failed_files += 1
+        self.failed_list.append((file_path, error))
+        self._log_progress()
+
+    def add_retry(self):
+        self.retried_files += 1
+
+    def _log_progress(self):
+        processed = self.completed_files + self.failed_files
+        percentage = (processed / self.total_files * 100) if self.total_files > 0 else 0
+        elapsed = (datetime.now() - self.start_time).total_seconds()
+
+        if processed > 0:
+            avg_time = elapsed / processed
+            remaining = (self.total_files - processed) * avg_time
+            eta = f"{int(remaining // 60)}m{int(remaining % 60)}s"
+        else:
+            eta = "calculating..."
+
+        logger.info(f"Progress: {processed}/{self.total_files} ({percentage:.1f}%) | "
+                   f"Success: {self.completed_files} | Failed: {self.failed_files} | "
+                   f"Retried: {self.retried_files} | ETA: {eta}")
+
+    def print_summary(self):
+        elapsed = (datetime.now() - self.start_time).total_seconds()
+        logger.info("=" * 60)
+        logger.info("Processing Complete")
+        logger.info("=" * 60)
+        logger.info(f"Total files: {self.total_files}")
+        logger.info(f"Success: {self.completed_files}")
+        logger.info(f"Failed: {self.failed_files}")
+        logger.info(f"Total retries: {self.retried_files}")
+        logger.info(f"Total time: {int(elapsed // 60)}m{int(elapsed % 60)}s")
+
+        if self.failed_list:
+            logger.info("\nFailed files:")
+            for file_path, error in self.failed_list:
+                logger.info(f"  - {file_path}")
+                logger.info(f"    Error: {error}")
+
+        self._save_report()
+
+    def _save_report(self):
+        report = {
+            "total_files": self.total_files,
+            "completed": self.completed_files,
+            "failed": self.failed_files,
+            "retries": self.retried_files,
+            "start_time": self.start_time.isoformat(),
+            "end_time": datetime.now().isoformat(),
+            "elapsed_seconds": (datetime.now() - self.start_time).total_seconds(),
+            "failed_files": [{"file": f, "error": e} for f, e in self.failed_list]
+        }
+
+        report_file = Path(EXAMPLES_FOLDER) / "processing_report.json"
+        with open(report_file, 'w', encoding='utf-8') as f:
+            json.dump(report, f, ensure_ascii=False, indent=2)
+
+        logger.info(f"\nDetailed report saved to: {report_file}")
+
+
+stats = ProcessingStats()
+software_tests = {}
+
+
+def check_dependencies():
+    """Check and prompt for missing dependencies"""
+    missing = []
+
+    try:
+        import pdf2image
+    except ImportError:
+        missing.append("pdf2image")
+
+    try:
+        import PIL
+    except ImportError:
+        missing.append("Pillow")
+
+    try:
+        import cv2
+    except ImportError:
+        missing.append("opencv-python or opencv-python-headless")
+
+    if not shutil.which("soffice") and not shutil.which("libreoffice"):
+        logger.warning("LibreOffice not detected, cannot convert .doc and .ppt files")
+        logger.info("Install: sudo apt-get install libreoffice (Linux) or download from https://www.libreoffice.org/")
+
+    if missing:
+        logger.error(f"Missing dependencies: {', '.join(missing)}")
+        logger.info(f"Install with: pip install {' '.join(missing)}")
+        logger.info("Note: pdf2image also requires poppler")
+        logger.info("  - Ubuntu/Debian: sudo apt-get install poppler-utils")
+        logger.info("  - macOS: brew install poppler")
+        logger.info("  - Windows: download from https://github.com/oschwartz10612/poppler-windows/releases/")
+        return False
+    return True
+
+
+def convert_pdf_to_images(pdf_path: str) -> List[str]:
+    """Convert PDF to base64-encoded images"""
+    try:
+        from pdf2image import convert_from_path
+        from PIL import Image
+        import io
+
+        images = convert_from_path(pdf_path, dpi=150, fmt='jpeg')
+        base64_images = []
+
+        for img in images:
+            buffer = io.BytesIO()
+            img.save(buffer, format='JPEG', quality=100)
+            img_base64 = base64.b64encode(buffer.getvalue()).decode('utf-8')
+            base64_images.append(img_base64)
+
+        return base64_images
+    except Exception as e:
+        logger.error(f"PDF conversion failed for {pdf_path}: {str(e)}")
+        return []
+
+
+def convert_office_to_pdf(input_path: str) -> Optional[str]:
+    """Convert Office documents to PDF using LibreOffice"""
+    try:
+        import subprocess
+
+        temp_dir = tempfile.mkdtemp()
+        soffice_cmd = "soffice" if shutil.which("soffice") else "libreoffice"
+
+        cmd = [
+            soffice_cmd,
+            "--headless",
+            "--convert-to", "pdf",
+            "--outdir", temp_dir,
+            input_path
+        ]
+
+        result = subprocess.run(cmd, capture_output=True, text=True, timeout=60)
+
+        if result.returncode == 0:
+            pdf_name = Path(input_path).stem + ".pdf"
+            pdf_path = os.path.join(temp_dir, pdf_name)
+
+            if os.path.exists(pdf_path):
+                return pdf_path
+
+        logger.error(f"LibreOffice conversion failed: {result.stderr}")
+        return None
+
+    except Exception as e:
+        logger.error(f"Office conversion failed for {input_path}: {str(e)}")
+        return None
+
+
+def convert_document_to_images(file_path: str) -> List[str]:
+    """Convert any supported document to base64-encoded images"""
+    file_ext = Path(file_path).suffix.lower()
+
+    if file_ext == '.pdf':
+        return convert_pdf_to_images(file_path)
+
+    elif file_ext in ['.docx', '.doc', '.ppt', '.pptx']:
+        pdf_path = convert_office_to_pdf(file_path)
+        if pdf_path:
+            images = convert_pdf_to_images(pdf_path)
+            try:
+                os.remove(pdf_path)
+                os.rmdir(os.path.dirname(pdf_path))
+            except:
+                pass
+            return images
+        return []
+
+    elif file_ext in ['.mp4', '.avi', '.mov', '.mkv']:
+        return extract_video_frames(file_path)
+
+    return []
+
+
+def extract_video_frames(video_path: str, num_frames: int = 10) -> List[str]:
+    """Extract key frames from video"""
+    try:
+        import cv2
+
+        cap = cv2.VideoCapture(video_path)
+        total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
+
+        if total_frames == 0:
+            return []
+
+        frame_indices = [int(total_frames * i / (num_frames + 1)) for i in range(1, num_frames + 1)]
+        base64_frames = []
+
+        for idx in frame_indices:
+            cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
+            ret, frame = cap.read()
+
+            if ret:
+                height, width = frame.shape[:2]
+                if width > 1280:
+                    scale = 1280 / width
+                    frame = cv2.resize(frame, (1280, int(height * scale)))
+
+                _, buffer = cv2.imencode('.jpg', frame, [cv2.IMWRITE_JPEG_QUALITY, 85])
+                frame_base64 = base64.b64encode(buffer).decode('utf-8')
+                base64_frames.append(frame_base64)
+
+        cap.release()
+        return base64_frames
+
+    except Exception as e:
+        logger.error(f"Video frame extraction failed for {video_path}: {str(e)}")
+        return []
+
+
+async def call_api_single_batch(images_batch: List[str], file_type: str,
+                                session: aiohttp.ClientSession, batch_num: int = 0) -> tuple[str, bool, int]:
+    """
+    Call API to process a single batch of images
+    Returns: (content, success, status_code)
+    """
+    messages = [{"role": "system", "content": SYSTEM_PROMPT}]
+
+    batch_info = f" (batch {batch_num})" if batch_num > 0 else ""
+    content = [
+        {"type": "text", "text": f"Please analyze the following {file_type} pages/frames{batch_info} and extract the operation workflow:"}
+    ]
+
+    for img_b64 in images_batch:
+        content.append({
+            "type": "image_url",
+            "image_url": {"url": f"data:image/jpeg;base64,{img_b64}"}
+        })
+
+    messages.append({"role": "user", "content": content})
+
+    try:
+        headers = {
+            "Authorization": f"Bearer {API_KEY}",
+            "Content-Type": "application/json"
+        }
+
+        payload = {
+            "model": MODEL_NAME,
+            "messages": messages,
+            "max_tokens": 8192
+        }
+
+        async with session.post(API_URL, headers=headers, json=payload, timeout=180) as response:
+            status_code = response.status
+            if status_code == 200:
+                result = await response.json()
+                return result['choices'][0]['message']['content'], True, status_code
+            else:
+                error_text = await response.text()
+                return f"[API call failed: {status_code}]\n{error_text}", False, status_code
+
+    except asyncio.TimeoutError:
+        return "[API call timeout]", False, 0
+    except Exception as e:
+        return f"[API call error: {str(e)}]", False, 0
+
+
+async def call_multimodal_api_with_retry(file_path: str, session: aiohttp.ClientSession) -> tuple[str, bool]:
+    """
+    Call multimodal API to analyze document images with retry mechanism
+    Returns: (content, success)
+    """
+    images_base64 = convert_document_to_images(file_path)
+
+    if not images_base64:
+        error_msg = f"[Document conversion failed: unable to convert {Path(file_path).name} to images]"
+        return error_msg, False
+
+    file_type = "video" if Path(file_path).suffix.lower() in ['.mp4', '.avi', '.mov', '.mkv'] else "document"
+    total_images = len(images_base64)
+
+    if total_images > MAX_IMAGES_PER_REQUEST:
+        images_base64 = images_base64[:MAX_IMAGES_PER_REQUEST]
+        total_images = MAX_IMAGES_PER_REQUEST
+
+    for attempt in range(1, MAX_RETRY_ATTEMPTS + 1):
+        try:
+            content, success, status_code = await call_api_single_batch(images_base64, file_type, session)
+
+            if success:
+                return content, True
+
+            if status_code == 413:
+                return f"[File too large: server refused to process the file]", False
+
+            if attempt < MAX_RETRY_ATTEMPTS:
+                delay = RETRY_DELAY * (RETRY_BACKOFF ** (attempt - 1))
+                logger.info(f"\nRetry {attempt}/{MAX_RETRY_ATTEMPTS}: {Path(file_path).name} (waiting {delay}s)")
+                stats.add_retry()
+                await asyncio.sleep(delay)
+                continue
+
+            return content, False
+
+        except asyncio.TimeoutError:
+            if attempt < MAX_RETRY_ATTEMPTS:
+                delay = RETRY_DELAY * (RETRY_BACKOFF ** (attempt - 1))
+                logger.info(f"\nRetry {attempt}/{MAX_RETRY_ATTEMPTS}: {Path(file_path).name} (timeout, waiting {delay}s)")
+                stats.add_retry()
+                await asyncio.sleep(delay)
+                continue
+            return "[API call timeout]", False
+
+        except Exception as e:
+            if attempt < MAX_RETRY_ATTEMPTS:
+                delay = RETRY_DELAY * (RETRY_BACKOFF ** (attempt - 1))
+                logger.info(f"\nRetry {attempt}/{MAX_RETRY_ATTEMPTS}: {Path(file_path).name} (error, waiting {delay}s)")
+                stats.add_retry()
+                await asyncio.sleep(delay)
+                continue
+            return f"[API call error: {str(e)}]", False
+
+    return "[Max retry attempts reached]", False
+
+
+async def process_file(file_path: str, session: aiohttp.ClientSession,
+                       semaphore: asyncio.Semaphore):
+    """Process a single file"""
+    async with semaphore:
+        try:
+            content, success = await call_multimodal_api_with_retry(file_path, session)
+
+            file_path_obj = Path(file_path).resolve()
+            input_folder_obj = Path(INPUT_FOLDER).resolve()
+
+            try:
+                rel_path = file_path_obj.relative_to(input_folder_obj)
+                software_name = rel_path.parts[0] if len(rel_path.parts) > 1 else "unknown"
+            except ValueError:
+                software_name = "unknown"
+
+            file_stem = file_path_obj.stem
+            test_id = file_stem
+            output_file = Path(EXAMPLES_FOLDER) / software_name / f"{file_stem}.json"
+            output_file.parent.mkdir(parents=True, exist_ok=True)
+
+            import re
+            match = re.search(r'```json\s*([\s\S]*?)\s*```', content)
+            content = match.group(1) if match else content
+
+            if success:
+                api_result = json.loads(content)
+
+                data = {
+                    "id": test_id,
+                    "snapshot": "snapshot",
+                    "instruction": api_result.get("steps", ""),
+                    "source": "custom",
+                    "config": [],
+                    "trajectory": "trajectories/",
+                    "related_apps": [software_name],
+                    "evaluator": {
+                        "postconfig": [
+                            {
+                                "type": "sleep",
+                                "parameters": {
+                                    "seconds": 3
+                                }
+                            }
+                        ],
+                        "func": "vllm_eval"
+                    },
+                    "proxy": False,
+                    "fixed_ip": False,
+                    "possibility_of_env_change": "low",
+                    "metadata": {
+                        "input_files": api_result.get("input_files", []),
+                        "task_goal": api_result.get("task_goal", "")
+                    }
+                }
+
+                if software_name not in software_tests:
+                    software_tests[software_name] = []
+                software_tests[software_name].append(test_id)
+
+            else:
+                data = {
+                    "id": test_id,
+                    "error": content,
+                    "status": "failed"
+                }
+
+            with open(output_file, 'w', encoding='utf-8') as f:
+                json.dump(data, f, ensure_ascii=False, indent=2)
+
+            if success:
+                stats.add_completed()
+            else:
+                stats.add_failed(file_path, content)
+
+        except Exception as e:
+            error_msg = str(e)
+            stats.add_failed(file_path, error_msg)
+            logger.error(f"\nError processing {file_path}: {error_msg}")
+
+
+def find_all_files(input_folder: str) -> List[str]:
+    """Recursively find all supported files"""
+    all_files = []
+
+    for root, dirs, files in os.walk(input_folder):
+        for file in files:
+            file_path = os.path.join(root, file)
+            if Path(file_path).suffix.lower() in SUPPORTED_EXTENSIONS:
+                all_files.append(file_path)
+
+    return all_files
+
+
+def save_test_all_json():
+    """Save aggregated test_all.json"""
+    test_all_path = Path(TEST_ALL_JSON)
+    if test_all_path.exists():
+        with open(test_all_path, 'r', encoding='utf-8') as f:
+            existing_data = json.load(f)
+    else:
+        existing_data = {}
+
+    for software, test_ids in software_tests.items():
+        if software in existing_data:
+            existing_data[software] = list(set(existing_data[software] + test_ids))
+        else:
+            existing_data[software] = test_ids
+
+    test_all_path.parent.mkdir(parents=True, exist_ok=True)
+    with open(test_all_path, 'w', encoding='utf-8') as f:
+        json.dump(existing_data, f, ensure_ascii=False, indent=2)
+
+    logger.info(f"\nTest index updated: {test_all_path}")
+    logger.info(f"Software included: {list(existing_data.keys())}")
+
+
+async def main():
+    """Main function"""
+    if not check_dependencies():
+        return
+
+    if not Path(INPUT_FOLDER).exists():
+        logger.error(f"Input directory does not exist: {INPUT_FOLDER}")
+        return
+
+    Path(EXAMPLES_FOLDER).mkdir(parents=True, exist_ok=True)
+
+    logger.info("Scanning files...")
+    logger.info(f"Input directory: {INPUT_FOLDER}")
+    logger.info(f"Output directory: {EXAMPLES_FOLDER}")
+    logger.info(f"Test index file: {TEST_ALL_JSON}\n")
+
+    files = find_all_files(INPUT_FOLDER)
+    stats.total_files = len(files)
+
+    logger.info(f"Found {len(files)} files")
+    logger.info(f"Configuration: max retries={MAX_RETRY_ATTEMPTS}, concurrency={MAX_CONCURRENT_REQUESTS}")
+    logger.info("=" * 60 + "\n")
+
+    if not files:
+        logger.warning("No supported files found")
+        return
+
+    semaphore = asyncio.Semaphore(MAX_CONCURRENT_REQUESTS)
+
+    async with aiohttp.ClientSession() as session:
+        tasks = [
+            process_file(file, session, semaphore)
+            for file in files
+        ]
+        await asyncio.gather(*tasks, return_exceptions=True)
+
+    save_test_all_json()
+    stats.print_summary()
+
+    logger.info("\nCompleted!")
+    logger.info(f"  - Test cases saved to: {EXAMPLES_FOLDER}")
+    logger.info(f"  - Test index updated: {TEST_ALL_JSON}")
+
+
+if __name__ == "__main__":
+    asyncio.run(main())
--- a/evaluation_examples/extract_instructions_v2.py
+++ b/evaluation_examples/extract_instructions_v2.py
@@ -0,0 +1,565 @@
+import os
+import sys
+import asyncio
+import aiohttp
+import base64
+import logging
+from pathlib import Path
+from typing import List, Optional
+import tempfile
+import shutil
+from dataclasses import dataclass
+from datetime import datetime
+import json
+import re
+
+# Configuration
+SCRIPT_DIR = Path(__file__).parent
+PROJECT_ROOT = SCRIPT_DIR.parent
+
+API_BASE_URL = os.getenv("OPENAI_BASE_URL", "https://api.openai.com/v1")
+API_URL = f"{API_BASE_URL}/chat/completions"
+API_KEY = os.getenv("OPENAI_API_KEY")
+MODEL_NAME = os.getenv("EXTRACT_MODEL", "gpt-4o")  # Configurable via env var
+MAX_CONCURRENT_REQUESTS = 5
+
+# Input folder where PDFs/Docs are stored, organized by software name
+# e.g. evaluation_examples/inputs/vesta/tutorial.pdf
+INPUT_FOLDER = PROJECT_ROOT / "evaluation_examples" / "inputs"
+EXAMPLES_FOLDER = PROJECT_ROOT / "evaluation_examples" / "examples"
+TEST_ALL_JSON = PROJECT_ROOT / "evaluation_examples" / "test_all.json"
+
+# Retry configuration
+MAX_RETRY_ATTEMPTS = 3
+RETRY_DELAY = 5
+RETRY_BACKOFF = 2
+
+# Image limit - keep low to avoid 413 payload too large errors
+MAX_IMAGES_PER_REQUEST = 20
+
+# Supported file extensions
+SUPPORTED_EXTENSIONS = {'.docx', '.doc', '.ppt', '.pptx', '.pdf', '.mp4', '.avi', '.mov', '.mkv'}
+
+# Software-specific launch config and snapshot mapping
+# Maps software folder name -> {"snapshot": ..., "config": [...]}
+SOFTWARE_CONFIG = {
+    "avogadro": {
+        "snapshot": "avogadro",
+        "config": [
+            {"type": "launch", "parameters": {"command": ["C:\\Avogadro2\\bin\\avogadro2.exe"]}},
+            {"type": "sleep", "parameters": {"seconds": 5}}
+        ]
+    },
+    "imagej": {
+        "snapshot": "imagej",
+        "config": [
+            {"type": "launch", "parameters": {"command": ["C:\\ImageJ\\ImageJ.exe"]}},
+            {"type": "sleep", "parameters": {"seconds": 5}}
+        ]
+    },
+    "origin": {
+        "snapshot": "origin",
+        "config": [
+            {"type": "launch", "parameters": {"command": ["C:\\OriginLab\\Origin2025b\\Origin64.exe"]}},
+            {"type": "sleep", "parameters": {"seconds": 5}}
+        ]
+    },
+    "ovito": {
+        "snapshot": "ovito",
+        "config": [
+            {"type": "launch", "parameters": {"command": ["C:\\OVITO Basic\\ovito.exe"]}},
+            {"type": "sleep", "parameters": {"seconds": 5}}
+        ]
+    },
+    "pymol": {
+        "snapshot": "pymol",
+        "config": [
+            {"type": "launch", "parameters": {"command": ["C:\\PYMOL\\PyMOLWin.exe"]}},
+            {"type": "sleep", "parameters": {"seconds": 5}}
+        ]
+    },
+    "vesta": {
+        "snapshot": "vesta",
+        "config": [
+            {"type": "launch", "parameters": {"command": ["C:\\VESTA-win64\\VESTA.exe"]}},
+            {"type": "sleep", "parameters": {"seconds": 5}}
+        ]
+    },
+}
+
+# Default config for unknown software
+DEFAULT_SOFTWARE_CONFIG = {
+    "snapshot": "snapshot",
+    "config": []
+}
+
+SYSTEM_PROMPT = """你是一个科研软件 GUI 自动化测试专家。你的任务是从教程文档中提取出多个**具体的、可执行的、可验证的** GUI 操作任务。
+
+## 核心要求
+这些任务将被用于测试 AI Agent 操控桌面软件的能力。每个任务必须足够具体，让 Agent 明确知道要做什么，做完后能通过截图判断是否成功。
+
+## 任务粒度要求（非常重要）
+- **每个任务应该是 3-8 步 GUI 操作就能完成的小任务**
+- **task_goal 必须包含具体的参数值、文件名、菜单路径等细节**
+- **绝对不要写模糊的指令**
+
+### ❌ 错误示例（太模糊）：
+- "Perform phase identification" — Agent 不知道用哪个文件、选什么参数
+- "Export data" — 导出什么格式？保存到哪里？
+- "Calculate crystallite size" — 选哪个峰？什么参数？
+
+### ✅ 正确示例（具体可执行）：
+- "在 ImageJ 中，通过 File → Open 打开桌面上的 cell_image.tif 文件"
+- "在 ImageJ 中，使用 Image → Adjust → Threshold 对当前图像进行阈值分割，选择 Default 方法并点击 Apply"
+- "在 ImageJ 中，通过 Analyze → Measure 测量当前选区的面积和平均灰度值"
+- "在 ImageJ 中，使用 Process → Filters → Gaussian Blur 对图像施加半径为 2.0 像素的高斯模糊"
+- "在 Avogadro 2 中，通过 Build → Insert → Molecule 搜索并插入一个 benzene 分子"
+- "在 VESTA 中通过 File → Open 打开桌面上的 Si.cif 文件，然后将视角旋转到 [110] 方向"
+
+## 输出格式
+返回严格的 JSON 对象：
+{
+    "tasks": [
+        {
+            "task_goal": "一句话具体描述要做什么（包含软件名、菜单路径、文件名、参数值等具体信息）。用中文。",
+            "input_files": ["涉及的文件名列表，如 'sample.raw'。如果不需要输入文件则为空列表 []"],
+            "steps": "详细的 GUI 操作步骤，带编号，用换行分隔"
+        }
+    ]
+}
+
+## 任务提取规则
+1. **独立性**：每个任务都能独立完成（假设软件已打开或从头启动）
+2. **具体性**：task_goal 中必须包含教程中提到的具体文件名、参数值、菜单名称
+3. **可验证性**：完成后应该能从屏幕截图看出任务是否成功（例如：文件已打开、图表已显示、对话框已出现等）
+4. **忠实性**：只描述教程中实际出现的操作，不要编造功能
+5. **数量**：从一份教程中提取 10-15 个不同的任务，覆盖教程的各个章节。优先选择最常用、最有代表性的操作
+6. **软件名称**：task_goal 必须以「在 XXX 中，」开头，明确指出软件名称
+7. **难度分布**：包含简单（2-3步）、中等（4-5步）、较难（6-8步）的任务各占三分之一
+
+"""
+
+# Logging configuration
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s - %(levelname)s - %(message)s',
+    handlers=[
+        logging.StreamHandler(sys.stdout)
+    ]
+)
+logger = logging.getLogger(__name__)
+
+stats = None # Will be initialized in main
+
+@dataclass
+class ProcessingStats:
+    """Processing statistics tracker"""
+    total_files: int = 0
+    completed_files: int = 0
+    failed_files: int = 0
+    retried_files: int = 0
+    generated_tasks: int = 0
+    start_time: datetime = None
+    failed_list: List[tuple] = None
+
+    def __post_init__(self):
+        if self.start_time is None:
+            self.start_time = datetime.now()
+        if self.failed_list is None:
+            self.failed_list = []
+
+    def add_completed(self, num_tasks=1):
+        self.completed_files += 1
+        self.generated_tasks += num_tasks
+        self._log_progress()
+
+    def add_failed(self, file_path: str, error: str):
+        self.failed_files += 1
+        self.failed_list.append((file_path, error))
+        self._log_progress()
+
+    def add_retry(self):
+        self.retried_files += 1
+
+    def _log_progress(self):
+        processed = self.completed_files + self.failed_files
+        percentage = (processed / self.total_files * 100) if self.total_files > 0 else 0
+        elapsed = (datetime.now() - self.start_time).total_seconds()
+
+        logger.info(f"Progress: {processed}/{self.total_files} ({percentage:.1f}%) | "
+                   f"Tasks Gen: {self.generated_tasks} | Failed: {self.failed_files}")
+
+# -----------------------------------------------------------------------------
+# Dependency Checks & File Conversion (Copied & Adapted from original script)
+# -----------------------------------------------------------------------------
+
+def check_dependencies():
+    """Check and prompt for missing dependencies"""
+    missing = []
+    try: import pdf2image
+    except ImportError: missing.append("pdf2image")
+    try: import PIL
+    except ImportError: missing.append("Pillow")
+    try: import cv2
+    except ImportError: missing.append("opencv-python")
+
+    if not shutil.which("soffice") and not shutil.which("libreoffice"):
+        logger.warning("LibreOffice not detected (needed for .doc/.ppt)")
+
+    if missing:
+        logger.error(f"Missing dependencies: {', '.join(missing)}")
+        return False
+    return True
+
+def convert_pdf_to_images(pdf_path: str) -> List[str]:
+    try:
+        from pdf2image import convert_from_path
+        import io
+
+        # First, get total page count at very low DPI
+        quick_check = convert_from_path(pdf_path, dpi=36, fmt='jpeg')
+        total_pages = len(quick_check)
+        del quick_check
+
+        # For large PDFs: lower DPI + sample pages evenly
+        if total_pages > MAX_IMAGES_PER_REQUEST:
+            dpi = 100  # lower DPI for large docs
+            quality = 80
+            # Sample pages evenly across the document
+            step = total_pages / MAX_IMAGES_PER_REQUEST
+            selected_pages = [int(step * i) + 1 for i in range(MAX_IMAGES_PER_REQUEST)]
+            logger.info(f"Large PDF ({total_pages} pages): sampling {len(selected_pages)} pages at {dpi} DPI")
+            base64_images = []
+            for page_num in selected_pages:
+                imgs = convert_from_path(pdf_path, dpi=dpi, fmt='jpeg',
+                                        first_page=page_num, last_page=page_num)
+                if imgs:
+                    buffer = io.BytesIO()
+                    imgs[0].save(buffer, format='JPEG', quality=quality)
+                    base64_images.append(base64.b64encode(buffer.getvalue()).decode('utf-8'))
+            return base64_images
+        else:
+            # Small PDF: convert all pages at normal quality
+            dpi = 150
+            quality = 90
+            logger.info(f"PDF ({total_pages} pages) at {dpi} DPI")
+            images = convert_from_path(pdf_path, dpi=dpi, fmt='jpeg')
+            base64_images = []
+            for img in images:
+                buffer = io.BytesIO()
+                img.save(buffer, format='JPEG', quality=quality)
+                base64_images.append(base64.b64encode(buffer.getvalue()).decode('utf-8'))
+            return base64_images
+    except Exception as e:
+        logger.error(f"PDF conversion failed: {e}")
+        return []
+
+def convert_office_to_pdf(input_path: str) -> Optional[str]:
+    try:
+        import subprocess
+        temp_dir = tempfile.mkdtemp()
+        soffice_cmd = "soffice" if shutil.which("soffice") else "libreoffice"
+        if not soffice_cmd: return None
+
+        cmd = [soffice_cmd, "--headless", "--convert-to", "pdf", "--outdir", temp_dir, input_path]
+        subprocess.run(cmd, capture_output=True, timeout=60)
+
+        pdf_name = Path(input_path).stem + ".pdf"
+        pdf_path = os.path.join(temp_dir, pdf_name)
+        return pdf_path if os.path.exists(pdf_path) else None
+    except Exception:
+        return None
+
+def extract_video_frames(video_path: str, num_frames=10) -> List[str]:
+    try:
+        import cv2
+        cap = cv2.VideoCapture(video_path)
+        total = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
+        if total == 0: return []
+        indices = [int(total * i / (num_frames + 1)) for i in range(1, num_frames + 1)]
+        frames = []
+        for idx in indices:
+            cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
+            ret, frame = cap.read()
+            if ret:
+                h, w = frame.shape[:2]
+                if w > 1280:
+                    scale = 1280/w
+                    frame = cv2.resize(frame, (1280, int(h*scale)))
+                _, buf = cv2.imencode('.jpg', frame)
+                frames.append(base64.b64encode(buf).decode('utf-8'))
+        cap.release()
+        return frames
+    except Exception:
+        return []
+
+def convert_document_to_images(file_path: str) -> List[str]:
+    path = Path(file_path)
+    ext = path.suffix.lower()
+    if ext == '.pdf':
+        return convert_pdf_to_images(file_path)
+    elif ext in ['.docx', '.doc', '.ppt', '.pptx']:
+        pdf = convert_office_to_pdf(file_path)
+        if pdf:
+            imgs = convert_pdf_to_images(pdf)
+            shutil.rmtree(os.path.dirname(pdf), ignore_errors=True)
+            return imgs
+    elif ext in ['.mp4', '.avi', '.mov', '.mkv']:
+        return extract_video_frames(file_path)
+    return []
+
+# -----------------------------------------------------------------------------
+# API Interaction
+# -----------------------------------------------------------------------------
+
+async def call_multimodal_api(images_b64: List[str], session: aiohttp.ClientSession) -> tuple[str, bool]:
+    messages = [{"role": "system", "content": SYSTEM_PROMPT}]
+
+    content = [{"type": "text", "text": "Analyze these tutorial pages and extract benchmark tasks as JSON."}]
+
+    # Cap images to avoid huge payloads
+    subset_images = images_b64[:MAX_IMAGES_PER_REQUEST]
+    for img in subset_images:
+        content.append({
+            "type": "image_url",
+            "image_url": {"url": f"data:image/jpeg;base64,{img}"}
+        })
+
+    messages.append({"role": "user", "content": content})
+
+    for attempt in range(1, MAX_RETRY_ATTEMPTS + 1):
+        try:
+            headers = {
+                "Authorization": f"Bearer {API_KEY}",
+                "Content-Type": "application/json"
+            }
+            # Add site specific headers if using openrouter or others if needed
+
+            payload = {
+                "model": MODEL_NAME,
+                "messages": messages,
+                "max_tokens": 4096,
+            }
+
+            async with session.post(API_URL, headers=headers, json=payload, timeout=180) as response:
+                if response.status == 200:
+                    res_json = await response.json()
+                    return res_json['choices'][0]['message']['content'], True
+                else:
+                    err = await response.text()
+                    logger.warning(f"API Error ({response.status}): {err}")
+                    if attempt < MAX_RETRY_ATTEMPTS:
+                        await asyncio.sleep(RETRY_DELAY)
+                    else:
+                        return f"API Error: {err}", False
+        except Exception as e:
+            logger.warning(f"Exception: {e}")
+            if attempt < MAX_RETRY_ATTEMPTS:
+                await asyncio.sleep(RETRY_DELAY)
+            else:
+                return str(e), False
+    return "Max retries", False
+
+# -----------------------------------------------------------------------------
+# Main Logic
+# -----------------------------------------------------------------------------
+
+software_tests = {}  # Global dict to track software -> [test_ids]
+FORCE_REGENERATE = False  # Set via --force flag
+
+async def process_file(file_path: str, session: aiohttp.ClientSession, semaphore: asyncio.Semaphore):
+    async with semaphore:
+        file_path_obj = Path(file_path)
+        file_stem = file_path_obj.stem
+
+        # Infer software name from folder structure
+        try:
+            rel_path = file_path_obj.relative_to(INPUT_FOLDER)
+            software_name = rel_path.parts[0] if len(rel_path.parts) > 1 else "unknown"
+        except ValueError:
+            software_name = "unknown"
+
+        # Skip if already processed (check if task1 json exists)
+        existing_task1 = EXAMPLES_FOLDER / software_name / f"{file_stem}_task1.json"
+        if existing_task1.exists() and not FORCE_REGENERATE:
+            logger.info(f"Skipping (already processed): {file_path_obj.name} → use --force to regenerate")
+            # Still register existing tasks in software_tests for test_all.json
+            import glob as g
+            existing_tasks = g.glob(str(EXAMPLES_FOLDER / software_name / f"{file_stem}_task*.json"))
+            for t in existing_tasks:
+                tid = Path(t).stem
+                if software_name not in software_tests:
+                    software_tests[software_name] = []
+                software_tests[software_name].append(tid)
+            stats.add_completed(num_tasks=len(existing_tasks))
+            return
+
+        logger.info(f"Processing: {file_path_obj.name}")
+
+        # 1. Convert to images
+        images = convert_document_to_images(file_path)
+        if not images:
+            stats.add_failed(file_path, "No images extracted")
+            return
+
+        # 2. Call API
+        content, success = await call_multimodal_api(images, session)
+        if not success:
+            stats.add_failed(file_path, content)
+            return
+
+        # 3. Parse JSON
+        try:
+            # Try to find JSON block if mixed with text
+            json_match = re.search(r'\{.*\}', content, re.DOTALL)
+            if json_match:
+                json_str = json_match.group(0)
+            else:
+                json_str = content
+
+            api_result = json.loads(json_str)
+            tasks = api_result.get("tasks", [])
+            if not tasks:
+                logger.warning(f"No tasks found in JSON for {file_path}")
+                return
+
+        except json.JSONDecodeError as e:
+            stats.add_failed(file_path, f"JSON Parse Error: {e}")
+            logger.error(f"Raw content: {content[:200]}...")
+            return
+
+        # 4. Generate Output Files
+        for i, task in enumerate(tasks, 1):
+            test_id = f"{file_stem}_task{i}"
+            output_file = EXAMPLES_FOLDER / software_name / f"{test_id}.json"
+            output_file.parent.mkdir(parents=True, exist_ok=True)
+
+            # Get software-specific config
+            sw_cfg = SOFTWARE_CONFIG.get(software_name, DEFAULT_SOFTWARE_CONFIG)
+
+            # Construct the OSWorld/Jade Benchmark Standard JSON
+            task_json = {
+                "id": test_id,
+                "snapshot": sw_cfg["snapshot"],
+                "instruction": task.get("task_goal", ""),
+                "source": "custom",
+                "config": sw_cfg["config"],
+                "trajectory": "trajectories/",
+                "related_apps": [software_name],
+                "evaluator": {
+                    "postconfig": [
+                        {
+                            "type": "sleep",
+                            "parameters": {
+                                "seconds": 3
+                            }
+                        }
+                    ],
+                    "func": "vllm_eval"
+                    # "result" field is NOT needed for vllm_eval
+                },
+                "proxy": False,
+                "fixed_ip": False,
+                "possibility_of_env_change": "low",
+                "metadata": {
+                    "input_files": task.get("input_files", []),
+                    "steps": task.get("steps", "")
+                }
+            }
+
+            with open(output_file, 'w', encoding='utf-8') as f:
+                json.dump(task_json, f, ensure_ascii=False, indent=2)
+
+            # Register to global index
+            if software_name not in software_tests:
+                software_tests[software_name] = []
+            software_tests[software_name].append(test_id)
+
+        stats.add_completed(num_tasks=len(tasks))
+        logger.info(f"Generated {len(tasks)} tasks for {file_path_obj.name}")
+
+def save_test_all_json():
+    """Update test_all.json with new tests"""
+    test_all_meta_path = Path(TEST_ALL_JSON)
+    existing_data = {}
+    if test_all_meta_path.exists():
+        try:
+            with open(test_all_meta_path, 'r', encoding='utf-8') as f:
+                existing_data = json.load(f)
+        except: pass
+
+    # Merge new tests
+    for software, test_ids in software_tests.items():
+        current_list = existing_data.get(software, [])
+        # Append unique
+        updated_list = sorted(list(set(current_list + test_ids)))
+        existing_data[software] = updated_list
+
+    with open(test_all_meta_path, 'w', encoding='utf-8') as f:
+        json.dump(existing_data, f, ensure_ascii=False, indent=2)
+
+    # Also save a 'test_custom.json' that ONLY contains the softwares we just processed/have in our inputs
+    # This is useful for running ONLY your custom benchmarks without OSWorld defaults
+    custom_data = {}
+
+    # We scan the INPUT_FOLDER to see which softwares are "ours"
+    custom_softwares = set()
+    if INPUT_FOLDER.exists():
+        for item in os.listdir(INPUT_FOLDER):
+            if (INPUT_FOLDER / item).is_dir():
+                custom_softwares.add(item)
+
+    for software in custom_softwares:
+        if software in existing_data:
+            custom_data[software] = existing_data[software]
+
+    test_custom_path = PROJECT_ROOT / "evaluation_examples" / "test_custom.json"
+    with open(test_custom_path, 'w', encoding='utf-8') as f:
+        json.dump(custom_data, f, ensure_ascii=False, indent=2)
+    logger.info(f"Custom test index saved to: {test_custom_path}")
+
+async def main():
+    global stats, FORCE_REGENERATE
+    stats = ProcessingStats()
+
+    # Parse --force flag
+    FORCE_REGENERATE = "--force" in sys.argv
+
+    if not API_KEY:
+        logger.error("OPENAI_API_KEY environment variable not set.")
+        return
+
+    # Check/Create Input Folder
+    if not INPUT_FOLDER.exists():
+        logger.warning(f"Input folder {INPUT_FOLDER} does not exist. Creating it.")
+        INPUT_FOLDER.mkdir(parents=True, exist_ok=True)
+        logger.info(f"Please put software PDF tutorials into subfolders in: {INPUT_FOLDER}")
+        return
+
+    # Find files
+    files = []
+    for root, _, filenames in os.walk(INPUT_FOLDER):
+        for f in filenames:
+            if Path(f).suffix.lower() in SUPPORTED_EXTENSIONS:
+                files.append(os.path.join(root, f))
+
+    stats.total_files = len(files)
+    logger.info(f"Found {len(files)} files in {INPUT_FOLDER}")
+
+    if not files:
+        logger.info("No files to process.")
+        return
+
+    # Process
+    semaphore = asyncio.Semaphore(MAX_CONCURRENT_REQUESTS)
+    async with aiohttp.ClientSession() as session:
+        tasks = [process_file(f, session, semaphore) for f in files]
+        await asyncio.gather(*tasks)
+
+    # Save Index
+    save_test_all_json()
+
+    logger.info("Done.")
+
+if __name__ == "__main__":
+    asyncio.run(main())
--- a/evaluation_examples/test_all.json
+++ b/evaluation_examples/test_all.json
@@ -387,5 +387,316 @@
    "dcbe20e8-647f-4f1d-8696-f1c5bbb570e3",
    "7c4cc09e-7a92-40dd-8338-b2286535c4ed",
    "971cbb5b-3cbf-4ff7-9e24-b5c84fcebfa6"
+  ],
+  "jade": [
+    "MDIJade6.5使用手册_task1",
+    "MDIJade6.5使用手册_task10",
+    "MDIJade6.5使用手册_task2",
+    "MDIJade6.5使用手册_task3",
+    "MDIJade6.5使用手册_task4",
+    "MDIJade6.5使用手册_task5",
+    "MDIJade6.5使用手册_task6",
+    "MDIJade6.5使用手册_task7",
+    "MDIJade6.5使用手册_task8",
+    "MDIJade6.5使用手册_task9",
+    "jade_test"
+  ],
+  "avogadro": [
+    "building-metal-complexes_task1",
+    "building-metal-complexes_task2",
+    "building-metal-complexes_task3",
+    "building-metal-complexes_task4",
+    "building-metal-complexes_task5",
+    "building-metal-complexes_task6",
+    "building-metal-complexes_task7",
+    "building-organic-molecules_task1",
+    "building-organic-molecules_task10",
+    "building-organic-molecules_task2",
+    "building-organic-molecules_task3",
+    "building-organic-molecules_task4",
+    "building-organic-molecules_task5",
+    "building-organic-molecules_task6",
+    "building-organic-molecules_task7",
+    "building-organic-molecules_task8",
+    "building-organic-molecules_task9",
+    "learning-avogadro_task1",
+    "learning-avogadro_task2",
+    "learning-avogadro_task3",
+    "learning-avogadro_task4",
+    "learning-avogadro_task5",
+    "learning-avogadro_task6",
+    "learning-avogadro_task7",
+    "learning-avogadro_task8",
+    "learning-avogadro_task9",
+    "naming-a-molecule_task1",
+    "naming-a-molecule_task2",
+    "using-qtaim-and-wfn_task1",
+    "using-qtaim-and-wfn_task2",
+    "using-qtaim-and-wfn_task3",
+    "viewing-electrostatic-potential_task1",
+    "viewing-electrostatic-potential_task2",
+    "viewing-molecular-orbitals_task1",
+    "viewing-molecular-orbitals_task2",
+    "viewing-molecular-orbitals_task3",
+    "viewing-vibrations_task1",
+    "viewing-vibrations_task2",
+    "viewing-vibrations_task3",
+    "viewing-vibrations_task4",
+    "viewing-vibrations_task5"
+  ],
+  "imagej": [
+    "user-guide_task1",
+    "user-guide_task10",
+    "user-guide_task2",
+    "user-guide_task3",
+    "user-guide_task4",
+    "user-guide_task5",
+    "user-guide_task6",
+    "user-guide_task7",
+    "user-guide_task8",
+    "user-guide_task9"
+  ],
+  "origin": [
+    "Origin_User_Guide_2025b_E_task1",
+    "Origin_User_Guide_2025b_E_task10",
+    "Origin_User_Guide_2025b_E_task11",
+    "Origin_User_Guide_2025b_E_task12",
+    "Origin_User_Guide_2025b_E_task2",
+    "Origin_User_Guide_2025b_E_task3",
+    "Origin_User_Guide_2025b_E_task4",
+    "Origin_User_Guide_2025b_E_task5",
+    "Origin_User_Guide_2025b_E_task6",
+    "Origin_User_Guide_2025b_E_task7",
+    "Origin_User_Guide_2025b_E_task8",
+    "Origin_User_Guide_2025b_E_task9"
+  ],
+  "ovito": [
+    "animation_task1",
+    "animation_task10",
+    "animation_task2",
+    "animation_task3",
+    "animation_task4",
+    "animation_task5",
+    "animation_task6",
+    "animation_task7",
+    "animation_task8",
+    "animation_task9",
+    "aspherical_particles_task1",
+    "aspherical_particles_task10",
+    "aspherical_particles_task2",
+    "aspherical_particles_task3",
+    "aspherical_particles_task4",
+    "aspherical_particles_task5",
+    "aspherical_particles_task6",
+    "aspherical_particles_task7",
+    "aspherical_particles_task8",
+    "aspherical_particles_task9",
+    "clone_pipeline_task1",
+    "clone_pipeline_task2",
+    "clone_pipeline_task3",
+    "clone_pipeline_task4",
+    "clone_pipeline_task5",
+    "clone_pipeline_task6",
+    "clone_pipeline_task7",
+    "clone_pipeline_task8",
+    "code_generation_task1",
+    "code_generation_task2",
+    "code_generation_task3",
+    "code_generation_task4",
+    "code_generation_task5",
+    "code_generation_task6",
+    "code_generation_task7",
+    "code_generation_task8",
+    "customize_init_state_task1",
+    "customize_init_state_task2",
+    "customize_init_state_task3",
+    "customize_init_state_task4",
+    "customize_init_state_task5",
+    "data_model_task1",
+    "data_model_task10",
+    "data_model_task2",
+    "data_model_task3",
+    "data_model_task4",
+    "data_model_task5",
+    "data_model_task6",
+    "data_model_task7",
+    "data_model_task8",
+    "data_model_task9",
+    "export_task1",
+    "export_task2",
+    "export_task3",
+    "export_task4",
+    "export_task5",
+    "import_task1",
+    "import_task10",
+    "import_task2",
+    "import_task3",
+    "import_task4",
+    "import_task5",
+    "import_task6",
+    "import_task7",
+    "import_task8",
+    "import_task9",
+    "marker_particles_task1",
+    "marker_particles_task2",
+    "marker_particles_task3",
+    "marker_particles_task4",
+    "marker_particles_task5",
+    "marker_particles_task6",
+    "marker_particles_task7",
+    "marker_particles_task8",
+    "marker_particles_task9",
+    "miscellaneous_task1",
+    "miscellaneous_task10",
+    "miscellaneous_task2",
+    "miscellaneous_task3",
+    "miscellaneous_task4",
+    "miscellaneous_task5",
+    "miscellaneous_task6",
+    "miscellaneous_task7",
+    "miscellaneous_task8",
+    "miscellaneous_task9",
+    "pipeline_task1",
+    "pipeline_task2",
+    "pipeline_task3",
+    "pipeline_task4",
+    "pipeline_task5",
+    "pipeline_task6",
+    "pipeline_task7",
+    "pipeline_task8",
+    "pipeline_task9",
+    "python_extensions_task1",
+    "python_extensions_task10",
+    "python_extensions_task2",
+    "python_extensions_task3",
+    "python_extensions_task4",
+    "python_extensions_task5",
+    "python_extensions_task6",
+    "python_extensions_task7",
+    "python_extensions_task8",
+    "python_extensions_task9",
+    "remote_file_access_task1",
+    "remote_file_access_task10",
+    "remote_file_access_task2",
+    "remote_file_access_task3",
+    "remote_file_access_task4",
+    "remote_file_access_task5",
+    "remote_file_access_task6",
+    "remote_file_access_task7",
+    "remote_file_access_task8",
+    "remote_file_access_task9",
+    "remote_rendering_task1",
+    "remote_rendering_task2",
+    "remote_rendering_task3",
+    "remote_rendering_task4",
+    "remote_rendering_task5",
+    "remote_rendering_task6",
+    "remote_rendering_task7",
+    "remote_rendering_task8",
+    "remote_rendering_task9",
+    "rendering_task1",
+    "rendering_task2",
+    "rendering_task3",
+    "rendering_task4",
+    "rendering_task5",
+    "rendering_task6",
+    "rendering_task7",
+    "rendering_task8",
+    "rendering_task9",
+    "transparent_particles_task1",
+    "transparent_particles_task2",
+    "turntable_animation_task1",
+    "turntable_animation_task2",
+    "turntable_animation_task3",
+    "turntable_animation_task4",
+    "turntable_animation_task5",
+    "turntable_animation_task6",
+    "viewport_layouts_task1",
+    "viewport_layouts_task10",
+    "viewport_layouts_task2",
+    "viewport_layouts_task3",
+    "viewport_layouts_task4",
+    "viewport_layouts_task5",
+    "viewport_layouts_task6",
+    "viewport_layouts_task7",
+    "viewport_layouts_task8",
+    "viewport_layouts_task9",
+    "viewports_task1",
+    "viewports_task10",
+    "viewports_task11",
+    "viewports_task2",
+    "viewports_task3",
+    "viewports_task4",
+    "viewports_task5",
+    "viewports_task6",
+    "viewports_task7",
+    "viewports_task8",
+    "viewports_task9"
+  ],
+  "pymol": [
+    "Biochemistry_student_intro_task1",
+    "Biochemistry_student_intro_task10",
+    "Biochemistry_student_intro_task2",
+    "Biochemistry_student_intro_task3",
+    "Biochemistry_student_intro_task4",
+    "Biochemistry_student_intro_task5",
+    "Biochemistry_student_intro_task6",
+    "Biochemistry_student_intro_task7",
+    "Biochemistry_student_intro_task8",
+    "Biochemistry_student_intro_task9",
+    "MovieSchool_1_task1",
+    "MovieSchool_1_task2",
+    "MovieSchool_1_task3",
+    "MovieSchool_1_task4",
+    "MovieSchool_1_task5",
+    "MovieSchool_1_task6",
+    "MovieSchool_1_task7",
+    "MovieSchool_3_task1",
+    "MovieSchool_3_task10",
+    "MovieSchool_3_task2",
+    "MovieSchool_3_task3",
+    "MovieSchool_3_task4",
+    "MovieSchool_3_task5",
+    "MovieSchool_3_task6",
+    "MovieSchool_3_task7",
+    "MovieSchool_3_task8",
+    "MovieSchool_3_task9",
+    "Mutagenesis_task1",
+    "Mutagenesis_task2",
+    "Mutagenesis_task3",
+    "Mutagenesis_task4",
+    "Mutagenesis_task5",
+    "Mutagenesis_task6",
+    "Mutagenesis_task7",
+    "Practical_Pymol_for_Beginners_task1",
+    "Practical_Pymol_for_Beginners_task10",
+    "Practical_Pymol_for_Beginners_task11",
+    "Practical_Pymol_for_Beginners_task12",
+    "Practical_Pymol_for_Beginners_task13",
+    "Practical_Pymol_for_Beginners_task2",
+    "Practical_Pymol_for_Beginners_task3",
+    "Practical_Pymol_for_Beginners_task4",
+    "Practical_Pymol_for_Beginners_task5",
+    "Practical_Pymol_for_Beginners_task6",
+    "Practical_Pymol_for_Beginners_task7",
+    "Practical_Pymol_for_Beginners_task8",
+    "Practical_Pymol_for_Beginners_task9",
+    "Visualizing_a_computed_structure_-_a_commented_example_task1",
+    "Visualizing_a_computed_structure_-_a_commented_example_task2",
+    "Visualizing_a_computed_structure_-_a_commented_example_task3",
+    "Visualizing_a_computed_structure_-_a_commented_example_task4"
+  ],
+  "vesta": [
+    "VESTA_Manual_task1",
+    "VESTA_Manual_task10",
+    "VESTA_Manual_task11",
+    "VESTA_Manual_task2",
+    "VESTA_Manual_task3",
+    "VESTA_Manual_task4",
+    "VESTA_Manual_task5",
+    "VESTA_Manual_task6",
+    "VESTA_Manual_task7",
+    "VESTA_Manual_task8",
+    "VESTA_Manual_task9"
  ]
 }
--- a/evaluation_examples/test_curated.json
+++ b/evaluation_examples/test_curated.json
@@ -0,0 +1,93 @@
+{
+  "avogadro": [
+    "building-metal-complexes_task1",
+    "building-metal-complexes_task3",
+    "building-metal-complexes_task7",
+    "building-organic-molecules_task1",
+    "building-organic-molecules_task3",
+    "building-organic-molecules_task4",
+    "building-organic-molecules_task5",
+    "building-organic-molecules_task9",
+    "naming-a-molecule_task1",
+    "viewing-electrostatic-potential_task1"
+  ],
+  "imagej": [
+    "user-guide_task1",
+    "user-guide_task10",
+    "user-guide_task2",
+    "user-guide_task3",
+    "user-guide_task4",
+    "user-guide_task5",
+    "user-guide_task6",
+    "user-guide_task7",
+    "user-guide_task8",
+    "user-guide_task9"
+  ],
+  "jade": [
+    "MDIJade6.5使用手册_task1",
+    "MDIJade6.5使用手册_task10",
+    "MDIJade6.5使用手册_task2",
+    "MDIJade6.5使用手册_task3",
+    "MDIJade6.5使用手册_task4",
+    "MDIJade6.5使用手册_task5",
+    "MDIJade6.5使用手册_task6",
+    "MDIJade6.5使用手册_task7",
+    "MDIJade6.5使用手册_task8"
+
+  ],
+  "origin": [
+    "Origin_User_Guide_2025b_E_task1",
+    "Origin_User_Guide_2025b_E_task11",
+    "Origin_User_Guide_2025b_E_task12",
+    "Origin_User_Guide_2025b_E_task2",
+    "Origin_User_Guide_2025b_E_task3",
+    "Origin_User_Guide_2025b_E_task4",
+    "Origin_User_Guide_2025b_E_task5",
+    "Origin_User_Guide_2025b_E_task8",
+    "Origin_User_Guide_2025b_E_task9"
+  ],
+  "ovito": [
+    "animation_task3",
+    "aspherical_particles_task1",
+    "clone_pipeline_task1",
+    "code_generation_task1",
+    "customize_init_state_task1",
+    "data_model_task1",
+    "export_task1",
+    "marker_particles_task2",
+    "miscellaneous_task1",
+    "python_extensions_task1",
+    "remote_file_access_task1",
+    "remote_rendering_task1",
+    "rendering_task1",
+    "transparent_particles_task1"
+  ],
+  "pymol": [
+    "MovieSchool_1_task1",
+    "MovieSchool_1_task2",
+    "MovieSchool_1_task3",
+    "MovieSchool_1_task4",
+    "MovieSchool_1_task5",
+    "MovieSchool_3_task1",
+    "MovieSchool_3_task10",
+    "MovieSchool_3_task2",
+    "MovieSchool_3_task3",
+    "MovieSchool_3_task4",
+    "MovieSchool_3_task5",
+    "Mutagenesis_task4",
+    "Practical_Pymol_for_Beginners_task6"
+  ],
+  "vesta": [
+    "VESTA_Manual_task1",
+    "VESTA_Manual_task10",
+    "VESTA_Manual_task11",
+    "VESTA_Manual_task2",
+    "VESTA_Manual_task3",
+    "VESTA_Manual_task4",
+    "VESTA_Manual_task5",
+    "VESTA_Manual_task6",
+    "VESTA_Manual_task7",
+    "VESTA_Manual_task8",
+    "VESTA_Manual_task9"
+  ]
+}
--- a/evaluation_examples/test_custom.json
+++ b/evaluation_examples/test_custom.json
@@ -0,0 +1,93 @@
+{
+  "avogadro": [
+    "building-metal-complexes_task1",
+    "building-metal-complexes_task3",
+    "building-metal-complexes_task7",
+    "building-organic-molecules_task1",
+    "building-organic-molecules_task3",
+    "building-organic-molecules_task4",
+    "building-organic-molecules_task5",
+    "building-organic-molecules_task9",
+    "naming-a-molecule_task1",
+    "viewing-electrostatic-potential_task1"
+  ],
+  "imagej": [
+    "user-guide_task1",
+    "user-guide_task10",
+    "user-guide_task2",
+    "user-guide_task3",
+    "user-guide_task4",
+    "user-guide_task5",
+    "user-guide_task6",
+    "user-guide_task7",
+    "user-guide_task8",
+    "user-guide_task9"
+  ],
+  "jade": [
+    "MDIJade6.5使用手册_task1",
+    "MDIJade6.5使用手册_task10",
+    "MDIJade6.5使用手册_task2",
+    "MDIJade6.5使用手册_task3",
+    "MDIJade6.5使用手册_task4",
+    "MDIJade6.5使用手册_task5",
+    "MDIJade6.5使用手册_task6",
+    "MDIJade6.5使用手册_task7",
+    "MDIJade6.5使用手册_task8",
+    "jade_test"
+  ],
+  "origin": [
+    "Origin_User_Guide_2025b_E_task1",
+    "Origin_User_Guide_2025b_E_task11",
+    "Origin_User_Guide_2025b_E_task12",
+    "Origin_User_Guide_2025b_E_task2",
+    "Origin_User_Guide_2025b_E_task3",
+    "Origin_User_Guide_2025b_E_task4",
+    "Origin_User_Guide_2025b_E_task5",
+    "Origin_User_Guide_2025b_E_task8",
+    "Origin_User_Guide_2025b_E_task9"
+  ],
+  "ovito": [
+    "animation_task3",
+    "aspherical_particles_task1",
+    "clone_pipeline_task1",
+    "code_generation_task1",
+    "customize_init_state_task1",
+    "data_model_task1",
+    "export_task1",
+    "marker_particles_task2",
+    "miscellaneous_task1",
+    "python_extensions_task1",
+    "remote_file_access_task1",
+    "remote_rendering_task1",
+    "rendering_task1",
+    "transparent_particles_task1"
+  ],
+  "pymol": [
+    "MovieSchool_1_task1",
+    "MovieSchool_1_task2",
+    "MovieSchool_1_task3",
+    "MovieSchool_1_task4",
+    "MovieSchool_1_task5",
+    "MovieSchool_3_task1",
+    "MovieSchool_3_task10",
+    "MovieSchool_3_task2",
+    "MovieSchool_3_task3",
+    "MovieSchool_3_task4",
+    "MovieSchool_3_task5",
+    "Mutagenesis_task4",
+    "Practical_Pymol_for_Beginners_task6"
+  ],
+  "vesta": [
+    "VESTA_Manual_task1",
+    "VESTA_Manual_task10",
+    "VESTA_Manual_task11",
+    "VESTA_Manual_task2",
+    "VESTA_Manual_task3",
+    "VESTA_Manual_task4",
+    "VESTA_Manual_task5",
+    "VESTA_Manual_task6",
+    "VESTA_Manual_task7",
+    "VESTA_Manual_task8",
+    "VESTA_Manual_task9"
+  ]
+}
--- a/evaluation_examples/test_each_domain_a11y_tree.json
+++ b/evaluation_examples/test_each_domain_a11y_tree.json
@@ -0,0 +1,9 @@
+{
+    "avogadro": ["building-organic-molecules_task1"],
+    "imagej": ["user-guide_task1"],
+    "jade": ["MDIJade6.5使用手册_task1"],
+    "origin": ["Origin_User_Guide_2025b_E_task1"],
+    "ovito": ["animation_task3"],
+    "pymol": ["MovieSchool_1_task1"],
+    "vesta": ["VESTA_Manual_task1"]
+}
--- a/evaluation_examples/test_single.json
+++ b/evaluation_examples/test_single.json
@@ -0,0 +1 @@
+{"avogadro": ["building-organic-molecules_task1"]}
--- a/lib_results_logger.py
+++ b/lib_results_logger.py
@@ -7,10 +7,19 @@ Appends task completion results to results.json in real-time.
 import json
 import os
 import time
-import fcntl
+import platform
 from pathlib import Path
 from typing import Dict, Any, Optional

+# Import fcntl only on Unix-like systems (Linux, macOS)
+# On Windows, we'll use msvcrt for file locking
+if platform.system() != "Windows":
+    import fcntl
+    HAS_FCNTL = True
+else:
+    import msvcrt
+    HAS_FCNTL = False
+

 def extract_domain_from_path(result_path: str) -> str:
    """
@@ -66,8 +75,12 @@ def append_task_result(
    # Thread-safe JSON append with file locking
    try:
        with open(results_file, 'a+') as f:
-            # Lock the file for exclusive access
-            fcntl.flock(f.fileno(), fcntl.LOCK_EX)
+            # Lock the file for exclusive access (platform-specific)
+            if HAS_FCNTL:
+                fcntl.flock(f.fileno(), fcntl.LOCK_EX)
+            else:
+                # Windows file locking using msvcrt
+                msvcrt.locking(f.fileno(), msvcrt.LK_LOCK, 1)
            
            try:
                # Move to beginning to read existing content
@@ -95,8 +108,12 @@ def append_task_result(
                f.write('\n')  # Add newline for readability
                
            finally:
-                # Always unlock the file
-                fcntl.flock(f.fileno(), fcntl.LOCK_UN)
+                # Always unlock the file (platform-specific)
+                if HAS_FCNTL:
+                    fcntl.flock(f.fileno(), fcntl.LOCK_UN)
+                else:
+                    # Windows unlock using msvcrt
+                    msvcrt.locking(f.fileno(), msvcrt.LK_UNLCK, 1)
                
        print(f"📝 Logged result: {domain}/{task_id} -> {result_entry['status']} (score: {score})")
        
--- a/lib_run_single.py
+++ b/lib_run_single.py
@@ -9,33 +9,48 @@ from lib_results_logger import log_task_completion
 logger = logging.getLogger("desktopenv.experiment")


-def run_single_example(agent, env, example, max_steps, instruction, args, example_result_dir, scores):
+def run_single_example(agent, env, example, max_steps, instruction, args, example_result_dir, scores, metadata_steps=""):
    runtime_logger = setup_logger(example, example_result_dir)

    # Reset environment first to get fresh VM IP
    env.reset(task_config=example)
+    logger.info("=======Environment reset completed=======")

-    # Reset agent with fresh VM IP (for snapshot reverts)
-    try:
-        agent.reset(runtime_logger, vm_ip=env.vm_ip)
-    except Exception as e:
-        agent.reset(vm_ip=env.vm_ip)
-    
-    time.sleep(60) # Wait for the environment to be ready
+    # # Reset agent with fresh VM IP (for snapshot reverts)
+    # try:
+    #     agent.reset(runtime_logger, vm_ip=env.vm_ip)
+    # except Exception as e:
+    #     agent.reset(vm_ip=env.vm_ip)
+
+    time.sleep(15) # Wait for the environment to be ready (apps like Avogadro need time to fully load)
+
+    # get initial observation
+    logger.info("Getting initial observation...")
    obs = env._get_obs() # Get the initial observation
+    logger.info("Initial observation obtained.")
    done = False
    step_idx = 0
-    env.controller.start_recording()
+    if getattr(args, 'enable_recording', False):
+        env.controller.start_recording()
    while not done and step_idx < max_steps:
+        logger.info(f"Step {step_idx + 1} prediction...")
        response, actions = agent.predict(
            instruction,
-            obs
+            obs,
+            metadata_steps=metadata_steps,
        )
+        logger.info(f"Response: {response}")
+        logger.info(f"Actions: {actions}")
+
+        logger.info(f"Executing actions...")
        for action in actions:
            # Capture the timestamp before executing the action
            action_timestamp = datetime.datetime.now().strftime("%Y%m%d@%H%M%S%f")
            logger.info("Step %d: %s", step_idx + 1, action)
+
+            logger.info("执行动作中...")
            obs, reward, done, info = env.step(action, args.sleep_after_execution)
+            logger.info("动作执行完成。")

            logger.info("Reward: %.2f", reward)
            logger.info("Done: %s", done)
@@ -60,16 +75,16 @@ def run_single_example(agent, env, example, max_steps, instruction, args, exampl
                break
        step_idx += 1
    time.sleep(20) # Wait for the environment to settle
-    result = env.evaluate()
-    logger.info("Result: %.2f", result)
+    result = env.evaluate(result_dir=example_result_dir)
    scores.append(result)
    with open(os.path.join(example_result_dir, "result.txt"), "w", encoding="utf-8") as f:
        f.write(f"{result}\n")
-    
+
    # Log task completion to results.json
    log_task_completion(example, result, example_result_dir, args)
-    
-    env.controller.end_recording(os.path.join(example_result_dir, "recording.mp4"))
+
+    if getattr(args, 'enable_recording', False):
+        env.controller.end_recording(os.path.join(example_result_dir, "recording.mp4"))


 def setup_logger(example, example_result_dir):
@@ -83,11 +98,11 @@ def run_single_example_human(env, example, max_steps, instruction, args, example
    env.reset(task_config=example)
    time.sleep(60) # Wait for the environment to be ready
    obs = env._get_obs() # Get the initial observation
-    
+
    # Save initial screenshot
    with open(os.path.join(example_result_dir, "initial_state.png"), "wb") as _f:
        _f.write(obs['screenshot'])
-    
+
    # Save trajectory information
    with open(os.path.join(example_result_dir, "traj.jsonl"), "a") as f:
        f.write(json.dumps({
@@ -95,9 +110,9 @@ def run_single_example_human(env, example, max_steps, instruction, args, example
            "initial_state": "initial_state.png"
        }))
        f.write("\n")
-    
+
    # Evaluate the result
-    result = env.evaluate()
+    result = env.evaluate(result_dir=example_result_dir)
    logger.info("Result: %.2f", result)
    scores.append(result)
    with open(os.path.join(example_result_dir, "result.txt"), "w", encoding="utf-8") as f:
@@ -240,14 +255,14 @@ def run_single_example_opencua(agent, env, example, max_steps, instruction, args

        logger.info(f"Got Action: {actions}")
        # Breack if no actions
-        if not actions or len(actions)==0 or actions[0]=="" or actions[0].lower().startswith("error"): 
+        if not actions or len(actions)==0 or actions[0]=="" or actions[0].lower().startswith("error"):
            break

        for action in actions:
            # Capture the timestamp before executing the action
            action_timestamp = datetime.datetime.now().strftime("%Y%m%d@%H%M%S")
            logger.info("Step %d: %s", step_idx + 1, action)
-            
+
            obs, reward, done, info = env.step(action, args.sleep_after_execution)

            logger.info(f"Action {action} executed, reward: {reward}, done: {done}")
@@ -290,7 +305,7 @@ def run_single_example_autoglm(agent, env, example, max_steps, instruction, args
        agent.reset()

    env.reset(task_config=example)
-    
+
    time.sleep(60) # Wait for the environment to be ready
    obs = env._get_obs() # Get the initial observation
    done = False
@@ -325,20 +340,20 @@ def run_single_example_autoglm(agent, env, example, max_steps, instruction, args
                    "screenshot_file": f"step_{step_idx + 1}_{action_timestamp}.png"
                }))
                f.write("\n")
-                
+
            if done:
                logger.info("The episode is done.")
                break
-        
+
        # Invalid Action
        if not actions:
            obs = env._get_obs() # update observation
-            
+
        step_idx += 1
-    
+
    if not done: # not completed the task yet
        env.action_history.append('FAIL')
-    
+
    result = env.evaluate()
    logger.info("Result: %.2f", result)
    scores.append(result)
@@ -355,7 +370,7 @@ def run_single_example_mano(agent, env, example, max_steps, instruction, args, e
    done = False
    step_idx = 0
    env.controller.start_recording()
-    
+
    with open(os.path.join(example_result_dir, f"step_0.png"),
      "wb") as _f:
        _f.write(obs['screenshot'])
@@ -365,12 +380,12 @@ def run_single_example_mano(agent, env, example, max_steps, instruction, args, e
            obs
        )
        if len(actions) > 1:
-            if (("pyautogui.hotkey('shift')" in actions[0] or "pyautogui.hotkey('ctrl')" in actions[0]) 
+            if (("pyautogui.hotkey('shift')" in actions[0] or "pyautogui.hotkey('ctrl')" in actions[0])
                and "pyautogui.click" in actions[1]):
                hotkey_type = 'shift' if "shift" in actions[0] else 'ctrl'
                action = f"pyautogui.keyDown('{hotkey_type}')\n{actions[1]}\npyautogui.keyUp('{hotkey_type}')"
-                actions = [action]  
-                
+                actions = [action]
+
        for action in actions:
            # Capture the timestamp before executing the action
            action_timestamp = datetime.datetime.now().strftime("%Y%m%d@%H%M%S")
@@ -405,7 +420,7 @@ def run_single_example_mano(agent, env, example, max_steps, instruction, args, e
    with open(os.path.join(example_result_dir, "result.txt"), "w", encoding="utf-8") as f:
        f.write(f"{result}\n")
    env.controller.end_recording(os.path.join(example_result_dir, "recording.mp4"))
-    
+
 def run_single_example_uipath(agent, env, example, max_steps, instruction, args, example_result_dir, scores):
    runtime_logger = setup_logger(example, example_result_dir)
    try:
@@ -471,7 +486,7 @@ logger = logging.getLogger("desktopenv.experiment")

 def run_single_example_os_symphony(agent, env, example, max_steps, instruction, args, example_result_dir, scores):
    set_current_result_dir(example_result_dir)
-    
+
    agent.reset(result_dir=example_result_dir)
    env.reset(task_config=example)
    time.sleep(30) # Wait for the environment to be ready
@@ -493,14 +508,14 @@ def run_single_example_os_symphony(agent, env, example, max_steps, instruction,
                img_name = f"step_{step_idx + 1}_milestone.png"
            else:
                img_name = f"step_{step_idx + 1}.png"
-                
+
            with open(os.path.join(example_result_dir, img_name),
                      "wb") as _f:
                _f.write(obs['screenshot'])
            if "coordinates" in response and response["coordinates"]:
                draw_coordinates(
-                    image_bytes=obs['screenshot'], 
-                    coordinates=response["coordinates"], 
+                    image_bytes=obs['screenshot'],
+                    coordinates=response["coordinates"],
                    save_path=os.path.join(example_result_dir, img_name[:-4] + "_draw.png")
                )

@@ -534,7 +549,7 @@ def run_single_example_os_symphony(agent, env, example, max_steps, instruction,
                break
        step_idx += 1
    end_time = time.time()
-    result = float(env.evaluate())
+    result = float(env.evaluate(result_dir=example_result_dir))
    logger.info("Result: %.2f", result)
    scores.append(result)
    with open(os.path.join(example_result_dir, "result.txt"), "w", encoding="utf-8") as f:
@@ -549,10 +564,10 @@ def run_single_example_evocua(agent, env, example, max_steps, instruction, args,
    Unified run function for EvoCUAAgent (supporting both S1 and S2 modes).
    """
    runtime_logger = setup_logger(example, example_result_dir)
-    
+
    # Reset Environment
    env.reset(task_config=example)
-    
+
    # Reset Agent
    # Handle agent reset signature differences if any
    try:
@@ -573,7 +588,7 @@ def run_single_example_evocua(agent, env, example, max_steps, instruction, args,
        # EvoCUAAgent.predict unified signature: returns (response, actions)
        # It handles both modes internally.
        predict_res = agent.predict(instruction, obs)
-        
+
        # Check return signature logic
        if len(predict_res) == 3:
            # Compatibility with S1 original signature if agent was updated to match
@@ -583,7 +598,7 @@ def run_single_example_evocua(agent, env, example, max_steps, instruction, args,
            info_dict = {}

        logger.info(f"Step {step_idx + 1} Actions: {actions}")
-        
+
        # Break if no actions (fail-safe)
        if not actions or (len(actions) == 1 and (actions[0] == "" or "error" in actions[0].lower())):
             # Allow "FAIL" or "DONE" to process through execution loop if agent outputs them as actions
@@ -594,18 +609,18 @@ def run_single_example_evocua(agent, env, example, max_steps, instruction, args,
        for action in actions:
            action_timestamp = datetime.datetime.now().strftime("%Y%m%d@%H%M%S%f")
            logger.info("Executing action: %s", action)
-            
+
            # Execute
            obs, reward, done, info = env.step(action, args.sleep_after_execution)
-            
+
            logger.info("Reward: %.2f", reward)
            logger.info("Done: %s", done)
-            
+
            # Save screenshot
            screenshot_file = f"step_{step_idx + 1}_{action_timestamp}.png"
            with open(os.path.join(example_result_dir, screenshot_file), "wb") as _f:
                _f.write(obs['screenshot'])
-            
+
            # Log Trajectory
            log_entry = {
                "step_num": step_idx + 1,
@@ -620,25 +635,25 @@ def run_single_example_evocua(agent, env, example, max_steps, instruction, args,
            # Add natural language info if available (S1 style)
            if info_dict:
                log_entry["natural_language_action"] = info_dict.get("action")
-            
+
            with open(os.path.join(example_result_dir, "traj.jsonl"), "a", encoding="utf-8") as f:
                f.write(json.dumps(log_entry, ensure_ascii=False))
                f.write("\n")
-                
+
            if done:
                logger.info("The episode is done.")
                break
-        
+
        step_idx += 1
-        
+
    time.sleep(20) # Wait for environment to settle
-    result = env.evaluate()
+    result = env.evaluate(result_dir=example_result_dir)
    logger.info("Result: %.2f", result)
    scores.append(result)
-    
+
    with open(os.path.join(example_result_dir, "result.txt"), "w", encoding="utf-8") as f:
        f.write(f"{result}\n")
-    
+
    log_task_completion(example, result, example_result_dir, args)

    env.controller.end_recording(os.path.join(example_result_dir, "recording.mp4"))
--- a/mm_agents/accessibility_tree_wrap/heuristic_retrieve.py
+++ b/mm_agents/accessibility_tree_wrap/heuristic_retrieve.py
@@ -46,6 +46,15 @@ def judge_node(node: ET, platform="ubuntu", check_image=False) -> bool:
        raise ValueError("Invalid platform, must be 'ubuntu' or 'windows'")

    keeps: bool = node.tag.startswith("document") \
+                  or node.tag.startswith("sunawt") \
+                  or node.tag.startswith("qt5q") \
+                  or node.tag.startswith("qt6q") \
+                  or node.tag.startswith("ovito") \
+                  or node.tag.startswith("pymol") \
+                  or node.tag.startswith("contentspanel") \
+                  or node.tag.startswith("wx") \
+                  or node.tag.startswith("afx") \
+                  or node.tag.startswith("thunderrt") \
                  or node.tag.endswith("item") \
                  or node.tag.endswith("button") \
                  or node.tag.endswith("heading") \
@@ -58,6 +67,18 @@ def judge_node(node: ET, platform="ubuntu", check_image=False) -> bool:
                  or node.tag.endswith("textfield") \
                  or node.tag.endswith("textarea") \
                  or node.tag.endswith("menu") \
+                  or node.tag.endswith("menuitem") \
+                  or node.tag.endswith("menubar") \
+                  or node.tag.endswith("toolbar") \
+                  or node.tag.endswith("tabitem") \
+                  or node.tag.endswith("treeitem") \
+                  or node.tag.endswith("window") \
+                  or node.tag.endswith("edit") \
+                  or node.tag.endswith("widget") \
+                  or node.tag.endswith("box") \
+                  or node.tag.endswith("dialog") \
+                  or node.tag.endswith("view") \
+                  or node.tag.endswith("frame") \
                  or node.tag in {"alert", "canvas", "check-box"
                      , "combo-box", "entry", "icon"
                      , "image", "paragraph", "scroll-bar"
@@ -66,6 +87,16 @@ def judge_node(node: ET, platform="ubuntu", check_image=False) -> bool:
                      , "netuiribbontab", "start", "trayclockwclass"
                      , "traydummysearchcontrol", "uiimage", "uiproperty"
                      , "uiribboncommandbar"
+                      , "qt5qwindowicon", "textblock", "listview"
+                      , "chrome_widgetwin_1", "chrome_renderwidgethosthwnd"
+                      , "unknown", "pane", "tree", "tab"
+                      , "datagrid", "dataitem", "group"
+                      , "statusbar", "titlebar", "tooltip"
+                      , "toolbarwindow32", "richedit50w"
+                      , "msctls_statusbar32", "qaction"
+                      , "qsplitter", "qsplitterhandle"
+                      , "qtoolbarseparator", "qtextbrowser"
+                      , "qtabbar", "qopenglwidget"
                                  }
    keeps = keeps and (
            platform == "ubuntu"
@@ -83,6 +114,12 @@ def judge_node(node: ET, platform="ubuntu", check_image=False) -> bool:
            and (
                    node.get("name", "") != "" or node.text is not None and len(node.text) > 0 \
                    or check_image and node.get("image", "false") == "true"
+                    # Keep empty input fields (edit/textfield) - they are important interactive elements
+                    # even without name/text (e.g., search boxes, filter inputs)
+                    or node.tag.endswith("edit") or node.tag.endswith("textfield")
+                    or node.tag.endswith("textarea") or node.tag.endswith("textbox")
+                    or node.tag.endswith("searchbox") or node.tag.endswith("combobox")
+                    or node.tag in {"entry", "combo-box", "check-box", "slider"}
            )

    coordinates: Tuple[int, int] = eval(node.get("{{{:}}}screencoord".format(_component_ns), "(-1, -1)"))
--- a/mm_agents/agent.py
+++ b/mm_agents/agent.py
@@ -49,6 +49,48 @@ def encode_image(image_content):
    return base64.b64encode(image_content).decode('utf-8')


+def compress_screenshot(image_bytes, quality=75, resize_ratio=None):
+    """
+    Compress screenshot to reduce file size while maintaining resolution.
+
+    Args:
+        image_bytes: Raw image bytes (PNG format)
+        quality: JPEG quality (1-100, default 75)
+        resize_ratio: Optional resize ratio (e.g., 0.5 for 50% size). None = keep original size.
+
+    Returns:
+        Compressed image bytes in JPEG format
+    """
+    try:
+        # Open image from bytes
+        img = Image.open(BytesIO(image_bytes))
+
+        # Optionally resize if ratio is provided
+        if resize_ratio and resize_ratio != 1.0:
+            new_size = (int(img.size[0] * resize_ratio), int(img.size[1] * resize_ratio))
+            img = img.resize(new_size, Image.Resampling.LANCZOS)
+
+        # Convert to RGB if necessary (JPEG doesn't support alpha channel)
+        if img.mode in ('RGBA', 'LA', 'P'):
+            background = Image.new('RGB', img.size, (255, 255, 255))
+            if img.mode == 'P':
+                img = img.convert('RGBA')
+            background.paste(img, mask=img.split()[-1] if img.mode in ('RGBA', 'LA') else None)
+            img = background
+
+        # Save as JPEG with compression
+        output = BytesIO()
+        img.save(output, format='JPEG', quality=quality, optimize=True)
+        compressed_size = len(output.getvalue())
+
+        logger.debug(f"Screenshot compressed: original={len(image_bytes)/1024:.1f}KB, compressed={compressed_size/1024:.1f}KB, ratio={compressed_size/len(image_bytes):.2%}")
+
+        return output.getvalue()
+    except Exception as e:
+        logger.warning(f"Failed to compress screenshot: {e}, using original")
+        return image_bytes
+
+
 def encoded_img_to_pil_img(data_str):
    base64_str = data_str.replace("data:image/png;base64,", "")
    image_data = base64.b64decode(base64_str)
@@ -84,7 +126,7 @@ def linearize_accessibility_tree(accessibility_tree, platform="ubuntu"):
        raise ValueError("Invalid platform, must be 'ubuntu' or 'windows'")

    filtered_nodes = filter_nodes(ET.fromstring(accessibility_tree), platform)
-    linearized_accessibility_tree = ["tag\tname\ttext\tclass\tdescription\tposition (top-left x&y)\tsize (w&h)"]
+    linearized_accessibility_tree = ["tag\tname\ttext\tposition (center x&y)\tsize (w&h)\tstates"]

    # Linearize the accessibility tree nodes into a table format
    for node in filtered_nodes:
@@ -103,14 +145,36 @@ def linearize_accessibility_tree(accessibility_tree, platform="ubuntu"):
        else:
            text = '""'

+        # Compute center coordinates from top-left + size/2
+        coords_str = node.get('{{{:}}}screencoord'.format(_component_ns), "")
+        size_str = node.get('{{{:}}}size'.format(_component_ns), "")
+        if coords_str and size_str:
+            try:
+                cx, cy = coords_str.strip('()').split(', ')
+                sw, sh = size_str.strip('()').split(', ')
+                center_x = int(cx) + int(sw) // 2
+                center_y = int(cy) + int(sh) // 2
+                center_str = "({:d}, {:d})".format(center_x, center_y)
+            except (ValueError, IndexError):
+                center_str = coords_str
+        else:
+            center_str = coords_str
+
+        # Extract useful UI states (expanded/collapsed/checked/selected/focused)
+        state_flags = []
+        for state_name in ["expanded", "collapsed", "checked", "selected", "focused", "pressed"]:
+            val = node.get("{{{:}}}{:}".format(_state_ns, state_name), "")
+            if val == "true":
+                state_flags.append(state_name)
+        states_str = ",".join(state_flags) if state_flags else ""
+
        linearized_accessibility_tree.append(
-            "{:}\t{:}\t{:}\t{:}\t{:}\t{:}\t{:}".format(
+            "{:}\t{:}\t{:}\t{:}\t{:}\t{:}".format(
                node.tag, node.get("name", ""),
                text,
-                node.get("{{{:}}}class".format(_attributes_ns), "") if platform == "ubuntu" else node.get("{{{:}}}class".format(class_ns_windows), ""),
-                node.get("{{{:}}}description".format(_attributes_ns), ""),
-                node.get('{{{:}}}screencoord'.format(_component_ns), ""),
-                node.get('{{{:}}}size'.format(_component_ns), "")
+                center_str,
+                size_str,
+                states_str
            )
        )

@@ -236,7 +300,9 @@ class PromptAgent:
            # observation_type can be in ["screenshot", "a11y_tree", "screenshot_a11y_tree", "som"]
            max_trajectory_length=3,
            a11y_tree_max_tokens=10000,
-            client_password="password"
+            client_password="password",
+            screen_width=1920,
+            screen_height=1080
    ):
        self.platform = platform
        self.model = model
@@ -248,6 +314,8 @@ class PromptAgent:
        self.max_trajectory_length = max_trajectory_length
        self.a11y_tree_max_tokens = a11y_tree_max_tokens
        self.client_password = client_password
+        self.screen_width = screen_width
+        self.screen_height = screen_height

        self.thoughts = []
        self.actions = []
@@ -283,14 +351,16 @@ class PromptAgent:
                raise ValueError("Invalid action space: " + action_space)
        else:
            raise ValueError("Invalid experiment type: " + observation_type)
-        
-        self.system_message = self.system_message.format(CLIENT_PASSWORD=self.client_password)

-    def predict(self, instruction: str, obs: Dict) -> List:
+        self.system_message = self.system_message.format(CLIENT_PASSWORD=self.client_password, SCREEN_WIDTH=self.screen_width, SCREEN_HEIGHT=self.screen_height)
+
+    def predict(self, instruction: str, obs: Dict, metadata_steps: str = "") -> List:
        """
        Predict the next action(s) based on the current observation.
        """
        system_message = self.system_message + "\nYou are asked to complete the following task: {}".format(instruction)
+        if metadata_steps:
+            system_message += "\n\nHere are the reference steps from the software tutorial, which may help you complete the task:\n{}".format(metadata_steps)

        # Prepare the payload for the API call
        messages = []
@@ -342,8 +412,8 @@ class PromptAgent:
                        {
                            "type": "image_url",
                            "image_url": {
-                                "url": f"data:image/png;base64,{_screenshot}",
-                                "detail": "high"
+                                "url": f"data:image/jpeg;base64,{_screenshot}",
+                                "detail": "auto"
                            }
                        }
                    ]
@@ -361,8 +431,8 @@ class PromptAgent:
                        {
                            "type": "image_url",
                            "image_url": {
-                                "url": f"data:image/png;base64,{_screenshot}",
-                                "detail": "high"
+                                "url": f"data:image/jpeg;base64,{_screenshot}",
+                                "detail": "auto"
                            }
                        }
                    ]
@@ -380,8 +450,8 @@ class PromptAgent:
                        {
                            "type": "image_url",
                            "image_url": {
-                                "url": f"data:image/png;base64,{_screenshot}",
-                                "detail": "high"
+                                "url": f"data:image/jpeg;base64,{_screenshot}",
+                                "detail": "auto"
                            }
                        }
                    ]
@@ -414,7 +484,9 @@ class PromptAgent:

        # {{{1
        if self.observation_type in ["screenshot", "screenshot_a11y_tree"]:
-            base64_image = encode_image(obs["screenshot"])
+            # Compress screenshot to JPEG (keep original resolution for accurate coordinates)
+            compressed_screenshot = compress_screenshot(obs["screenshot"], quality=75)
+            base64_image = encode_image(compressed_screenshot)
            linearized_accessibility_tree = linearize_accessibility_tree(accessibility_tree=obs["accessibility_tree"],
                                                                         platform=self.platform) if self.observation_type == "screenshot_a11y_tree" else None
            logger.debug("LINEAR AT: %s", linearized_accessibility_tree)
@@ -447,16 +519,37 @@ class PromptAgent:
                    {
                        "type": "image_url",
                        "image_url": {
-                            "url": f"data:image/png;base64,{base64_image}",
-                            "detail": "high"
+                            "url": f"data:image/jpeg;base64,{base64_image}",
+                            "detail": "auto"
                        }
                    }
                ]
            })
        elif self.observation_type == "a11y_tree":
+            # Debug: log raw a11y tree XML to help diagnose missing elements
+            raw_tree = obs["accessibility_tree"]
+            if raw_tree:
+                # Log first 2000 chars of raw XML and count total nodes
+                root = ET.fromstring(raw_tree)
+                all_tags = set()
+                total_nodes = 0
+                for node in root.iter():
+                    all_tags.add(node.tag)
+                    total_nodes += 1
+                logger.info("Raw a11y tree: %d total nodes, unique tags: %s", total_nodes, all_tags)
+                logger.debug("Raw a11y tree XML (first 2000 chars): %s", raw_tree[:2000])
+                # Also log nodes containing 'avogadro' or 'qt5' in their attributes
+                for node in root.iter():
+                    node_str = ET.tostring(node, encoding="unicode")
+                    if 'avogadro' in node_str.lower() or 'qt5' in node_str.lower():
+                        logger.info("Avogadro/Qt5 node: tag=%s, name=%s, visible=%s, enabled=%s",
+                                    node.tag, node.get("name", ""),
+                                    node.get("{https://accessibility.windows.example.org/ns/state}visible", "?"),
+                                    node.get("{https://accessibility.windows.example.org/ns/state}enabled", "?"))
            linearized_accessibility_tree = linearize_accessibility_tree(accessibility_tree=obs["accessibility_tree"],
                                                                         platform=self.platform)
            logger.debug("LINEAR AT: %s", linearized_accessibility_tree)
+            logger.info("Linearized a11y tree lines: %d", len(linearized_accessibility_tree.split('\n')) if linearized_accessibility_tree else 0)

            if linearized_accessibility_tree:
                linearized_accessibility_tree = trim_accessibility_tree(linearized_accessibility_tree,
@@ -481,7 +574,9 @@ class PromptAgent:
            # Add som to the screenshot
            masks, drew_nodes, tagged_screenshot, linearized_accessibility_tree = tag_screenshot(obs["screenshot"], obs[
                "accessibility_tree"], self.platform)
-            base64_image = encode_image(tagged_screenshot)
+            # Compress tagged screenshot (keep original resolution)
+            compressed_screenshot = compress_screenshot(tagged_screenshot, quality=75)
+            base64_image = encode_image(compressed_screenshot)
            logger.debug("LINEAR AT: %s", linearized_accessibility_tree)

            if linearized_accessibility_tree:
@@ -504,8 +599,8 @@ class PromptAgent:
                    {
                        "type": "image_url",
                        "image_url": {
-                            "url": f"data:image/png;base64,{base64_image}",
-                            "detail": "high"
+                            "url": f"data:image/jpeg;base64,{base64_image}",
+                            "detail": "auto"
                        }
                    }
                ]
@@ -523,7 +618,7 @@ class PromptAgent:
                "model": self.model,
                "messages": messages,
                "max_tokens": self.max_tokens,
-                "top_p": self.top_p,
+                # "top_p": self.top_p,
                "temperature": self.temperature
            })
        except Exception as e:
@@ -620,7 +715,7 @@ class PromptAgent:
                return response.json()['choices'][0]['message']['content']
        elif self.model.startswith("gpt"):
            # Support custom OpenAI base URL via environment variable
-            base_url = os.environ.get('OPENAI_BASE_URL', 'https://api.openai.com')
+            base_url = os.environ.get('OPENAI_BASE_URL', os.environ.get('OPENAI_API_BASE', 'https://api.openai.com'))
            # Smart handling: avoid duplicate /v1 if base_url already ends with /v1
            api_url = f"{base_url}/chat/completions" if base_url.endswith('/v1') else f"{base_url}/v1/chat/completions"
            headers = {
@@ -691,8 +786,8 @@ class PromptAgent:
            logger.debug("CLAUDE MESSAGE: %s", repr(claude_messages))

            headers = {
-                "x-api-key": os.environ["ANTHROPIC_API_KEY"],
-                "anthropic-version": "2023-06-01",
+                "x-api-key": os.environ["OPENAI_API_KEY"],
+                # "anthropic-version": "2023-06-01",
                "content-type": "application/json"
            }

@@ -705,7 +800,7 @@ class PromptAgent:
            }

            response = requests.post(
-                "https://api.anthropic.com/v1/messages",
+                "https://api.apiyi.com/v1/messages",
                headers=headers,
                json=payload
            )
@@ -1103,7 +1198,7 @@ class PromptAgent:
            except Exception as e:
                print("Failed to call LLM: " + str(e))
                return ""
-        
+
        else:
            raise ValueError("Invalid model: " + self.model)

@@ -1142,4 +1237,4 @@ class PromptAgent:

        self.thoughts = []
        self.actions = []
-        self.observations = []
+        self.observations = []
--- a/mm_agents/evocua/evocua_agent.py
+++ b/mm_agents/evocua/evocua_agent.py
@@ -317,7 +317,26 @@ Previous actions:
                    args = tool_call["arguments"]
                    action = args["action"]

-                    if action == "left_click":
+                    def _clean_keys(raw_keys):
+                        keys = raw_keys if isinstance(raw_keys, list) else [raw_keys]
+                        cleaned_keys = []
+                        for key in keys:
+                            if isinstance(key, str):
+                                if key.startswith("keys=["):
+                                    key = key[6:]
+                                if key.endswith("]"):
+                                    key = key[:-1]
+                                if key.startswith("['") or key.startswith('["'):
+                                    key = key[2:] if len(key) > 2 else key
+                                if key.endswith("']") or key.endswith('"]'):
+                                    key = key[:-2] if len(key) > 2 else key
+                                key = key.strip()
+                                cleaned_keys.append(key)
+                            else:
+                                cleaned_keys.append(key)
+                        return cleaned_keys
+
+                    if action == "left_click" or action == "click":
                        if "coordinate" in args:
                            x, y = args["coordinate"]
                            adj_x, adj_y = adjust_coordinates(x, y)
@@ -355,6 +374,16 @@ Previous actions:
                        else:
                            pyautogui_code.append("pyautogui.doubleClick()")

+                    elif action == "triple_click":
+                        if "coordinate" in args:
+                            x, y = args["coordinate"]
+                            adj_x, adj_y = adjust_coordinates(x, y)
+                            pyautogui_code.append(
+                                f"pyautogui.tripleClick({adj_x}, {adj_y})"
+                            )
+                        else:
+                            pyautogui_code.append("pyautogui.tripleClick()")
+
                    elif action == "type":
                        text = args.get("text", "")
                        
@@ -383,24 +412,7 @@ Previous actions:
                    

                    elif action == "key":
-                        keys = args.get("keys", [])
-                        if isinstance(keys, list):
-                            cleaned_keys = []
-                            for key in keys:
-                                if isinstance(key, str):
-                                    if key.startswith("keys=["):
-                                        key = key[6:]
-                                    if key.endswith("]"):
-                                        key = key[:-1]
-                                    if key.startswith("['") or key.startswith('["'):
-                                        key = key[2:] if len(key) > 2 else key
-                                    if key.endswith("']") or key.endswith('"]'):
-                                        key = key[:-2] if len(key) > 2 else key
-                                    key = key.strip()
-                                    cleaned_keys.append(key)
-                                else:
-                                    cleaned_keys.append(key)
-                            keys = cleaned_keys
+                        keys = _clean_keys(args.get("keys", []))

                        keys_str = ", ".join([f"'{key}'" for key in keys])
                        if len(keys) > 1:
@@ -408,6 +420,16 @@ Previous actions:
                        else:
                            pyautogui_code.append(f"pyautogui.press({keys_str})")

+                    elif action == "key_down":
+                        keys = _clean_keys(args.get("keys", []))
+                        for k in keys:
+                            pyautogui_code.append(f"pyautogui.keyDown('{k}')")
+
+                    elif action == "key_up":
+                        keys = _clean_keys(args.get("keys", []))
+                        for k in reversed(keys):
+                            pyautogui_code.append(f"pyautogui.keyUp('{k}')")
+
                    elif action == "scroll":
                        pixels = args.get("pixels", 0)
                        pyautogui_code.append(f"pyautogui.scroll({pixels})")
@@ -416,7 +438,15 @@ Previous actions:
                        pyautogui_code.append("WAIT")

                    elif action == "terminate":
-                        pyautogui_code.append("DONE")
+                        # Termination should respect status:
+                        # - success -> DONE
+                        # - failure -> FAIL
+                        # Backward compatible: missing status defaults to success.
+                        status = args.get("status", "success")
+                        if str(status).lower() == "failure":
+                            pyautogui_code.append("FAIL")
+                        else:
+                            pyautogui_code.append("DONE")

                    elif action == "mouse_move":
                        if "coordinate" in args:
@@ -481,7 +511,11 @@ Previous actions:
            process_tool_call("\n".join(current_tool_call))

        if not low_level_instruction and len(pyautogui_code) > 0:
-            action_type = pyautogui_code[0].split(".", 1)[1].split("(", 1)[0]
+            first_action = pyautogui_code[0]
+            if "." in first_action:
+                action_type = first_action.split(".", 1)[1].split("(", 1)[0]
+            else:
+                action_type = first_action.lower()
            low_level_instruction = f"Performing {action_type} action"

        return low_level_instruction, pyautogui_code
--- a/mm_agents/evocua/prompts.py
+++ b/mm_agents/evocua/prompts.py
@@ -60,6 +60,8 @@ S1_ACTION_HISTORY_TEMPLATE = "## Action:\n{action}\n"
 # S2 Prompts
 S2_ACTION_DESCRIPTION = """
 * `key`: Performs key down presses on the arguments passed in order, then performs key releases in reverse order.
+* `key_down`: Press and HOLD the specified key(s) down in order (no release). Use this for stateful holds like holding Shift while clicking.
+* `key_up`: Release the specified key(s) in reverse order.
 * `type`: Type a string of text on the keyboard.
 * `mouse_move`: Move the cursor to a specified (x, y) pixel coordinate on the screen.
 * `left_click`: Click the left mouse button at a specified (x, y) pixel coordinate on the screen.
@@ -67,7 +69,7 @@ S2_ACTION_DESCRIPTION = """
 * `right_click`: Click the right mouse button at a specified (x, y) pixel coordinate on the screen.
 * `middle_click`: Click the middle mouse button at a specified (x, y) pixel coordinate on the screen.
 * `double_click`: Double-click the left mouse button at a specified (x, y) pixel coordinate on the screen.
-* `triple_click`: Triple-click the left mouse button at a specified (x, y) pixel coordinate on the screen (simulated as double-click since it's the closest action).
+* `triple_click`: Triple-click the left mouse button at a specified (x, y) pixel coordinate on the screen.
 * `scroll`: Performs a scroll of the mouse scroll wheel.
 * `hscroll`: Performs a horizontal scroll (mapped to regular scroll).
 * `wait`: Wait specified seconds for the change to happen.
@@ -76,7 +78,7 @@ S2_ACTION_DESCRIPTION = """
 """

 S2_DESCRIPTION_PROMPT_TEMPLATE = """Use a mouse and keyboard to interact with a computer, and take screenshots.
-* This is an interface to a desktop GUI. You do not have access to a terminal or applications menu. You must click on desktop icons to start applications.
+* This is an interface to a desktop GUI. You must click on desktop icons to start applications.
 * Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions. E.g. if you click on Firefox and a window doesn't open, try wait and taking another screenshot.
 {resolution_info}
 * Whenever you intend to move the cursor to click on an element like an icon, you should consult a screenshot to determine the coordinates of the element before moving the cursor.
@@ -122,7 +124,8 @@ def build_s2_tools_def(description_prompt):
                    "action": {
                        "description": S2_ACTION_DESCRIPTION,
                        "enum": ["key", "type", "mouse_move", "left_click", "left_click_drag", 
-                                 "right_click", "middle_click", "double_click", "scroll", "wait", "terminate"], 
+                                 "right_click", "middle_click", "double_click", "triple_click", "scroll", 
+                                 "wait", "terminate", "key_down", "key_up"], 
                        "type": "string"
                    },
                    "keys": {"description": "Required only by `action=key`.", "type": "array"}, 
--- a/mm_agents/prompts.py
+++ b/mm_agents/prompts.py
--- a/mm_agents/uipath/README.md
+++ b/mm_agents/uipath/README.md
@@ -1,5 +1,13 @@
 # UiPath Screen Agent

+### 23 Dec 2025
+- Updated the planner model to [Claude 4.5 Opus](https://www.anthropic.com/news/claude-opus-4-5)
+- Updated the grounder model to an internally finetuned version of [Qwen3-VL](https://github.com/QwenLM/Qwen3-VL) and allowing it to predict "refusal" (similar to OSWorld-G) for elements that do not exist 
+- Added memory for storing relevant information across steps
+- Improved utilization of the UI element detector for fine grained details (such as cell corners)
+- Refactoring and various small fixes
+
+### 18 Sep 2025
 We propose a simple, yet effective implementation of a Computer Use Agent, which achieves a performance of **53.6%** on the **OSWorld** benchmark with 50 steps, demonstrating competitive results with a relatively lightweight setup and UI only actions. 

 Our system builds upon recent approaches in agentic computer use and follows the literature in adopting a two-stage architecture that separates high-level reasoning from low-level execution. Specifically, the system is composed of:
@@ -32,7 +40,7 @@ The interaction history is structured as a conversation: the user reports the ta
 By combining the current state with this structured history, the Action Planner generates context-aware, informed predictions at every step, being able to reconstruct the sequence of actions that led him to this point, noticing eventual failures, and plan the subsequent steps.

 We support a concise set of actions for interacting with the environment, focusing specifically on UI-related activities:
- Click (left, right, double click)
+- Click (left, right, double, triple, click)
 - Type
 - Scroll
 - Drag
@@ -68,4 +76,3 @@ This process gives the model multiple opportunities to predict within a relevant

 ## Conclusion
 Our method offers a clean and simple yet competitive pipeline for Computer Use tasks. It is cost effective, minimizing token usage during planning, avoiding parallel planning and reliance on numerous past images, and incorporate only **direct UI actions** with refined grounding actions to improve accuracy. With this approach, we achieve **53.6%** accuracy on OSWorld with a 50-step horizon.
-
--- a/mm_agents/uipath/action_planner.py
+++ b/mm_agents/uipath/action_planner.py
@@ -1,7 +1,9 @@
 import datetime
 import json
-from collections import OrderedDict
 import time
+from collections import OrderedDict
+from copy import deepcopy
+
 import mm_agents.uipath.llm_client as llm_client
 from mm_agents.uipath.types_utils import (
    PlanAction,
@@ -11,43 +13,54 @@ from mm_agents.uipath.types_utils import (
 )
 from mm_agents.uipath.action_planner_prompt_builder import (
    ComputerUseAgentInterface,
-    PlanerCoTSections,
-    user_command_template,
+    PlanerCoTSectionsType,
+    user_command_template_chat,
    user_task_info_template,
-    PlannerOutput,
 )
-from mm_agents.uipath.utils import ValidationException, parse_message_json
+from mm_agents.uipath.utils import ValidationException, parse_message_json, ExecutionInfo
+from mm_agents.uipath.memory import ShortTermMemoryManager
+
+    
+class PlannerOutput(object):
+    def __init__(self, plan_action: PlanAction, additional_sections: dict[str, str]):
+        self.plan_action = plan_action
+        self.thought = additional_sections["thought"]
+        self.review = additional_sections["review"]
+        self.additional_sections = {key: value for key, value in additional_sections.items() if key not in ["review", "thought"]}


 class ActionPlanner(object):
    def __init__(self):
        self.number_history_steps_with_images = 2
        self.computer_use_agent_interface = ComputerUseAgentInterface()
+        self.short_term_memory_manager = ShortTermMemoryManager()

    def build_message_output_format_info(self) -> str:
        output_dict = OrderedDict({})
-        for _, value in PlanerCoTSections.items():
+        cot_sections: dict[str, dict] = self.computer_use_agent_interface.get_planner_cot_sections()
+        for _, value in cot_sections.items():
            display = value["display"]
            description = value["description"]
            output_dict[display] = description

-        output_dict["action"] = (
-            "<The action to perform in JSON format as specified in the system message>"
-        )
+        output_dict["action"] = "<The action to perform in JSON format as specified in the system message>"

        return json.dumps(output_dict, indent=4, ensure_ascii=False)

-    def get_step_content(
-        self, step: dict, following_step: dict | None
-    ) -> tuple[str, str]:
+    def get_step_content(self, step: dict, following_step: dict | None) -> tuple[str, str]:
        content_dict = OrderedDict({})
        observation_dict = OrderedDict({})

-        observation_dict["Performed actions"] = step["actions"]
+        observation_dict["Performed actions"] = deepcopy(step["actions"])

-        if (
-            "extracted_data" in step["additional_parameters"]
-        ):  # if the step was an extraction step add the dummy extraction action
+        def remove_unused_fields(action: list[dict], keys: list[str]):
+            for act in action:
+                for key in keys:
+                    if key in act:
+                        del act[key]
+        remove_unused_fields(observation_dict["Performed actions"], ["id", "result", "execution_error_message", "detected_items", "description"])
+
+        if "extracted_data" in step["additional_parameters"]:  # if the step was an extraction step add the dummy extraction action
            extraction_action = {
                "type": PlanActionType.ExtractData,
                "description": step["description"],
@@ -56,24 +69,22 @@ class ActionPlanner(object):
            observation_dict["Performed actions"] = [extraction_action]

        if following_step:
-            observation_dict["Observation"] = following_step[
-                "additional_parameters"
-            ].get("review", None)
+            observation_dict["Observation"] = following_step["additional_parameters"].get("review", None)

-        for key, value in PlanerCoTSections.items():
-            if key != "review":
+        cot_sections = self.computer_use_agent_interface.get_planner_cot_sections()
+        for key, value in cot_sections.items():
+            if key not in [PlanerCoTSectionsType.Review, PlanerCoTSectionsType.Memory]:
                param_value = step["additional_parameters"].get(key, None)
                display_name = value["display"]
                content_dict[display_name] = param_value
-        content_dict["actions"] = json.loads(
-            step["additional_parameters"]["plan_action"]
-        )
+        content_dict["action"] = json.loads(step["additional_parameters"]["plan_action"])

        content_dict = json.dumps(content_dict, indent=4, ensure_ascii=False)
        observation_dict = json.dumps(observation_dict, indent=4, ensure_ascii=False)
        return content_dict, observation_dict

-    def build_messages_chat(self, state: State, execution_info: dict) -> list[dict]:
+    def build_messages_chat(self, state: State, execution_state: ExecutionState) -> list[dict]:
+        execution_info = execution_state.execution_info
        messages = []
        system_message = {
            "role": "system",
@@ -82,42 +93,45 @@ class ActionPlanner(object):

        messages.append(system_message)

+        start_index = max(0, len(state.previous_steps) - self.number_history_steps_with_images)
+        end_index = len(state.previous_steps)
+
+        images_dict = {index: state.previous_steps[index]["image"] for index in range(start_index, end_index)}
+
+        # Don't set it for the first iteration as the history is empty anyway
+        user_messages = state.task
+        if end_index == 0:
+            user_task_with_ref_imgs = ""
+            user_messages = [{"type": "text", "text": state.task}]
+        else:
+            user_task_with_ref_imgs = state.task
+            user_messages = [{"type": "text", "text": "Recall the task again:"}, {"type": "text", "text": state.task}]
+
        user_task_info_message = {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": user_task_info_template.format(
-                        task=state.task,
+                        task=user_task_with_ref_imgs,
                        current_date=datetime.datetime.now().strftime("%Y-%m-%d"),
                    ),
                }
            ],
        }
-
        messages.append(user_task_info_message)

-        start_index = max(
-            0, len(state.previous_steps) - self.number_history_steps_with_images
-        )
-        end_index = len(state.previous_steps)
-
        for index in range(0, end_index):
            step = state.previous_steps[index]

            if index >= start_index:
-                assert step["image"] is not None and len(step["image"]) > 0, (
-                    "Step image is empty"
-                )
+                image = images_dict.get(index, None)
+
+                assert image is not None and len(image) > 0, "Step image is empty"
                user_image_message = {
                    "role": "user",
                    "content": [
-                        {
-                            "type": "image_url",
-                            "image_url": {
-                                "url": f"data:image/jpeg;base64,{step['image']}"
-                            },
-                        },
+                        {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image}"}},
                    ],
                }
                messages.append(user_image_message)
@@ -148,79 +162,98 @@ class ActionPlanner(object):
            }
            messages.append(user_message_reply)

+        memory = json.loads(state.previous_steps[-1]["additional_parameters"].get("memory", "{}")) if len(state.previous_steps) > 0 else {}
+        memory_str = json.dumps(memory, indent=4, ensure_ascii=False) if len(memory) > 0 else "No memory."
+
        last_user_message = {
            "role": "user",
-            "content": [
+            "content": user_messages
+            + [
                {
                    "type": "text",
                    "text": "Current screenshot:",
                },
-                {
-                    "type": "image_url",
-                    "image_url": {
-                        "url": f"data:image/jpeg;base64,{state.image_base64}"
-                    },
-                },
+                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{state.image_base64}"}},
                {
                    "type": "text",
-                    "text": user_command_template.format(
-                        task=state.task,
-                        execution_info_message=self.build_execution_info_message(
-                            execution_info
-                        ),
+                    "text": user_command_template_chat.format(
+                        execution_info_message=self.build_execution_info_message(execution_info),
                        json_output_format=self.build_message_output_format_info(),
+                        memory=memory_str,
                    ),
                },
            ],
        }

        messages.append(last_user_message)
+
+        for raw_response in execution_info.responses:
+            if raw_response.grounding_error is not None:
+                ai_message = {
+                    "role": "assistant",
+                    "content": [
+                        {
+                            "type": "text",
+                            "text": raw_response.raw_planning_prediction,
+                        }
+                    ],
+                }
+
+                messages.append(ai_message)
+
+                user_message = {
+                    "role": "user",
+                    "content": [
+                        {
+                            "type": "text",
+                            "text": f"Grounder model error detected. Could not identify the element with description: '{raw_response.grounding_error.element_description}', error {raw_response.grounding_error.message}. Possible reasons:the description is not precise enough for the grounder or the element is not visible on the screenshot. If providing a new description does not work, try to complete the action through another path than using that specific button (either by changing the element to be clicked or providing another action such as a hotkey if any exist).",
+                        }
+                    ],
+                }
+                messages.append(user_message)
        return messages

-    def extract_response(
-        self, response_content: str
-    ) -> tuple[PlanAction, dict[str, str]]:
-        cot_sections_lst = list(PlanerCoTSections.keys())
-
+    def extract_response(self, response_content: str) -> tuple[PlanAction, dict[str, str]]:
        additional_sections = OrderedDict({})
        response_json = parse_message_json(response_content)
+        cot_sections = self.computer_use_agent_interface.get_planner_cot_sections()
+        cot_sections_lst = list(cot_sections.keys())

        for section in cot_sections_lst:
-            section_display = PlanerCoTSections[section]["display"]
+            section_display = cot_sections[section]["display"]
            if section_display not in response_json:
-                raise ValidationException(
-                    f"Invalid response format, '{section}' key not found: {response_content}"
-                )
-            additional_sections[section] = response_json.get(
-                PlanerCoTSections[section]["display"]
-            )
+                raise ValidationException(f"Invalid response format, '{section_display}' key not found: {response_content}")
+            additional_sections[section] = response_json.get(section_display)

        if "action" not in response_json:
-            raise ValidationException(
-                f"Invalid response format, 'action' key not found: {response_content}"
-            )
+            raise ValidationException(f"Invalid response format, 'action' key not found: {response_content}")

        action_dict = response_json["action"]

-        plan_action = PlanAction.from_dict(self.correct_action_type(action_dict))
+        plan_action = PlanAction.from_dict(ActionPlanner.correct_action_type(action_dict))
+
+        if plan_action is None:
+            raise ValidationException(f"Invalid action format: {response_content}")

        if plan_action.action_type == PlanActionType.Drag:
            self.computer_use_agent_interface.validate_action(plan_action)

        return plan_action, additional_sections

-    def build_execution_info_message(self, execution_info: dict) -> str:
+    def build_execution_info_message(self, execution_info: ExecutionInfo) -> str:
        execution_info_message = ""
-        if "planner_action_review" in execution_info:
-            action_description = execution_info["planner_action_review"][
-                "action_description"
-            ]
-            error_message = execution_info["planner_action_review"]["error_message"]
-
-            execution_info_message = f"You predicted this action: '{action_description}' but it is not valid because: {error_message}. If the target element is not visible on the screenshot, scroll first to make the target element visible. If the target element is not correct, change the action description with more precise element description using nearby context."
+        if execution_info.planner_action_review is not None:
+            action_description = execution_info.planner_action_review["action_description"]
+            error_message = execution_info.planner_action_review["error_message"]
+            execution_info_message = f"You predicted this action: '{action_description}' but it is not valid because: {error_message}. If the target element is not visible/fully visible on the screenshot, scroll first to make the target element visible. If the target element is not correct, change the action description with more precise element description using nearby context."
+        elif execution_info.responses and len(execution_info.responses) > 0 and execution_info.responses[-1].grounding_error is not None:
+            grounding_error = execution_info.responses[-1].grounding_error
+            error_message = str(grounding_error)
+            execution_info_message = f"The predicted is not valid because of this {error_message}. If the target element is not visible/fully visible on the screenshot, scroll first to make the target element visible. If the target element is not correct, change the action description with more precise element description using nearby context."
        return execution_info_message

-    def correct_action_type(self, response_json: dict) -> dict:
+    @staticmethod
+    def correct_action_type(response_json: dict) -> dict:
        action_type = response_json.get("type", "").lower()
        if action_type in ("press", "key_press", "press_key"):
            response_json["type"] = "key_press"
@@ -234,11 +267,13 @@ class ActionPlanner(object):
            response_json["type"] = "wait"
        return response_json

-    def predict(self, state: State, execution_state: ExecutionState) -> PlannerOutput:
-        messages = self.build_messages_chat(state, execution_state.execution_info)
+    async def predict(self, state: State, execution_state: ExecutionState) -> PlannerOutput:
+        messages = self.build_messages_chat(state, execution_state)
        llm_messages = [message for message in messages]
-        repeat_count = 2
-        plan, response_content = None, None
+        repeat_count = 3
+        response_content = ""
+        plan_action = None
+        additional_sections = {}
        while repeat_count > 0:
            try:
                payload = {
@@ -250,13 +285,14 @@ class ActionPlanner(object):
                response_content = llm_client.send_messages(payload)
                if response_content is None or len(response_content.strip()) == 0:
                    raise ValidationException("Planner response is None or empty")
-                plan_action, additional_sections = self.extract_response(
-                    str(response_content)
-                )
-                plan = PlannerOutput(plan_action, additional_sections)
+
+                plan_action, additional_sections = self.extract_response(str(response_content))
+                llm_memory_response = additional_sections.get("memory", None)
+                memory_operations = self.short_term_memory_manager.extract_memory_operations(llm_memory_response)
+
+                execution_state.execution_info.current_response.raw_planning_prediction = response_content
                break
            except ValidationException as e:
-                time.sleep(5)
                repeat_count -= 1
                ai_message = {
                    "role": "assistant",
@@ -280,9 +316,15 @@ class ActionPlanner(object):
                llm_messages = messages + [ai_message, error_message]

                if repeat_count == 0:
-                    raise ValueError(
-                        f"Invalid planner response format: {response_content}, {str(e)}"
-                    )
-        if plan is None:
+                    raise ValueError(f"Invalid planner response format: {response_content}")
+        if plan_action is None:
            raise ValueError("Planner response is not valid")
-        return plan
+        planner_output = PlannerOutput(
+            plan_action=plan_action,
+            additional_sections=additional_sections,
+        )
+        updated_memory = await self.short_term_memory_manager.get_updated_memory(
+            state, memory_operations, execution_state=execution_state
+        )
+        planner_output.additional_sections["memory"] = json.dumps(updated_memory, indent=4, ensure_ascii=False)
+        return planner_output
--- a/mm_agents/uipath/action_planner_prompt_builder.py
+++ b/mm_agents/uipath/action_planner_prompt_builder.py
@@ -1,8 +1,11 @@
 from collections import OrderedDict
 from dataclasses import dataclass, field
 from typing import Any, Dict, List, Optional
+
+from enum import Enum
 from mm_agents.uipath.types_utils import PlanAction, key_maps
 from mm_agents.uipath.utils import ValidationException
+from mm_agents.uipath.memory import memory_system_template

 system_template = """You are a computer use agent that perform computer-related tasks.
 You will be given a task, a current screenshot, and a list of previous actions. You need to predict the next action.
@@ -25,91 +28,144 @@ Your action response must be a valid JSON with the following format:
 {{
    "type": str  # one of the valid action types
    "description": # action description
-    "parameters": # optional, action parameters dictionary 
+    "parameters": # optional, action parameters dictionary    
 }}

 ## Action examples: example of valid actions:
 {examples}

-## Important Notes:
- Close any cookies, ads, login or registration etc pop-ups if not needed.
- Before typing, ensure the input box is focused by clicking on it.
+## Action Sequence Example:
+Here is an example of the correct sequence for typing text into an input field.
+
+Step 1: Scroll to make the 'Username' input field fully visible.
+
+{{
+  "type": "scroll",
+  "description": "Scroll page to make the 'Username' input field fully visible."
+  "parameters": {{"element_description": "the main page", "direction": "down", "distance": 3}}
+}}
+
+Step 2: Click the input field to focus it.
+
+{{
+  "type": "click",
+  "description": "Click the 'Username' input field."
+}}
+
+Step 3: Type the desired text.
+
+{{
+  "type": "type",
+  "description": "Type 'testuser' into the focused 'Username' input field.",
+  "parameters": {{
+    "text": "testuser"
+  }}
+}}
+
+## Important Rules:
+CRITICAL: Always click to focus an input field before using the type action if it is not focused already from a previous step. The model must predict a click on the element, and then in the next step, predict the type action.
+Close any cookies, ads, login or registration pop-ups if they are not needed for the task.
+Before finish action, ensure all necessary data entries or selections are committed by performing appropriate actions (e.g., pressing 'Enter'/ 'Tab', Ctrl+S for saving documents or clicking 'Save', changing focus, or blurring the input field).
+- **Strict Adherence**: Only perform actions the user has explicitly requested; avoid unnecessary steps. E.g. For colors, ensure that if user requested to use "green" you use the color named green, not light green or other shades.
+- CRITICAL: Make sure the modified files or settings are saved and if no file name is specified in the user task, use the default settings that appear.
+- Dismiss "Authentication required" prompts by clicking "Cancel".
+- Leave windows/applications open at task completion.
+- **Completion Criteria**: Only finish when all user requirements are met in full and all running commands have finished.
+- **Impossibility Handling**: Return failure if completion is blocked by environmental constraints.
+- You must never logout/close the computer, otherwise you won't be able to interact with the environment, if an action requires this, mark it as failure
 """

-user_command_template = """Recall Task Again: {task}
-Check if the task is finished. If not provide the next action to perform.
-Remember:
- Perform the task on provided application(s) or website(s). You are not allowed to use the browser "address bar".
- Close any cookies, ads, login or registration etc pop-ups if not needed.
- Only one action at a time (never "click and type", "click and drag", "type and press", "press shift and click", etc..). Think of how to combine them in two consecutive actions obtaining the intended result or use an available action that can obtain it.
- For any opening input combobox, dropdown menu options, you must select an option or press Enter key to select default one.
- Click on input box to ensure is focused before typing. Otherwise, the input box will not accept the text.
- Once focusing on an input box, if it has a default pre-typed value (not placeholder which is usually grayed-out), remove the existing value first by clicking on "X" icon or using "Ctrl A" + "Backspace" or "Backspace" if the value is already selected.
- For search input, if no search button or suggestions popup after typing, press 'Enter' to trigger search.
- Retry the drag action on slider control if needed to refine the slider values closer to expected values.
- Scroll / Pageup / Pagedown to explore or extract more content/data if needed (prefer 'key_press' action with key 'Pageup', 'Pagedown' for faster scrolling). Particularly when extraction data from table with hidden rows or columns.
- Scroll action must have a 'direction' parameter. Finish action must have a 'status' parameter.
- If you modify some settings remember to save/apply them. If button is not visible try to scroll for it.
+user_message_template = """Here are the current information:
+The current date is (YYYY-MM-DD): {current_date}
+Task: {task}

-Most importantly, never type or click on element not visible on screenshot. Use scroll or pageup/pagedown to make the element visible first.
-
-{execution_info_message}
-Answer in json format:
-{json_output_format}
+Previous actions:
+{history}
 """

-PlanerCoTSections = OrderedDict(
-    {
-        "review": {
-            "display": "previous_action_result",
-            "description": "Briefly describe the previous action result and UI change on the screenshot to see if is correctly performed.",
-        },
-        "thought": {
-            "display": "thought",
-            "description": "Reason briefly about the next action to perform if the task is not finished.",
-        },
-        "action_description": {
-            "display": "action_description",
-            "description": "Describe the action to perform in a single sentence. The description must be precise and not rely on specific information in the current screen.",
-        },
-    }
-)
-
-
 ### for chat conversation
 user_task_info_template = """## Task Information:
 The current date is (YYYY-MM-DD): {current_date}
 Task: {task}
 """

+user_command_template_chat = """Current Memory: {memory}
+Check if the task is finished. If not provide the next action to perform.
+Remember:
+- Perform the task on provided application(s) or website(s). You are not allowed to use the browser "address bar".
+- Close any cookies, ads, login or registration etc pop-ups if not needed.
+- Only one action at a time (never "click and type", "click and drag", "type and press" etc..).
+- For any opening input combobox, dropdown menu options, you must select an option or press Enter key to select default one.
+- Caret is not always visible in input box even when the input box is focused
+- CRITICAL: Scroll to make the target element fully visible on the screenshot before clicking or typing on it. Never click or type on an element not fully visible on the screenshot.
+- CRITICAL: Before typing ensure the element is focused by first clicking it. Otherwise, the input box will not accept the text.
+- Once focusing on an input box, if it has a default pre-typed value (not placeholder which is usually grayed-out), remove the existing value first by clicking on "X" icon or using "Ctrl A" + "Backspace" or "Backspace" if the value is already selected.
+- For search input, if no search button or suggestions popup after typing, press 'Enter' to trigger search.
+- Retry the drag action on slider control if needed to refine the slider values closer to expected values.
+- Scroll / Pageup / Pagedown to explore or extract more content/data if needed (prefer 'key_press' action with key 'Pageup', 'Pagedown' for faster scrolling). Particularly when extraction data from table with hidden rows or columns.
+- Scroll action must have a 'direction' parameter. Finish action must have a 'status' parameter.
+
+MOST IMPORTANTLY, never type or click on element not visible on screenshot. Use scroll or pageup/pagedown to make the element visible first.
+
+{execution_info_message}
+Answer in json format:
+{json_output_format}
+"""
+
+user_command_template = """Recall Task Again: {task}\n""" + user_command_template_chat
+
+
+class PlanerCoTSectionsType(str, Enum):
+    Review = "review"
+    Thought = "thought"
+    ActionDescription = "action_description"
+    Memory = "memory"
+
+PlanerCoTSections = OrderedDict(
+    {
+        PlanerCoTSectionsType.Review: {
+            "display": "previous_action_result",
+            "description": "Briefly describe the previous action result and UI change on the screenshot to see if is correctly performed.",
+        },
+        PlanerCoTSectionsType.Thought: {"display": "thought", "description": "Reason briefly about the next action to perform if the task is not finished."},
+        PlanerCoTSectionsType.ActionDescription: {
+            "display": "action_description",
+            "description": "Describe the action to perform in a single sentence. The description must be precise and not rely on specific information in the current screen.",
+        },
+        PlanerCoTSectionsType.Memory: {
+            "display": "update_memory",
+            "description": "<Proceed with a memory update considering the previous actions. Emit a list of memory operations. If no memory update is needed, emit an empty list>",
+        },
+    }
+)
+

@dataclass
 class ActionDefinition:
+    """Simple action definition with description, parameters, and examples"""
+
    type: str
    description: str
    parameters: Optional[Dict[str, str]] = None
    examples: List[Dict[str, Any]] = field(default_factory=list)


-class PlannerOutput(object):
-    def __init__(self, plan_action: PlanAction, additional_sections: dict[str, str]):
-        self.plan_action = plan_action
-        self.thought = additional_sections["thought"]
-        self.review = additional_sections["review"]
-        self.additional_sections = {
-            key: value
-            for key, value in additional_sections.items()
-            if key not in ["review", "thought"]
-        }
-
-
 class ComputerUseAgentInterface:
+    """Simple computer use agent with modular action definitions"""
+
    def __init__(self):
        self.ui_actions = {}
        self.special_actions = {}
        self._setup_default_actions()

+    def get_planner_cot_sections(self) -> OrderedDict:
+        cot_sections = PlanerCoTSections.copy()
+        return cot_sections
+
    def _setup_default_actions(self):
+        """Define all available actions"""
+
+        # Click action - no parameters
        self.add_action(
            ActionDefinition(
                type="click",
@@ -120,124 +176,121 @@ class ComputerUseAgentInterface:
                        "type": "click",
                        "description": "Click the 'X' icon in the input box",
                    },
-                    {
-                        "type": "click",
-                        "description": "Click the first name input box to focus on it.",
-                    },
+                    {"type": "click", "description": "Click the first name input box to focus on it."},
                ],
            )
        )

+        # Right click action - no parameters
        self.add_action(
            ActionDefinition(
                type="right_click",
                description="Right click on a UI element",
-                examples=[
-                    {
-                        "type": "right_click",
-                        "description": "Right click on the first row from the patient table to open the context menu.",
-                    }
-                ],
+                examples=[{"type": "right_click", "description": "Right click on the first row from the patient table to open the context menu."}],
            )
        )

+        # Double click action - no parameters
        self.add_action(
            ActionDefinition(
                type="double_click",
                description="Double click on a UI element",
                examples=[
-                    {
-                        "type": "double_click",
-                        "description": "Double click word app icon to open the application.",
-                    },
+                    {"type": "double_click", "description": "Double click word app icon to open the application."},
+                ],
+            )
+        )
+        
+        # Triple click action - no parameters
+        self.add_action(
+            ActionDefinition(
+                type="triple_click",
+                description="Triple click on a UI element",
+                examples=[
+                    {"type": "triple_click", "description": "Triple click the second paragraph to select it."},
                ],
            )
        )

+        # Type action - with text parameter
        self.add_action(
            ActionDefinition(
                type="type",
                description="Type text into a focused input field. Ensure the input box is focused before typing. To focus the input box, you may need to click on it first.",
                parameters={"text": "str - the text to be typed"},
                examples=[
-                    {
-                        "type": "type",
-                        "description": "Type 'John' in the first name input box.",
-                        "parameters": {"text": "John"},
-                    },
-                    {
-                        "type": "type",
-                        "description": "Type 'Doe' in the last name input box.",
-                        "parameters": {"text": "Doe"},
-                    },
-                    {
-                        "type": "type",
-                        "description": "Type 'Hello, world!' in the text area.",
-                        "parameters": {"text": "Hello, world!"},
-                    },
+                    {"type": "type", "description": "Type 'John' in the first name input box.", "parameters": {"text": "John"}},
+                    {"type": "type", "description": "Type 'Doe' in the last name input box.", "parameters": {"text": "Doe"}},
+                    {"type": "type", "description": "Type 'Hello, world!' in the text area.", "parameters": {"text": "Hello, world!"}},
                ],
            )
        )

+        # Scroll action - with direction parameter
        self.add_action(
            ActionDefinition(
                type="scroll",
                description="Scroll an UI element in a specified direction",
                parameters={
+                    "element_description": "str - description of the element to be scrolled such that the executor can locate it",
                    "direction": "str - 'up', 'down', 'left', or 'right'",
-                    "distance": "int - the number of scroll steps (wheel “clicks”) to send.",
+                    "distance": "int - number of 'clicks' to scroll, e.g. on windows, 1 click = 120 units of scroll internally",
                },
                examples=[
                    {
                        "type": "scroll",
-                        "description": "Scroll down to see more content.",
-                        "parameters": {"direction": "down"},
+                        "description": "Scroll down the user table to see more content.",
+                        "parameters": {"element_description": "Users table", "direction": "down", "distance": "6"},
                    },
                    {
                        "type": "scroll",
                        "description": "Scroll up to the top of the page.",
-                        "parameters": {"direction": "up"},
+                        "parameters": {"element_description": "the main page", "direction": "up"},
                    },
                ],
            )
        )

+        # Drag action
        self.add_action(
            ActionDefinition(
                type="drag",
-                description="Drag an element or the mouse (with left click on) from one location to another. You must specify both start_description and end_description.",
-                parameters={
-                    "start_description": "description of the location to start dragging",
-                    "end_description": "description of the location to drag to",
-                },
+                description="Drag an element or the mouse (with left click on) from one location to another.",
+                parameters={"start_description": "description of the location to start dragging", "end_description": "description of the location to drag to"},
                examples=[
                    {
                        "type": "drag",
                        "description": "Drag the response.txt file to the responses folder",
-                        "start_description": "Click the response.txt file",
-                        "end_description": "Click the responses folder",
+                        "parameters": {
+                            "start_description": "the response.txt file",
+                            "end_description": "the responses folder",
+                        },
+                    },
+                    {
+                        "type": "drag",
+                        "description": "Drag the profile picture image into the upload box",
+                        "parameters": {
+                            "start_description": "the profile picture image",
+                            "end_description": "the upload box",
+                        },
                    },
                ],
            )
        )

+        # Mouse move action
        self.add_action(
            ActionDefinition(
                type="mouse_move",
                description="Move the mouse to a specific element",
                examples=[
-                    {
-                        "type": "mouse_move",
-                        "description": "Move the mouse to the 'Submit' button.",
-                    },
-                    {
-                        "type": "mouse_move",
-                        "description": "Hover over the 'Settings' icon.",
-                    },
+                    {"type": "mouse_move", "description": "Move the mouse to the 'Submit' button."},
+                    {"type": "mouse_move", "description": "Hover over the 'Settings' icon."},
                ],
            )
        )

+        # Key press action - with key parameter
        self.add_action(
            ActionDefinition(
                type="key_press",
@@ -246,50 +299,55 @@ class ComputerUseAgentInterface:
                    "key": f'str  # the key or key combination (separated by space) to be pressed. Example of key combination "Ctrl A", "Shift Tab", "Ctrl C" etc. "<Key> + Click" is not a valid combination, use two separate actions. Beside normal keys like letters, numerics, punctuations etc.. here are special key list: {key_maps.keys()}.'
                },
                examples=[
-                    {
-                        "type": "key_press",
-                        "description": "Press 'Ctrl A' to select all text.",
-                        "parameters": {"key": "Ctrl A"},
-                    },
-                    {
-                        "type": "key_press",
-                        "description": "Press Pagedown key.",
-                        "parameters": {"key": "Pagedown"},
-                    },
+                    {"type": "key_press", "description": "Press 'Ctrl A' to select all text.", "parameters": {"key": "Ctrl A"}},
+                    {"type": "key_press", "description": "Press Pagedown key.", "parameters": {"key": "Pagedown"}},
                ],
            )
        )

+        # Extract data action - with variable parameter
        self.add_special_action(
            ActionDefinition(
                type="extract_data",
                description="Use to extract some data from the screen for the task. This data will be stored in memory and used in the next actions or returned in the final result.",
-                parameters={
-                    "description": "str - short description of the data to be extracted",
-                    "data": "str|json - the data to be extracted",
-                },
+                parameters={"description": "str - short description of the data to be extracted", "data": "str|json - the data to be extracted"},
                examples=[
                    {
                        "type": "extract_data",
                        "description": "Extract the product name and price from the screen.",
-                        "parameters": {
-                            "description": "Available product name and price",
-                            "data": "Product Name: iPhone 14, Price: $999",
-                        },
+                        "parameters": {"description": "Available product name and price", "data": "Product Name: iPhone 14, Price: $999"},
                    },
                ],
            )
        )

+        # Wait action
+        self.add_special_action(
+            ActionDefinition(
+                type="wait",
+                description="Use it to wait for the completion of an event.",
+                examples=[
+                    {"type": "wait", "description": "Wait for the running command to finish."},
+                ],
+            )
+        )
+
+        # Finish action - with status parameter
        self.add_special_action(
            ActionDefinition(
                type="finish",
-                description=" Use it to finish the task with success or failure status. When you think the task was finished return success, while when you think can not be done, return failure, don't easily say failure, try your best to do the task.",
+                description=(
+                    "Use it to finish the task with success or failure. "
+                    "Before finishing, ensure all necessary data entries or selections required by the task are committed by performing appropriate actions (e.g., pressing 'Enter'/ 'Tab', pressing CTRL + S to save the document or clicking 'Save', changing focus, or blurring the input field). After typing a value that should be set/submitted, perform a COMMIT action (Enter, Tab, click Save/Apply or blur) before using the finish action.",
+                    "Do not use the finish action while any essential process or command (e.g., downloading data, running a script, loading results) is still in progress; wait for it (emmit wait action) to fully complete before finishing. ",
+                    "Failure status is used when the task is impossible to complete or you are unable to complete it (e.g. stuck in a loop, etc)."
+                ),
                parameters={"status": "str - 'success' or 'failure'"},
                examples=[
+                    {"type": "finish", "description": "Task completed successfully.", "parameters": {"status": "success"}},
                    {
                        "type": "finish",
-                        "description": "Task completed successfully.",
+                        "description": "After typing 'John Doe' and pressing TAB to save the value, finish the task successfully.",
                        "parameters": {"status": "success"},
                    },
                ],
@@ -297,15 +355,19 @@ class ComputerUseAgentInterface:
        )

    def add_action(self, action: ActionDefinition):
+        """Add a new action to the agent"""
        self.ui_actions[action.type] = action

    def add_special_action(self, action: ActionDefinition):
+        """Add a special action that is not part of the main UI actions"""
        self.special_actions[action.type] = action

    def get_action_definition(self, action_type: str) -> Optional[ActionDefinition]:
+        """Get action definition by type"""
        return self.ui_actions.get(action_type) or self.special_actions.get(action_type)

    def validate_action(self, action: PlanAction):
+        """Validate if the action is valid and has all required parameters"""
        action_definition = self.get_action_definition(action.action_type)
        if action_definition is None:
            raise ValidationException(f"Invalid action type: {action.action_type}")
@@ -313,26 +375,25 @@ class ComputerUseAgentInterface:
        if action_definition.parameters:
            for parameter in action_definition.parameters:
                if parameter not in action.parameters:
-                    raise ValidationException(
-                        f"Missing parameter '{parameter}' in action: {action}"
-                    )
+                    raise ValidationException(f"Missing parameter '{parameter}' in action: {action}")

    def get_system_prompt(self) -> str:
+        """Generate the complete prompt for the agent"""
        indentation = "  "

        def get_action_definition(action: ActionDefinition) -> str:
+            """Format action definitions for the prompt"""
+
            action_prompt = f"- {action.type}: {action.description}"
            if action.parameters is not None and len(action.parameters) > 0:
-                params = (",\n" + 2 * indentation).join(
-                    f"{k}: {v}" for k, v in action.parameters.items()
-                )
-                parameter_def = (
-                    f"{indentation}parameters:\n{indentation}{indentation}{params}"
-                )
+                params = (",\n" + 2 * indentation).join(f"{k}: {v}" for k, v in action.parameters.items())
+                parameter_def = f"{indentation}parameters:\n{indentation}{indentation}{params}"
                action_prompt += "\n" + parameter_def
            return action_prompt

        def get_examples(actions: List[ActionDefinition]) -> list[str]:
+            """Format action examples for the prompt"""
+
            output_examples = []
            for action in actions:
                for example in action.examples:
@@ -343,48 +404,23 @@ class ComputerUseAgentInterface:
                    example_parts = [type_str, description_str]

                    if "parameters" in example:
-                        params = (",\n" + 2 * indentation).join(
-                            f'"{k}": "{v}"' for k, v in example["parameters"].items()
-                        )
-                        parameters_str = (
-                            '"parameters"'
-                            + ": {\n"
-                            + 2 * indentation
-                            + params
-                            + "\n"
-                            + indentation
-                            + "}"
-                        )
+                        params = (",\n" + 2 * indentation).join(f'"{k}": "{v}"' for k, v in example["parameters"].items())
+                        parameters_str = '"parameters"' + ": {\n" + 2 * indentation + params + "\n" + indentation + "}"
                        example_parts.append(parameters_str)
-                    example_json = (
-                        "{\n"
-                        + indentation
-                        + (",\n" + indentation).join(example_parts)
-                        + "\n}"
-                    )
+                    example_json = "{\n" + indentation + (",\n" + indentation).join(example_parts) + "\n}"
                    output_examples.append(example_json)

            return output_examples

-        available_actions = "\n\n".join(
-            get_action_definition(action) for action in self.ui_actions.values()
-        )
-        special_actions = "\n\n".join(
-            get_action_definition(action) for action in self.special_actions.values()
-        )
-        examples = "\n\n".join(
-            get_examples(
-                list(self.ui_actions.values()) + list(self.special_actions.values())
-            )
-        )
+        available_actions = "\n\n".join(get_action_definition(action) for action in self.ui_actions.values())
+        special_actions = "\n\n".join(get_action_definition(action) for action in self.special_actions.values())
+        examples = "\n\n".join(get_examples(list(self.ui_actions.values()) + list(self.special_actions.values())))

-        return system_template.format(
-            available_actions=available_actions,
-            special_actions=special_actions,
-            examples=examples,
-        )
+        out = system_template.format(available_actions=available_actions, special_actions=special_actions, examples=examples)
+        out += "\n\n" + memory_system_template.format()
+        return out


 if __name__ == "__main__":
    agent = ComputerUseAgentInterface()
-    print(agent.get_system_prompt())
+    print(agent.get_system_prompt())
--- a/mm_agents/uipath/agent.py
+++ b/mm_agents/uipath/agent.py
@@ -19,113 +19,19 @@ class UiPathComputerUseV1(object):
        self.planner = ActionPlanner()
        self.executor = GrounderClient()

-    async def predict_request(
-        self, request_body: dict, model_name: str
-    ) -> tuple[dict, dict]:
+    async def predict_request(self, request_body: dict, model_name: str) -> tuple[dict, dict]:
+        previous_steps = request_body['previousSteps'] if request_body['previousSteps'] else []
        state = State(
            task=request_body["userTask"],
            image_base64=request_body["image"],
-            previous_steps=request_body.get("previousSteps", []),
+            previous_steps=[step for step in previous_steps],
        )

-        execution_state = ExecutionState(model_name=model_name, execution_info={})
-        output = await self.predict(state, execution_state)
+        execution_state = ExecutionState(model_name=model_name)
+        output = await self.predict(state, execution_state, max_retries=2)
        return output

-    def process_grounding(
-        self,
-        plan_action: PlanAction,
-        grounding_result: utils.GroundingOutput,
-        x: int,
-        y: int,
-    ):
-        match plan_action.action_type:
-            case PlanActionType.Scroll:
-                # guess the scroll direction if missing in the plan output
-                if "direction" not in plan_action.parameters:
-                    if "scroll up" in plan_action.description.lower():
-                        scroll_direction = "up"
-                    else:
-                        scroll_direction = "down"
-                else:
-                    scroll_direction = plan_action.parameters["direction"]
-
-                action = ComputerUseAction(
-                    name=SupportedActions.Scroll,
-                    description=plan_action.description,
-                    parameters={"position": [x, y], "direction": scroll_direction},
-                )
-
-                if "distance" in plan_action.parameters:
-                    match scroll_direction:
-                        case "up":
-                            action.parameters["offset"] = [
-                                0,
-                                plan_action.parameters["distance"],
-                            ]
-                        case "down":
-                            action.parameters["offset"] = [
-                                0,
-                                -plan_action.parameters["distance"],
-                            ]
-                        case "left":
-                            action.parameters["offset"] = [
-                                plan_action.parameters["distance"],
-                                0,
-                            ]
-                        case "right":
-                            action.parameters["offset"] = [
-                                -plan_action.parameters["distance"],
-                                0,
-                            ]
-            case PlanActionType.Drag:
-                assert grounding_result.end_position is not None, (
-                    "End position must be provided for drag action"
-                )
-                x_end, y_end = grounding_result.end_position
-                action = ComputerUseAction(
-                    name=SupportedActions.Drag,
-                    description=plan_action.description,
-                    parameters={
-                        "path": [
-                            {"x": x, "y": y},
-                            {"x": x_end, "y": y_end},
-                        ]
-                    },
-                )
-            case _:
-                action_name = plan_action.action_type
-                parameters = {"position": [x, y]}
-
-                if plan_action.action_type == PlanActionType.DoubleClick:
-                    action_name = SupportedActions.Click
-                    parameters["click_type"] = "double"
-                elif plan_action.action_type == PlanActionType.RightClick:
-                    action_name = SupportedActions.Click
-                    parameters["button"] = "right"
-                elif plan_action.action_type == PlanActionType.MouseMove:
-                    action_name = SupportedActions.MouseMove  # different names
-
-                assert action_name in [
-                    SupportedActions.Click,
-                    SupportedActions.MouseMove,
-                ]
-                action = ComputerUseAction(
-                    name=action_name,
-                    description=plan_action.description,
-                    parameters=parameters,
-                )
-        return action
-
-    async def predict(
-        self, state: State, execution_state: ExecutionState
-    ) -> tuple[dict, dict]:
-        planer_output: PlannerOutput = self.planner.predict(state, execution_state)
-        plan_action = planer_output.plan_action
-
-        action: ComputerUseAction | None = None
-        step: ComputerUseStep | None = None
-
+    def wrap_to_computer_use_action(self, plan_action: PlanAction, grounding_result: utils.GroundingOutput | None) -> ComputerUseAction:
        match plan_action.action_type:
            case PlanActionType.KeyPress:
                keys = plan_action.parameters["key"].split(" ")
@@ -142,6 +48,125 @@ class UiPathComputerUseV1(object):
                    description=plan_action.description,
                    parameters={},
                )
+            case PlanActionType.Click | PlanActionType.DoubleClick |  PlanActionType.TripleClick | PlanActionType.MouseMove | PlanActionType.RightClick:
+                action_name = plan_action.action_type
+                x, y = grounding_result.position
+                parameters = {"position": [int(x), int(y)]}
+
+                if plan_action.action_type == PlanActionType.DoubleClick:
+                    action_name = SupportedActions.Click
+                    parameters["click_type"] = "double"
+                elif plan_action.action_type == PlanActionType.TripleClick:
+                    action_name = SupportedActions.Click
+                    parameters["click_type"] = "triple"
+                elif plan_action.action_type == PlanActionType.RightClick:
+                    action_name = SupportedActions.Click
+                    parameters["button"] = "right"
+                elif plan_action.action_type == PlanActionType.MouseMove:
+                    action_name = SupportedActions.MouseMove  # different names
+
+                assert action_name in [SupportedActions.Click, SupportedActions.MouseMove]
+                action = ComputerUseAction(
+                    name=action_name,
+                    description=plan_action.description,
+                    parameters=parameters,
+                )
+            case PlanActionType.Drag:
+                assert grounding_result.end_position is not None, "End position must be provided for drag action"
+                x, y = grounding_result.position
+                x_end, y_end = grounding_result.end_position
+                x, y = int(x), int(y)
+                x_end, y_end = int(x_end), int(y_end)
+                action = ComputerUseAction(
+                    name=SupportedActions.Drag,
+                    description=plan_action.description,
+                    parameters={"path": [{"x": x, "y": y}, {"x": x_end, "y": y_end}]},
+                )
+            case PlanActionType.Scroll:
+                x, y = grounding_result.position
+                x, y = int(x), int(y)
+                # guess the scroll direction if missing in the plan output
+                if "direction" not in plan_action.parameters:
+                    if "scroll up" in plan_action.description.lower():
+                        scroll_direction = "up"
+                    else:
+                        scroll_direction = "down"
+                else:
+                    scroll_direction = plan_action.parameters["direction"]
+
+                action = ComputerUseAction(
+                    name=SupportedActions.Scroll, description=plan_action.description, parameters={"position": [x, y], "direction": scroll_direction}
+                )
+
+                if "distance" in plan_action.parameters:
+                    match scroll_direction:
+                        case "up":
+                            action.parameters["offset"] = [0, plan_action.parameters["distance"]]
+                        case "down":
+                            action.parameters["offset"] = [0, -plan_action.parameters["distance"]]
+                        case "left":
+                            action.parameters["offset"] = [plan_action.parameters["distance"], 0]
+                        case "right":
+                            action.parameters["offset"] = [-plan_action.parameters["distance"], 0]
+            case PlanActionType.Type:
+                action = ComputerUseAction(
+                    name=SupportedActions.TypeInto,
+                    description=plan_action.description,
+                    parameters={"value": plan_action.parameters["text"]},
+                )
+
+        return action
+
+    async def predict(
+        self, state: State, execution_state: ExecutionState, max_retries: int = 0, planer_output: PlannerOutput | None = None
+    ) -> tuple[dict, dict]:
+        execute_planning = True
+        is_planning_fixed = planer_output is not None
+        execution_count = 0
+        execution_state.execution_info.responses = []
+        while execute_planning:
+            try:
+                execution_count += 1
+                if execution_state.execution_info.current_response is not None:
+                    execution_state.execution_info.responses.append(execution_state.execution_info.current_response)
+                execution_state.execution_info.current_response = utils.RawAgentResponse()
+                if not is_planning_fixed:
+                    planer_output = await self.planner.predict(state, execution_state)
+                plan_action = planer_output.plan_action
+
+                step = await self.process_plan_and_ground(planer_output, state, execution_state, retry_number=max_retries)
+                execute_planning = False
+            except utils.GroundingOutputValidationException as e:
+                execution_state.execution_info.current_response.grounding_error = e
+                if is_planning_fixed or execution_count > max_retries:
+                    raise ValueError(f"Grounding error with fixed plan: {e.message}, element description: {e.element_description}")
+
+        # save additional data for history
+        assert step is not None
+        assert step.additional_parameters is not None
+        step.additional_parameters["thought"] = planer_output.thought
+        step.additional_parameters["review"] = planer_output.review
+        step.additional_parameters.update(planer_output.additional_sections)
+        step.additional_parameters["plan_action"] = json.dumps(plan_action.to_dict())
+
+        history_image = state.image_base64
+        previous_steps_parameters = {
+            "max_chat_history_messages": 1000,
+            "max_chat_history_images": 1,
+            "image": history_image,
+        }
+        agent_response = {"step": step.to_response_dict(), "previous_steps_parameters": previous_steps_parameters}
+
+        return agent_response
+
+    async def process_plan_and_ground(
+        self, planer_output: PlannerOutput, state: State, execution_state: ExecutionState, retry_number: int = 0
+    ) -> ComputerUseStep:
+        plan_action = planer_output.plan_action
+        action: ComputerUseAction | None = None
+        step: ComputerUseStep | None = None
+
+        match plan_action.action_type:
            case PlanActionType.ExtractData:
                # return a step with no action, just to store the extracted data
                step = ComputerUseStep(
@@ -164,35 +189,29 @@ class UiPathComputerUseV1(object):
                | PlanActionType.Scroll
                | PlanActionType.Drag
                | PlanActionType.DoubleClick
+                | PlanActionType.TripleClick
                | PlanActionType.RightClick
            ):
                if plan_action.action_type != PlanActionType.Drag:
+                    element_description = plan_action.parameters.get("element_description", None)
                    grounding_result = await self.executor.predict(
                        state.image_base64,
                        plan_action.description,
                        action=plan_action.action_type,
+                        element_description=element_description
                    )
                else:
-                    grounding_result = await self.executor.predict(
-                        state.image_base64,
-                        plan_action.parameters["start_description"],
-                        action=plan_action.action_type,
-                    )
-                    grounding_result_end = await self.executor.predict(
-                        state.image_base64,
-                        plan_action.parameters["end_description"],
-                        action=plan_action.action_type,
-                    )
-                    grounding_result.end_position = grounding_result_end.position
-                x, y = grounding_result.position
-                action = self.process_grounding(plan_action, grounding_result, x, y)
-            case PlanActionType.Type:
-                action = ComputerUseAction(
-                    name=SupportedActions.TypeInto,
-                    description=plan_action.description,
-                    parameters={"value": plan_action.parameters["text"]},
-                )
-
+                    start_description = plan_action.parameters.get("start_description", None)
+                    end_description = plan_action.parameters.get("end_description", None)
+                    drag_entire_description = plan_action.description
+                    drag_start_description = f"Drag Start point:{start_description}. [Full Drag Description:{drag_entire_description}]"
+                    drag_end_description = f"Drag End point:{end_description}. [Full Drag Description:{drag_entire_description}]"
+                    grounding_result = await self.executor.predict(state.image_base64, drag_start_description, action=plan_action.action_type)
+                    grounding_result_end = await self.executor.predict(state.image_base64, drag_end_description, action=plan_action.action_type)
+                    grounding_result.end_position = grounding_result_end.get_point_location()
+                action = self.wrap_to_computer_use_action(plan_action, grounding_result)
+            case _:
+                action = self.wrap_to_computer_use_action(plan_action, grounding_result=None)
        if step is None:
            assert action is not None
            step = ComputerUseStep(
@@ -202,22 +221,4 @@ class UiPathComputerUseV1(object):
                thought=planer_output.thought,
            )

-        # save additional data for history
-        assert step.additional_parameters is not None
-        step.additional_parameters["thought"] = planer_output.thought
-        step.additional_parameters["review"] = planer_output.review
-        step.additional_parameters.update(planer_output.additional_sections)
-        step.additional_parameters["plan_action"] = json.dumps(plan_action.to_dict())
-
-        history_image = state.image_base64
-        previous_steps_parameters = {
-            "max_chat_history_messages": 1000,
-            "max_chat_history_images": self.planner.number_history_steps_with_images,
-            "image": history_image,
-        }
-        agent_response = {
-            "step": step.to_response_dict(),
-            "previous_steps_parameters": previous_steps_parameters,
-        }
-
-        return agent_response
+        return step
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
lizhanyuan	b75f6bf341	feat: 增强任务步骤注入与a11y状态表达，提升树形交互稳定性 - 打通 metadata.steps 传递链路，将任务步骤注入 agent 预测上下文 - 优化 a11y tree 线性化输出：使用中心坐标并新增 states 列（expanded/collapsed/selected 等） - 放宽可保留节点条件，保留无文本输入类控件（edit/textfield/searchbox 等） - 强化输出约束：单轮仅允许动作代码或 WAIT/DONE/FAIL，禁止动作与 DONE 同轮返回 - 补充 avogadro 示例步骤：展开 aromatics 并选择 benzene.cjson	2026-02-26 18:56:53 +08:00
lizhanyuan	07e66490dd	feat: 增强科研软件的 a11y tree 支持 - 扩展 heuristic_retrieve.py 白名单以覆盖科研软件 GUI 框架: - 新增 prefix 规则: sunawt (Java Swing), qt5q/qt6q (Qt), ovito, pymol, contentspanel, wx (wxWidgets), afx (MFC), thunderrt (VB6) - 新增 endswith 规则: edit, widget, box, dialog, view, frame, menuitem, menubar, toolbar, tabitem, treeitem, window - 新增 Qt 控件和 Win32 控件的精确匹配 - 在 agent.py 中添加原始 a11y tree 的调试日志 - 修复 run.py 中 agent 初始化缺少 platform='windows' 的问题 - 添加 NO_PROXY 绕过本地/VM IP (兼容 Clash 全局代理) - lib_run_single.py 中应用启动等待时间增加到 15 秒 - 新增 test_each_domain_a11y_tree.json (每个域一个任务用于 a11y 验证)	2026-02-26 15:04:28 +08:00
lizhanyuan	9899d4a0c7	feat: 新增科研软件 benchmark 任务数据 - 新增 avogadro/imagej/jade/origin/ovito/pymol/vesta 等科研软件任务 JSON - 修改 vllm_eval.py，修改图片文件名称为第x步 - desktop_env.py 添加额外数据参数 config 和 metadata	2026-02-25 15:19:36 +08:00
cui0711	613f55f0da	feat(tools): add instructions extraction script for generating test cases	2026-02-09 17:47:02 +08:00
cui0711	ba03784196	fix(env): handle None result_getter for vllm_eval evaluator	2026-02-09 17:46:05 +08:00
cui0711	3890ee5fc3	fix(vllm_eval): add image compression to prevent 413 error with large max_steps	2026-02-09 14:24:59 +08:00
cui0711	9bc54c0a66	feat(vllm_eval): add structured JSON response format with step analysis	2026-02-09 13:58:14 +08:00
cui0711	1e9281a1ab	feat(cli): add eval_model argument	2026-02-05 16:56:39 +08:00
cui0711	63484c7b7b	fix(runner): pass result_dir to evaluate and re-enable environment reset	2026-02-05 16:55:49 +08:00
cui0711	ad46acc5f3	refactor(example): replace check_include_exclude with vllm_eval evaluator	2026-02-05 16:55:03 +08:00
cui0711	58d411bf86	feat(evaluator): export vllm_eval module	2026-02-05 16:54:16 +08:00
cui0711	be24e77d93	feat(env): add eval_model parameter and result_dir support for vllm evaluation	2026-02-05 16:53:12 +08:00
cui0711	dd58a1de03	feat(evaluator): add vision-language model evaluator	2026-02-05 16:52:35 +08:00
cui0711	231f7a8fbc	feat(eval): add jade test case and update test categories	2026-01-30 16:29:05 +08:00
cui0711	716d82f4d1	feat: add flexible recording control and improve execution logging	2026-01-30 16:28:13 +08:00
cui0711	47bcfc0f0b	feat(agent): add screenshot compression and dynamic resolution support	2026-01-30 16:28:02 +08:00
cui0711	7e9090e115	fix(prompts): fix template variable syntax and add dynamic resolution	2026-01-30 16:28:02 +08:00
cui0711	308282e830	feat(server): add cross-platform support and improve screenshot handling	2026-01-30 16:27:49 +08:00
cui0711	788b248dbc	fix(logger): add Windows platform support for file locking	2026-01-30 16:27:49 +08:00
alexandruilie7	5463d3bb89	uipath v2 (#413 ) * submission v2 * small updates	2026-01-09 08:47:20 +08:00
蘑菇先生	5ef8bdfa35	EvoCUA Update (2025.01.05) (#412 ) * evocua init * setup max_token * evocua update --------- Co-authored-by: xuetaofeng <xuetaofeng@meituan.com> Co-authored-by: Tianbao Xie <47296835+Timothyxxx@users.noreply.github.com>	2026-01-05 16:14:53 +08:00
Bowen Yang	439e178a2e	fix(os_symphony_evaluation) (#410 ) * fix(os_symphony) * Update desktop_env_os_symphony.py * fix(os_symphony_desktop) * fix(os_symphony_start) * Add docstring to run_multienv_os_symphony.py Added documentation header for the evaluation script.	2026-01-04 15:56:51 +08:00
Bowen Yang	951e1928c8	fix(desktop_os_symphony):support aws (#406 ) * fix(os_symphony) * Update desktop_env_os_symphony.py	2026-01-01 11:27:34 +08:00
				`@@ -0,0 +1 @@`
				`{"avogadro": ["building-organic-molecules_task1"]}`