fix(eval): 修复vllm_eval截图排序bug并对齐reeval逻辑

- 修复_load_screenshots_from_dir中截图按字符串排序导致step_9被误判为最终帧的bug，改为数字排序 - 对齐reeval.py的prompt逻辑：明确要求模型优先检查最终截图（STEP 1 EXAMINE FINAL SCREENSHOT FIRST） - 评估temperature从0.7降至0.2提升一致性 - 新增batch_reeval.py：基于test_final.json批量重评测已有轨迹 - 新增reeval.py：单任务重评测脚本（final-frame-anchored evaluation） - test_final.json新增avogadro(11题)和origin(8题)
2026-03-27 14:25:45 +08:00
parent 4e192cf013
commit 252d2f79ce
5 changed files with 434 additions and 94 deletions
--- a/desktop_env/evaluators/metrics/vllm_eval.py
+++ b/desktop_env/evaluators/metrics/vllm_eval.py
@@ -347,8 +347,13 @@ def _load_screenshots_from_dir(result_dir: str, compress: bool = False, max_size
    filenames = []

    # Find all step screenshot files (e.g., step_1_20240101@120000.png)
+    # Sort numerically by step number to avoid lexicographic issues (step_10 < step_2 in string sort)
+    import re as _re_sort
    pattern = os.path.join(result_dir, "step_*.png")
-    screenshot_files = sorted(glob.glob(pattern))
+    screenshot_files = sorted(
+        glob.glob(pattern),
+        key=lambda p: int(_re_sort.search(r"step_(\d+)", os.path.basename(p)).group(1))
+    )

    if not screenshot_files:
        logger.warning(f"No screenshot files found in {result_dir}")
@@ -446,7 +451,7 @@ def vllm_eval(result_state, **options) -> float:
    metadata = options.get("metadata", {})

    params = {
-        "temperature": options.get("temperature", 0.7),
+        "temperature": options.get("temperature", 0.2),
        "max_tokens": options.get("max_tokens", 16384),
        "top_p": options.get("top_p", 1.0)
    }
@@ -493,51 +498,49 @@ IMPORTANT: Only reference screenshots from the list above. Do NOT reference any
    else:
        img_info = "\nNo screenshots were provided."

-    prompt = f"""You are a STRICT and RIGOROUS evaluator for desktop environment tasks. Your job is to score ONLY based on concrete, visible evidence of task completion in the screenshots.
+    final_name = screenshot_filenames[-1] if screenshot_filenames else "N/A"

-Task Instruction: {instruction}
+    prompt = f"""You are a STRICT evaluator for desktop GUI agent tasks.
+
+Task: {instruction}
 {preconfig_section}
 {expected_steps_section}
 {img_info}

-Analyze ONLY the FINAL screenshot ({screenshot_filenames[-1] if screenshot_filenames else 'N/A'}) to determine the end state, while using earlier screenshots for context.
+════════════════════════════════════════════════════
+STEP 1 — EXAMINE THE FINAL SCREENSHOT FIRST
+The LAST image provided is "{final_name}" — this is the FINAL STATE of the agent's session.
+Look at this image carefully NOW before anything else. Ask yourself:
+  "Does this final screenshot show the task is complete?"
+Only after answering that, use earlier screenshots to understand HOW the agent got there.
+════════════════════════════════════════════════════

-CRITICAL SCORING RULES:
-1. Score ONLY based on what the AGENT actually accomplished. The pre-configured environment (application already launched, files already opened, etc.) is the STARTING STATE and worth 0 points.
-2. Score ONLY based on what is ACTUALLY VISIBLE in the screenshots. Do NOT give credit for assumed or potential progress.
-3. If the screenshots show NO meaningful action beyond the initial pre-configured state, the score MUST be 0.
-4. Do NOT give partial credit for "having the system on", "desktop being visible", "the application being open" (if it was pre-launched), or "the application being installed". These are prerequisites or pre-configured state, NOT progress.
-5. Each point must correspond to a SPECIFIC, VERIFIABLE action that was successfully completed BY THE AGENT toward the task goal.
+SCORING RULES:
+1. Base your final_completion and score PRIMARILY on "{final_name}" (the last image).
+2. Credit ONLY actions performed BY THE AGENT (not pre-configured setup).
+3. Require VISIBLE evidence in a specific screenshot for each step.
+4. If the final screenshot shows the task is done, score high even if earlier steps were messy.

-SCORING GUIDE (0-10):
- 0: No progress beyond the pre-configured starting state. If the app was pre-launched, merely having it open is 0. If the screenshots only show the desktop or the initial app state without any agent action, score is 0.
- 1-2: The agent performed one minor action (e.g., clicked on a menu) but did not make meaningful progress toward the task goal.
- 3-4: Some initial steps toward the task have been taken but the task is far from complete.
- 5-6: Significant progress - about half the required steps are completed with visible evidence.
- 7-8: Most steps are completed but the final result is not fully achieved or has minor issues.
- 9: The task is essentially complete with very minor cosmetic differences.
- 10: The task is perfectly and completely finished with clear evidence in the final screenshot.
+SCORE GUIDE (0-10):
+- 0: No agent progress; only pre-configured state visible.
+- 1-3: Minor actions taken, far from goal.
+- 4-6: Meaningful progress, roughly half done.
+- 7-8: Most steps done, minor issues.
+- 9: Essentially complete, cosmetic differences only.
+- 10: Fully and perfectly complete with clear visual proof in the final screenshot.

-IMPORTANT: You must respond with ONLY a valid JSON object (no additional text before or after). Use the following exact format:
+Respond with ONLY valid JSON (no extra text):

 {{
+  "final_screenshot": "{final_name}",
+  "final_screenshot_description": "Describe exactly what you see in {final_name}",
+  "task_complete_in_final": true/false,
  "steps_analysis": [
-    {{"step": "Step description", "status": "Success/Fail", "evidence_img": "step_X.png", "reason": "Brief explanation of VISIBLE evidence"}},
-    {{"step": "Another step", "status": "Success/Fail", "evidence_img": "step_Y.png", "reason": "Brief explanation of VISIBLE evidence"}}
+    {{"step": "...", "status": "Success/Fail", "evidence_img": "step_X", "reason": "..."}}
  ],
  "final_completion": "True/False",
  "score": 0-10
-}}
-
-Where:
- "steps_analysis": Array of steps you identified from the screenshots. Each step must cite VISIBLE evidence from a specific screenshot. Do NOT include pre-configured actions as agent steps.
- "status": Either "Success" or "Fail" for each step
- "evidence_img": The screenshot filename that shows evidence for this step (e.g., "step_2.png")
- "reason": Explanation of what is VISUALLY observed in the screenshot as evidence
- "final_completion": "True" ONLY if the overall task is fully completed with clear visual proof, "False" otherwise
- "score": Integer from 0 to 10, following the strict scoring guide above
-
-Remember: Return ONLY the JSON object, no additional text. Be STRICT - when in doubt, score LOWER."""
+}}"""

    try:
        result = llm.generate_with_images(