fix(eval): 修复vllm_eval截图排序bug并对齐reeval逻辑

- 修复_load_screenshots_from_dir中截图按字符串排序导致step_9被误判为最终帧的bug,改为数字排序
- 对齐reeval.py的prompt逻辑:明确要求模型优先检查最终截图(STEP 1 EXAMINE FINAL SCREENSHOT FIRST)
- 评估temperature从0.7降至0.2提升一致性
- 新增batch_reeval.py:基于test_final.json批量重评测已有轨迹
- 新增reeval.py:单任务重评测脚本(final-frame-anchored evaluation)
- test_final.json新增avogadro(11题)和origin(8题)
This commit is contained in:
2026-03-27 14:25:45 +08:00
parent 4e192cf013
commit 252d2f79ce
5 changed files with 434 additions and 94 deletions

View File

@@ -347,8 +347,13 @@ def _load_screenshots_from_dir(result_dir: str, compress: bool = False, max_size
filenames = []
# Find all step screenshot files (e.g., step_1_20240101@120000.png)
# Sort numerically by step number to avoid lexicographic issues (step_10 < step_2 in string sort)
import re as _re_sort
pattern = os.path.join(result_dir, "step_*.png")
screenshot_files = sorted(glob.glob(pattern))
screenshot_files = sorted(
glob.glob(pattern),
key=lambda p: int(_re_sort.search(r"step_(\d+)", os.path.basename(p)).group(1))
)
if not screenshot_files:
logger.warning(f"No screenshot files found in {result_dir}")
@@ -446,7 +451,7 @@ def vllm_eval(result_state, **options) -> float:
metadata = options.get("metadata", {})
params = {
"temperature": options.get("temperature", 0.7),
"temperature": options.get("temperature", 0.2),
"max_tokens": options.get("max_tokens", 16384),
"top_p": options.get("top_p", 1.0)
}
@@ -493,51 +498,49 @@ IMPORTANT: Only reference screenshots from the list above. Do NOT reference any
else:
img_info = "\nNo screenshots were provided."
prompt = f"""You are a STRICT and RIGOROUS evaluator for desktop environment tasks. Your job is to score ONLY based on concrete, visible evidence of task completion in the screenshots.
final_name = screenshot_filenames[-1] if screenshot_filenames else "N/A"
Task Instruction: {instruction}
prompt = f"""You are a STRICT evaluator for desktop GUI agent tasks.
Task: {instruction}
{preconfig_section}
{expected_steps_section}
{img_info}
Analyze ONLY the FINAL screenshot ({screenshot_filenames[-1] if screenshot_filenames else 'N/A'}) to determine the end state, while using earlier screenshots for context.
════════════════════════════════════════════════════
STEP 1 — EXAMINE THE FINAL SCREENSHOT FIRST
The LAST image provided is "{final_name}" — this is the FINAL STATE of the agent's session.
Look at this image carefully NOW before anything else. Ask yourself:
"Does this final screenshot show the task is complete?"
Only after answering that, use earlier screenshots to understand HOW the agent got there.
════════════════════════════════════════════════════
CRITICAL SCORING RULES:
1. Score ONLY based on what the AGENT actually accomplished. The pre-configured environment (application already launched, files already opened, etc.) is the STARTING STATE and worth 0 points.
2. Score ONLY based on what is ACTUALLY VISIBLE in the screenshots. Do NOT give credit for assumed or potential progress.
3. If the screenshots show NO meaningful action beyond the initial pre-configured state, the score MUST be 0.
4. Do NOT give partial credit for "having the system on", "desktop being visible", "the application being open" (if it was pre-launched), or "the application being installed". These are prerequisites or pre-configured state, NOT progress.
5. Each point must correspond to a SPECIFIC, VERIFIABLE action that was successfully completed BY THE AGENT toward the task goal.
SCORING RULES:
1. Base your final_completion and score PRIMARILY on "{final_name}" (the last image).
2. Credit ONLY actions performed BY THE AGENT (not pre-configured setup).
3. Require VISIBLE evidence in a specific screenshot for each step.
4. If the final screenshot shows the task is done, score high even if earlier steps were messy.
SCORING GUIDE (0-10):
- 0: No progress beyond the pre-configured starting state. If the app was pre-launched, merely having it open is 0. If the screenshots only show the desktop or the initial app state without any agent action, score is 0.
- 1-2: The agent performed one minor action (e.g., clicked on a menu) but did not make meaningful progress toward the task goal.
- 3-4: Some initial steps toward the task have been taken but the task is far from complete.
- 5-6: Significant progress - about half the required steps are completed with visible evidence.
- 7-8: Most steps are completed but the final result is not fully achieved or has minor issues.
- 9: The task is essentially complete with very minor cosmetic differences.
- 10: The task is perfectly and completely finished with clear evidence in the final screenshot.
SCORE GUIDE (0-10):
- 0: No agent progress; only pre-configured state visible.
- 1-3: Minor actions taken, far from goal.
- 4-6: Meaningful progress, roughly half done.
- 7-8: Most steps done, minor issues.
- 9: Essentially complete, cosmetic differences only.
- 10: Fully and perfectly complete with clear visual proof in the final screenshot.
IMPORTANT: You must respond with ONLY a valid JSON object (no additional text before or after). Use the following exact format:
Respond with ONLY valid JSON (no extra text):
{{
"final_screenshot": "{final_name}",
"final_screenshot_description": "Describe exactly what you see in {final_name}",
"task_complete_in_final": true/false,
"steps_analysis": [
{{"step": "Step description", "status": "Success/Fail", "evidence_img": "step_X.png", "reason": "Brief explanation of VISIBLE evidence"}},
{{"step": "Another step", "status": "Success/Fail", "evidence_img": "step_Y.png", "reason": "Brief explanation of VISIBLE evidence"}}
{{"step": "...", "status": "Success/Fail", "evidence_img": "step_X", "reason": "..."}}
],
"final_completion": "True/False",
"score": 0-10
}}
Where:
- "steps_analysis": Array of steps you identified from the screenshots. Each step must cite VISIBLE evidence from a specific screenshot. Do NOT include pre-configured actions as agent steps.
- "status": Either "Success" or "Fail" for each step
- "evidence_img": The screenshot filename that shows evidence for this step (e.g., "step_2.png")
- "reason": Explanation of what is VISUALLY observed in the screenshot as evidence
- "final_completion": "True" ONLY if the overall task is fully completed with clear visual proof, "False" otherwise
- "score": Integer from 0 to 10, following the strict scoring guide above
Remember: Return ONLY the JSON object, no additional text. Be STRICT - when in doubt, score LOWER."""
}}"""
try:
result = llm.generate_with_images(