fix(eval): 修复vllm_eval截图排序bug并对齐reeval逻辑
- 修复_load_screenshots_from_dir中截图按字符串排序导致step_9被误判为最终帧的bug,改为数字排序 - 对齐reeval.py的prompt逻辑:明确要求模型优先检查最终截图(STEP 1 EXAMINE FINAL SCREENSHOT FIRST) - 评估temperature从0.7降至0.2提升一致性 - 新增batch_reeval.py:基于test_final.json批量重评测已有轨迹 - 新增reeval.py:单任务重评测脚本(final-frame-anchored evaluation) - test_final.json新增avogadro(11题)和origin(8题)
This commit is contained in:
@@ -347,8 +347,13 @@ def _load_screenshots_from_dir(result_dir: str, compress: bool = False, max_size
|
||||
filenames = []
|
||||
|
||||
# Find all step screenshot files (e.g., step_1_20240101@120000.png)
|
||||
# Sort numerically by step number to avoid lexicographic issues (step_10 < step_2 in string sort)
|
||||
import re as _re_sort
|
||||
pattern = os.path.join(result_dir, "step_*.png")
|
||||
screenshot_files = sorted(glob.glob(pattern))
|
||||
screenshot_files = sorted(
|
||||
glob.glob(pattern),
|
||||
key=lambda p: int(_re_sort.search(r"step_(\d+)", os.path.basename(p)).group(1))
|
||||
)
|
||||
|
||||
if not screenshot_files:
|
||||
logger.warning(f"No screenshot files found in {result_dir}")
|
||||
@@ -446,7 +451,7 @@ def vllm_eval(result_state, **options) -> float:
|
||||
metadata = options.get("metadata", {})
|
||||
|
||||
params = {
|
||||
"temperature": options.get("temperature", 0.7),
|
||||
"temperature": options.get("temperature", 0.2),
|
||||
"max_tokens": options.get("max_tokens", 16384),
|
||||
"top_p": options.get("top_p", 1.0)
|
||||
}
|
||||
@@ -493,51 +498,49 @@ IMPORTANT: Only reference screenshots from the list above. Do NOT reference any
|
||||
else:
|
||||
img_info = "\nNo screenshots were provided."
|
||||
|
||||
prompt = f"""You are a STRICT and RIGOROUS evaluator for desktop environment tasks. Your job is to score ONLY based on concrete, visible evidence of task completion in the screenshots.
|
||||
final_name = screenshot_filenames[-1] if screenshot_filenames else "N/A"
|
||||
|
||||
Task Instruction: {instruction}
|
||||
prompt = f"""You are a STRICT evaluator for desktop GUI agent tasks.
|
||||
|
||||
Task: {instruction}
|
||||
{preconfig_section}
|
||||
{expected_steps_section}
|
||||
{img_info}
|
||||
|
||||
Analyze ONLY the FINAL screenshot ({screenshot_filenames[-1] if screenshot_filenames else 'N/A'}) to determine the end state, while using earlier screenshots for context.
|
||||
════════════════════════════════════════════════════
|
||||
STEP 1 — EXAMINE THE FINAL SCREENSHOT FIRST
|
||||
The LAST image provided is "{final_name}" — this is the FINAL STATE of the agent's session.
|
||||
Look at this image carefully NOW before anything else. Ask yourself:
|
||||
"Does this final screenshot show the task is complete?"
|
||||
Only after answering that, use earlier screenshots to understand HOW the agent got there.
|
||||
════════════════════════════════════════════════════
|
||||
|
||||
CRITICAL SCORING RULES:
|
||||
1. Score ONLY based on what the AGENT actually accomplished. The pre-configured environment (application already launched, files already opened, etc.) is the STARTING STATE and worth 0 points.
|
||||
2. Score ONLY based on what is ACTUALLY VISIBLE in the screenshots. Do NOT give credit for assumed or potential progress.
|
||||
3. If the screenshots show NO meaningful action beyond the initial pre-configured state, the score MUST be 0.
|
||||
4. Do NOT give partial credit for "having the system on", "desktop being visible", "the application being open" (if it was pre-launched), or "the application being installed". These are prerequisites or pre-configured state, NOT progress.
|
||||
5. Each point must correspond to a SPECIFIC, VERIFIABLE action that was successfully completed BY THE AGENT toward the task goal.
|
||||
SCORING RULES:
|
||||
1. Base your final_completion and score PRIMARILY on "{final_name}" (the last image).
|
||||
2. Credit ONLY actions performed BY THE AGENT (not pre-configured setup).
|
||||
3. Require VISIBLE evidence in a specific screenshot for each step.
|
||||
4. If the final screenshot shows the task is done, score high even if earlier steps were messy.
|
||||
|
||||
SCORING GUIDE (0-10):
|
||||
- 0: No progress beyond the pre-configured starting state. If the app was pre-launched, merely having it open is 0. If the screenshots only show the desktop or the initial app state without any agent action, score is 0.
|
||||
- 1-2: The agent performed one minor action (e.g., clicked on a menu) but did not make meaningful progress toward the task goal.
|
||||
- 3-4: Some initial steps toward the task have been taken but the task is far from complete.
|
||||
- 5-6: Significant progress - about half the required steps are completed with visible evidence.
|
||||
- 7-8: Most steps are completed but the final result is not fully achieved or has minor issues.
|
||||
- 9: The task is essentially complete with very minor cosmetic differences.
|
||||
- 10: The task is perfectly and completely finished with clear evidence in the final screenshot.
|
||||
SCORE GUIDE (0-10):
|
||||
- 0: No agent progress; only pre-configured state visible.
|
||||
- 1-3: Minor actions taken, far from goal.
|
||||
- 4-6: Meaningful progress, roughly half done.
|
||||
- 7-8: Most steps done, minor issues.
|
||||
- 9: Essentially complete, cosmetic differences only.
|
||||
- 10: Fully and perfectly complete with clear visual proof in the final screenshot.
|
||||
|
||||
IMPORTANT: You must respond with ONLY a valid JSON object (no additional text before or after). Use the following exact format:
|
||||
Respond with ONLY valid JSON (no extra text):
|
||||
|
||||
{{
|
||||
"final_screenshot": "{final_name}",
|
||||
"final_screenshot_description": "Describe exactly what you see in {final_name}",
|
||||
"task_complete_in_final": true/false,
|
||||
"steps_analysis": [
|
||||
{{"step": "Step description", "status": "Success/Fail", "evidence_img": "step_X.png", "reason": "Brief explanation of VISIBLE evidence"}},
|
||||
{{"step": "Another step", "status": "Success/Fail", "evidence_img": "step_Y.png", "reason": "Brief explanation of VISIBLE evidence"}}
|
||||
{{"step": "...", "status": "Success/Fail", "evidence_img": "step_X", "reason": "..."}}
|
||||
],
|
||||
"final_completion": "True/False",
|
||||
"score": 0-10
|
||||
}}
|
||||
|
||||
Where:
|
||||
- "steps_analysis": Array of steps you identified from the screenshots. Each step must cite VISIBLE evidence from a specific screenshot. Do NOT include pre-configured actions as agent steps.
|
||||
- "status": Either "Success" or "Fail" for each step
|
||||
- "evidence_img": The screenshot filename that shows evidence for this step (e.g., "step_2.png")
|
||||
- "reason": Explanation of what is VISUALLY observed in the screenshot as evidence
|
||||
- "final_completion": "True" ONLY if the overall task is fully completed with clear visual proof, "False" otherwise
|
||||
- "score": Integer from 0 to 10, following the strict scoring guide above
|
||||
|
||||
Remember: Return ONLY the JSON object, no additional text. Be STRICT - when in doubt, score LOWER."""
|
||||
}}"""
|
||||
|
||||
try:
|
||||
result = llm.generate_with_images(
|
||||
|
||||
Reference in New Issue
Block a user