fix(eval): 修复vllm_eval截图排序bug并对齐reeval逻辑
- 修复_load_screenshots_from_dir中截图按字符串排序导致step_9被误判为最终帧的bug,改为数字排序 - 对齐reeval.py的prompt逻辑:明确要求模型优先检查最终截图(STEP 1 EXAMINE FINAL SCREENSHOT FIRST) - 评估temperature从0.7降至0.2提升一致性 - 新增batch_reeval.py:基于test_final.json批量重评测已有轨迹 - 新增reeval.py:单任务重评测脚本(final-frame-anchored evaluation) - test_final.json新增avogadro(11题)和origin(8题)
This commit is contained in:
@@ -26,7 +26,7 @@
|
||||
],
|
||||
"origin": [
|
||||
"Origin_User_Guide_2025b_E_task2",
|
||||
"Origin_User_Guide_2025b_E_task3",
|
||||
"Origin_User_Guide_2025b_E_task3",
|
||||
"Origin_User_Guide_2025b_E_task4",
|
||||
"Origin_User_Guide_2025b_E_task5",
|
||||
"Origin_User_Guide_2025b_E_task8",
|
||||
@@ -70,4 +70,4 @@
|
||||
"viewports_task10",
|
||||
"viewports_task11"
|
||||
]
|
||||
}
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user