lizhanyuan
252d2f79ce
fix(eval): 修复vllm_eval截图排序bug并对齐reeval逻辑
- 修复_load_screenshots_from_dir中截图按字符串排序导致step_9被误判为最终帧的bug,改为数字排序
- 对齐reeval.py的prompt逻辑:明确要求模型优先检查最终截图(STEP 1 EXAMINE FINAL SCREENSHOT FIRST)
- 评估temperature从0.7降至0.2提升一致性
- 新增batch_reeval.py:基于test_final.json批量重评测已有轨迹
- 新增reeval.py:单任务重评测脚本(final-frame-anchored evaluation)
- test_final.json新增avogadro(11题)和origin(8题)
2026-03-27 14:34:32 +08:00
..
2026-02-05 16:54:16 +08:00
2024-03-14 12:54:10 +08:00
2025-07-18 14:16:16 +00:00
2025-07-18 19:26:29 +00:00
2025-07-14 05:43:17 +00:00
2025-07-30 06:07:49 +00:00
2024-03-14 12:54:10 +08:00
2025-07-26 20:42:18 +00:00
2024-03-14 12:54:10 +08:00
2025-07-14 05:43:17 +00:00
2025-11-19 17:24:25 +08:00
2024-03-14 12:54:10 +08:00
2025-07-19 17:15:40 +08:00
2025-07-08 16:25:00 +08:00
2026-03-27 14:34:32 +08:00
2025-07-16 11:44:46 +00:00