pip-installing directly from PyPI fails misteriously in postconfig
execution, possible owing to proxy configuration in the VM, adjusted
strategy by downloading the wheel on host and pip-installing it locally
on VM in thunderbird/d38192b0-17dc-4e1d-99c3-786d0117de77
* feat: add claude support
* feat: add script for end-to-end evaluation with logging and task distribution
* feat&fix: add tool result handling and update model default in evaluation script
* chore: remove run_test_env.py script
* feat&fix: implement action parsing for tool calls and update default action space
* fix: update text formatting in action parsing and replace logger import
* feat&fix: implement action parsing for tool calls and add screen size handling
* feat: add setup instructions for Anthropic API integration
* feat: add notice about image size limitations for Anthropic API
* Delete test_env/logger.py
* Delete test_env/utils.py
* fix: update logger usage to use global logger and improve error handling
* feat&fix: add configuration management API endpoints and update UI for configuration selection
* feat&fix: update environment configuration, enhance task statistics, and improve UI responsiveness
* feat&fix: add configuration toggle button in UI and improve task loading performance
* feat&fix: add accuracy percentage display to score and style updates for UI
- Added logging for file retrieval and error handling in file.py, improving robustness during file operations.
- Implemented checks for file existence and parsing errors in general.py, enhancing reliability in JSON/YAML processing.
- Improved table comparison logic in table.py with detailed error logging for sheet loading and cell value reading.
- Enhanced metrics evaluation in slides.py with additional checks for paragraph and run counts, ensuring thorough comparison.
- Updated utils.py to include file existence checks and detailed error logging during cell value reading.
* feat: add claude support
* feat: add script for end-to-end evaluation with logging and task distribution
* feat&fix: add tool result handling and update model default in evaluation script
* chore: remove run_test_env.py script
* feat&fix: implement action parsing for tool calls and update default action space
* fix: update text formatting in action parsing and replace logger import
* feat&fix: implement action parsing for tool calls and add screen size handling
* feat: add setup instructions for Anthropic API integration
* feat: add notice about image size limitations for Anthropic API
* Delete test_env/logger.py
* Delete test_env/utils.py
- Added logging for OCR results and text matching outcomes in compare_image_text function.
- Updated JSON examples to support multiple expected results and improved structure for evaluator functions.
- Enhanced handling of expected text rules to include multiple variations for better matching accuracy.