- Add `fixed_ip` field to all 369 JSON files in examples directory
- Set to `true` for 8 files listed in google_chrome.json multi_apps
- Set to `false` for remaining 361 files
- Add `possibility_of_env_change` field to 363 JSON files missing this field
- Set to "low" for newly added fields
- Preserve existing values (4 medium, 2 high) for 6 files that already had this field
This ensures consistent configuration schema across all evaluation examples
while maintaining backward compatibility with existing settings.
- Add screen_size parameter to get_vm_path() for all providers (with default 1920x1080)
- Add os_type parameter to start_emulator() for Azure and VirtualBox providers
- Add region parameter to stop_emulator() for VMware, Docker, and VirtualBox providers
- Use *args, **kwargs for better extensibility and parameter consistency
- Add documentation comments explaining ignored parameters for interface consistency
- Prevents TypeError exceptions when AWS-specific parameters are passed to other providers
This ensures all providers can handle the same parameter sets while maintaining
backward compatibility and avoiding interface fragmentation.
pip-installing directly from PyPI fails misteriously in postconfig
execution, possible owing to proxy configuration in the VM, adjusted
strategy by downloading the wheel on host and pip-installing it locally
on VM in thunderbird/d38192b0-17dc-4e1d-99c3-786d0117de77
* feat: add claude support
* feat: add script for end-to-end evaluation with logging and task distribution
* feat&fix: add tool result handling and update model default in evaluation script
* chore: remove run_test_env.py script
* feat&fix: implement action parsing for tool calls and update default action space
* fix: update text formatting in action parsing and replace logger import
* feat&fix: implement action parsing for tool calls and add screen size handling
* feat: add setup instructions for Anthropic API integration
* feat: add notice about image size limitations for Anthropic API
* Delete test_env/logger.py
* Delete test_env/utils.py
* fix: update logger usage to use global logger and improve error handling
* feat&fix: add configuration management API endpoints and update UI for configuration selection
* feat&fix: update environment configuration, enhance task statistics, and improve UI responsiveness
* feat&fix: add configuration toggle button in UI and improve task loading performance
* feat&fix: add accuracy percentage display to score and style updates for UI
- Added logging for file retrieval and error handling in file.py, improving robustness during file operations.
- Implemented checks for file existence and parsing errors in general.py, enhancing reliability in JSON/YAML processing.
- Improved table comparison logic in table.py with detailed error logging for sheet loading and cell value reading.
- Enhanced metrics evaluation in slides.py with additional checks for paragraph and run counts, ensuring thorough comparison.
- Updated utils.py to include file existence checks and detailed error logging during cell value reading.
* feat: add claude support
* feat: add script for end-to-end evaluation with logging and task distribution
* feat&fix: add tool result handling and update model default in evaluation script
* chore: remove run_test_env.py script
* feat&fix: implement action parsing for tool calls and update default action space
* fix: update text formatting in action parsing and replace logger import
* feat&fix: implement action parsing for tool calls and add screen size handling
* feat: add setup instructions for Anthropic API integration
* feat: add notice about image size limitations for Anthropic API
* Delete test_env/logger.py
* Delete test_env/utils.py
- Added logging for OCR results and text matching outcomes in compare_image_text function.
- Updated JSON examples to support multiple expected results and improved structure for evaluator functions.
- Enhanced handling of expected text rules to include multiple variations for better matching accuracy.
* Enhance PPTX comparison logic in slides.py
- Improved alignment comparison to treat None and LEFT as equivalent.
- Added special handling for font bold and italic properties to consider None and False as equivalent.
- Introduced a new bullet comparison function that allows for minor differences and tolerates formatting variations.
- Updated JSON examples to support multiple file comparisons and results.
* fix all fonts json file f23ac
* fix clean the shape examination in unrelevatn part-top position check
* Refactor JSON structure for PPTX comparison
- Updated the instruction formatting for clarity.
- Modified the comparison logic to support multiple expected and result files, enhancing flexibility in evaluations.
- Changed the function key to an array to accommodate multiple comparison functions.
- Introduced a conjunction key to specify logical relationships between comparisons.
* fix impress-e4ef0baf by adding all fonts gold file
* update impress bf4e9888 task ins
* fix impress b8adbc24 font size
* Enhance PPTX comparison functionality in slides.py
- Introduced a debug logger for detailed output during PPTX comparisons.
- Added a new function to recursively retrieve all text shapes, including those within groups.
- Enabled debug logging to provide insights on slide and shape comparisons.
- Updated JSON examples to support multiple expected and result files for enhanced evaluation flexibility.
* Enable debug logging by default in PPTX comparison and enhance debug output for shape mismatches. Updated JSON examples to support multiple expected and result files for improved evaluation consistency.
* fix impress all fons compare file
* Refactor PPTX comparison logic and JSON examples for height modification tasks
- Added critical notes in slides.py to clarify the execution order of shape examination and height modification checks.
- Updated JSON examples to support multiple expected and result files, enhancing evaluation consistency.
- Ensured that examine_shape must be set to False for examine_modify_height to function correctly, preventing premature termination of comparisons.
* Enhance debug logging in PPTX comparison for detailed font attribute mismatches
- Added debug logging for differences in font color, bold, italic, and underline attributes during table cell comparisons.
- Improved clarity of debug output by including specific slide, shape, and cell indices for mismatches.
- Ensured that existing comparison logic remains intact while enhancing debugging capabilities.
* Enhance debug logging for font attribute mismatches in PPTX comparison
- Added detailed debug logging for font name and size mismatches during PPTX comparisons, including specific slide, shape, and paragraph indices.
- Updated JSON examples to support multiple expected and result files, improving evaluation consistency.
- Maintained existing comparison logic while enhancing debugging capabilities.
* fix impress 3161de json file
---------
Co-authored-by: yuanmengqi <yuanmengqi@mail.ustc.edu.cn>
- Replaced print statements with logging for better traceability in gimp.py.
- Added handling for transparent images in structure checks and size evaluations.
- Updated JSON examples to include delays in pyautogui commands for improved execution reliability.
- Changed image URL in example to a more accessible source.
- Added `pytz` dependency to `requirements.txt` for timezone handling.
- Introduced `get_macys_product_url_parse` function to replace the old `get_url_path_parse` for better clarity and maintain backward compatibility.
- Enhanced logging throughout the `get_active_tab_html_parse` and `get_rule_relativeTime` functions for improved debugging and traceability.
- Updated JSON examples to reflect changes in expected keys and added new fields for better evaluation context.
- Removed deprecated execution commands from JSON examples to streamline the evaluation process.