* ver Jun17th
updating annotations
* ver Jun17th
corrected annotation of 1d17
added check for cell merge
* ver Jun17th
updated several annotations
* ver Jun20th
fixed set-up config of 2bd59342-0664-4ccb-ba87-79379096cc08
* fix: Enhance instructions in LibreOffice Calc examples for clarity and specificity, including details on using Pivot Tables, column placements, and revenue calculations.
* ver Jun21st
updating calc evals
* ver Jun22nd
fixed an impress task
* ver Jun22ndv2
adjusted several calc tasks
* Clean scalfolds
* ver Jul18th
added two try-excepts to handle possible formula parsing and calculation
failures
---------
Co-authored-by: BowenBryanWang <bryanwang.nlp@connect.hku.hk>
Co-authored-by: yuanmengqi <yuanmengqi@mail.ustc.edu.cn>
* feat&style: add task status configuration and clear cache functionality; enhance UI styles
* feat&refactor: enhance current configuration API and improve cache clearing logic
* refactor&style: simplify task status update logic and improve page refresh mechanism
* refactor&feat: streamline default configuration retrieval and enhance cache initialization logic
* feat&refactor: add caching to default configuration retrieval and streamline task status logic
* feat&style: add collapsible section for additional model parameters and enhance styling for config items
* refactor&style: remove floating action button and clean up related styles
* fix: update video and screenshot sources to include action space, observation type, and model name parameters
- Added robust error handling for page processing, including checks for closed pages and HTTP status codes.
- Implemented retry logic for page loads and active tab checks to improve reliability.
- Enhanced logging throughout the process to capture detailed information about failures and successes.
- Preserved existing logic while ensuring better maintainability and robustness in the Chrome evaluator functions.
- Added logging to the VM wallpaper retrieval function to capture errors and warnings related to content retrieval and file creation.
- Implemented checks for None, empty, and invalid content types to ensure robustness in wallpaper handling.
- Enhanced the SSIM structure check function with size validation and improved error handling for image processing.
- Added logging for image size discrepancies and exceptions during SSIM computation to aid in debugging.
These changes improve error handling and logging, ensuring better maintainability and reliability of the evaluators.
* feat&style: add task status configuration and clear cache functionality; enhance UI styles
* feat&refactor: enhance current configuration API and improve cache clearing logic
* refactor&style: simplify task status update logic and improve page refresh mechanism
* refactor&feat: streamline default configuration retrieval and enhance cache initialization logic
* feat&refactor: add caching to default configuration retrieval and streamline task status logic
* feat&style: add collapsible section for additional model parameters and enhance styling for config items
* refactor&style: remove floating action button and clean up related styles
- Implemented retry mechanism for connecting to Chrome instances, allowing up to two attempts before failure.
- Increased timeout settings for page navigation and loading to enhance reliability.
- Added detailed logging for connection attempts, page loading status, and error handling to improve debugging and user experience.
- Ensured existing logic is preserved while enhancing error handling and operational robustness.
These changes improve the overall reliability and maintainability of the Chrome evaluator functions.
- Added postconfig commands to multiple JSON files for Chrome evaluation examples.
- Included commands to terminate existing Chrome processes, launch Chrome with remote debugging, and introduce sleep intervals for timing.
- Updated logging messages in the AWS manager to improve clarity and user experience.
These changes enhance the automation and usability of the evaluation examples while preserving existing logic.
- Reformatted import statements for better organization.
- Enhanced error handling in file reading functions to provide clearer logging.
- Introduced a new `_safe_read_file` function to handle multiple encoding attempts when reading files.
- Updated the `compare_csv` function to utilize the new file reading method, improving robustness.
- Ensured consistent return values across functions, replacing `0.` with `0.0` for clarity.
These changes maintain existing logic while improving code maintainability and readability.
- Added a series of postconfig commands to the evaluator section in the JSON file.
- Commands include executing a refresh in Chrome, managing Chrome processes, launching Chrome with remote debugging, and opening specific settings tabs.
- Introduced sleep intervals to ensure proper execution timing between commands.
This update improves the automation capabilities of the evaluation examples while maintaining existing logic.
- Add `fixed_ip` field to all 369 JSON files in examples directory
- Set to `true` for 8 files listed in google_chrome.json multi_apps
- Set to `false` for remaining 361 files
- Add `possibility_of_env_change` field to 363 JSON files missing this field
- Set to "low" for newly added fields
- Preserve existing values (4 medium, 2 high) for 6 files that already had this field
This ensures consistent configuration schema across all evaluation examples
while maintaining backward compatibility with existing settings.
- Add screen_size parameter to get_vm_path() for all providers (with default 1920x1080)
- Add os_type parameter to start_emulator() for Azure and VirtualBox providers
- Add region parameter to stop_emulator() for VMware, Docker, and VirtualBox providers
- Use *args, **kwargs for better extensibility and parameter consistency
- Add documentation comments explaining ignored parameters for interface consistency
- Prevents TypeError exceptions when AWS-specific parameters are passed to other providers
This ensures all providers can handle the same parameter sets while maintaining
backward compatibility and avoiding interface fragmentation.
pip-installing directly from PyPI fails misteriously in postconfig
execution, possible owing to proxy configuration in the VM, adjusted
strategy by downloading the wheel on host and pip-installing it locally
on VM in thunderbird/d38192b0-17dc-4e1d-99c3-786d0117de77
* feat: add claude support
* feat: add script for end-to-end evaluation with logging and task distribution
* feat&fix: add tool result handling and update model default in evaluation script
* chore: remove run_test_env.py script
* feat&fix: implement action parsing for tool calls and update default action space
* fix: update text formatting in action parsing and replace logger import
* feat&fix: implement action parsing for tool calls and add screen size handling
* feat: add setup instructions for Anthropic API integration
* feat: add notice about image size limitations for Anthropic API
* Delete test_env/logger.py
* Delete test_env/utils.py
* fix: update logger usage to use global logger and improve error handling
* feat&fix: add configuration management API endpoints and update UI for configuration selection
* feat&fix: update environment configuration, enhance task statistics, and improve UI responsiveness
* feat&fix: add configuration toggle button in UI and improve task loading performance
* feat&fix: add accuracy percentage display to score and style updates for UI