sci-gui-agent-benchmark

Author	SHA1	Message	Date
Adam Yanxiao Zhao	aa05f6cc26	Add AutoGLM-OS agent (#309 ) * autoglm-os initialize * clean code * chore: use proxy for download setup * feat(autoglm-os): add parameter to toggle images * fix: use temporary directory for files pulled from the vm to prevent potential collision when running multiple instances of the same task in parallel * update * add client_password * update multienv * fix * fix prompt * fix prompt * fix prompt * fix sys prompt * feat: use proxy in file evaluator * fix client_password * fix note_prompt * fix autoglm agent cmd type * fix * revert: fix: use temporary directory for files pulled from the vm to prevent potential collision when running multiple instances of the same task in parallel reverts commit bab5473eea1de0e61b0e1d68b23ce324a5b0ee57 * feat(autoglm): setup tools * fix(autoglm): remove second time of get a11y tree * add osworld server restart * Revert "add osworld server restart" This reverts commit 7bd9d84122e246ce2a26de0e49c25494244c2b3d. * fix _launch_setup * fix autoglm agent tools & xml tree * fix desktop_env * fix bug for tool name capitalization * fix: always use proxy for setup download * add fail after exceeding max turns * fix(autoglm): avoid adding image to message when screenshot is empty * fix maximize_window * fix maximize_window * fix maximize_window * fix import browsertools module bug * fix task proxy config bug * restore setup * refactor desktop env * restore image in provider * restore file.py * refactor desktop_env * quick fix * refactor desktop_env.step * fix our env reset * add max truns constraint * clean run script * clean lib_run_single.py --------- Co-authored-by: hanyullai <hanyullai@outlook.com> Co-authored-by: JingBh <jingbohao@yeah.net>	2025-08-17 12:08:40 +08:00
Timothyxxx	7fb5860da0	feat: enhance run_coact.py and related agents with improved task handling and configuration - Updated TASK_DESCRIPTION in run_coact.py to clarify task-solving steps and requirements. - Modified configuration parameters for provider name and client password for better security and flexibility. - Enhanced OrchestratorUserProxyAgent to include user instruction in the auto-reply and improved screenshot handling. - Adjusted coding_agent.py to ensure proper verification of results before saving changes. - Improved CUA agent prompts to maintain application state and handle user instructions more effectively. - Ensured existing code logic remains unchanged while enhancing functionality and usability.	2025-08-13 09:04:09 +00:00
Timothyxxx	d2ae0f697d	feat: enhance AnthropicAgent with start_coordinate handling and modifier key support - Added support for an optional start_coordinate parameter to facilitate drag actions from a specified starting point. - Implemented validation for start_coordinate to ensure it is a tuple of two integers. - Enhanced click actions to handle modifier keys, allowing for more complex interactions. - Ensured existing code logic remains unchanged while improving functionality and usability.	2025-08-12 05:34:18 +00:00
yuanmengqi	84f407afdd	feat: enhance run_coact.py with logging and configuration options - Added logging configuration to capture runtime logs in both file and console with adjustable log levels. - Introduced new command-line arguments for provider name, region, and client password to improve flexibility and security. - Updated process_task function to accommodate new parameters, ensuring compatibility with existing logic. - Modified prompt templates in coding_agent.py and cua_agent.py to use the client password placeholder for enhanced security.	2025-07-31 05:47:58 +00:00
Yuan Mengqi	239dd37d2e	clean claude run code (#293 ) * add uitars agent code * improve claude * improve claude * improve claude * improve claude * improve claude * add nogdrive json * merge claude code * clean code claude run * clean code claude run * clean code claude run	2025-07-31 12:09:08 +08:00
Linxin Song	b968155757	CoACT initialize (#292 )	2025-07-31 10:35:20 +08:00
Xinyuan Wang	862d704b8c	Wxy/opencua (#290 ) * OpenCUA Agent code base * update url * debug, modify url input * debug opencua * show result * debug agent history overlap * modify opencua agent; add comment lines * update parallel; clean code; use sleep 3s * ui-tars-0717 * update detail * add system password to system prompt * add running command	2025-07-31 08:53:49 +08:00
Xinyuan Wang	3d32556085	Uitars/dev (#291 ) * use aws pub ip * os task fix: set the default dim screen time to be 300s * add all the uitars agents: 1. run_multienv_uitars.py: Qwen2VL-based UITARS models 2. run_multienv_uitars15_v1.py: UITARS1.5-7B 3. run_multienv_uitars15_v2.py: SeedVL1.5 thining/non-thinking --------- Co-authored-by: Jiaqi <dengjiaqi@moonshot.cn>	2025-07-31 08:52:27 +08:00
MillanK0817	4ae9d41da4	feat: update jedi agent with support for o3 as planner	2025-07-30 14:06:37 +08:00
yuanmengqi	0f00788c4d	feat: add run_multienv_o3.py script for multi-environment evaluation - Introduced a new script `run_multienv_o3.py` to facilitate end-to-end evaluation across multiple environments. - Implemented command-line argument parsing for various configurations including environment settings, logging levels, and AWS parameters. - Integrated signal handling for graceful shutdown of environments and processes. - Enhanced logging capabilities for better traceability during execution. - Maintained existing logic from previous scripts while introducing new functionalities for improved evaluation processes.	2025-07-27 16:47:24 +00:00
yuanmengqi	523d553e88	feat: add client password argument to multiple agents and scripts - Introduced `--client_password` argument in `run_multienv_aguvis.py`, `run_multienv_claude.py`, and `run_multienv_gta1.py` for enhanced security and flexibility. - Updated agent classes (`PromptAgent`, `AguvisAgent`, `GTA1Agent`) to accept and utilize `client_password` for improved configuration. - Modified evaluation guidelines to reflect the new client password requirement. - Ensured existing logic remains intact while enhancing functionality for better user experience.	2025-07-27 16:11:23 +00:00
yuanmengqi	b25854edba	feat: introduce DummyAgent class for enhanced coordinate handling - Added DummyAgent class to facilitate coordinate generation and action assignment. - Updated GTA1Agent to utilize DummyAgent for improved planning and execution. - Increased max_steps and N_SEQ parameters for better performance. - Enhanced logging for planning and execution processes. - Maintained existing logic while integrating new functionality.	2025-07-26 08:26:23 +00:00
yuanmengqi	73caf53880	delete: remove img_utils.py and update imports in jedi_3b_agent.py and jedi_7b_agent.py to use qwen_vl_utils	2025-07-26 07:28:31 +00:00
yuanmengqi	f5595df71c	delete: remove gat1_agent.py file	2025-07-25 07:11:55 +00:00
张逸群	bf78b6d05e	Add OPENAI_BASE_URL support for custom OpenAI-compatible endpoints (#283 ) Enables GPT models to use custom API endpoints through OPENAI_BASE_URL environment variable. This addresses the limitation where only Azure OpenAI supported custom endpoints while standard GPT models were hardcoded to api.openai.com. - Add intelligent URL handling to avoid duplicate /v1 paths - Maintain backward compatibility with default OpenAI API - Update README with configuration instructions - Non-breaking change preserving existing functionality Fixes API integration issues for users with custom OpenAI-compatible services.	2025-07-24 12:31:08 +08:00
Yan98	2f3a6c48f6	Fix Typos (#275 ) * init * init * fix typo	2025-07-24 00:06:04 +08:00
yuanmengqi	82c3cdd590	feat: refactor run_multienv_qwen25vl.py and qwen25vl_agent.py for improved logging and task management - Introduced signal handling for graceful shutdown of environments and processes. - Enhanced logging configuration to support dynamic log levels and structured output. - Updated argument parsing to include new parameters for model selection and task execution. - Refactored task distribution logic to streamline environment task management. - Improved error handling during task execution and environment cleanup. - Adjusted Qwen25VLAgent initialization to support new model and thought prefix options. - Reduced max tries for LLM calls to optimize performance.	2025-07-22 19:46:42 +00:00
Yuan Mengqi	0a37cccd53	update claude (#280 ) * add uitars agent code * improve claude * improve claude * improve claude * improve claude * improve claude	2025-07-23 03:35:49 +08:00
Dunjie Lu	53fb96298a	support_qwen25vl (#276 ) Co-authored-by: root <ludunjie1219@github.com>	2025-07-22 16:33:03 +08:00
Xinyuan Wang	e10dd9267c	Wxy/opencua (#274 ) * OpenCUA Agent code base * update url * debug, modify url input * debug opencua * show result * debug agent history overlap * modify opencua agent; add comment lines * update parallel; clean code; use sleep 3s * ui-tars-0717	2025-07-20 15:52:23 +08:00
Yuan Mengqi	5ca516ac7a	add uitars agent code (#265 )	2025-07-17 18:17:13 +08:00
Xinyuan Wang	24fbad9015	Merge pull request #264 from yuanmengqi/main Improve the parallel logic	2025-07-17 12:28:48 +08:00
yuanmengqi	bb8b0b2582	Improve the parallel logic	2025-07-17 04:19:44 +00:00
yuanmengqi	9eeabfc52d	Improve the parallel logic	2025-07-17 04:14:20 +00:00
Xinyuan Wang	0f2655249c	Wxy/opencua (#260 ) * OpenCUA Agent code base * update url * debug, modify url input * debug opencua * show result * debug agent history overlap * modify opencua agent; add comment lines	2025-07-16 17:53:12 +08:00
yuanmengqi	175b4b46c2	Merge remote-tracking branch 'upstream/main' into fix_chrome	2025-07-15 14:50:48 +00:00
Yuan Mengqi	af47ed8fb1	fix infeasible&chrome tasks (#258 ) * fix chrome * fix: fix proxy setup * feat&fix: add proxy support in setup and remove hardcoded proxy from example * fix tasks * fix chrome finished * fix * clean chrome_fix code * clean chrome_fix code * fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774 * fix multiapps * fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774 * fix some multi_apps tasks * fix some multi_apps tasks * fix password&resolution * fix password&resolution * Improve code logic for password & resolution * edit * Merge branch 'main' into fix_chrome * fix chrome tasks * Merge branch 'fix_chrome' * fix insensible&chrome tasks --------- Co-authored-by: adlsdztony <zzl0712@connect.hku.hk>	2025-07-15 13:02:42 +08:00
yuanmengqi	08b4cf2c2f	fix infeasible&chome tasks	2025-07-15 02:09:40 +00:00
Xinyuan Wang	db83b9cb2c	Wxy/opencua (#256 ) * OpenCUA Agent code base * update url * debug, modify url input	2025-07-14 20:26:39 +08:00
Zilong Zhou	74b7c189af	Feat/monitor (#254 ) * feat: add claude support * feat: add script for end-to-end evaluation with logging and task distribution * feat&fix: add tool result handling and update model default in evaluation script * chore: remove run_test_env.py script * feat&fix: implement action parsing for tool calls and update default action space * fix: update text formatting in action parsing and replace logger import * feat&fix: implement action parsing for tool calls and add screen size handling * feat: add setup instructions for Anthropic API integration * feat: add notice about image size limitations for Anthropic API * Delete test_env/logger.py * Delete test_env/utils.py * fix: update logger usage to use global logger and improve error handling * feat&fix: add configuration management API endpoints and update UI for configuration selection * feat&fix: update environment configuration, enhance task statistics, and improve UI responsiveness * feat&fix: add configuration toggle button in UI and improve task loading performance * feat&fix: add accuracy percentage display to score and style updates for UI	2025-07-14 13:43:41 +08:00
Zilong Zhou	349f2fd9fe	Feat/claude cua support (#253 ) * feat: add claude support * feat: add script for end-to-end evaluation with logging and task distribution * feat&fix: add tool result handling and update model default in evaluation script * chore: remove run_test_env.py script * feat&fix: implement action parsing for tool calls and update default action space * fix: update text formatting in action parsing and replace logger import * feat&fix: implement action parsing for tool calls and add screen size handling * feat: add setup instructions for Anthropic API integration * feat: add notice about image size limitations for Anthropic API * Delete test_env/logger.py * Delete test_env/utils.py	2025-07-13 21:10:49 +08:00
Yuan Mengqi	38a30734a6	Improve code logic for password & resolution (#252 ) * fix chrome * fix: fix proxy setup * feat&fix: add proxy support in setup and remove hardcoded proxy from example * fix tasks * fix chrome finished * fix * clean chrome_fix code * clean chrome_fix code * fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774 * fix multiapps * fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774 * fix some multi_apps tasks * fix some multi_apps tasks * fix password&resolution * fix password&resolution * Improve code logic for password & resolution * edit * Merge branch 'main' into fix_chrome * fix chrome tasks --------- Co-authored-by: adlsdztony <zzl0712@connect.hku.hk>	2025-07-13 21:04:07 +08:00
Yuan Mengqi	27319ce1e3	fix password&resolution (#251 ) * fix chrome * fix: fix proxy setup * feat&fix: add proxy support in setup and remove hardcoded proxy from example * fix tasks * fix chrome finished * fix * clean chrome_fix code * clean chrome_fix code * fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774 * fix multiapps * fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774 * fix some multi_apps tasks * fix some multi_apps tasks * fix password&resolution * fix password&resolution --------- Co-authored-by: adlsdztony <zzl0712@connect.hku.hk>	2025-07-13 00:25:37 +08:00
Yan98	4e3446d6fe	Fix Name (#249 ) * init * init	2025-07-11 00:15:46 +08:00
Yan98	0a5058342d	init (#246 )	2025-07-10 00:29:42 +08:00
yuanmengqi	7315aec6e6	clean code	2025-06-10 04:06:54 +00:00
yuanmengqi	3da32fe5cf	update operator prompt	2025-06-10 02:35:53 +00:00
yuanmengqi	692486f8e7	add GDrive guideline	2025-06-09 14:59:47 +00:00
yuanmengqi	aee1207fff	fix error	2025-06-09 04:20:59 +00:00
yuanmengqi	d8872634ee	edit prompt	2025-06-08 03:59:31 +00:00
yuanmengqi	c57b1d4e7a	eval update	2025-06-07 13:19:22 +00:00
yuanmengqi	a146c1e0b7	edit prompt	2025-06-07 05:21:04 +00:00
yuanmengqi	64177045b5	Merge remote-tracking branch 'upstream/feat/aws-provider-support'	2025-06-06 10:22:56 +00:00
Timothyxxx	8373f7cff2	refactor: remove AWSVMManagerWithProxy and integrate proxy support directly into AWSVMManager for streamlined VM allocation; minor fix on openai_cua_agent	2025-06-06 02:55:50 +08:00
yuanmengqi	a6300e05c9	Merge remote-tracking branch 'upstream/feat/aws-provider-support'	2025-06-05 13:31:42 +00:00
adlsdztony	3b1540ed23	feat&fix: enhance task status handling and update logging configuration	2025-06-05 09:33:36 +00:00
yuanmengqi	b211df3385	fix timeout	2025-06-04 10:23:45 +00:00
yuanmengqi	98a810d31e	edit operator	2025-06-02 12:11:25 +00:00
yuanmengqi	228849ab03	add openai cua agent	2025-05-31 11:22:38 +00:00
uvheart	a845824f06	add azure_gpt_4o (#197 )	2025-05-23 03:57:42 +08:00

1 2 3 4

161 Commits