sci-gui-agent-benchmark

Author	SHA1	Message	Date
MillanK0817	4ae9d41da4	feat: update jedi agent with support for o3 as planner	2025-07-30 14:06:37 +08:00
yuanmengqi	0f00788c4d	feat: add run_multienv_o3.py script for multi-environment evaluation - Introduced a new script `run_multienv_o3.py` to facilitate end-to-end evaluation across multiple environments. - Implemented command-line argument parsing for various configurations including environment settings, logging levels, and AWS parameters. - Integrated signal handling for graceful shutdown of environments and processes. - Enhanced logging capabilities for better traceability during execution. - Maintained existing logic from previous scripts while introducing new functionalities for improved evaluation processes.	2025-07-27 16:47:24 +00:00
yuanmengqi	523d553e88	feat: add client password argument to multiple agents and scripts - Introduced `--client_password` argument in `run_multienv_aguvis.py`, `run_multienv_claude.py`, and `run_multienv_gta1.py` for enhanced security and flexibility. - Updated agent classes (`PromptAgent`, `AguvisAgent`, `GTA1Agent`) to accept and utilize `client_password` for improved configuration. - Modified evaluation guidelines to reflect the new client password requirement. - Ensured existing logic remains intact while enhancing functionality for better user experience.	2025-07-27 16:11:23 +00:00
yuanmengqi	b25854edba	feat: introduce DummyAgent class for enhanced coordinate handling - Added DummyAgent class to facilitate coordinate generation and action assignment. - Updated GTA1Agent to utilize DummyAgent for improved planning and execution. - Increased max_steps and N_SEQ parameters for better performance. - Enhanced logging for planning and execution processes. - Maintained existing logic while integrating new functionality.	2025-07-26 08:26:23 +00:00
yuanmengqi	73caf53880	delete: remove img_utils.py and update imports in jedi_3b_agent.py and jedi_7b_agent.py to use qwen_vl_utils	2025-07-26 07:28:31 +00:00
yuanmengqi	f5595df71c	delete: remove gat1_agent.py file	2025-07-25 07:11:55 +00:00
张逸群	bf78b6d05e	Add OPENAI_BASE_URL support for custom OpenAI-compatible endpoints (#283 ) Enables GPT models to use custom API endpoints through OPENAI_BASE_URL environment variable. This addresses the limitation where only Azure OpenAI supported custom endpoints while standard GPT models were hardcoded to api.openai.com. - Add intelligent URL handling to avoid duplicate /v1 paths - Maintain backward compatibility with default OpenAI API - Update README with configuration instructions - Non-breaking change preserving existing functionality Fixes API integration issues for users with custom OpenAI-compatible services.	2025-07-24 12:31:08 +08:00
Yan98	2f3a6c48f6	Fix Typos (#275 ) * init * init * fix typo	2025-07-24 00:06:04 +08:00
yuanmengqi	82c3cdd590	feat: refactor run_multienv_qwen25vl.py and qwen25vl_agent.py for improved logging and task management - Introduced signal handling for graceful shutdown of environments and processes. - Enhanced logging configuration to support dynamic log levels and structured output. - Updated argument parsing to include new parameters for model selection and task execution. - Refactored task distribution logic to streamline environment task management. - Improved error handling during task execution and environment cleanup. - Adjusted Qwen25VLAgent initialization to support new model and thought prefix options. - Reduced max tries for LLM calls to optimize performance.	2025-07-22 19:46:42 +00:00
Yuan Mengqi	0a37cccd53	update claude (#280 ) * add uitars agent code * improve claude * improve claude * improve claude * improve claude * improve claude	2025-07-23 03:35:49 +08:00
Dunjie Lu	53fb96298a	support_qwen25vl (#276 ) Co-authored-by: root <ludunjie1219@github.com>	2025-07-22 16:33:03 +08:00
Xinyuan Wang	e10dd9267c	Wxy/opencua (#274 ) * OpenCUA Agent code base * update url * debug, modify url input * debug opencua * show result * debug agent history overlap * modify opencua agent; add comment lines * update parallel; clean code; use sleep 3s * ui-tars-0717	2025-07-20 15:52:23 +08:00
Yuan Mengqi	5ca516ac7a	add uitars agent code (#265 )	2025-07-17 18:17:13 +08:00
Xinyuan Wang	24fbad9015	Merge pull request #264 from yuanmengqi/main Improve the parallel logic	2025-07-17 12:28:48 +08:00
yuanmengqi	bb8b0b2582	Improve the parallel logic	2025-07-17 04:19:44 +00:00
yuanmengqi	9eeabfc52d	Improve the parallel logic	2025-07-17 04:14:20 +00:00
Xinyuan Wang	0f2655249c	Wxy/opencua (#260 ) * OpenCUA Agent code base * update url * debug, modify url input * debug opencua * show result * debug agent history overlap * modify opencua agent; add comment lines	2025-07-16 17:53:12 +08:00
yuanmengqi	175b4b46c2	Merge remote-tracking branch 'upstream/main' into fix_chrome	2025-07-15 14:50:48 +00:00
Yuan Mengqi	af47ed8fb1	fix infeasible&chrome tasks (#258 ) * fix chrome * fix: fix proxy setup * feat&fix: add proxy support in setup and remove hardcoded proxy from example * fix tasks * fix chrome finished * fix * clean chrome_fix code * clean chrome_fix code * fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774 * fix multiapps * fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774 * fix some multi_apps tasks * fix some multi_apps tasks * fix password&resolution * fix password&resolution * Improve code logic for password & resolution * edit * Merge branch 'main' into fix_chrome * fix chrome tasks * Merge branch 'fix_chrome' * fix insensible&chrome tasks --------- Co-authored-by: adlsdztony <zzl0712@connect.hku.hk>	2025-07-15 13:02:42 +08:00
yuanmengqi	08b4cf2c2f	fix infeasible&chome tasks	2025-07-15 02:09:40 +00:00
Xinyuan Wang	db83b9cb2c	Wxy/opencua (#256 ) * OpenCUA Agent code base * update url * debug, modify url input	2025-07-14 20:26:39 +08:00
Zilong Zhou	74b7c189af	Feat/monitor (#254 ) * feat: add claude support * feat: add script for end-to-end evaluation with logging and task distribution * feat&fix: add tool result handling and update model default in evaluation script * chore: remove run_test_env.py script * feat&fix: implement action parsing for tool calls and update default action space * fix: update text formatting in action parsing and replace logger import * feat&fix: implement action parsing for tool calls and add screen size handling * feat: add setup instructions for Anthropic API integration * feat: add notice about image size limitations for Anthropic API * Delete test_env/logger.py * Delete test_env/utils.py * fix: update logger usage to use global logger and improve error handling * feat&fix: add configuration management API endpoints and update UI for configuration selection * feat&fix: update environment configuration, enhance task statistics, and improve UI responsiveness * feat&fix: add configuration toggle button in UI and improve task loading performance * feat&fix: add accuracy percentage display to score and style updates for UI	2025-07-14 13:43:41 +08:00
Zilong Zhou	349f2fd9fe	Feat/claude cua support (#253 ) * feat: add claude support * feat: add script for end-to-end evaluation with logging and task distribution * feat&fix: add tool result handling and update model default in evaluation script * chore: remove run_test_env.py script * feat&fix: implement action parsing for tool calls and update default action space * fix: update text formatting in action parsing and replace logger import * feat&fix: implement action parsing for tool calls and add screen size handling * feat: add setup instructions for Anthropic API integration * feat: add notice about image size limitations for Anthropic API * Delete test_env/logger.py * Delete test_env/utils.py	2025-07-13 21:10:49 +08:00
Yuan Mengqi	38a30734a6	Improve code logic for password & resolution (#252 ) * fix chrome * fix: fix proxy setup * feat&fix: add proxy support in setup and remove hardcoded proxy from example * fix tasks * fix chrome finished * fix * clean chrome_fix code * clean chrome_fix code * fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774 * fix multiapps * fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774 * fix some multi_apps tasks * fix some multi_apps tasks * fix password&resolution * fix password&resolution * Improve code logic for password & resolution * edit * Merge branch 'main' into fix_chrome * fix chrome tasks --------- Co-authored-by: adlsdztony <zzl0712@connect.hku.hk>	2025-07-13 21:04:07 +08:00
Yuan Mengqi	27319ce1e3	fix password&resolution (#251 ) * fix chrome * fix: fix proxy setup * feat&fix: add proxy support in setup and remove hardcoded proxy from example * fix tasks * fix chrome finished * fix * clean chrome_fix code * clean chrome_fix code * fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774 * fix multiapps * fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774 * fix some multi_apps tasks * fix some multi_apps tasks * fix password&resolution * fix password&resolution --------- Co-authored-by: adlsdztony <zzl0712@connect.hku.hk>	2025-07-13 00:25:37 +08:00
Yan98	4e3446d6fe	Fix Name (#249 ) * init * init	2025-07-11 00:15:46 +08:00
Yan98	0a5058342d	init (#246 )	2025-07-10 00:29:42 +08:00
yuanmengqi	7315aec6e6	clean code	2025-06-10 04:06:54 +00:00
yuanmengqi	3da32fe5cf	update operator prompt	2025-06-10 02:35:53 +00:00
yuanmengqi	692486f8e7	add GDrive guideline	2025-06-09 14:59:47 +00:00
yuanmengqi	aee1207fff	fix error	2025-06-09 04:20:59 +00:00
yuanmengqi	d8872634ee	edit prompt	2025-06-08 03:59:31 +00:00
yuanmengqi	c57b1d4e7a	eval update	2025-06-07 13:19:22 +00:00
yuanmengqi	a146c1e0b7	edit prompt	2025-06-07 05:21:04 +00:00
yuanmengqi	64177045b5	Merge remote-tracking branch 'upstream/feat/aws-provider-support'	2025-06-06 10:22:56 +00:00
Timothyxxx	8373f7cff2	refactor: remove AWSVMManagerWithProxy and integrate proxy support directly into AWSVMManager for streamlined VM allocation; minor fix on openai_cua_agent	2025-06-06 02:55:50 +08:00
yuanmengqi	a6300e05c9	Merge remote-tracking branch 'upstream/feat/aws-provider-support'	2025-06-05 13:31:42 +00:00
adlsdztony	3b1540ed23	feat&fix: enhance task status handling and update logging configuration	2025-06-05 09:33:36 +00:00
yuanmengqi	b211df3385	fix timeout	2025-06-04 10:23:45 +00:00
yuanmengqi	98a810d31e	edit operator	2025-06-02 12:11:25 +00:00
yuanmengqi	228849ab03	add openai cua agent	2025-05-31 11:22:38 +00:00
uvheart	a845824f06	add azure_gpt_4o (#197 )	2025-05-23 03:57:42 +08:00
Shihao Liang	119bef25e2	Dev/uitars 15 (#194 ) * debug uitars1.0, add uitars1.5 * update pyautogui parser * modify function name * update parser * update prompt * FIX: bug in ui tars	2025-05-19 17:15:17 +08:00
MillanK	51f5ddea04	Add Jedi agent implementation to mm_agents (#192 ) * feat: implement Jedi agent * chore: code clean	2025-05-10 19:55:33 +08:00
Thomas Kuntz	5678b510d7	fix: Invalid escape sequence in prompts (#191 ) Fixes the warning: SyntaxWarning: invalid escape sequence '\`'	2025-05-10 18:19:07 +08:00
Thomas Kuntz	7d88283f8a	feat: Support newer Gemini models (#188 )	2025-05-06 16:04:30 +08:00
Shihao Liang	b92c716df7	Dev/uitars 15 (#181 ) * debug uitars1.0, add uitars1.5 * update pyautogui parser * modify function name * update parser * update prompt	2025-04-21 13:44:08 +08:00
Shihao Liang	bd2e980666	Dev/uitars 15 (#178 ) * debug uitars1.0, add uitars1.5 * update pyautogui parser * modify function name * update parser	2025-04-17 18:49:21 +08:00
Shiqian Su	c4d818c5cf	Update aguvis_agent.py (#141 ) Fix Aguvis prompt bug	2025-02-28 16:48:41 +08:00
Shihao Liang	339a13e1d5	Dev/uitars (#132 ) * init uitars * change agent class name * FIX: return bug in agent predict	2025-02-14 11:17:37 +08:00

1 2 3 4

153 Commits