feat: enhance run_coact.py and related agents with improved task handling and configuration

- Updated TASK_DESCRIPTION in run_coact.py to clarify task-solving steps and requirements. - Modified configuration parameters for provider name and client password for better security and flexibility. - Enhanced OrchestratorUserProxyAgent to include user instruction in the auto-reply and improved screenshot handling. - Adjusted coding_agent.py to ensure proper verification of results before saving changes. - Improved CUA agent prompts to maintain application state and handle user instructions more effectively. - Ensured existing code logic remains unchanged while enhancing functionality and usability.
2025-08-13 09:04:09 +00:00
parent d2ae0f697d
commit 7fb5860da0
4 changed files with 100 additions and 59 deletions
--- a/run_coact.py
+++ b/run_coact.py
@@ -6,6 +6,7 @@ import shutil
 import traceback
 from typing import Dict, List
 import json
+import time
 import os
 from mm_agents.coact.operator_agent import OrchestratorAgent, OrchestratorUserProxyAgent
 from mm_agents.coact.autogen import LLMConfig
@@ -16,30 +17,34 @@ import sys


 TASK_DESCRIPTION = """# Your role
-You are a task solver, you need to try your best to complete a computer-using task step-by-step.
- Based on the task description AND the screenshot, provide a detailed plan. The screenshot includes the current state of the computer, and it includes a lot of hints.
- Do not do anything else out of the user's instruction, like discover the file out of the user specific location or imagine the conditions. This will affect the judgement of the infeasible task.
- When you see the interactable element in the screenshot is dense (like a spreadsheet is opening), you MUST try coding agent first.
- When you let coding agent to modify an existing file, you MUST let it check the file content first.
- After coding agent is done, you MUST check the final result carefully either by looking into the screenshot or by GUI agent and make sure EVERY value is in the desired position (e.g., the required cells in the spreadsheet are filled).
- When you call GUI agent, it will have a **20-step** budget to complete your task. Each step is a one-time interaction with OS like mouse click or keyboard typing. Please take this into account when you plan the actions.
- Remember to save the file (if applicable) before completing the task.
+You are a task solver, you need to complete a computer-using task step-by-step.
+1. Describe the screenshot.
+2. Provide a detailed plan, including a list of user requirements like specific file name, file path, etc.
+3. Follow the following instructions and complete the task with your skills.
+    - If you think the task is impossible to complete (no file, wrong environment, etc.), reply with "INFEASIBLE" to end the conversation.
+    - **Do not** do (or let coding/GUI agent do) anything else out of the user's instruction like change the file name. This will make the task fail.
+    - Check every screenshot carefully and see if it fulfills the task requirement.
+    - You MUST try the Coding Agent first for file operation tasks like spreadsheet modification.
+4. Verify the result and see if it fulfills the user's requirement.

-# Your skills
+# Your helpers
 You can use the following tools to solve the task. You can only call one of gui agent or coding agent per reply:
- call_coding_agent: Let a coding agent to solve a task. Coding agent can write python or bash code to modify everything on the computer. It requires a environment description and a detailed task description.
- call_gui_agent: Let a GUI agent to solve a task. GUI agent can operate the computer by clicking and typing (not that accurate). It will have a **20-step** budget to complete your task. Require a detailed task description.

-# About the task
- Check every screenshot carefully and see if it fulfills the task requirement.
- If the task is completed, reply with "TERMINATE" to end the conversation.
- If you think the task is impossible to complete (no file, wrong environment, etc.), reply with "INFEASIBLE" to end the conversation.
- TERMINATE and INFEASIBLE are used to determine if the task is completed. Therefore, do not use it in your response unless the task is completed.
+## Programmer
+Let a programmer to solve a subtask you assigned. 
+The Programmer can write python or bash code to modify almost everything in the computer, like files, apps, system settings, etc. 
+It requires a environment description and a detailed task description. As detailed as possible.
+Can use any python package you instructed.
+Will return a summary with the output of the code.
+When letting coding agent to modify the spreadsheet, after the task completed, you MUST make sure EVERY modified value in the spreadsheet is in the desired position (e.g., filled in the expected cell) by a GUI Operator.
+After that, if anything is wrong, tell the programmer to modify it.

-# User task
-{instruction}
-Please first check carefully if my task is possible to complete. If not, reply with "INFEASIBLE".
-If possible to complete, please complete this task on my computer. I will not provide further information to you.
+## GUI Operator
+Let a GUI agent to solve a subtask you assigned. 
+GUI agent can operate the computer by clicking and typing (but not accurate). 
+Require a detailed task description.
+When you call GUI agent, it will only have a **20-step** budget to complete your task. Each step is a one-time interaction with OS like mouse click or keyboard typing. Please take this into account when you plan the actions.
+If you let GUI Operator to check the result, you MUST let it close and reopen the file because programmer's result will NOT be updated to the screen. 
 """


@@ -50,22 +55,22 @@ def config() -> argparse.Namespace:

    # environment config
    parser.add_argument("--path_to_vm", type=str, default=None)
-    parser.add_argument("--provider_name", type=str, default="docker")
+    parser.add_argument("--provider_name", type=str, default="aws")
    parser.add_argument("--screen_width", type=int, default=1920)
    parser.add_argument("--screen_height", type=int, default=1080)
    parser.add_argument("--sleep_after_execution", type=float, default=0.5)
    parser.add_argument("--region", type=str, default="us-east-1")
-    parser.add_argument("--client_password", type=str, default="")
+    parser.add_argument("--client_password", type=str, default="osworld-public-evaluation")

    # agent config
-    parser.add_argument("--oai_config_path", type=str, default="OAI_CONFIG_LIST")
+    parser.add_argument("--oai_config_path", type=str, default="/home/ubuntu/OSWorld/mm_agents/coact/OAI_CONFIG_LIST")
    parser.add_argument("--orchestrator_model", type=str, default="o3")
    parser.add_argument("--coding_model", type=str, default="o4-mini")
    parser.add_argument("--cua_model", type=str, default="computer-use-preview")
    parser.add_argument("--orchestrator_max_steps", type=int, default=15)
    parser.add_argument("--coding_max_steps", type=int, default=20)
    parser.add_argument("--cua_max_steps", type=int, default=25)
-    parser.add_argument("--cut_off_steps", type=int, default=150)
+    parser.add_argument("--cut_off_steps", type=int, default=200)

    # example config
    parser.add_argument("--domain", type=str, default="all")
@@ -77,7 +82,7 @@ def config() -> argparse.Namespace:
    )

    # logging related
-    parser.add_argument("--result_dir", type=str, default="./results")
+    parser.add_argument("--result_dir", type=str, default="./results_coact")
    parser.add_argument("--num_envs", type=int, default=1, help="Number of environments to run in parallel")
    parser.add_argument("--log_level", type=str, choices=['DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL'], 
                       default='INFO', help="Set the logging level")
@@ -158,10 +163,11 @@ def process_task(task_info,
            with llm_config:
                orchestrator = OrchestratorAgent(
                    name="orchestrator",
+                    system_message=TASK_DESCRIPTION
                )
                orchestrator_proxy = OrchestratorUserProxyAgent(
                    name="orchestrator_proxy",
-                    is_termination_msg=lambda x: x.get("content", "") and ("terminate" in x.get("content", "")[0]["text"].lower() or "infeasible" in x.get("content", "")[0]["text"].lower()),
+                    is_termination_msg=lambda x: x.get("content", "") and (x.get("content", "")[0]["text"].lower() == "terminate" or x.get("content", "")[0]["text"].lower() == "infeasible"),
                    human_input_mode="NEVER",
                    provider_name=provider_name,
                    path_to_vm=path_to_vm,
@@ -175,14 +181,22 @@ def process_task(task_info,
                    cua_max_steps=cua_max_steps,
                    coding_max_steps=coding_max_steps,
                    region=region,
-                    client_password=client_password
+                    client_password=client_password,
+                    user_instruction=task_config["instruction"]
                )

-            obs = orchestrator_proxy.reset(task_config=task_config)
+            orchestrator_proxy.reset(task_config=task_config)
+            time.sleep(60)
+            screenshot = orchestrator_proxy.env.controller.get_screenshot()

+            with open(os.path.join(history_save_dir, f'initial_screenshot_orchestrator.png'), "wb") as f:
+                f.write(screenshot)
+                
            orchestrator_proxy.initiate_chat(
                recipient=orchestrator,
-                message=TASK_DESCRIPTION.format(instruction=task_config["instruction"]) + "<img data:image/png;base64," + base64.b64encode(obs["screenshot"]).decode("utf-8") + ">",
+                message=f"""{task_config["instruction"]}
+Check my computer screenshot and describe it first. If this task is possible to complete, please complete it on my computer. If not, reply with "INFEASIBLE" to end the conversation.
+I will not provide further information to you.""" + "<img data:image/png;base64," + base64.b64encode(screenshot).decode("utf-8") + ">",
                max_turns=orchestrator_max_steps
            )