From 38f4506ea344dc0d38f5459952b216d280f27d2c Mon Sep 17 00:00:00 2001 From: Tianbao Xie <47296835+Timothyxxx@users.noreply.github.com> Date: Fri, 12 Apr 2024 18:25:05 +0800 Subject: [PATCH] Update README.md of agents --- mm_agents/README.md | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/mm_agents/README.md b/mm_agents/README.md index ff97c2a..ccf95d4 100644 --- a/mm_agents/README.md +++ b/mm_agents/README.md @@ -2,7 +2,7 @@ ## Prompt-based Agents ### Supported Models -We currently support the following models as the foundation models for the agents: +We currently support the following models as the foundational models for the agents: - `GPT-3.5` (gpt-3.5-turbo-16k, ...) - `GPT-4` (gpt-4-0125-preview, gpt-4-1106-preview, ...) - `GPT-4V` (gpt-4-vision-preview, ...) @@ -11,13 +11,13 @@ We currently support the following models as the foundation models for the agent - `Claude-3, 2` (claude-3-haiku-2024030, claude-3-sonnet-2024022, ...) - ... -And those from open-source community: +And those from the open-source community: - `Mixtral 8x7B` - `QWEN`, `QWEN-VL` - `CogAgent` - ... -And we will integrate and support more foundation models to support digital agent in the future, stay tuned. +In the future, we will integrate and support more foundational models to enhance digital agents, so stay tuned. ### How to use @@ -25,11 +25,11 @@ And we will integrate and support more foundation models to support digital agen from mm_agents.agent import PromptAgent agent = PromptAgent( - model="gpt-4-0125-preview", + model="gpt-4-vision-preview", observation_type="screenshot", ) agent.reset() -# say we have a instruction and observation +# say we have an instruction and observation instruction = "Please help me to find the nearest restaurant." obs = {"screenshot": "path/to/observation.jpg"} response, actions = agent.predict( @@ -40,16 +40,16 @@ response, actions = agent.predict( ### Observation Space and Action Space We currently support the following observation spaces: -- `a11y_tree`: the a11y tree of the current screen +- `a11y_tree`: the accessibility tree of the current screen - `screenshot`: a screenshot of the current screen -- `screenshot_a11y_tree`: a screenshot of the current screen with a11y tree -- `som`: the set-of-mark trick on the current screen, with a table metadata +- `screenshot_a11y_tree`: a screenshot of the current screen with the accessibility tree overlay +- `som`: the set-of-mark trick on the current screen, with table metadata included. And the following action spaces: -- `pyautogui`: valid python code with `pyauotgui` code valid +- `pyautogui`: valid Python code with `pyautogui` code valid - `computer_13`: a set of enumerated actions designed by us -To use feed an observation into the agent, you have to keep the obs variable as a dict with the corresponding information: +To feed an observation into the agent, you have to maintain the `obs` variable as a dict with the corresponding information: ```python obs = { "screenshot": "path/to/observation.jpg",