# Agent ## Prompt-based Agents ### Supported Models We currently support the following models as the foundation models for the agents: - `GPT-3.5` (gpt-3.5-turbo-16k, ...) - `GPT-4` (gpt-4-0125-preview, gpt-4-1106-preview, ...) - `GPT-4V` (gpt-4-vision-preview, ...) - `Gemini-Pro` - `Gemini-Pro-Vision` - `Claude-3, 2` (claude-3-haiku-2024030, claude-3-sonnet-2024022, ...) - ... And those from open-source community: - `Mixtral 8x7B` - `QWEN`, `QWEN-VL` - `CogAgent` - ... And we will integrate and support more foundation models to support digital agent in the future, stay tuned. ### How to use ```python from mm_agents.agent import PromptAgent agent = PromptAgent( model="gpt-4-0125-preview", observation_type="screenshot", ) agent.reset() # say we have a instruction and observation instruction = "Please help me to find the nearest restaurant." obs = {"screenshot": "path/to/observation.jpg"} response, actions = agent.predict( instruction, obs ) ``` ### Observation Space and Action Space We currently support the following observation spaces: - `a11y_tree`: the a11y tree of the current screen - `screenshot`: a screenshot of the current screen - `screenshot_a11y_tree`: a screenshot of the current screen with a11y tree - `som`: the set-of-mark trick on the current screen, with a table metadata And the following action spaces: - `pyautogui`: valid python code with `pyauotgui` code valid - `computer_13`: a set of enumerated actions designed by us To use feed an observation into the agent, you have to keep the obs variable as a dict with the corresponding information: ```python obs = { "screenshot": "path/to/observation.jpg", "a11y_tree": "" # [a11y_tree data] } response, actions = agent.predict( instruction, obs ) ``` ## Efficient Agents, Q* Agents, and more Stay tuned for more updates.