* autoglm-os initialize * clean code * chore: use proxy for download setup * feat(autoglm-os): add parameter to toggle images * fix: use temporary directory for files pulled from the vm to prevent potential collision when running multiple instances of the same task in parallel * update * add client_password * update multienv * fix * fix prompt * fix prompt * fix prompt * fix sys prompt * feat: use proxy in file evaluator * fix client_password * fix note_prompt * fix autoglm agent cmd type * fix * revert: fix: use temporary directory for files pulled from the vm to prevent potential collision when running multiple instances of the same task in parallel reverts commit bab5473eea1de0e61b0e1d68b23ce324a5b0ee57 * feat(autoglm): setup tools * fix(autoglm): remove second time of get a11y tree * add osworld server restart * Revert "add osworld server restart" This reverts commit 7bd9d84122e246ce2a26de0e49c25494244c2b3d. * fix _launch_setup * fix autoglm agent tools & xml tree * fix desktop_env * fix bug for tool name capitalization * fix: always use proxy for setup download * add fail after exceeding max turns * fix(autoglm): avoid adding image to message when screenshot is empty * fix maximize_window * fix maximize_window * fix maximize_window * fix import browsertools module bug * fix task proxy config bug * restore setup * refactor desktop env * restore image in provider * restore file.py * refactor desktop_env * quick fix * refactor desktop_env.step * fix our env reset * add max truns constraint * clean run script * clean lib_run_single.py --------- Co-authored-by: hanyullai <hanyullai@outlook.com> Co-authored-by: JingBh <jingbohao@yeah.net>
Agent
Prompt-based Agents
Supported Models
We currently support the following models as the foundational models for the agents:
GPT-3.5(gpt-3.5-turbo-16k, ...)GPT-4(gpt-4-0125-preview, gpt-4-1106-preview, ...)GPT-4V(gpt-4-vision-preview, ...)Gemini-ProGemini-Pro-VisionClaude-3, 2(claude-3-haiku-2024030, claude-3-sonnet-2024022, ...)- ...
And those from the open-source community:
Mixtral 8x7BQWEN,QWEN-VLCogAgentLlama3- ...
In the future, we will integrate and support more foundational models to enhance digital agents, so stay tuned.
How to use
from mm_agents.agent import PromptAgent
agent = PromptAgent(
model="gpt-4-vision-preview",
observation_type="screenshot",
)
agent.reset()
# say we have an instruction and observation
instruction = "Please help me to find the nearest restaurant."
obs = {"screenshot": open("path/to/observation.jpg", 'rb').read()}
response, actions = agent.predict(
instruction,
obs
)
Observation Space and Action Space
We currently support the following observation spaces:
a11y_tree: the accessibility tree of the current screenscreenshot: a screenshot of the current screenscreenshot_a11y_tree: a screenshot of the current screen with the accessibility tree overlaysom: the set-of-mark trick on the current screen, with table metadata included.
And the following action spaces:
pyautogui: valid Python code withpyautoguicode validcomputer_13: a set of enumerated actions designed by us
To feed an observation into the agent, you have to maintain the obs variable as a dict with the corresponding information:
# continue from the previous code snippet
obs = {
"screenshot": open("path/to/observation.jpg", 'rb').read(),
"a11y_tree": "" # [a11y_tree data]
}
response, actions = agent.predict(
instruction,
obs
)
Efficient Agents, Q* Agents, and more
Stay tuned for more updates.