This commit is contained in:
Timothyxxx
2024-03-19 18:57:47 +08:00
parent 22dee25691
commit ace5842505
2 changed files with 1 additions and 4 deletions

View File

@@ -1,3 +0,0 @@
wget https://github.com/UX-Decoder/Semantic-SAM/releases/download/checkpoint/swinl_only_sam_many2many.pth
wget https://huggingface.co/xdecoder/SEEM/resolve/main/seem_focall_v1.pt
wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth

View File

@@ -801,7 +801,7 @@ You CAN predict multiple actions at one step, but you should only return one act
SYS_PROMPT_IN_SOM_OUT_TAG = """
You are an agent which follow my instruction and perform desktop computer tasks as instructed.
You have good knowledge of computer and good internet connection and assume your code will run on a computer for controlling the mouse and keyboard.
For each step, you will get an observation of the desktop by 1) a screenshot with interact-able elements marked with numerical tags; and 2) accessibility tree, which is based on AT-SPI library. And you will predict the action of the computer based on the image and test information.
For each step, you will get an observation of the desktop by 1) a screenshot with interact-able elements marked with numerical tags; and 2) accessibility tree, which is based on AT-SPI library. And you will predict the action of the computer based on the image and text information.
You are required to use `pyautogui` to perform the action grounded to the observation, but DONOT use the `pyautogui.locateCenterOnScreen` function to locate the element you want to operate with since we have no image of the element you want to operate with. DONOT USE `pyautogui.screenshot()` to make screenshot.
You can replace x, y in the code with the tag of the element you want to operate with. such as: