Anthropic has given Claude a new capability they call "computer use". The accompanying announcement says "developers can direct Claude to use computers the way people do — by looking at a screen, moving a cursor, clicking buttons, and typing text."
It sounds wild and sci-fi, and Anthropic says it's still experimental, slow, and error-prone, but it does work!
"Computer use" in practice
I would love to be able to say I built my own implementation for "computer using" Claude, but someone beat me to it and did an amazing job. Corbt wrote a prototype in node and electron called agent.exe that's basically a desktop interface for the new feature, which I used to do some basic testing.
Agent.exe consists of a single window that contains a text box for your prompt, and a "Let's go" button. Sparse.
After sending your prompt, you get a running commentary of Claude's thought process as it figures out how to do what you want. This includes things like how it's interpreted your request, how it thinks the request should be broken down into steps, and what actions it's taking.
You'll also be able to see it carrying out its mission in real-time, as it hijacks your mouse and starts clicking, launching apps and typing its way around - the way it types gives me rubberducky flashbacks as it just pastes whole sentences and queries into input fields in one go.
Running tests using agent.exe
I gave it three tasks:
- Find a picture of an orange cat (this post)
- Find a restaurant for dinner
- Send a Slack message (coming soon)
Finding a picture of an orange cat
The prompt I used was "Find me a picture of an orange cat."
I had tried to get it to open Firefox a couple times with no success, so I decided to try with Firefox already open and in focus on the screen.
This is how its thought process started:
REASONING I'll help you find a picture of an orange cat using Firefox. Let me break this down into steps: 1. First, let me take a screenshot to see the desktop: ACTION { type: 'screenshot' } REASONING 2. I see Firefox is already open. Let me click in the address bar and go to Google Images: ACTION { type: 'mouse_move', x: 245, y: 245 } ACTION { type: 'left_click' } ACTION { type: 'type', text: 'https://images.google.com' } ACTION { type: 'key', text: 'Return' }
At this point, Claude tried to perform a Google image search while there was a massive cookie modal on the screen. It did eventually get around it (interestingly deciding to accept it on my behalf), and proceed, but it messed the flow up a little bit.
REASONING 3. Now I'll search for "orange cat": ACTION { type: 'mouse_move', x: 618, y: 245 } ACTION { type: 'left_click' } REASONING I see we need to accept cookies first. I'll click "Accept all": ACTION { type: 'mouse_move', x: 679, y: 586 } ACTION { type: 'left_click' } REASONING Now I'll type "orange cat" in the search box: ACTION { type: 'type', text: 'orange cat' } ACTION { type: 'key', text: 'Return' } REASONING Let me take another screenshot to see the search results: ACTION { type: 'screenshot' } REASONING The page is still loading. Let me wait a moment and take another screenshot: ACTION { type: 'screenshot' } REASONING I see the page is still loading. Let me try one more time: ACTION { type: 'screenshot' }
Thinking there was something wrong with the page at this point, Claude tried to remedy the issue by refreshing the page. For some reason, F5 wasn't a valid key to press, and agent.exe threw an error but luckily didn't stop execution.
REASONING I notice the page is not loading properly. Let me try refreshing the page: ACTION { type: 'key', text: 'F5' } (node:19877) UnhandledPromiseRejectionWarning: Error: Tried to press unknown key: F5
As far as Claude was concerned, nothing weird had happened and it was once again looking at a blank Google Image search field. In reality, it hadn't managed to type the query in yet, having been confused by the cookie modal, and the page hadn't been refreshed.
From this point on, everything works well.
REASONING Let me try clicking the search box and typing again: ACTION { type: 'mouse_move', x: 618, y: 315 } ACTION { type: 'left_click' } ACTION { type: 'type', text: 'orange cat' } ACTION { type: 'key', text: 'Return' } REASONING Now let me wait a moment and take another screenshot to see if the images have loaded: ACTION { type: 'screenshot' } REASONING Perfect! I can now see many pictures of orange cats. I have successfully found pictures of orange cats as requested. Let me call finish_run since I've completed the task: ACTION { type: 'finish' }
It worked! It got tripped up by a cookie notice, but it worked!
It doesn't look like Claude takes a screenshot after each step, so when it gets one thing wrong (for example, expecting to see an input field instead of a cookie modal) it can take a few steps for it to realize this and try to fix it. This might be why it thought it was waiting for a page to load until it took another screenshot and realized it could proceed.
In other tests, it has also just plowed ahead and told me it had accomplished the task when all it had done was click around aimlessly.
Cost and usage
I didn't think to get usage metrics on a test-by-test basis, but the whole run of testing I did - around 13 prompts, each composed of several API calls - I ended up using:
- 416,105 input tokens
- 9,912 output tokens
It cost around £1/$1.30.