Skip to content

Commit 9532750

Browse files
authored
Merge pull request #6720 from aahill/sept-tools
computer use tool
2 parents 6a02e84 + 98ad2a8 commit 9532750

File tree

4 files changed

+398
-0
lines changed

4 files changed

+398
-0
lines changed
Lines changed: 222 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,222 @@
1+
---
2+
title: 'How to use the Computer Use Tool'
3+
titleSuffix: Azure AI Foundry
4+
description: Find code samples and instructions for using the Computer Use model in the Azure AI Foundry Agent Service.
5+
services: cognitive-services
6+
manager: nitinme
7+
ms.service: azure-ai-agent-service
8+
ms.topic: how-to
9+
ms.date: 08/22/2025
10+
author: aahill
11+
ms.author: aahi
12+
---
13+
14+
# How to use the Computer Use Tool
15+
16+
Use this article to learn how to use the Computer Use tool with the Azure AI Projects SDK.
17+
18+
## Prerequisites
19+
20+
* The requirements in the [Computer Use Tool overview](./deep-research.md).
21+
* Your Azure AI Foundry Project endpoint.
22+
23+
24+
[!INCLUDE [endpoint-string-portal](../../includes/endpoint-string-portal.md)]
25+
26+
Save this endpoint to an environment variable named `PROJECT_ENDPOINT`.
27+
28+
* The deployment name of your Computer Use model. You can find it in **Models + Endpoints** in the left navigation menu.
29+
30+
:::image type="content" source="../../media/tools/computer-use-model-deployment.png" alt-text="A screenshot showing the model deployment screen the AI Foundry portal." lightbox="../../media/tools/computer-use-model-deployment.png":::
31+
32+
Save the name of your model's deployment name as an environment variable named `COMPUTER_USE_MODEL_DEPLOYMENT_NAME`.
33+
34+
* Before using the tool, you need to set up an environment that can capture screenshots and execute the recommended actions by the agent. We recommend using a sandboxed environment, such as Playwright for safety reasons.
35+
36+
The Computer Use tool requires the latest prerelease versions of the `azure-ai-projects` library. First we recommend creating a [virtual environment](https://docs.python.org/3/library/venv.html) to work in:
37+
38+
```console
39+
python -m venv env
40+
# after creating the virtual environment, activate it with:
41+
.\env\Scripts\activate
42+
```
43+
44+
You can install the package with the following command:
45+
46+
```console
47+
pip install --pre azure-ai-projects, azure-identity, azure-ai-agents
48+
```
49+
50+
## Code example
51+
52+
The following code sample shows a basic API request. Once the initial API request is sent, you would perform a loop where the specified action is performed in your application code, sending a screenshot with each turn so the model can evaluate the updated state of the environment. You can see an example integration for a similar API in the [Azure OpenAI documentation](../../../openai/how-to/computer-use.md#playwright-integration).
53+
54+
```python
55+
import os, time, base64
56+
from typing import List
57+
from azure.ai.agents.models._models import ComputerScreenshot, TypeAction
58+
from azure.ai.projects import AIProjectClient
59+
from azure.ai.agents.models import (
60+
MessageRole,
61+
RunStepToolCallDetails,
62+
RunStepComputerUseToolCall,
63+
ComputerUseTool,
64+
ComputerToolOutput,
65+
MessageInputContentBlock,
66+
MessageImageUrlParam,
67+
MessageInputTextBlock,
68+
MessageInputImageUrlBlock,
69+
RequiredComputerUseToolCall,
70+
SubmitToolOutputsAction,
71+
)
72+
from azure.identity import DefaultAzureCredential
73+
74+
def image_to_base64(image_path: str) -> str:
75+
"""
76+
Convert an image file to a Base64-encoded string.
77+
78+
:param image_path: The path to the image file (e.g. 'image_file.png')
79+
:return: A Base64-encoded string representing the image.
80+
:raises FileNotFoundError: If the provided file path does not exist.
81+
:raises OSError: If there's an error reading the file.
82+
"""
83+
if not os.path.isfile(image_path):
84+
raise FileNotFoundError(f"File not found at: {image_path}")
85+
86+
try:
87+
with open(image_path, "rb") as image_file:
88+
file_data = image_file.read()
89+
return base64.b64encode(file_data).decode("utf-8")
90+
except Exception as exc:
91+
raise OSError(f"Error reading file '{image_path}'") from exc
92+
93+
asset_file_path = os.path.abspath(os.path.join(os.path.dirname(__file__), "../assets/cua_screenshot.jpg"))
94+
action_result_file_path = os.path.abspath(os.path.join(os.path.dirname(__file__), "../assets/cua_screenshot_next.jpg"))
95+
project_client = AIProjectClient(endpoint=os.environ["PROJECT_ENDPOINT"], credential=DefaultAzureCredential())
96+
97+
# Initialize Computer Use tool with a browser-sized viewport
98+
environment = os.environ.get("COMPUTER_USE_ENVIRONMENT", "windows")
99+
computer_use = ComputerUseTool(display_width=1026, display_height=769, environment=environment)
100+
101+
with project_client:
102+
103+
agents_client = project_client.agents
104+
105+
# Create a new Agent that has the Computer Use tool attached.
106+
agent = agents_client.create_agent(
107+
model=os.environ["MODEL_DEPLOYMENT_NAME"],
108+
name="my-agent-computer-use",
109+
instructions="""
110+
You are an computer automation assistant.
111+
Use the computer_use_preview tool to interact with the screen when needed.
112+
""",
113+
tools=computer_use.definitions,
114+
)
115+
116+
print(f"Created agent, ID: {agent.id}")
117+
118+
# Create thread for communication
119+
thread = agents_client.threads.create()
120+
print(f"Created thread, ID: {thread.id}")
121+
122+
input_message = (
123+
"I can see a web browser with bing.com open and the cursor in the search box."
124+
"Type 'movies near me' without pressing Enter or any other key. Only type 'movies near me'."
125+
)
126+
image_base64 = image_to_base64(asset_file_path)
127+
img_url = f"data:image/jpeg;base64,{image_base64}"
128+
url_param = MessageImageUrlParam(url=img_url, detail="high")
129+
content_blocks: List[MessageInputContentBlock] = [
130+
MessageInputTextBlock(text=input_message),
131+
MessageInputImageUrlBlock(image_url=url_param),
132+
]
133+
# Create message to thread
134+
message = agents_client.messages.create(thread_id=thread.id, role=MessageRole.USER, content=content_blocks)
135+
print(f"Created message, ID: {message.id}")
136+
137+
run = agents_client.runs.create(thread_id=thread.id, agent_id=agent.id)
138+
print(f"Created run, ID: {run.id}")
139+
140+
# create a fake screenshot showing the text typed in
141+
result_image_base64 = image_to_base64(action_result_file_path)
142+
result_img_url = f"data:image/jpeg;base64,{result_image_base64}"
143+
computer_screenshot = ComputerScreenshot(image_url=result_img_url)
144+
145+
while run.status in ["queued", "in_progress", "requires_action"]:
146+
time.sleep(1)
147+
run = agents_client.runs.get(thread_id=thread.id, run_id=run.id)
148+
149+
if run.status == "requires_action" and isinstance(run.required_action, SubmitToolOutputsAction):
150+
print("Run requires action:")
151+
tool_calls = run.required_action.submit_tool_outputs.tool_calls
152+
if not tool_calls:
153+
print("No tool calls provided - cancelling run")
154+
agents_client.runs.cancel(thread_id=thread.id, run_id=run.id)
155+
break
156+
157+
tool_outputs = []
158+
for tool_call in tool_calls:
159+
if isinstance(tool_call, RequiredComputerUseToolCall):
160+
print(tool_call)
161+
try:
162+
action = tool_call.computer_use_preview.action
163+
print(f"Executing computer use action: {action.type}")
164+
if isinstance(action, TypeAction):
165+
print(f" Text to type: {action.text}")
166+
#(add hook to input text in managed environment API here)
167+
168+
tool_outputs.append(
169+
ComputerToolOutput(tool_call_id=tool_call.id, output=computer_screenshot)
170+
)
171+
if isinstance(action, ComputerScreenshot):
172+
print(f" Screenshot requested")
173+
# (add hook to take screenshot in managed environment API here)
174+
175+
tool_outputs.append(
176+
ComputerToolOutput(tool_call_id=tool_call.id, output=computer_screenshot)
177+
)
178+
except Exception as e:
179+
print(f"Error executing tool_call {tool_call.id}: {e}")
180+
181+
print(f"Tool outputs: {tool_outputs}")
182+
if tool_outputs:
183+
agents_client.runs.submit_tool_outputs(thread_id=thread.id, run_id=run.id, tool_outputs=tool_outputs)
184+
185+
print(f"Current run status: {run.status}")
186+
187+
print(f"Run completed with status: {run.status}")
188+
if run.status == "failed":
189+
print(f"Run failed: {run.last_error}")
190+
191+
# Fetch run steps to get the details of the agent run
192+
run_steps = agents_client.run_steps.list(thread_id=thread.id, run_id=run.id)
193+
for step in run_steps:
194+
print(f"Step {step.id} status: {step.status}")
195+
print(step)
196+
197+
if isinstance(step.step_details, RunStepToolCallDetails):
198+
print(" Tool calls:")
199+
run_step_tool_calls = step.step_details.tool_calls
200+
201+
for call in run_step_tool_calls:
202+
print(f" Tool call ID: {call.id}")
203+
print(f" Tool call type: {call.type}")
204+
205+
if isinstance(call, RunStepComputerUseToolCall):
206+
details = call.computer_use_preview
207+
print(f" Computer use action type: {details.action.type}")
208+
209+
print() # extra newline between tool calls
210+
211+
print() # extra newline between run steps
212+
213+
# Optional: Delete the agent once the run is finished.
214+
agents_client.delete_agent(agent.id)
215+
print("Deleted agent")
216+
```
217+
218+
## Next steps
219+
220+
* [Python agent samples](https://github.com/azure-ai-foundry/foundry-samples/tree/main/samples/microsoft/python/getting-started-agents)
221+
* [Azure OpenAI Computer Use example Playwright integration](../../../openai/how-to/computer-use.md#playwright-integration)
222+
* The Azure OpenAI API has implementation differences compared to the Agent Service, and these examples may need to be adapted to work with agents.
Lines changed: 169 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,169 @@
1+
---
2+
title: 'How to use Azure AI Foundry Agent Service Computer Use Tool'
3+
titleSuffix: Azure AI Foundry
4+
description: Learn how to use Azure AI Foundry Agent Service Computer Use Tool
5+
services: cognitive-services
6+
manager: nitinme
7+
ms.service: azure-ai-agent-service
8+
ms.topic: how-to
9+
ms.date: 09/09/2025
10+
author: aahill
11+
ms.author: aahi
12+
ms.custom: references_regions
13+
---
14+
15+
# Azure AI Foundry Agent Service Computer Use Tool
16+
17+
> [!WARNING]
18+
> The Computer Use tool comes with additional significant security and privacy risks, including prompt injection attacks. Learn more about intended uses, capabilities, limitations, risks, and considerations when choosing a use case in the [Azure OpenAI transparency note](../../../responsible-ai/openai/transparency-note.md).
19+
20+
21+
22+
Use this article to learn how to work with the Computer Use Tool in Azure AI Foundry Agent Service. Computer Use is a specialized AI tool that uses a specialized model that can perform tasks by interacting with computer systems and applications through their user interfaces. With Computer Use, you can create an agent that can handle complex tasks and make decisions by interpreting visual elements and taking action based on on-screen content.
23+
24+
## Features
25+
26+
* Autonomous navigation: For example opening applications, clicking buttons, filling out forms, and navigating multi-page workflows.
27+
28+
* Dynamic adaptation: Interpreting UI changes and adjusting actions accordingly.
29+
30+
* Cross-application task execution: Can operate across web-based and desktop applications.
31+
32+
* Natural language interface: Users can describe a task in plain language, and the Computer Use model determines the correct UI interactions to execute.
33+
34+
## Request access
35+
36+
For access to the `computer-use-preview` model, registration is required and access will be granted based on Microsoft's eligibility criteria. Customers who have access to other limited access models will still need to request access for this model.
37+
38+
To request access, see the [application form](https://aka.ms/oai/cuaaccess).
39+
40+
Once access has been granted, you will need to create a deployment for the model.
41+
42+
## Differences between Browser Automation and Computer Use
43+
44+
The following table lists some of the differences between the Computer Use Tool and [Browser Automation](./browser-automation.md) Tool.
45+
46+
| Feature | Browser Automation | Computer Use Tool |
47+
|--------------------------------|-----------------------------|----------------------------|
48+
| Model support | All GPT models | `Computer-use-preview` model only |
49+
| Can I visualize what's happening? | No | Yes |
50+
| How it understands the screen | Parses the HTML or XML pages into DOM documents | Raw pixel data from screenshots |
51+
| How it acts | A list of actions provided by the model | Virtual keyboard and mouse |
52+
| Is it multi-step? | Yes | Yes |
53+
| Interfaces | Browser | Computer and browser |
54+
| Do I need to bring my own resource? | Your own Playwright resource with the keys stored as a connection. | No additional resource required but we highly recommend running this tool in a sandboxed environment. |
55+
56+
## Regional support
57+
58+
In order to use the Computer Use Tool, you need to have a [Computer Use model](../../../foundry-models/concepts/models-sold-directly-by-azure.md#computer-use-preview) deployment. The Computer Use model is available in the following regions:
59+
* `eastus2`
60+
* `swedencentral`
61+
* `southindia`
62+
63+
## Understanding the Computer Use integration
64+
65+
When working with the Computer Use tool, you typically would perform the following to integrate it into your application.
66+
67+
1. Send a request to the model that includes a call to the computer use tool, and the display size and environment. You can also include a screenshot of the initial state of the environment in the first API request.
68+
69+
1. Receive a response from the model. If the response has action items, those items contain suggested actions to make progress toward the specified goal. For example an action might be screenshot so the model can assess the current state with an updated screenshot, or click with X/Y coordinates indicating where the mouse should be moved.
70+
71+
1. Execute the action using your application code on your computer or browser environment.
72+
73+
1. After executing the action, capture the updated state of the environment as a screenshot.
74+
75+
1. Send a new request with the updated state as a `tool_call_output`, and repeat this loop until the model stops requesting actions or you decide to stop.
76+
77+
> [!NOTE]
78+
> Before using the tool, you need to set up an environment that can capture screenshots and execute the recommended actions by the agent. We recommend using a sandboxed environment, such as Playwright for safety reasons.
79+
80+
## Handling conversation history
81+
82+
You can use the `tool_call_id` parameter to link the current request to the previous response. Using this parameter is recommended if you don't want to manage the conversation history.
83+
84+
If you don't use this parameter, you should make sure to include all the items returned in the response output of the previous request in your inputs array. This includes reasoning items if present.
85+
86+
## Safety checks
87+
88+
> [!WARNING]
89+
> Computer Use carries substantial security and privacy risks and user responsibility. Computer Use comes with significant security and privacy risks. Both errors in judgment by the AI and the presence of malicious or confusing instructions on web pages, desktops, or other operating environments which the AI encounters may cause it to execute commands you or others do not intend, which could compromise the security of your or other users’ browsers, computers, and any accounts to which AI has access, including personal, financial, or enterprise systems.
90+
>
91+
> We strongly recommend using the Computer Use tool on virtual machines with no access to sensitive data or critical resources. Learn more about intended uses, capabilities, limitations, risks, and considerations when choosing a use case in the [Azure OpenAI transparency note](../../../responsible-ai/openai/transparency-note.md).
92+
93+
The API has safety checks to help protect against prompt injection and model mistakes. These checks include:
94+
95+
**Malicious instruction detection**: The system evaluates the screenshot image and checks if it contains adversarial content that might change the model's behavior.
96+
97+
**Irrelevant domain detection**: The system evaluates the `current_url` parameter (if provided) and checks if the current domain is considered relevant given the conversation history.
98+
99+
**Sensitive domain detection**: The system checks the `current_url` parameter (if provided) and raises a warning when it detects the user is on a sensitive domain.
100+
101+
If one or more of the above checks is triggered, a safety check is raised when the model returns the next `computer_call` with the `pending_safety_checks` parameter.
102+
103+
```json
104+
"output": [
105+
{
106+
"type": "reasoning",
107+
"id": "rs_67cb...",
108+
"summary": [
109+
{
110+
"type": "summary_text",
111+
"text": "Exploring 'File' menu option."
112+
}
113+
]
114+
},
115+
{
116+
"type": "computer_call",
117+
"id": "cu_67cb...",
118+
"call_id": "call_nEJ...",
119+
"action": {
120+
"type": "click",
121+
"button": "left",
122+
"x": 135,
123+
"y": 193
124+
},
125+
"pending_safety_checks": [
126+
{
127+
"id": "cu_sc_67cb...",
128+
"code": "malicious_instructions",
129+
"message": "We've detected instructions that may cause your application to perform malicious or unauthorized actions. Please acknowledge this warning if you'd like to proceed."
130+
}
131+
],
132+
"status": "completed"
133+
}
134+
]
135+
```
136+
137+
You need to pass the safety checks back as `acknowledged_safety_checks` in the next request in order to proceed.
138+
139+
```json
140+
"input":[
141+
{
142+
"type": "computer_call_output",
143+
"call_id": "<call_id>",
144+
"acknowledged_safety_checks": [
145+
{
146+
"id": "<safety_check_id>",
147+
"code": "malicious_instructions",
148+
"message": "We've detected instructions that may cause your application to perform malicious or unauthorized actions. Please acknowledge this warning if you'd like to proceed."
149+
}
150+
],
151+
"output": {
152+
"type": "computer_screenshot",
153+
"image_url": "<image_url>"
154+
}
155+
}
156+
]
157+
```
158+
159+
## Safety check handling
160+
161+
In all cases where `pending_safety_checks` are returned, actions should be handed over to the end user to confirm proper model behavior and accuracy.
162+
163+
`malicious_instructions` and `irrelevant_domain`: end users should review model actions and confirm that the model is behaving as intended.
164+
165+
`sensitive_domain`: ensure an end user is actively monitoring the model actions on these sites. Exact implementation of this "watch mode" can vary by application, but a potential example could be collecting user impression data on the site to make sure there is active end user engagement with the application.
166+
167+
## Next steps
168+
169+
* [Computer Use code samples](./computer-use-samples.md)
159 KB
Loading

0 commit comments

Comments
 (0)