Skip to content

Commit 74da850

Browse files
authored
Merge pull request #3547 from aahill/cua
Computer use article
2 parents fba94f4 + bcf148a commit 74da850

File tree

2 files changed

+363
-0
lines changed

2 files changed

+363
-0
lines changed
Lines changed: 360 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,360 @@
1+
---
2+
title: 'Computer Use (preview) in Azure OpenAI'
3+
titleSuffix: Azure OpenAI
4+
description: Learn about Computer Use in Azure OpenAI which allows AI to interact with computer applications.
5+
manager: nitinme
6+
ms.service: azure-ai-openai
7+
ms.topic: how-to
8+
ms.date: 03/14/2025
9+
author: aahill
10+
ms.author: aahi
11+
---
12+
13+
# Computer Use (preview) in Azure OpenAI
14+
15+
Use this article to learn how to work with Computer Use in Azure OpenAI. Computer Use is a specialized AI tool that uses a specialized model that can perform tasks by interacting with computer systems and applications through their UIs. With Computer Use, you can create an agent that can handle complex tasks and make decisions by interpreting visual elements and taking action based on on-screen content.
16+
17+
Computer Use provides:
18+
19+
* **Autonomous navigation**: For example, opens applications, clicks buttons, fills out forms, and navigates multi-page workflows.
20+
* **Dynamic adaptation**: Interprets UI changes and adjusts actions accordingly.
21+
* **Cross-application task execution**: Operates across web-based and desktop applications.
22+
* **Natural language interface**: Users can describe a task in plain language, and the Computer Use model determines the correct UI interactions to execute.
23+
24+
## Request access
25+
26+
Access to Computer Use is limited. You will need to fill out the [access request form](https://aka.ms/oai/cuaaccess) before you can start using the model.
27+
28+
## Regional support
29+
30+
Computer Use is available in the following regions:
31+
* `eastus2`
32+
* `swedencentral`
33+
* `southindia`
34+
35+
## Sending an API call to the Computer Use model using the responses API
36+
37+
The Computer Use tool is accessed through the responses API. The tool operates in a continuous loop that sends actions such as typing text or performing a click. Your code executes these actions on a computer, and sends screenshots of the outcome to the model.
38+
39+
In this way, your code simulates the actions of a human using a computer interface, while the model uses the screenshots to understand the state of the environment and suggest next actions.
40+
41+
The following examples show a basic API call.
42+
43+
> [!NOTE]
44+
> You need an Azure OpenAI resource with a `computer-use-preview` model deployment in a [supported region](#regional-support).
45+
46+
## [Python](#tab/python)
47+
48+
To send requests, you will need to install the following Python packages.
49+
50+
```console
51+
pip install openai
52+
pip install azure-identity
53+
```
54+
55+
```python
56+
import os
57+
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
58+
from openai import AzureOpenAI
59+
60+
#from openai import OpenAI
61+
token_provider = get_bearer_token_provider(DefaultAzureCredential(), "https://cognitiveservices.azure.com/.default")
62+
63+
client = AzureOpenAI(
64+
azure_ad_token_provider=token_provider,
65+
azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
66+
api_version="2025-03-01-preview"
67+
)
68+
69+
response = client.responses.create(
70+
model="computer-use-preview",
71+
tools=[{
72+
"type": "computer-preview",
73+
"display_width": 1024,
74+
"display_height": 768,
75+
"environment": "browser" # other possible values: "mac", "windows", "ubuntu"
76+
}],
77+
input=[
78+
{
79+
"role": "user",
80+
"content": "Check the latest AI news on bing.com."
81+
}
82+
],
83+
truncation="auto"
84+
)
85+
86+
print(response.output)
87+
```
88+
89+
### Output
90+
91+
```console
92+
[
93+
ResponseComputerToolCall(
94+
id='cu_67d841873c1081908bfc88b90a8555e0',
95+
action=ActionScreenshot(type='screenshot'),
96+
call_id='call_wwEnfFDqQr1Z4Edk62Fyo7Nh',
97+
pending_safety_checks=[],
98+
status='completed',
99+
type='computer_call'
100+
)
101+
]
102+
```
103+
104+
## [REST API](#tab/rest-api)
105+
106+
```bash
107+
curl ${MY_ENDPOINT}/openai/responses?api-version=2025-03-01-preview \
108+
-H "Content-Type: application/json" \
109+
-H "api-key: $MY_API_KEY" \
110+
-d '{
111+
"model": "computer-use-preview",
112+
"input": [
113+
{
114+
"type": "message",
115+
"role": "user",
116+
"content": "Check the latest AI news on bing.com."
117+
}
118+
],
119+
"tools": [{
120+
"type": "computer-preview",
121+
"display_width": 1024,
122+
"display_height": 768,
123+
"environment": "browser"
124+
}],
125+
"truncation":"auto"
126+
}'
127+
```
128+
129+
### Output
130+
131+
```json
132+
{
133+
"id": "resp_xxxxxxxxxxxxxxxxxxxxxxxx",
134+
"object": "response",
135+
"created_at": 1742227653,
136+
"status": "completed",
137+
"error": null,
138+
"incomplete_details": null,
139+
"instructions": null,
140+
"max_output_tokens": null,
141+
"model": "computer-use-preview",
142+
"output": [
143+
{
144+
"type": "computer_call",
145+
"id": "cu_xxxxxxxxxxxxxxxxxxxxxxxxxx",
146+
"call_id": "call_xxxxxxxxxxxxxxxxxxxxxxx",
147+
"action": {
148+
"type": "screenshot"
149+
},
150+
"pending_safety_checks": [],
151+
"status": "completed"
152+
}
153+
],
154+
"parallel_tool_calls": true,
155+
"previous_response_id": null,
156+
"reasoning": {
157+
"effort": "medium",
158+
"generate_summary": null
159+
},
160+
"store": true,
161+
"temperature": 1.0,
162+
"text": {
163+
"format": {
164+
"type": "text"
165+
}
166+
},
167+
"tools": [
168+
{
169+
"type": "computer_use_preview",
170+
"display_height": 768,
171+
"display_width": 1024,
172+
"environment": "browser"
173+
}
174+
],
175+
"top_p": 1.0,
176+
"truncation": "auto",
177+
"usage": {
178+
"input_tokens": 519,
179+
"input_tokens_details": {
180+
"cached_tokens": 0
181+
},
182+
"output_tokens": 7,
183+
"output_tokens_details": {
184+
"reasoning_tokens": 0
185+
},
186+
"total_tokens": 526
187+
},
188+
"user": null,
189+
"metadata": {}
190+
}
191+
```
192+
193+
---
194+
195+
Once the initial API request is sent, you perform a loop where the specified action is performed in your application code, sending a screenshot with each turn so the model can evaluate the updated state of the environment.
196+
197+
## [Python](#tab/python)
198+
199+
```python
200+
201+
## response.output is the previous response from the model
202+
computer_calls = [item for item in response.output if item.type == "computer_call"]
203+
if not computer_calls:
204+
print("No computer call found. Output from model:")
205+
for item in response.output:
206+
print(item)
207+
208+
computer_call = computer_calls[0]
209+
last_call_id = computer_call.call_id
210+
action = computer_call.action
211+
212+
# Your application would now perform the action suggested by the model
213+
# And create a screenshot of the updated state of the environment before sending another response
214+
215+
response_2 = client.responses.create(
216+
model="computer-use-preview",
217+
previous_response_id=response.id,
218+
tools=[{
219+
"type": "computer-preview",
220+
"display_width": 1024,
221+
"display_height": 768
222+
"environment": "browser" # other possible values: "mac", "windows", "ubuntu"
223+
}],
224+
input=[
225+
{
226+
"call_id": last_call_id,
227+
"type": "computer_call_output",
228+
"output": {
229+
"type": "input_image",
230+
# Image should be in base64
231+
"image_url": f"data:image/png;base64,{<base64_string>}"
232+
}
233+
}
234+
],
235+
truncation="auto"
236+
)
237+
```
238+
239+
240+
## [REST API](#tab/rest-api)
241+
242+
```bash
243+
curl ${MY_ENDPOINT}/openai/responses?api-version=2025-03-01-preview \
244+
-H "Content-Type: application/json" \
245+
-H "api-key: $MY_API_KEY" \
246+
-d '{
247+
"model": "computer-use-preview",
248+
"input": [
249+
"tools": [{
250+
"type": "computer-preview",
251+
"display_width": 1024,
252+
"display_height": 768,
253+
"environment": "browser" # other possible values: "mac", "windows", "ubuntu"
254+
}],
255+
{
256+
"call_id": last_call_id,
257+
"type": "computer_call_output",
258+
"output": {
259+
"type": "input_image",
260+
"image_url": "<base64_string>"
261+
}
262+
}
263+
],
264+
"truncation":"auto"
265+
}'
266+
```
267+
268+
---
269+
## Understanding the Computer Use integration
270+
271+
When working with the Computer Use tool, you typically would perform the following to integrate it into your application.
272+
273+
1. Send a request to the model that includes a call to the computer use tool, and the display size and environment. You can also include a screenshot of the initial state of the environment in the first API request.
274+
1. Receive a response from the model. If the response has `action` items, those items contain suggested actions to make progress toward the specified goal. For example an action might be `screenshot` so the model can assess the current state with an updated screenshot, or `click` with X/Y coordinates indicating where the mouse should be moved.
275+
1. Execute the action using your application code on your computer or browser environment.
276+
1. After executing the action, capture the updated state of the environment as a screenshot.
277+
1. Send a new request with the updated state as a `computer_call_output`, and repeat this loop until the model stops requesting actions or you decide to stop.
278+
279+
## Handling conversation history
280+
281+
You can use the `previous_response_id` parameter to link the current request to the previous response. Using this parameter is recommended if you don't want to manage the conversation history.
282+
283+
If you don't use this parameter, you should make sure to include all the items returned in the response output of the previous request in your inputs array. This includes reasoning items if present.
284+
285+
## Safety checks
286+
287+
The API has safety checks to help protect against prompt injection and model mistakes. These checks include:
288+
289+
* **Malicious instruction detection**: The system evaluates the screenshot image and checks if it contains adversarial content that might change the model's behavior.
290+
* **Irrelevant domain detection**: The system evaluates the `current_url` (if provided) and checks if the current domain is considered relevant given the conversation history.
291+
* **Sensitive domain detection**: The system checks the `current_url` (if provided) and raises a warning when it detects the user is on a sensitive domain.
292+
293+
If one or more of the above checks is triggered, a safety check is raised when the model returns the next `computer_call`, with the `pending_safety_checks` parameter.
294+
295+
```json
296+
"output": [
297+
{
298+
"type": "reasoning",
299+
"id": "rs_67cb...",
300+
"summary": [
301+
{
302+
"type": "summary_text",
303+
"text": "Exploring 'File' menu option."
304+
}
305+
]
306+
},
307+
{
308+
"type": "computer_call",
309+
"id": "cu_67cb...",
310+
"call_id": "call_nEJ...",
311+
"action": {
312+
"type": "click",
313+
"button": "left",
314+
"x": 135,
315+
"y": 193
316+
},
317+
"pending_safety_checks": [
318+
{
319+
"id": "cu_sc_67cb...",
320+
"code": "malicious_instructions",
321+
"message": "We've detected instructions that may cause your application to perform malicious or unauthorized actions. Please acknowledge this warning if you'd like to proceed."
322+
}
323+
],
324+
"status": "completed"
325+
}
326+
]
327+
```
328+
329+
You need to pass the safety checks back as `acknowledged_safety_checks` in the next request in order to proceed.
330+
331+
```json
332+
"input":[
333+
{
334+
"type": "computer_call_output",
335+
"call_id": "<call_id>",
336+
"acknowledged_safety_checks": [
337+
{
338+
"id": "<safety_check_id>",
339+
"code": "malicious_instructions",
340+
"message": "We've detected instructions that may cause your application to perform malicious or unauthorized actions. Please acknowledge this warning if you'd like to proceed."
341+
}
342+
],
343+
"output": {
344+
"type": "computer_screenshot",
345+
"image_url": "<image_url>"
346+
}
347+
}
348+
],
349+
```
350+
351+
### Safety check handling
352+
353+
In all cases where `pending_safety_checks` are returned, actions should be handed over to the end user to confirm proper model behavior and accuracy.
354+
355+
* `malicious_instructions` and `irrelevant_domain`: end users should review model actions and confirm that the model is behaving as intended.
356+
* `sensitive_domain`: ensure an end user is actively monitoring the model actions on these sites. Exact implementation of this "watch mode" can vary by application, but a potential example could be collecting user impression data on the site to make sure there is active end user engagement with the application.
357+
358+
## See also
359+
360+
* [Responses API](./responses.md)

articles/ai-services/openai/toc.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -128,6 +128,9 @@ items:
128128
- name: GPT-35-Turbo & GPT-4
129129
href: ./how-to/chatgpt.md
130130
displayName: ChatGPT, chatgpt
131+
- name: Computer Use
132+
href: ./how-to/computer-use.md
133+
displayName: cua, computer using model
131134
- name: Vision-enabled chats
132135
href: ./how-to/gpt-with-vision.md
133136
- name: DALL-E

0 commit comments

Comments
 (0)