Skip to content

Commit f1436ac

Browse files
committed
Rename Examples to examples (case-sensitive fix)
1 parent ab74b56 commit f1436ac

File tree

5 files changed

+902
-0
lines changed

5 files changed

+902
-0
lines changed
260 KB
Loading
Lines changed: 251 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,251 @@
1+
---
2+
title: Computer Use Agent
3+
---
4+
5+
# Computer Use Agents
6+
7+
<div class="subtitle">
8+
Test your Computer Use agent with <code>testing</code>
9+
</div>
10+
11+
Anthropic has recently announced a [Computer Use Agent](https://docs.anthropic.com/en/docs/build-with-claude/computer-use), an AI Agent capable
12+
of interacting with a computer desktop environment. For this example, we prompt the agent to act as a QA engineer with the knowledge about the documentation of
13+
the Invariant SDK and the Invariant Explorer UI, and we ask it to perform tasks related to testing the agent.
14+
15+
## Running the example
16+
17+
You can run the example discussed in this notebook by running the following command in the root of the repository:
18+
19+
```bash
20+
poetry run invariant test sample_tests/demos/computer_use_agent.py --push --dataset_name computer_use_agent
21+
```
22+
23+
!!! note
24+
25+
If you want to run the example without sending the results to the Explorer UI, you can always run without the `--push` flag. You will still see the parts of the trace that fail
26+
as higihlighted in the terminal.
27+
28+
## Global assertions
29+
30+
There are often assertions that we always want to check for, and it should never be the case that the agent violates them, regardless of the input prompt.
31+
Each global assertion is a function that takes a `Trace` object and runs some assertions on it.
32+
33+
One such assertion is to make sure that the agent never clicks on the firefox hamburger menu on the right, which it ocassionally does as the agent thinks it may be part of the application.
34+
We can check this assertion by iterating over all the tool outputs that contain an image and checking that they do not contain both the text "New tab" and "New window" (which is high indicator that the agent clicked on the menu).
35+
36+
```python
37+
def does_not_click_on_firefox_menu(trace: Trace):
38+
"""Agent should not click on the firefox hamburger menu on the right."""
39+
for tool_out in trace.tool_outputs(data_type="image"):
40+
assert_false(tool_out["content"].ocr_contains_all("New tab", "New window"))
41+
```
42+
43+
Next, we can make sure that tool outputs do not contain `ModuleNotFoundError`, which typically indicates coding mistakes that the agent made.
44+
45+
```python
46+
def does_not_make_python_error(trace: Trace):
47+
"""Agent should not produce code that results in ModuleNotFoundError."""
48+
for tool_out in trace.messages(role="tool"):
49+
assert_false(tool_out["content"].contains("ModuleNotFoundError"))
50+
```
51+
52+
We also noticed that the agent often overwrites the existing files using the `create` command. We can add a check for that:
53+
54+
```python
55+
def does_not_make_file_edit_errors(trace: Trace):
56+
"""Given a trace, assert that the agent does not make a file edit error."""
57+
for tool_out in trace.tool_outputs():
58+
assert_false(tool_out["content"].contains("Cannot overwrite files using command `create`."))
59+
```
60+
61+
## Unit tests
62+
63+
Now we can write unit tests for specific test cases. We are going to give the agent a range of tasks - e.g. annotating a snippet, uploading a dataset using
64+
either SDK or a browser, etc.
65+
66+
### Task 1: Annotate the first comment in the snippet
67+
68+
<div class='tiles'>
69+
<a href="https://explorer.invariantlabs.ai/u/mbalunovic/computer_use_agent-1733382354/t/1" class='tile'>
70+
<span class='tile-title'>Open in Explorer →</span>
71+
<span class='tile-description'>See this example in the Invariant Explorer</span>
72+
</a>
73+
</div>
74+
75+
In the first test, we ask the agent to go to a snippet in the Explorer and annotate the first comment with the text "nice nice".
76+
We run the agent by calling `run_agent`, which runs the agent and returns a `Trace` object.
77+
78+
```python
79+
def test_annotation():
80+
trace = run_agent("""Go to this snippet https://explorer.invariantlabs.ai/trace/9d55fa77-18f5-4a3b-9f7f-deae06833c58
81+
and annotate the first comment with: "nice nice" """)
82+
83+
with trace.as_context():
84+
trace.run_assertions(global_asserts)
85+
assert_true(trace.messages(0)["content"].contains("nice nice"))
86+
87+
expect_true(max(F.frequency(
88+
F.filter(
89+
lambda x: "http" in x.value,
90+
F.map(lambda tc: tc["function"]["arguments"]["text"], trace.tool_calls({"arguments.action": "type", "name": "computer"}))
91+
)
92+
).values()) <= 1)
93+
94+
# assert that the last screenshot contains the text "annotated" and text "nice nice"
95+
last_screenshot = trace.messages(role="tool")[-1]["content"]
96+
assert_true(last_screenshot.ocr_contains_all("annotated", "nice nice"))
97+
```
98+
99+
We first use `F.map` to get the `text` argument from the `type` command and then filter only for the traces that contain the string `http` (so we know they refer to the URL)
100+
In the last part, we take the last screenshot and assert that it contains both "annotated" and "nice nice" using `ocr_contains_all` that uses Tesseract to perform OCR on the image.
101+
102+
### Task 2: Upload traces using UI
103+
104+
<div class='tiles'>
105+
<a href="https://explorer.invariantlabs.ai/u/mbalunovic/computer_use_agent-1733382354/t/2" class='tile'>
106+
<span class='tile-title'>Open in Explorer →</span>
107+
<span class='tile-description'>See this example in the Invariant Explorer</span>
108+
</a>
109+
</div>
110+
111+
In the second test, we ask the agent to upload a dataset consisting of 100 traces using a browser. Here, we only check the global assertions:
112+
113+
```python
114+
def test_firefox_menu():
115+
trace = run_agent("""upload a dataset of 100 traces using a browser""")
116+
with trace.as_context():
117+
trace.run_assertions(global_asserts)
118+
```
119+
120+
### Task 3: Empty dataset and upload traces using SDK
121+
122+
<div class='tiles'>
123+
<a href="https://explorer.invariantlabs.ai/u/mbalunovic/computer_use_agent-1733382354/t/3" class='tile'>
124+
<span class='tile-title'>Open in Explorer →</span>
125+
<span class='tile-description'>See this example in the Invariant Explorer</span>
126+
</a>
127+
</div>
128+
129+
Next test asks the agent to create an empty dataset and then upload 4 traces to it using the SDK.
130+
Here, in addition to global assertions, we also assert that the agent uses `str_replace_editor` command in which `file_text` argument
131+
contains `create_request_and_push_trace` string.
132+
133+
```python
134+
def test_food_dataset():
135+
trace = run_agent("""create an empty dataset "chats-about-food", then use sdk to push 4 different traces
136+
to it and then finally use sdk to update the metadata of the dataset to have "weather="snowy day" and "mood"="great"
137+
after that go to the UI and verify that there are 4 traces and metadata is good""")
138+
with trace.as_context():
139+
trace.run_assertions(global_asserts)
140+
assert_true(F.any(F.map(
141+
lambda x: x["function"]["arguments"]["file_text"].contains("create_request_and_push_trace"),
142+
trace.tool_calls(name="str_replace_editor"))))
143+
```
144+
145+
### Task 4: Using Anthropic SDK and creating a dataset
146+
147+
<div class='tiles'>
148+
<a href="https://explorer.invariantlabs.ai/u/mbalunovic/computer_use_agent-1733382354/t/4" class='tile'>
149+
<span class='tile-title'>Open in Explorer →</span>
150+
<span class='tile-description'>See this example in the Invariant Explorer</span>
151+
</a>
152+
</div>
153+
154+
In this test case we ask the agent to use Anthropic SDK to generate some traces and upload them to the Explorer using Invariant SDK.
155+
Here, we would like to assert that the dataset created using the SDK actually appears in the UI later on.
156+
157+
```python
158+
def test_anthropic():
159+
trace = run_agent("""use https://github.com/anthropics/anthropic-sdk-python to generate some traces and upload them
160+
to the explorer using invariant sdk. your ANTHROPIC_API_KEY is already set up with a valid key""")
161+
with trace.as_context():
162+
trace.run_assertions(global_asserts)
163+
164+
edit_tool_calls = trace.tool_calls(
165+
{"name": "str_replace_editor", "arguments.command": "create"}
166+
)
167+
file_text = edit_tool_calls[0]["function"]["arguments"]["file_text"]
168+
assert_true(file_text.contains_any("import anthropic", "from anthropic import"))
169+
170+
# Extract the dataset name from a tool output and check if it's in the last screenshot
171+
tool_outs = trace.messages(role="tool")
172+
dataset_name = F.match(r"Dataset: (\w+)", F.map(lambda x: x["content"], tool_outs), 1)[0]
173+
tool_out = trace.messages(role="tool")[-1]
174+
assert_true(tool_out["content"].ocr_contains(dataset_name))
175+
```
176+
177+
First, we have a simple assertion that checks whether the agent imports `anthropic` Python library in two different ways
178+
using `contains_any` function.
179+
180+
For this, we need two things:
181+
182+
1. Extract the dataset name from the tool output using a regex: `Dataset: (\w+)`, for instance `dataset_name` is `claude_examples`
183+
2. We can assert that the dataset name is present in the last screenshot using `ocr_contains` function.
184+
185+
### Task 5: FastAPI application
186+
187+
<div class='tiles'>
188+
<a href="https://explorer.invariantlabs.ai/u/mbalunovic/computer_use_agent-1733382354/t/5" class='tile'>
189+
<span class='tile-title'>Open in Explorer →</span>
190+
<span class='tile-description'>See this example in the Invariant Explorer</span>
191+
</a>
192+
</div>
193+
194+
In this test, we use the agent to create a FastAPI application with an endpoint that counts the number of words in a string.
195+
First, we assert that the agent does not run any bash command that results in a "Permission denied" error.
196+
Then, in the second part, we assert that the agent edits the same file in two different tool calls.
197+
198+
```python
199+
def test_code_agent_fastapi():
200+
trace = run_agent("""use fastapi to create a count_words api that receives a string and counts
201+
the number of words in it, then write a small client that tests it with a couple of different inputs""")
202+
203+
with trace.as_context():
204+
trace.run_assertions(global_asserts)
205+
206+
for tool_call, tool_out in trace.tool_pairs():
207+
assert_false(
208+
tool_call["function"]["name"] == "bash"
209+
and tool_out.get("content", "").contains("Permission denied")
210+
)
211+
212+
tool_calls = trace.tool_calls({"name": "str_replace_editor"})
213+
max_freq = max(F.frequency(F.map(lambda x: x["function"]["arguments"]["file_text"], tool_calls)).values())
214+
assert_true(max_freq <= 2, "At least 3 edits to the same file with the same text")
215+
```
216+
217+
First, we find all pairs of tool calls and tool outputs and assert that the content of the tool output corresponding to a `bash` command does not contain `Permission denied` string.
218+
In the second part, we use `F.map` to get the `file_text` argument from the `str_replace_editor` command and then use `max(F.frequency(..).values())` to find the most frequent `file_text`
219+
220+
### Task 6: Code example with Fibonacci sequence
221+
222+
<div class='tiles'>
223+
<a href="https://explorer.invariantlabs.ai/u/mbalunovic/computer_use_agent-1733382354/t/6" class='tile'>
224+
<span class='tile-title'>Open in Explorer →</span>
225+
<span class='tile-description'>See this example in the Invariant Explorer</span>
226+
</a>
227+
</div>
228+
229+
In this test, we ask the agent to write a function `compute_fibonacci(n)` that computes the n-th Fibonacci number and test it on a few inputs.
230+
We then assert that executing the code `print(compute_fibonacci(12))` results in the `144` being present in the standard output (note that this asssertion requires
231+
Docker to be installed).
232+
233+
```python
234+
def test_fibonacci():
235+
trace = run_agent(
236+
"""write me a python function compute_fibonacci(n) that computes n-th fibonacci number and test it on a few inputs"""
237+
)
238+
with trace.as_context():
239+
trace.run_assertions(global_asserts)
240+
241+
tool_calls = trace.tool_calls({"name": "str_replace_editor", "arguments.command": "create"})
242+
for tc in tool_calls:
243+
res = tc["function"]["arguments"]["file_text"].execute_contains("144", "print(compute_fibonacci(12))")
244+
assert_true(res, "Execution output does not contain 144")
245+
```
246+
247+
For this we used `.execute_contains` function that executes the code in the string inside of Docker containerand checks whether the output contains the expected substring.
248+
249+
## Conclusion
250+
251+
We have seen how to write global assertions that are always checked for, and how to write unit tests for specific test cases.

0 commit comments

Comments
 (0)