Skip to content

Commit 8c44944

Browse files
Merge branch 'main' into swarm-langgraph-docs
2 parents b476f13 + 19e1790 commit 8c44944

File tree

8 files changed

+319
-5
lines changed

8 files changed

+319
-5
lines changed

docs/testing/Examples/computer-use.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,11 @@
22
title: Computer Use Agent
33
---
44

5-
# Intro
5+
# Computer Use Agents
6+
7+
<div class="subtitle">
8+
Test your Computer Use agent with <code>testing</code>
9+
</div>
610

711
Anthropic has recently announced a [Computer Use Agent](https://docs.anthropic.com/en/docs/build-with-claude/computer-use), an AI Agent capable
812
of interacting with a computer desktop environment. For this example, we prompt the agent to act as a QA engineer with the knowledge about the documentation of
Lines changed: 261 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,261 @@
1+
---
2+
title: Function Calling Agents
3+
---
4+
5+
# Function Calling Agents
6+
7+
<div class="subtitle">
8+
Test an OpenAI function calling agent using <code>testing</code>
9+
</div>
10+
11+
OpenAI's function calling can be used to build agents that integrate with external tools and APIs, allowing the agent to call custom functions and deliver enhanced, context-aware responses. More details can be found here: [OpenAI Function Calling Guide](https://platform.openai.com/docs/guides/function-calling)
12+
13+
In this chapter we are testing simple OpenAI function calling agent as [implemented in this file](https://github.com/invariantlabs-ai/testing/blob/main/sample_tests/openai/test_python_agent.py).
14+
15+
## Agent Overview
16+
17+
The agent generates and executes Python code in response to user requests and returns the computed results. It operates under a strict prompt and utilizes the run_python tool to guarantee accurate code execution and adherence to its intended functionality.
18+
19+
A loop is implemented to run the client until the chat is completed without further tool calls. During this process, all chat interactions are stored in `messages`. A simplifed implementation is shown below:
20+
21+
```python
22+
while True:
23+
24+
# call the client to get response
25+
response = self.client.chat.completions.create(
26+
messages=messages,
27+
model="gpt-4o",
28+
tools=tools,
29+
)
30+
31+
# check if the response calling tools, if not means the chat is completed
32+
tool_calls = response.choices[0].message.tool_calls
33+
if tool_calls:
34+
35+
# append current response message t o messages
36+
messages.append(response_message.to_dict())
37+
38+
# In this demo there's only one tool call in the response
39+
tool_call = tool_calls[0]
40+
if tool_call.function.name == "run_python":
41+
42+
# get the arguments generated by agent for the function
43+
function_args = json.loads(tool_call.function.arguments)
44+
45+
# run the function with the argument with "code"
46+
function_response = run_python(function_args["code"])
47+
48+
# append the response of the function to messages for next round chat
49+
messages.append(
50+
{
51+
"tool_call_id": tool_call.id,
52+
"role": "tool",
53+
"name": "run_python",
54+
"content": str(function_response),
55+
}
56+
)
57+
else:
58+
break
59+
```
60+
61+
## Run the Example
62+
63+
You can run the example by running the following command in the root of the repository:
64+
65+
```bash
66+
poetry run invariant test sample_tests/openai/test_python_agent.py --push --dataset_name test_python_agent
67+
```
68+
69+
>**Note** If you want to run the example without sending the results to the Explorer UI, you can always run without the `--push` flag. You will still see the parts of the trace that fail
70+
as higihlighted in the terminal.
71+
72+
73+
## Unit Tests
74+
75+
Here, we design three unit tests to cover different scenarios.
76+
77+
In these tests, we set varied `input` to reflect different situations. Within each test, we create an instance of the agent named `python_agent`, and retrieve its response by calling `python_agent.get_response(input)`.
78+
79+
The agent's response is subsequently transformed into a Trace object using` TraceFactory.from_openai(response)` for further validation.
80+
81+
### Test 1: Valid Python Code Execution:
82+
83+
<div class='tiles'>
84+
<a href="https://explorer.invariantlabs.ai/u/zishan-wei/openai_python_agent-1733417505/t/1" class='tile'>
85+
<span class='tile-title'>Open in Explorer →</span>
86+
<span class='tile-description'>See this example in the Invariant Explorer</span>
87+
</a>
88+
</div>
89+
90+
In the first test, we ask the agent to calculate the Fibonacci series for the first 10 elements using Python.
91+
92+
The implementation of the first test is shown below:
93+
```python
94+
def test_python_question():
95+
input = "Calculate fibonacci series for the first 10 elements in python"
96+
97+
# run the agent
98+
python_agent = PythonAgent()
99+
response = python_agent.get_response(input)
100+
101+
# convert trace
102+
trace = TraceFactory.from_openai(response)
103+
104+
# test the agent behavior
105+
with trace.as_context():
106+
run_python_tool_call = trace.tool_calls(name="run_python")
107+
108+
# assert the agent calls "run_python" exactly once
109+
assert_true(F.len(run_python_tool_call) == 1)
110+
111+
# assert the argument passed to the tool_call is valid Python code.
112+
assert_true(
113+
run_python_tool_call[0]["function"]["arguments"]["code"].is_valid_code(
114+
"python"
115+
)
116+
)
117+
118+
# assert if 34 is included in the agent's final response.
119+
assert_true("34" in trace.messages(-1)["content"])
120+
```
121+
122+
Our primary objective is to verify that the agent correctly calls the `run_python` tool and provides valid Python code as its parameter. To achieve this, we first filter the tool_calls where `name = "run_python"`. Then, we assert that exactly one `tool_call` meets this condition. Next, we confirm that the argument passed to the `tool_call` is valid Python code.
123+
124+
Then we validate that the Python code executes correctly. To confirm this, we check if one of the calculated result, "34," is included in the agent's final response.
125+
126+
### Test 2: Invalid Response:
127+
128+
<div class='tiles'>
129+
<a href="https://explorer.invariantlabs.ai/u/zishan-wei/openai_python_agent-1733417505/t/2" class='tile'>
130+
<span class='tile-title'>Open in Explorer →</span>
131+
<span class='tile-description'>See this example in the Invariant Explorer</span>
132+
</a>
133+
</div>
134+
135+
In this test, we use `unittest.mock.MagicMock` to simulate a scenario where the agent incorrectly responds with Java code instead of Python, ensuring such behavior is detected. The actual response from `python_agent.get_response(input)` is replaced with our custom content stored in `mock_invalid_response`
136+
137+
The implementation of the second test is shown below:
138+
```python
139+
140+
def test_python_question_invalid():
141+
input = "Calculate fibonacci series for the first 10 elements in python"
142+
python_agent = PythonAgent()
143+
144+
# set custom response that contains Java code instead of Python code
145+
mock_invalid_response = [
146+
{
147+
"role": "system",
148+
"content": '\n You are an assistant that strictly responds with Python code only. \n The code should print the result.\n You always use tool run_python to execute the code that you write to present the results.\n If the user specifies other programming language in the question, you should respond with "I can only help with Python code."\n ',
149+
},
150+
{"role": "user", "content": "Calculate fibonacci series for 10"},
151+
{
152+
"content": "None",
153+
"refusal": "None",
154+
"role": "assistant",
155+
"tool_calls": [
156+
{
157+
"id": "call_GMx1WYM7sN0BGY1ISCk05zez",
158+
"function": {
159+
"arguments": '{"code":"public class Fibonacci { public static void main(String[] args) { for (int n = 10, a = 0, b = 1, i = 0; i < n; i++, b = a + (a = b)) System.out.print(a + '
160+
'); } }"}',
161+
"name": "run_python",
162+
},
163+
"type": "function",
164+
}
165+
],
166+
},
167+
]
168+
169+
# the response will be replaced by our mock_invalid_response
170+
python_agent.get_response = MagicMock(return_value=mock_invalid_response)
171+
response = python_agent.get_response(input)
172+
173+
# convert trace
174+
trace = TraceFactory.from_openai(response)
175+
176+
# test the agent behavior
177+
with trace.as_context():
178+
run_python_tool_call = trace.tool_calls(name="run_python")
179+
180+
assert_true(F.len(run_python_tool_call) == 1)
181+
assert_true(
182+
not run_python_tool_call[0]["function"]["arguments"]["code"].is_valid_code(
183+
"python"
184+
)
185+
)
186+
187+
```
188+
189+
In this test we still verify that the agent correctly calls the run_python tool once, but it provids invalid Python code as its parameter. So we assert that the parameter passed to this call is not valid Python code.
190+
191+
### Test 3: Non-Python Language Request:
192+
193+
<div class='tiles'>
194+
<a href="https://explorer.invariantlabs.ai/u/zishan-wei/openai_python_agent-1733417505/t/3" class='tile'>
195+
<span class='tile-title'>Open in Explorer →</span>
196+
<span class='tile-description'>See this example in the Invariant Explorer</span>
197+
</a>
198+
</div>
199+
200+
This test's request included another programming langguage Java and the agent should be able to handle it nicely as clarifyed in the prompt.
201+
202+
This test evaluates the agent's ability to handle requests involving a programming language other than Python, specifically Java. The agent is expected to respond appropriately by clarifying its limitation to Python code as outlined in the prompt.
203+
204+
The implementation of the third test is shown below:
205+
```python
206+
207+
def test_java_question():
208+
input = "How to calculate fibonacci series in Java?"
209+
# run the agent
210+
python_agent = PythonAgent()
211+
response = python_agent.get_response(input)
212+
213+
# convert trace
214+
trace = TraceFactory.from_openai(response)
215+
216+
# set expected response as clarified in prompt
217+
expected_response = "I can only help with Python code."
218+
219+
# test the agent behavior
220+
with trace.as_context():
221+
222+
# assert that the agent does not call the `run_python` tool
223+
run_python_tool_call = trace.tool_calls(name="run_python")
224+
assert_true(F.len(run_python_tool_call) == 0)
225+
226+
# assert that the real repsonse is close enough with expected response
227+
expect_equals(
228+
"I can only help with Python code.", trace.messages(-1)["content"]
229+
)
230+
assert_true(trace.messages(-1)["content"].levenshtein(expected_response) < 5)
231+
232+
```
233+
234+
The first validation confirms that the agent does not call the `run_python` tool.
235+
236+
The agent’s response should align closely with `expected_response = "I can only help with Python code."`.
237+
We use the `expect_equals` assertion, which is less strict than `assert_equal`, to validate similarity.
238+
239+
To further confirm similarity, weo use our `levenshtein()` function which calculate Levenshtein distance. So we assert that the Levenshtein distance between the response and the expected response is smaller than 5.
240+
241+
To further confirm similarity, we compute the Levenshtein distance between the agent's response and the expected output, ensuring it is less than 5 using our `levenshtein()` function.
242+
243+
## Conclusion
244+
245+
We have seen how to build an OpenAI Function Calling Agent and how to write unit tests to ensure the agent functions correctly by using `testing`.
246+
247+
To learn more, please select a topic from the tiles below.
248+
249+
<div class='tiles'>
250+
251+
<a href="https://en.wikipedia.org/wiki/Levenshtein_distance" class='tile primary'>
252+
<span class='tile-title'>Levenshtein Distance →</span>
253+
<span class='tile-description'>Wikipedia's introduction of Levenshtein Distance</span>
254+
</a>
255+
256+
<a href="https://docs.python.org/3/library/unittest.mock.html" class='tile primary'>
257+
<span class='tile-title'>Intro of unittest.mock →</span>
258+
<span class='tile-description'>Docs of unittest.mock — mock object library</span>
259+
</a>
260+
261+
</div>

docs/testing/Writing_Tests/Matchers.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,13 @@
11
# Matchers
22

3-
<div class='subtitle'>Use matchers for fuzzy and LLM-based checks</div>
3+
<div class='subtitle'>Test with custom checkers and LLM-based evaluation</div>
44

5-
Not all agentic behavior can be specified with precise, traditional checking methods. Instead, more often than not, we expect AI models to generalize and thus respond slightly differently to different inputs.
5+
Not all agentic behavior can be specified with precise, traditional checking methods. Instead, more often than not, we expect AI models to generalize and thus respond slightly differently everytime we invoke them.
66

77
To accommodate this, `testing` includes several different `Matcher` implementations, that allow you to write tests that rely on fuzzy, similarity-based or property-based conditions.
88

9+
Beyond that, `Matcher` is also a simple base class that allows you to write your own custom matchers, if the provided ones are not sufficient for your needs (e.g. custom properties).
10+
911
## `IsSimilar`
1012

1113
TODO
Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
# Parameterized Tests
2+
3+
<div class='subtitle'>Use parameterized tests to test multiple scenarios</div>
4+
5+
In some cases, a certain agent functionality should generalize to multiple scenarios. For example, a weather agent should be able to answer questions about the weather in different cities.
6+
7+
In `testing`, instead of writing a separate test for each city, you can use parameterized tests to test multiple scenarios. This ensures robustness and generalization of your agent's behavior.
8+
9+
```python
10+
from invariant.testing import Trace, assert_equals, parameterized
11+
import pytest
12+
13+
@pytest.mark.parametrize(
14+
("city",),
15+
[
16+
("Paris",),
17+
("London",),
18+
("New York",),
19+
]
20+
)
21+
def test_check_weather_in(city: str):
22+
# create a Trace object from your agent trajectory
23+
trace = Trace(
24+
trace=[
25+
{"role": "user", "content": f"What is the weather like in {city}"},
26+
{"role": "agent", "content": f"The weather in {city} is 75°F and sunny."},
27+
]
28+
)
29+
30+
# make assertions about the agent's behavior
31+
with trace.as_context():
32+
# extract the locations mentioned in the agent's response
33+
locations = trace.messages()[-1]["content"].extract("locations")
34+
35+
# assert that the agent responded about the given city
36+
assert_equals(
37+
1, len(locations), "The agent should respond about one location only"
38+
)
39+
40+
assert_equals(city, locations[0], "The agent should respond about " + city)
41+
```
42+
43+
### Visualization
44+
45+
When pushing the parameterized test results to Explorer (`invariant test --push`), the resulting test instances will be listed separately:
46+
47+
<img src="../../assets/parameterized_tests.png"/>
108 KB
Loading

docs/testing/index.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,7 @@ ________________________________________________________________________________
7373
# },
7474
# ]
7575
```
76-
The test result provides information about which assertion failed but also [localizes the assertion failure precisely](Writing_Tests/1_Traces.ipynb) in the provided list of agent messages.
76+
The test result provides information about which assertion failed but also [localizes the assertion failure precisely](./Writing_Tests/tests.md) in the provided list of agent messages.
7777

7878
**Visual Test Viewer (Explorer):**
7979

@@ -92,7 +92,7 @@ Like the terminal output, the Explorer highlights the relevant ranges, but does
9292
* Comprehensive [`Trace` API](Writing_Tests/1_Traces.ipynb) for easily navigating and checking agent traces.
9393
* [Assertions library](Writing_Tests/2_Assertions.md) to check agent behavior, including fuzzy checkers such as _Levenshtein distance_, _semantic similarity_ and _LLM-as-a-judge_ pipelines.
9494
* Full [`pytest` compatibility](Running_Tests/PyTest_Compatibility.md) for easy integration with existing test and CI/CD pipelines.
95-
* Parameterized tests for [testing multiple scenarios](Writing_Tests/3_Parameterized_Tests.md) with a single test function.
95+
* Parameterized tests for [testing multiple scenarios](Writing_Tests/parameterized-tests) with a single test function.
9696
* [Visual test viewer](Writing_Tests/4_Visual_Test_Viewer.md) for exploring large traces and debugging test failures.
9797

9898
## Next Steps

0 commit comments

Comments
 (0)