You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Test an OpenAI function calling agent using <code>testing</code>.
9
+
</div>
7
10
8
-
OpenAI's function calling can be used to build agents that integrate with external tools and APIs, allowing the agent to call custom functions and deliver enhanced, context-aware responses. More details can be found here: [OpenAi Function Calling Guide](https://platform.openai.com/docs/guides/function-calling)
11
+
# Intro
9
12
10
-
This example demonstrates how we validate an OpenAI python agent by using Invariant testing to ensure the agent functions correctly.
13
+
OpenAI's function calling can be used to build agents that integrate with external tools and APIs, allowing the agent to call custom functions and deliver enhanced, context-aware responses. More details can be found here: [OpenAI Function Calling Guide](https://platform.openai.com/docs/guides/function-calling)
11
14
15
+
In this chapter we are testing simple OpenAI function calling agent as [implemented in this file](https://github.com/invariantlabs-ai/testing/blob/main/sample_tests/openai/test_python_agent.py).
12
16
13
17
## Agent Overview
14
18
15
19
The agent generates and executes Python code in response to user requests and returns the computed results. It operates under a strict prompt and utilizes the run_python tool to guarantee accurate code execution and adherence to its intended functionality.
16
20
17
-
A loop is implemented to run the client until the chat is completed without further tool calls. During this process, all chat interactions are stored in `messages`.
21
+
A loop is implemented to run the client until the chat is completed without further tool calls. During this process, all chat interactions are stored in `messages`. A simplifed implementation is shown below:
22
+
23
+
```python
24
+
whileTrue:
25
+
26
+
# call the client to get response
27
+
response =self.client.chat.completions.create(
28
+
messages=messages,
29
+
model="gpt-4o",
30
+
tools=tools,
31
+
)
32
+
33
+
# check if the response calling tools, if not means the chat is completed
# append the response of the function to messages for next round chat
51
+
messages.append(
52
+
{
53
+
"tool_call_id": tool_call.id,
54
+
"role": "tool",
55
+
"name": "run_python",
56
+
"content": str(function_response),
57
+
}
58
+
)
59
+
else:
60
+
break
61
+
```
18
62
19
63
## Run the Example
20
64
@@ -24,9 +68,7 @@ You can run the example by running the following command in the root of the repo
24
68
poetry run invariant test sample_tests/openai/test_python_agent.py --push --dataset_name test_python_agent
25
69
```
26
70
27
-
!!! note
28
-
29
-
If you want to run the example without sending the results to the Explorer UI, you can always run without the `--push` flag. You will still see the parts of the trace that fail
71
+
>**Note** If you want to run the example without sending the results to the Explorer UI, you can always run without the `--push` flag. You will still see the parts of the trace that fail
30
72
as higihlighted in the terminal.
31
73
32
74
@@ -49,42 +91,40 @@ The agent's response is subsequently transformed into a Trace object using` Trac
49
91
50
92
In the first test, we ask the agent to calculate the Fibonacci series for the first 10 elements using Python.
51
93
94
+
The implementation of the first test is shown below:
52
95
```python
53
96
deftest_python_question():
54
97
input="Calculate fibonacci series for the first 10 elements in python"
# assert if 34 is included in the agent's final response.
66
121
assert_true("34"in trace.messages(-1)["content"])
67
122
```
68
123
69
-
70
124
Our primary objective is to verify that the agent correctly calls the `run_python` tool and provides valid Python code as its parameter. To achieve this, we first filter the tool_calls where `name = "run_python"`. Then, we assert that exactly one `tool_call` meets this condition. Next, we confirm that the argument passed to the `tool_call` is valid Python code.
"python"Secondly, we validate that the Python code executes correctly. To confirm this, we check if the calculated result, "34,"is included in the agent's final response.
78
-
)
79
-
)
80
-
```
81
-
82
126
Then we validate that the Python code executes correctly. To confirm this, we check if one of the calculated result, "34," is included in the agent's final response.
83
127
84
-
```python
85
-
assert_true("34"in trace.messages(-1)["content"])
86
-
```
87
-
88
128
### Test 2: Invalid Response:
89
129
90
130
<divclass='tiles'>
@@ -96,12 +136,14 @@ assert_true("34" in trace.messages(-1)["content"])
96
136
97
137
In this test, we use `unittest.mock.MagicMock` to simulate a scenario where the agent incorrectly responds with Java code instead of Python, ensuring such behavior is detected. The actual response from `python_agent.get_response(input)` is replaced with our custom content stored in `mock_invalid_response`
98
138
99
-
139
+
The implementation of the second test is shown below:
100
140
```python
101
141
102
142
deftest_python_question_invalid():
103
143
input="Calculate fibonacci series for the first 10 elements in python"
104
144
python_agent = PythonAgent()
145
+
146
+
# set custom response that contains Java code instead of Python code
In this test we still verify that the agent correctly calls the run_python tool once, but it provids invalid Python code as its parameter. So we assert that the parameter passed to this call is not valid Python code.
not run_python_tool_call[0]["function"]["arguments"]["code"].is_valid_code(
149
-
"python"
150
-
)
151
-
)
152
-
```
153
-
154
193
### Test 3: Non-Python Language Request:
155
194
156
195
<divclass='tiles'>
@@ -164,18 +203,29 @@ This test's request included another programming langguage Java and the agent sh
164
203
165
204
This test evaluates the agent's ability to handle requests involving a programming language other than Python, specifically Java. The agent is expected to respond appropriately by clarifying its limitation to Python code as outlined in the prompt.
166
205
167
-
206
+
The implementation of the third test is shown below:
168
207
```python
169
208
170
209
deftest_java_question():
171
210
input="How to calculate fibonacci series in Java?"
211
+
# run the agent
172
212
python_agent = PythonAgent()
173
213
response = python_agent.get_response(input)
214
+
215
+
# convert trace
174
216
trace = TraceFactory.from_openai(response)
217
+
218
+
# set expected response as clarified in prompt
175
219
expected_response ="I can only help with Python code."
220
+
221
+
# test the agent behavior
176
222
with trace.as_context():
223
+
224
+
# assert that the agent does not call the `run_python` tool
The agent’s response should align closely with `expected_response = "I can only help with Python code."`.
193
239
We use the `expect_equals` assertion, which is less strict than `assert_equal`, to validate similarity.
194
240
195
-
```python
196
-
expected_response ="I can only help with Python code."
197
-
expect_equals(
198
-
expected_response, trace.messages(-1)["content"]
199
-
)
200
-
```
201
-
Another way to do it is to use our `levenshtein()` function which calculate Levenshtein distance. So we assert that the Levenshtein distance between the response and the expected response is smaller than 5.
241
+
To further confirm similarity, weo use our `levenshtein()` function which calculate Levenshtein distance. So we assert that the Levenshtein distance between the response and the expected response is smaller than 5.
202
242
203
-
To further confirm similarity, we use `levenshtein()` function to compute the Levenshtein distance. And assert that the Levenshtein distance between the agent's response and the expected output, ensuring it is less than 5.
243
+
To further confirm similarity, we compute the Levenshtein distancebetween the agent's response and the expected output, ensuring it is less than 5 using our `levenshtein()` function.
0 commit comments