Skip to content

Commit 39e6901

Browse files
ZishanZishan
authored andcommitted
adjust openai-agent demo doc
1 parent 6a4a6d1 commit 39e6901

File tree

1 file changed

+106
-50
lines changed

1 file changed

+106
-50
lines changed
Lines changed: 106 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,64 @@
11
---
2-
title: OpenAi Python Agent
2+
title: OpenAI Python Agent
33
---
44

5+
# OpenAI Function Calling Agent
56

6-
# Intro
7+
<div class="subtitle">
8+
Test an OpenAI function calling agent using <code>testing</code>.
9+
</div>
710

8-
OpenAI's function calling can be used to build agents that integrate with external tools and APIs, allowing the agent to call custom functions and deliver enhanced, context-aware responses. More details can be found here: [OpenAi Function Calling Guide](https://platform.openai.com/docs/guides/function-calling)
11+
# Intro
912

10-
This example demonstrates how we validate an OpenAI python agent by using Invariant testing to ensure the agent functions correctly.
13+
OpenAI's function calling can be used to build agents that integrate with external tools and APIs, allowing the agent to call custom functions and deliver enhanced, context-aware responses. More details can be found here: [OpenAI Function Calling Guide](https://platform.openai.com/docs/guides/function-calling)
1114

15+
In this chapter we are testing simple OpenAI function calling agent as [implemented in this file](https://github.com/invariantlabs-ai/testing/blob/main/sample_tests/openai/test_python_agent.py).
1216

1317
## Agent Overview
1418

1519
The agent generates and executes Python code in response to user requests and returns the computed results. It operates under a strict prompt and utilizes the run_python tool to guarantee accurate code execution and adherence to its intended functionality.
1620

17-
A loop is implemented to run the client until the chat is completed without further tool calls. During this process, all chat interactions are stored in `messages`.
21+
A loop is implemented to run the client until the chat is completed without further tool calls. During this process, all chat interactions are stored in `messages`. A simplifed implementation is shown below:
22+
23+
```python
24+
while True:
25+
26+
# call the client to get response
27+
response = self.client.chat.completions.create(
28+
messages=messages,
29+
model="gpt-4o",
30+
tools=tools,
31+
)
32+
33+
# check if the response calling tools, if not means the chat is completed
34+
tool_calls = response.choices[0].message.tool_calls
35+
if tool_calls:
36+
37+
# append current response message t o messages
38+
messages.append(response_message.to_dict())
39+
40+
# In this demo there's only one tool call in the response
41+
tool_call = tool_calls[0]
42+
if tool_call.function.name == "run_python":
43+
44+
# get the arguments generated by agent for the function
45+
function_args = json.loads(tool_call.function.arguments)
46+
47+
# run the function with the argument with "code"
48+
function_response = run_python(function_args["code"])
49+
50+
# append the response of the function to messages for next round chat
51+
messages.append(
52+
{
53+
"tool_call_id": tool_call.id,
54+
"role": "tool",
55+
"name": "run_python",
56+
"content": str(function_response),
57+
}
58+
)
59+
else:
60+
break
61+
```
1862

1963
## Run the Example
2064

@@ -24,9 +68,7 @@ You can run the example by running the following command in the root of the repo
2468
poetry run invariant test sample_tests/openai/test_python_agent.py --push --dataset_name test_python_agent
2569
```
2670

27-
!!! note
28-
29-
If you want to run the example without sending the results to the Explorer UI, you can always run without the `--push` flag. You will still see the parts of the trace that fail
71+
>**Note** If you want to run the example without sending the results to the Explorer UI, you can always run without the `--push` flag. You will still see the parts of the trace that fail
3072
as higihlighted in the terminal.
3173

3274

@@ -49,42 +91,40 @@ The agent's response is subsequently transformed into a Trace object using` Trac
4991

5092
In the first test, we ask the agent to calculate the Fibonacci series for the first 10 elements using Python.
5193

94+
The implementation of the first test is shown below:
5295
```python
5396
def test_python_question():
5497
input = "Calculate fibonacci series for the first 10 elements in python"
98+
99+
# run the agent
55100
python_agent = PythonAgent()
56101
response = python_agent.get_response(input)
102+
103+
# convert trace
57104
trace = TraceFactory.from_openai(response)
105+
106+
# test the agent behavior
58107
with trace.as_context():
59108
run_python_tool_call = trace.tool_calls(name="run_python")
109+
110+
# assert the agent calls "run_python" exactly once
60111
assert_true(F.len(run_python_tool_call) == 1)
112+
113+
# assert the argument passed to the tool_call is valid Python code.
61114
assert_true(
62115
run_python_tool_call[0]["function"]["arguments"]["code"].is_valid_code(
63116
"python"
64117
)
65118
)
119+
120+
# assert if 34 is included in the agent's final response.
66121
assert_true("34" in trace.messages(-1)["content"])
67122
```
68123

69-
70124
Our primary objective is to verify that the agent correctly calls the `run_python` tool and provides valid Python code as its parameter. To achieve this, we first filter the tool_calls where `name = "run_python"`. Then, we assert that exactly one `tool_call` meets this condition. Next, we confirm that the argument passed to the `tool_call` is valid Python code.
71125

72-
```python
73-
run_python_tool_call = trace.tool_calls(name="run_python")
74-
assert_true(F.len(run_python_tool_call) == 1)
75-
assert_true(
76-
run_python_tool_call[0]["function"]["arguments"]["code"].is_valid_code(
77-
"python"Secondly, we validate that the Python code executes correctly. To confirm this, we check if the calculated result, "34," is included in the agent's final response.
78-
)
79-
)
80-
```
81-
82126
Then we validate that the Python code executes correctly. To confirm this, we check if one of the calculated result, "34," is included in the agent's final response.
83127

84-
```python
85-
assert_true("34" in trace.messages(-1)["content"])
86-
```
87-
88128
### Test 2: Invalid Response:
89129

90130
<div class='tiles'>
@@ -96,12 +136,14 @@ assert_true("34" in trace.messages(-1)["content"])
96136

97137
In this test, we use `unittest.mock.MagicMock` to simulate a scenario where the agent incorrectly responds with Java code instead of Python, ensuring such behavior is detected. The actual response from `python_agent.get_response(input)` is replaced with our custom content stored in `mock_invalid_response`
98138

99-
139+
The implementation of the second test is shown below:
100140
```python
101141

102142
def test_python_question_invalid():
103143
input = "Calculate fibonacci series for the first 10 elements in python"
104144
python_agent = PythonAgent()
145+
146+
# set custom response that contains Java code instead of Python code
105147
mock_invalid_response = [
106148
{
107149
"role": "system",
@@ -125,11 +167,18 @@ def test_python_question_invalid():
125167
],
126168
},
127169
]
170+
171+
# the response will be replaced by our mock_invalid_response
128172
python_agent.get_response = MagicMock(return_value=mock_invalid_response)
129173
response = python_agent.get_response(input)
174+
175+
# convert trace
130176
trace = TraceFactory.from_openai(response)
177+
178+
# test the agent behavior
131179
with trace.as_context():
132180
run_python_tool_call = trace.tool_calls(name="run_python")
181+
133182
assert_true(F.len(run_python_tool_call) == 1)
134183
assert_true(
135184
not run_python_tool_call[0]["function"]["arguments"]["code"].is_valid_code(
@@ -141,16 +190,6 @@ def test_python_question_invalid():
141190

142191
In this test we still verify that the agent correctly calls the run_python tool once, but it provids invalid Python code as its parameter. So we assert that the parameter passed to this call is not valid Python code.
143192

144-
```python
145-
run_python_tool_call = trace.tool_calls(name="run_python")
146-
assert_true(F.len(run_python_tool_call) == 1)
147-
assert_true(
148-
not run_python_tool_call[0]["function"]["arguments"]["code"].is_valid_code(
149-
"python"
150-
)
151-
)
152-
```
153-
154193
### Test 3: Non-Python Language Request:
155194

156195
<div class='tiles'>
@@ -164,18 +203,29 @@ This test's request included another programming langguage Java and the agent sh
164203

165204
This test evaluates the agent's ability to handle requests involving a programming language other than Python, specifically Java. The agent is expected to respond appropriately by clarifying its limitation to Python code as outlined in the prompt.
166205

167-
206+
The implementation of the third test is shown below:
168207
```python
169208

170209
def test_java_question():
171210
input = "How to calculate fibonacci series in Java?"
211+
# run the agent
172212
python_agent = PythonAgent()
173213
response = python_agent.get_response(input)
214+
215+
# convert trace
174216
trace = TraceFactory.from_openai(response)
217+
218+
# set expected response as clarified in prompt
175219
expected_response = "I can only help with Python code."
220+
221+
# test the agent behavior
176222
with trace.as_context():
223+
224+
# assert that the agent does not call the `run_python` tool
177225
run_python_tool_call = trace.tool_calls(name="run_python")
178226
assert_true(F.len(run_python_tool_call) == 0)
227+
228+
# assert that the real repsonse is close enough with expected response
179229
expect_equals(
180230
"I can only help with Python code.", trace.messages(-1)["content"]
181231
)
@@ -184,24 +234,30 @@ def test_java_question():
184234
```
185235

186236
The first validation confirms that the agent does not call the `run_python` tool.
187-
```python
188-
run_python_tool_call = trace.tool_calls(name="run_python")
189-
assert_true(F.len(run_python_tool_call) == 0)
190-
```
191237

192238
The agent’s response should align closely with `expected_response = "I can only help with Python code."`.
193239
We use the `expect_equals` assertion, which is less strict than `assert_equal`, to validate similarity.
194240

195-
```python
196-
expected_response = "I can only help with Python code."
197-
expect_equals(
198-
expected_response, trace.messages(-1)["content"]
199-
)
200-
```
201-
Another way to do it is to use our `levenshtein()` function which calculate Levenshtein distance. So we assert that the Levenshtein distance between the response and the expected response is smaller than 5.
241+
To further confirm similarity, weo use our `levenshtein()` function which calculate Levenshtein distance. So we assert that the Levenshtein distance between the response and the expected response is smaller than 5.
202242

203-
To further confirm similarity, we use `levenshtein()` function to compute the Levenshtein distance. And assert that the Levenshtein distance between the agent's response and the expected output, ensuring it is less than 5.
243+
To further confirm similarity, we compute the Levenshtein distance between the agent's response and the expected output, ensuring it is less than 5 using our `levenshtein()` function.
204244

205-
```python
206-
assert_true(trace.messages(-1)["content"].levenshtein(expected_response) < 5)
207-
```
245+
## Conclusion
246+
247+
We have seen how to build an OpenAI Function Calling Agent and how to write unit tests to ensure the agent functions correctly by using `testing`.
248+
249+
To learn more, please select a topic from the tiles below.
250+
251+
<div class='tiles'>
252+
253+
<a href="https://en.wikipedia.org/wiki/Levenshtein_distance" class='tile primary'>
254+
<span class='tile-title'>Levenshtein Distance →</span>
255+
<span class='tile-description'>Wikipedia's introduction of Levenshtein Distance</span>
256+
</a>
257+
258+
<a href="https://docs.python.org/3/library/unittest.mock.html" class='tile primary'>
259+
<span class='tile-title'>Intro of unittest.mock →</span>
260+
<span class='tile-description'>Docs of unittest.mock — mock object library</span>
261+
</a>
262+
263+
</div>

0 commit comments

Comments
 (0)