adjust openai-agent demo doc

Zishan · Zishan · commit 39e69010760b · 2024-12-06T18:30:40.000+01:00
diff --git a/docs/testing/Examples/openai-python-agent.md b/docs/testing/Examples/openai-python-agent.md
@@ -1,20 +1,64 @@
 ---
-title: OpenAi Python Agent
+title: OpenAI Python Agent
 ---
 
+# OpenAI Function Calling Agent
 
-# Intro
+<div class="subtitle">
+Test an OpenAI function calling agent using <code>testing</code>.
+</div>
 
-OpenAI's function calling can be used to build agents that integrate with external tools and APIs, allowing the agent to call custom functions and deliver enhanced, context-aware responses. More details can be found here: [OpenAi Function Calling Guide](https://platform.openai.com/docs/guides/function-calling)
+# Intro
 
-This example demonstrates how we validate an OpenAI python agent by using Invariant testing to ensure the agent functions correctly.
+OpenAI's function calling can be used to build agents that integrate with external tools and APIs, allowing the agent to call custom functions and deliver enhanced, context-aware responses. More details can be found here: [OpenAI Function Calling Guide](https://platform.openai.com/docs/guides/function-calling)
 
+In this chapter we are testing simple OpenAI function calling agent as [implemented in this file](https://github.com/invariantlabs-ai/testing/blob/main/sample_tests/openai/test_python_agent.py).
 
 ## Agent Overview
 
 The agent generates and executes Python code in response to user requests and returns the computed results. It operates under a strict prompt and utilizes the run_python tool to guarantee accurate code execution and adherence to its intended functionality.
 
-A loop is implemented to run the client until the chat is completed without further tool calls. During this process, all chat interactions are stored in `messages`.
+A loop is implemented to run the client until the chat is completed without further tool calls. During this process, all chat interactions are stored in `messages`. A simplifed implementation is shown below:
+
+```python
+ while True:
+
+    # call the client to get response
+    response = self.client.chat.completions.create(
+        messages=messages,
+        model="gpt-4o",
+        tools=tools,
+    )
+
+    # check if the response calling tools, if not means the chat is completed
+    tool_calls = response.choices[0].message.tool_calls
+    if tool_calls:
+
+        # append current response message t o messages
+        messages.append(response_message.to_dict())
+
+        # In this demo there's only one tool call in the response
+        tool_call = tool_calls[0]
+        if tool_call.function.name == "run_python":
+
+            # get the arguments generated by agent for the function
+            function_args = json.loads(tool_call.function.arguments)
+
+            # run the function with the argument with "code"
+            function_response = run_python(function_args["code"])
+
+            # append the response of the function to messages for next round chat
+            messages.append(
+                {
+                    "tool_call_id": tool_call.id,
+                    "role": "tool",
+                    "name": "run_python",
+                    "content": str(function_response),
+                }
+            )
+    else:
+        break
+```
 
 ## Run the Example
 
@@ -24,9 +68,7 @@ You can run the example by running the following command in the root of the repo
 poetry run invariant test sample_tests/openai/test_python_agent.py --push --dataset_name test_python_agent
 ```
 
-!!! note
-
-    If you want to run the example without sending the results to the Explorer UI, you can always run without the `--push` flag. You will still see the parts of the trace that fail 
+>**Note**    If you want to run the example without sending the results to the Explorer UI, you can always run without the `--push` flag. You will still see the parts of the trace that fail 
     as higihlighted in the terminal.
 
 
@@ -49,42 +91,40 @@ The agent's response is subsequently transformed into a Trace object using` Trac
 
 In the first test, we ask the agent to calculate the Fibonacci series for the first 10 elements using Python.
 
+The implementation of the first test is shown below:
 ```python
 def test_python_question():
     input = "Calculate fibonacci series for the first 10 elements in python"
+
+    # run the agent
     python_agent = PythonAgent()
     response = python_agent.get_response(input)
+
+    # convert trace
     trace = TraceFactory.from_openai(response)
+
+    # test the agent behavior
     with trace.as_context():
         run_python_tool_call = trace.tool_calls(name="run_python")
+
+        # assert the agent calls "run_python" exactly once
         assert_true(F.len(run_python_tool_call) == 1)
+
+        # assert the argument passed to the tool_call is valid Python code.
         assert_true(
             run_python_tool_call[0]["function"]["arguments"]["code"].is_valid_code(
                 "python"
             )
         )
+
+        # assert if 34 is included in the agent's final response.
         assert_true("34" in trace.messages(-1)["content"])
 ```
 
-
 Our primary objective is to verify that the agent correctly calls the `run_python` tool and provides valid Python code as its parameter. To achieve this, we first filter the tool_calls where `name = "run_python"`. Then, we assert that exactly one `tool_call` meets this condition. Next, we confirm that the argument passed to the `tool_call` is valid Python code.
 
-```python
-run_python_tool_call = trace.tool_calls(name="run_python")
-assert_true(F.len(run_python_tool_call) == 1)
-assert_true(
-    run_python_tool_call[0]["function"]["arguments"]["code"].is_valid_code(
-        "python"Secondly, we validate that the Python code executes correctly. To confirm this, we check if the calculated result, "34," is included in the agent's final response.
-    )
-)
-```
-
 Then we validate that the Python code executes correctly. To confirm this, we check if one of the calculated result, "34," is included in the agent's final response.
 
-```python
-assert_true("34" in trace.messages(-1)["content"])
-```
-
 ### Test 2: Invalid Response:
 
 <div class='tiles'>
@@ -96,12 +136,14 @@ assert_true("34" in trace.messages(-1)["content"])
 
 In this test, we use `unittest.mock.MagicMock` to simulate a scenario where the agent incorrectly responds with Java code instead of Python, ensuring such behavior is detected. The actual response from `python_agent.get_response(input)` is replaced with our custom content stored in `mock_invalid_response`
 
-
+The implementation of the second test is shown below:
 ```python
 
 def test_python_question_invalid():
     input = "Calculate fibonacci series for the first 10 elements in python"
     python_agent = PythonAgent()
+
+    # set custom response that contains Java code instead of Python code
     mock_invalid_response = [
         {
             "role": "system",
@@ -125,11 +167,18 @@ def test_python_question_invalid():
             ],
         },
     ]
+
+    # the response will be replaced by our mock_invalid_response
     python_agent.get_response = MagicMock(return_value=mock_invalid_response)
     response = python_agent.get_response(input)
+
+    # convert trace
     trace = TraceFactory.from_openai(response)
+
+    # test the agent behavior
     with trace.as_context():
         run_python_tool_call = trace.tool_calls(name="run_python")
+
         assert_true(F.len(run_python_tool_call) == 1)
         assert_true(
             not run_python_tool_call[0]["function"]["arguments"]["code"].is_valid_code(
@@ -141,16 +190,6 @@ def test_python_question_invalid():
 
 In this test we still verify that the agent correctly calls the run_python tool once, but it provids invalid Python code as its parameter. So we assert that the parameter passed to this call is not valid Python code.
 
-```python
-run_python_tool_call = trace.tool_calls(name="run_python")
-    assert_true(F.len(run_python_tool_call) == 1)
-    assert_true(
-        not run_python_tool_call[0]["function"]["arguments"]["code"].is_valid_code(
-            "python"
-        )
-    )
-```
-
 ### Test 3: Non-Python Language Request:
 
 <div class='tiles'>
@@ -164,18 +203,29 @@ This test's request included another programming langguage Java and the agent sh
 
 This test evaluates the agent's ability to handle requests involving a programming language other than Python, specifically Java. The agent is expected to respond appropriately by clarifying its limitation to Python code as outlined in the prompt.
 
-
+The implementation of the third test is shown below:
 ```python
 
 def test_java_question():
     input = "How to calculate fibonacci series in Java?"
+    # run the agent
     python_agent = PythonAgent()
     response = python_agent.get_response(input)
+    
+    # convert trace
     trace = TraceFactory.from_openai(response)
+    
+    # set expected response as clarified in prompt
     expected_response = "I can only help with Python code."
+
+    # test the agent behavior
     with trace.as_context():
+
+        # assert that the agent does not call the `run_python` tool
         run_python_tool_call = trace.tool_calls(name="run_python")
         assert_true(F.len(run_python_tool_call) == 0)
+
+        # assert that the real repsonse is close enough with expected response
         expect_equals(
             "I can only help with Python code.", trace.messages(-1)["content"]
         )
@@ -184,24 +234,30 @@ def test_java_question():
 ```
 
 The first validation confirms that the agent does not call the `run_python` tool.
-```python
-run_python_tool_call = trace.tool_calls(name="run_python")
-assert_true(F.len(run_python_tool_call) == 0)
-```
 
 The agent’s response should align closely with `expected_response = "I can only help with Python code."`.
 We use the `expect_equals` assertion, which is less strict than `assert_equal`, to validate similarity.
 
-```python
-expected_response = "I can only help with Python code."
-expect_equals(
-           expected_response, trace.messages(-1)["content"]
-        )
-```
-Another way to do it is to use our `levenshtein()` function which calculate Levenshtein distance. So we assert that the Levenshtein distance between the response and the expected response is smaller than 5.
+To further confirm similarity, weo use our `levenshtein()` function which calculate Levenshtein distance. So we assert that the Levenshtein distance between the response and the expected response is smaller than 5.
 
-To further confirm similarity, we use `levenshtein()` function to compute the Levenshtein distance. And assert that the Levenshtein distance between the agent's response and the expected output, ensuring it is less than 5.
+To further confirm similarity, we compute the Levenshtein distance between the agent's response and the expected output, ensuring it is less than 5 using our `levenshtein()` function.
 
-```python
-assert_true(trace.messages(-1)["content"].levenshtein(expected_response) < 5)
-```
+## Conclusion
+
+We have seen how to build an OpenAI Function Calling Agent and how to write unit tests to ensure the agent functions correctly by using `testing`.
+
+To learn more, please select a topic from the tiles below.
+
+<div class='tiles'>
+
+<a href="https://en.wikipedia.org/wiki/Levenshtein_distance" class='tile primary'>
+    <span class='tile-title'>Levenshtein Distance →</span>
+    <span class='tile-description'>Wikipedia's introduction of Levenshtein Distance</span>
+</a>
+
+<a href="https://docs.python.org/3/library/unittest.mock.html" class='tile primary'>
+    <span class='tile-title'>Intro of unittest.mock →</span>
+    <span class='tile-description'>Docs of unittest.mock — mock object library</span>
+</a>
+
+</div>