|
| 1 | +--- |
| 2 | +title: OpenAi Python Agent |
| 3 | +--- |
| 4 | + |
| 5 | + |
| 6 | +# Intro |
| 7 | + |
| 8 | +OpenAI's function calling can be used to build agents that integrate with external tools and APIs, allowing the agent to call custom functions and deliver enhanced, context-aware responses. More details can be found here: [OpenAi Function Calling Guide](https://platform.openai.com/docs/guides/function-calling) |
| 9 | + |
| 10 | +This example demonstrates how we validate an OpenAI python agent by using Invariant testing to ensure the agent functions correctly. |
| 11 | + |
| 12 | + |
| 13 | +## Agent Overview |
| 14 | + |
| 15 | +The agent generates and executes Python code in response to user requests and returns the computed results. It operates under a strict prompt and utilizes the run_python tool to guarantee accurate code execution and adherence to its intended functionality. |
| 16 | + |
| 17 | +A loop is implemented to run the client until the chat is completed without further tool calls. During this process, all chat interactions are stored in `messages`. |
| 18 | + |
| 19 | +## Run the Example |
| 20 | + |
| 21 | +You can run the example by running the following command in the root of the repository: |
| 22 | + |
| 23 | +```bash |
| 24 | +poetry run invariant test sample_tests/openai/test_python_agent.py --push --dataset_name test_python_agent |
| 25 | +``` |
| 26 | + |
| 27 | +!!! note |
| 28 | + |
| 29 | + If you want to run the example without sending the results to the Explorer UI, you can always run without the `--push` flag. You will still see the parts of the trace that fail |
| 30 | + as higihlighted in the terminal. |
| 31 | + |
| 32 | + |
| 33 | +## Unit Tests |
| 34 | + |
| 35 | +Here, we design three unit tests to cover different scenarios. |
| 36 | + |
| 37 | +In these tests, we set varied `input` to reflect different situations. Within each test, we create an instance of the agent named `python_agent`, and retrieve its response by calling `python_agent.get_response(input)`. |
| 38 | + |
| 39 | +The agent's response is subsequently transformed into a Trace object using` TraceFactory.from_openai(response)` for further validation. |
| 40 | + |
| 41 | +### Test 1: Valid Python Code Execution: |
| 42 | + |
| 43 | +<div class='tiles'> |
| 44 | +<a href="https://explorer.invariantlabs.ai/u/zishan-wei/openai_python_agent-1733417505/t/1" class='tile'> |
| 45 | + <span class='tile-title'>Open in Explorer →</span> |
| 46 | + <span class='tile-description'>See this example in the Invariant Explorer</span> |
| 47 | +</a> |
| 48 | +</div> |
| 49 | + |
| 50 | +In the first test, we ask the agent to calculate the Fibonacci series for the first 10 elements using Python. |
| 51 | + |
| 52 | +```python |
| 53 | +def test_python_question(): |
| 54 | + input = "Calculate fibonacci series for the first 10 elements in python" |
| 55 | + python_agent = PythonAgent() |
| 56 | + response = python_agent.get_response(input) |
| 57 | + trace = TraceFactory.from_openai(response) |
| 58 | + with trace.as_context(): |
| 59 | + run_python_tool_call = trace.tool_calls(name="run_python") |
| 60 | + assert_true(F.len(run_python_tool_call) == 1) |
| 61 | + assert_true( |
| 62 | + run_python_tool_call[0]["function"]["arguments"]["code"].is_valid_code( |
| 63 | + "python" |
| 64 | + ) |
| 65 | + ) |
| 66 | + assert_true("34" in trace.messages(-1)["content"]) |
| 67 | +``` |
| 68 | + |
| 69 | + |
| 70 | +Our primary objective is to verify that the agent correctly calls the `run_python` tool and provides valid Python code as its parameter. To achieve this, we first filter the tool_calls where `name = "run_python"`. Then, we assert that exactly one `tool_call` meets this condition. Next, we confirm that the argument passed to the `tool_call` is valid Python code. |
| 71 | + |
| 72 | +```python |
| 73 | +run_python_tool_call = trace.tool_calls(name="run_python") |
| 74 | +assert_true(F.len(run_python_tool_call) == 1) |
| 75 | +assert_true( |
| 76 | + run_python_tool_call[0]["function"]["arguments"]["code"].is_valid_code( |
| 77 | + "python"Secondly, we validate that the Python code executes correctly. To confirm this, we check if the calculated result, "34," is included in the agent's final response. |
| 78 | + ) |
| 79 | +) |
| 80 | +``` |
| 81 | + |
| 82 | +Then we validate that the Python code executes correctly. To confirm this, we check if one of the calculated result, "34," is included in the agent's final response. |
| 83 | + |
| 84 | +```python |
| 85 | +assert_true("34" in trace.messages(-1)["content"]) |
| 86 | +``` |
| 87 | + |
| 88 | +### Test 2: Invalid Response: |
| 89 | + |
| 90 | +<div class='tiles'> |
| 91 | +<a href="https://explorer.invariantlabs.ai/u/zishan-wei/openai_python_agent-1733417505/t/2" class='tile'> |
| 92 | + <span class='tile-title'>Open in Explorer →</span> |
| 93 | + <span class='tile-description'>See this example in the Invariant Explorer</span> |
| 94 | +</a> |
| 95 | +</div> |
| 96 | + |
| 97 | +In this test, we use `unittest.mock.MagicMock` to simulate a scenario where the agent incorrectly responds with Java code instead of Python, ensuring such behavior is detected. The actual response from `python_agent.get_response(input)` is replaced with our custom content stored in `mock_invalid_response` |
| 98 | + |
| 99 | + |
| 100 | +```python |
| 101 | + |
| 102 | +def test_python_question_invalid(): |
| 103 | + input = "Calculate fibonacci series for the first 10 elements in python" |
| 104 | + python_agent = PythonAgent() |
| 105 | + mock_invalid_response = [ |
| 106 | + { |
| 107 | + "role": "system", |
| 108 | + "content": '\n You are an assistant that strictly responds with Python code only. \n The code should print the result.\n You always use tool run_python to execute the code that you write to present the results.\n If the user specifies other programming language in the question, you should respond with "I can only help with Python code."\n ', |
| 109 | + }, |
| 110 | + {"role": "user", "content": "Calculate fibonacci series for 10"}, |
| 111 | + { |
| 112 | + "content": "None", |
| 113 | + "refusal": "None", |
| 114 | + "role": "assistant", |
| 115 | + "tool_calls": [ |
| 116 | + { |
| 117 | + "id": "call_GMx1WYM7sN0BGY1ISCk05zez", |
| 118 | + "function": { |
| 119 | + "arguments": '{"code":"public class Fibonacci { public static void main(String[] args) { for (int n = 10, a = 0, b = 1, i = 0; i < n; i++, b = a + (a = b)) System.out.print(a + ' |
| 120 | + '); } }"}', |
| 121 | + "name": "run_python", |
| 122 | + }, |
| 123 | + "type": "function", |
| 124 | + } |
| 125 | + ], |
| 126 | + }, |
| 127 | + ] |
| 128 | + python_agent.get_response = MagicMock(return_value=mock_invalid_response) |
| 129 | + response = python_agent.get_response(input) |
| 130 | + trace = TraceFactory.from_openai(response) |
| 131 | + with trace.as_context(): |
| 132 | + run_python_tool_call = trace.tool_calls(name="run_python") |
| 133 | + assert_true(F.len(run_python_tool_call) == 1) |
| 134 | + assert_true( |
| 135 | + not run_python_tool_call[0]["function"]["arguments"]["code"].is_valid_code( |
| 136 | + "python" |
| 137 | + ) |
| 138 | + ) |
| 139 | + |
| 140 | +``` |
| 141 | + |
| 142 | +In this test we still verify that the agent correctly calls the run_python tool once, but it provids invalid Python code as its parameter. So we assert that the parameter passed to this call is not valid Python code. |
| 143 | + |
| 144 | +```python |
| 145 | +run_python_tool_call = trace.tool_calls(name="run_python") |
| 146 | + assert_true(F.len(run_python_tool_call) == 1) |
| 147 | + assert_true( |
| 148 | + not run_python_tool_call[0]["function"]["arguments"]["code"].is_valid_code( |
| 149 | + "python" |
| 150 | + ) |
| 151 | + ) |
| 152 | +``` |
| 153 | + |
| 154 | +### Test 3: Non-Python Language Request: |
| 155 | + |
| 156 | +<div class='tiles'> |
| 157 | +<a href="https://explorer.invariantlabs.ai/u/zishan-wei/openai_python_agent-1733417505/t/3" class='tile'> |
| 158 | + <span class='tile-title'>Open in Explorer →</span> |
| 159 | + <span class='tile-description'>See this example in the Invariant Explorer</span> |
| 160 | +</a> |
| 161 | +</div> |
| 162 | + |
| 163 | +This test's request included another programming langguage Java and the agent should be able to handle it nicely as clarifyed in the prompt. |
| 164 | + |
| 165 | +This test evaluates the agent's ability to handle requests involving a programming language other than Python, specifically Java. The agent is expected to respond appropriately by clarifying its limitation to Python code as outlined in the prompt. |
| 166 | + |
| 167 | + |
| 168 | +```python |
| 169 | + |
| 170 | +def test_java_question(): |
| 171 | + input = "How to calculate fibonacci series in Java?" |
| 172 | + python_agent = PythonAgent() |
| 173 | + response = python_agent.get_response(input) |
| 174 | + trace = TraceFactory.from_openai(response) |
| 175 | + expected_response = "I can only help with Python code." |
| 176 | + with trace.as_context(): |
| 177 | + run_python_tool_call = trace.tool_calls(name="run_python") |
| 178 | + assert_true(F.len(run_python_tool_call) == 0) |
| 179 | + expect_equals( |
| 180 | + "I can only help with Python code.", trace.messages(-1)["content"] |
| 181 | + ) |
| 182 | + assert_true(trace.messages(-1)["content"].levenshtein(expected_response) < 5) |
| 183 | + |
| 184 | +``` |
| 185 | + |
| 186 | +The first validation confirms that the agent does not call the `run_python` tool. |
| 187 | +```python |
| 188 | +run_python_tool_call = trace.tool_calls(name="run_python") |
| 189 | +assert_true(F.len(run_python_tool_call) == 0) |
| 190 | +``` |
| 191 | + |
| 192 | +The agent’s response should align closely with `expected_response = "I can only help with Python code."`. |
| 193 | +We use the `expect_equals` assertion, which is less strict than `assert_equal`, to validate similarity. |
| 194 | + |
| 195 | +```python |
| 196 | +expected_response = "I can only help with Python code." |
| 197 | +expect_equals( |
| 198 | + "I can only help with Python code.", trace.messages(-1)["content"] |
| 199 | + ) |
| 200 | +``` |
| 201 | +Another way to do it is to use our `levenshtein()` function which calculate Levenshtein distance. So we assert that the Levenshtein distance between the response and the expected response is smaller than 5. |
| 202 | + |
| 203 | +To further confirm similarity, we use `levenshtein()` function to compute the Levenshtein distance. And assert that the Levenshtein distance between the agent's response and the expected output, ensuring it is less than 5. |
| 204 | + |
| 205 | +```python |
| 206 | +assert_true(trace.messages(-1)["content"].levenshtein(expected_response) < 5) |
| 207 | +``` |
0 commit comments