|
| 1 | +# AG-UI Integration |
| 2 | +Ragas can evaluate agents that stream events via the [AG-UI protocol](https://docs.ag-ui.com/). This notebook shows how to build evaluation datasets, configure metrics, and score AG-UI endpoints. |
| 3 | + |
| 4 | + |
| 5 | +## Prerequisites |
| 6 | +- Install optional dependencies with `pip install "ragas[ag-ui]" langchain-openai python-dotenv nest_asyncio` |
| 7 | +- Start an AG-UI compatible agent locally (Google ADK, PydanticAI, CrewAI, etc.) |
| 8 | +- Create an `.env` file with your evaluator LLM credentials (e.g. `OPENAI_API_KEY`, `GOOGLE_API_KEY`, etc.) |
| 9 | +- If you run this notebook, call `nest_asyncio.apply()` (shown below) so you can `await` coroutines in-place. |
| 10 | + |
| 11 | + |
| 12 | + |
| 13 | +```python |
| 14 | +# !pip install "ragas[ag-ui]" langchain-openai python-dotenv nest_asyncio |
| 15 | + |
| 16 | +``` |
| 17 | + |
| 18 | +## Imports and environment setup |
| 19 | +Load environment variables and import the classes used throughout the walkthrough. |
| 20 | + |
| 21 | + |
| 22 | + |
| 23 | +```python |
| 24 | +import asyncio |
| 25 | + |
| 26 | +from dotenv import load_dotenv |
| 27 | +import nest_asyncio |
| 28 | +from IPython.display import display |
| 29 | +from langchain_openai import ChatOpenAI |
| 30 | + |
| 31 | +from ragas.dataset_schema import EvaluationDataset, SingleTurnSample, MultiTurnSample |
| 32 | +from ragas.integrations.ag_ui import ( |
| 33 | + evaluate_ag_ui_agent, |
| 34 | + convert_to_ragas_messages, |
| 35 | + convert_messages_snapshot, |
| 36 | +) |
| 37 | +from ragas.messages import HumanMessage, ToolCall |
| 38 | +from ragas.metrics import FactualCorrectness, ToolCallF1 |
| 39 | +from ragas.llms import LangchainLLMWrapper |
| 40 | +from ag_ui.core import ( |
| 41 | + MessagesSnapshotEvent, |
| 42 | + TextMessageChunkEvent, |
| 43 | + UserMessage, |
| 44 | + AssistantMessage, |
| 45 | +) |
| 46 | + |
| 47 | +load_dotenv() |
| 48 | +# Patch the existing notebook loop so we can await coroutines safely |
| 49 | +nest_asyncio.apply() |
| 50 | + |
| 51 | +``` |
| 52 | + |
| 53 | +## Build single-turn evaluation data |
| 54 | +Create `SingleTurnSample` entries when you only need to grade the final answer text. |
| 55 | + |
| 56 | + |
| 57 | + |
| 58 | +```python |
| 59 | +scientist_questions = EvaluationDataset( |
| 60 | + samples=[ |
| 61 | + SingleTurnSample( |
| 62 | + user_input="Who originated the theory of relativity?", |
| 63 | + reference="Albert Einstein originated the theory of relativity.", |
| 64 | + ), |
| 65 | + SingleTurnSample( |
| 66 | + user_input="Who discovered penicillin and when?", |
| 67 | + reference="Alexander Fleming discovered penicillin in 1928.", |
| 68 | + ), |
| 69 | + ] |
| 70 | +) |
| 71 | + |
| 72 | +scientist_questions |
| 73 | + |
| 74 | +``` |
| 75 | + |
| 76 | + |
| 77 | + |
| 78 | + |
| 79 | + EvaluationDataset(features=['user_input', 'reference'], len=2) |
| 80 | + |
| 81 | + |
| 82 | + |
| 83 | +## Build multi-turn conversations |
| 84 | +For tool-usage metrics, extend the dataset with `MultiTurnSample` and expected tool calls. |
| 85 | + |
| 86 | + |
| 87 | + |
| 88 | +```python |
| 89 | +weather_queries = EvaluationDataset( |
| 90 | + samples=[ |
| 91 | + MultiTurnSample( |
| 92 | + user_input=[HumanMessage(content="What's the weather in Paris?")], |
| 93 | + reference_tool_calls=[ |
| 94 | + ToolCall(name="weatherTool", args={"location": "Paris"}) |
| 95 | + ], |
| 96 | + ) |
| 97 | + ] |
| 98 | +) |
| 99 | + |
| 100 | +weather_queries |
| 101 | + |
| 102 | +``` |
| 103 | + |
| 104 | + |
| 105 | + |
| 106 | + |
| 107 | + EvaluationDataset(features=['user_input', 'reference_tool_calls'], len=1) |
| 108 | + |
| 109 | + |
| 110 | + |
| 111 | +## Configure metrics and the evaluator LLM |
| 112 | +Wrap your grading model with the appropriate adapter and instantiate the metrics you plan to use. |
| 113 | + |
| 114 | + |
| 115 | + |
| 116 | +```python |
| 117 | +evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini")) |
| 118 | + |
| 119 | +qa_metrics = [FactualCorrectness(llm=evaluator_llm)] |
| 120 | +tool_metrics = [ToolCallF1()] # rule-based, no LLM required |
| 121 | + |
| 122 | +``` |
| 123 | + |
| 124 | + /var/folders/8k/tf3xr1rd1fl_dz35dfhfp_tc0000gn/T/ipykernel_93918/2135722072.py:1: DeprecationWarning: LangchainLLMWrapper is deprecated and will be removed in a future version. Use llm_factory instead: from openai import OpenAI; from ragas.llms import llm_factory; llm = llm_factory('gpt-4o-mini', client=OpenAI(api_key='...')) |
| 125 | + evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini")) |
| 126 | + |
| 127 | + |
| 128 | +## Evaluate a live AG-UI endpoint |
| 129 | +Set the endpoint URL exposed by your agent. Toggle the flags when you are ready to run the evaluations. |
| 130 | +In Jupyter/IPython you can `await` the helpers directly once `nest_asyncio.apply()` has been called. |
| 131 | + |
| 132 | + |
| 133 | + |
| 134 | +```python |
| 135 | +AG_UI_ENDPOINT = "http://localhost:8000/agentic_chat" # Update to match your agent |
| 136 | + |
| 137 | +RUN_FACTUAL_EVAL = False |
| 138 | +RUN_TOOL_EVAL = False |
| 139 | + |
| 140 | +``` |
| 141 | + |
| 142 | + |
| 143 | +```python |
| 144 | +async def evaluate_factual(): |
| 145 | + return await evaluate_ag_ui_agent( |
| 146 | + endpoint_url=AG_UI_ENDPOINT, |
| 147 | + dataset=scientist_questions, |
| 148 | + metrics=qa_metrics, |
| 149 | + evaluator_llm=evaluator_llm, |
| 150 | + metadata=True, |
| 151 | + ) |
| 152 | + |
| 153 | +if RUN_FACTUAL_EVAL: |
| 154 | + factual_result = await evaluate_factual() |
| 155 | + factual_df = factual_result.to_pandas() |
| 156 | + display(factual_df) |
| 157 | + |
| 158 | +``` |
| 159 | + |
| 160 | + |
| 161 | + Calling AG-UI Agent: 0%| | 0/2 [00:00<?, ?it/s] |
| 162 | + |
| 163 | + |
| 164 | + |
| 165 | + Evaluating: 0%| | 0/2 [00:00<?, ?it/s] |
| 166 | + |
| 167 | + |
| 168 | + |
| 169 | +<div> |
| 170 | +<style scoped> |
| 171 | + .dataframe tbody tr th:only-of-type { |
| 172 | + vertical-align: middle; |
| 173 | + } |
| 174 | + |
| 175 | + .dataframe tbody tr th { |
| 176 | + vertical-align: top; |
| 177 | + } |
| 178 | + |
| 179 | + .dataframe thead th { |
| 180 | + text-align: right; |
| 181 | + } |
| 182 | +</style> |
| 183 | +<table border="1" class="dataframe"> |
| 184 | + <thead> |
| 185 | + <tr style="text-align: right;"> |
| 186 | + <th></th> |
| 187 | + <th>user_input</th> |
| 188 | + <th>retrieved_contexts</th> |
| 189 | + <th>response</th> |
| 190 | + <th>reference</th> |
| 191 | + <th>factual_correctness(mode=f1)</th> |
| 192 | + </tr> |
| 193 | + </thead> |
| 194 | + <tbody> |
| 195 | + <tr> |
| 196 | + <th>0</th> |
| 197 | + <td>Who originated the theory of relativity?</td> |
| 198 | + <td>[]</td> |
| 199 | + <td>The theory of relativity was originated by Alb...</td> |
| 200 | + <td>Albert Einstein originated the theory of relat...</td> |
| 201 | + <td>0.33</td> |
| 202 | + </tr> |
| 203 | + <tr> |
| 204 | + <th>1</th> |
| 205 | + <td>Who discovered penicillin and when?</td> |
| 206 | + <td>[]</td> |
| 207 | + <td>Penicillin was discovered by Alexander Fleming...</td> |
| 208 | + <td>Alexander Fleming discovered penicillin in 1928.</td> |
| 209 | + <td>1.00</td> |
| 210 | + </tr> |
| 211 | + </tbody> |
| 212 | +</table> |
| 213 | +</div> |
| 214 | + |
| 215 | + |
| 216 | + |
| 217 | +```python |
| 218 | +async def evaluate_tool_usage(): |
| 219 | + return await evaluate_ag_ui_agent( |
| 220 | + endpoint_url=AG_UI_ENDPOINT, |
| 221 | + dataset=weather_queries, |
| 222 | + metrics=tool_metrics, |
| 223 | + evaluator_llm=evaluator_llm, |
| 224 | + ) |
| 225 | + |
| 226 | +if RUN_TOOL_EVAL: |
| 227 | + tool_result = await evaluate_tool_usage() |
| 228 | + tool_df = tool_result.to_pandas() |
| 229 | + display(tool_df) |
| 230 | + |
| 231 | +``` |
| 232 | + |
| 233 | + |
| 234 | + Calling AG-UI Agent: 0%| | 0/1 [00:00<?, ?it/s] |
| 235 | + |
| 236 | + |
| 237 | + |
| 238 | + Evaluating: 0%| | 0/1 [00:00<?, ?it/s] |
| 239 | + |
| 240 | + |
| 241 | + |
| 242 | +<div> |
| 243 | +<style scoped> |
| 244 | + .dataframe tbody tr th:only-of-type { |
| 245 | + vertical-align: middle; |
| 246 | + } |
| 247 | + |
| 248 | + .dataframe tbody tr th { |
| 249 | + vertical-align: top; |
| 250 | + } |
| 251 | + |
| 252 | + .dataframe thead th { |
| 253 | + text-align: right; |
| 254 | + } |
| 255 | +</style> |
| 256 | +<table border="1" class="dataframe"> |
| 257 | + <thead> |
| 258 | + <tr style="text-align: right;"> |
| 259 | + <th></th> |
| 260 | + <th>user_input</th> |
| 261 | + <th>reference_tool_calls</th> |
| 262 | + <th>tool_call_f1</th> |
| 263 | + </tr> |
| 264 | + </thead> |
| 265 | + <tbody> |
| 266 | + <tr> |
| 267 | + <th>0</th> |
| 268 | + <td>[{'content': 'What's the weather in Paris?', '...</td> |
| 269 | + <td>[{'name': 'weatherTool', 'args': {'location': ...</td> |
| 270 | + <td>0.0</td> |
| 271 | + </tr> |
| 272 | + </tbody> |
| 273 | +</table> |
| 274 | +</div> |
| 275 | + |
| 276 | + |
| 277 | +## Convert recorded AG-UI events |
| 278 | +Use the conversion helpers when you already have an event log to grade offline. |
| 279 | + |
| 280 | + |
| 281 | + |
| 282 | +```python |
| 283 | +events = [ |
| 284 | + TextMessageChunkEvent( |
| 285 | + message_id="assistant-1", |
| 286 | + role="assistant", |
| 287 | + delta="Hello from AG-UI!", |
| 288 | + ) |
| 289 | +] |
| 290 | + |
| 291 | +messages_from_stream = convert_to_ragas_messages(events, metadata=True) |
| 292 | + |
| 293 | +snapshot = MessagesSnapshotEvent( |
| 294 | + messages=[ |
| 295 | + UserMessage(id="msg-1", content="Hello?"), |
| 296 | + AssistantMessage(id="msg-2", content="Hi! How can I help you today?"), |
| 297 | + ] |
| 298 | +) |
| 299 | + |
| 300 | +messages_from_snapshot = convert_messages_snapshot(snapshot) |
| 301 | + |
| 302 | +messages_from_stream, messages_from_snapshot |
| 303 | + |
| 304 | +``` |
| 305 | + |
| 306 | + |
| 307 | + |
| 308 | + |
| 309 | + ([AIMessage(content='Hello from AG-UI!', metadata={'timestamp': None, 'message_id': 'assistant-1'}, type='ai', tool_calls=None)], |
| 310 | + [HumanMessage(content='Hello?', metadata=None, type='human'), |
| 311 | + AIMessage(content='Hi! How can I help you today?', metadata=None, type='ai', tool_calls=None)]) |
| 312 | + |
| 313 | + |
| 314 | + |
| 315 | + |
| 316 | +```python |
| 317 | + |
| 318 | +``` |
0 commit comments