Skip to content

Commit 008d6c4

Browse files
Added "How-to" docs and Jupyter notebook.
1 parent 545124e commit 008d6c4

File tree

4 files changed

+1032
-0
lines changed

4 files changed

+1032
-0
lines changed

docs/howtos/integrations/_ag_ui.md

Lines changed: 318 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,318 @@
1+
# AG-UI Integration
2+
Ragas can evaluate agents that stream events via the [AG-UI protocol](https://docs.ag-ui.com/). This notebook shows how to build evaluation datasets, configure metrics, and score AG-UI endpoints.
3+
4+
5+
## Prerequisites
6+
- Install optional dependencies with `pip install "ragas[ag-ui]" langchain-openai python-dotenv nest_asyncio`
7+
- Start an AG-UI compatible agent locally (Google ADK, PydanticAI, CrewAI, etc.)
8+
- Create an `.env` file with your evaluator LLM credentials (e.g. `OPENAI_API_KEY`, `GOOGLE_API_KEY`, etc.)
9+
- If you run this notebook, call `nest_asyncio.apply()` (shown below) so you can `await` coroutines in-place.
10+
11+
12+
13+
```python
14+
# !pip install "ragas[ag-ui]" langchain-openai python-dotenv nest_asyncio
15+
16+
```
17+
18+
## Imports and environment setup
19+
Load environment variables and import the classes used throughout the walkthrough.
20+
21+
22+
23+
```python
24+
import asyncio
25+
26+
from dotenv import load_dotenv
27+
import nest_asyncio
28+
from IPython.display import display
29+
from langchain_openai import ChatOpenAI
30+
31+
from ragas.dataset_schema import EvaluationDataset, SingleTurnSample, MultiTurnSample
32+
from ragas.integrations.ag_ui import (
33+
evaluate_ag_ui_agent,
34+
convert_to_ragas_messages,
35+
convert_messages_snapshot,
36+
)
37+
from ragas.messages import HumanMessage, ToolCall
38+
from ragas.metrics import FactualCorrectness, ToolCallF1
39+
from ragas.llms import LangchainLLMWrapper
40+
from ag_ui.core import (
41+
MessagesSnapshotEvent,
42+
TextMessageChunkEvent,
43+
UserMessage,
44+
AssistantMessage,
45+
)
46+
47+
load_dotenv()
48+
# Patch the existing notebook loop so we can await coroutines safely
49+
nest_asyncio.apply()
50+
51+
```
52+
53+
## Build single-turn evaluation data
54+
Create `SingleTurnSample` entries when you only need to grade the final answer text.
55+
56+
57+
58+
```python
59+
scientist_questions = EvaluationDataset(
60+
samples=[
61+
SingleTurnSample(
62+
user_input="Who originated the theory of relativity?",
63+
reference="Albert Einstein originated the theory of relativity.",
64+
),
65+
SingleTurnSample(
66+
user_input="Who discovered penicillin and when?",
67+
reference="Alexander Fleming discovered penicillin in 1928.",
68+
),
69+
]
70+
)
71+
72+
scientist_questions
73+
74+
```
75+
76+
77+
78+
79+
EvaluationDataset(features=['user_input', 'reference'], len=2)
80+
81+
82+
83+
## Build multi-turn conversations
84+
For tool-usage metrics, extend the dataset with `MultiTurnSample` and expected tool calls.
85+
86+
87+
88+
```python
89+
weather_queries = EvaluationDataset(
90+
samples=[
91+
MultiTurnSample(
92+
user_input=[HumanMessage(content="What's the weather in Paris?")],
93+
reference_tool_calls=[
94+
ToolCall(name="weatherTool", args={"location": "Paris"})
95+
],
96+
)
97+
]
98+
)
99+
100+
weather_queries
101+
102+
```
103+
104+
105+
106+
107+
EvaluationDataset(features=['user_input', 'reference_tool_calls'], len=1)
108+
109+
110+
111+
## Configure metrics and the evaluator LLM
112+
Wrap your grading model with the appropriate adapter and instantiate the metrics you plan to use.
113+
114+
115+
116+
```python
117+
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
118+
119+
qa_metrics = [FactualCorrectness(llm=evaluator_llm)]
120+
tool_metrics = [ToolCallF1()] # rule-based, no LLM required
121+
122+
```
123+
124+
/var/folders/8k/tf3xr1rd1fl_dz35dfhfp_tc0000gn/T/ipykernel_93918/2135722072.py:1: DeprecationWarning: LangchainLLMWrapper is deprecated and will be removed in a future version. Use llm_factory instead: from openai import OpenAI; from ragas.llms import llm_factory; llm = llm_factory('gpt-4o-mini', client=OpenAI(api_key='...'))
125+
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
126+
127+
128+
## Evaluate a live AG-UI endpoint
129+
Set the endpoint URL exposed by your agent. Toggle the flags when you are ready to run the evaluations.
130+
In Jupyter/IPython you can `await` the helpers directly once `nest_asyncio.apply()` has been called.
131+
132+
133+
134+
```python
135+
AG_UI_ENDPOINT = "http://localhost:8000/agentic_chat" # Update to match your agent
136+
137+
RUN_FACTUAL_EVAL = False
138+
RUN_TOOL_EVAL = False
139+
140+
```
141+
142+
143+
```python
144+
async def evaluate_factual():
145+
return await evaluate_ag_ui_agent(
146+
endpoint_url=AG_UI_ENDPOINT,
147+
dataset=scientist_questions,
148+
metrics=qa_metrics,
149+
evaluator_llm=evaluator_llm,
150+
metadata=True,
151+
)
152+
153+
if RUN_FACTUAL_EVAL:
154+
factual_result = await evaluate_factual()
155+
factual_df = factual_result.to_pandas()
156+
display(factual_df)
157+
158+
```
159+
160+
161+
Calling AG-UI Agent: 0%| | 0/2 [00:00<?, ?it/s]
162+
163+
164+
165+
Evaluating: 0%| | 0/2 [00:00<?, ?it/s]
166+
167+
168+
169+
<div>
170+
<style scoped>
171+
.dataframe tbody tr th:only-of-type {
172+
vertical-align: middle;
173+
}
174+
175+
.dataframe tbody tr th {
176+
vertical-align: top;
177+
}
178+
179+
.dataframe thead th {
180+
text-align: right;
181+
}
182+
</style>
183+
<table border="1" class="dataframe">
184+
<thead>
185+
<tr style="text-align: right;">
186+
<th></th>
187+
<th>user_input</th>
188+
<th>retrieved_contexts</th>
189+
<th>response</th>
190+
<th>reference</th>
191+
<th>factual_correctness(mode=f1)</th>
192+
</tr>
193+
</thead>
194+
<tbody>
195+
<tr>
196+
<th>0</th>
197+
<td>Who originated the theory of relativity?</td>
198+
<td>[]</td>
199+
<td>The theory of relativity was originated by Alb...</td>
200+
<td>Albert Einstein originated the theory of relat...</td>
201+
<td>0.33</td>
202+
</tr>
203+
<tr>
204+
<th>1</th>
205+
<td>Who discovered penicillin and when?</td>
206+
<td>[]</td>
207+
<td>Penicillin was discovered by Alexander Fleming...</td>
208+
<td>Alexander Fleming discovered penicillin in 1928.</td>
209+
<td>1.00</td>
210+
</tr>
211+
</tbody>
212+
</table>
213+
</div>
214+
215+
216+
217+
```python
218+
async def evaluate_tool_usage():
219+
return await evaluate_ag_ui_agent(
220+
endpoint_url=AG_UI_ENDPOINT,
221+
dataset=weather_queries,
222+
metrics=tool_metrics,
223+
evaluator_llm=evaluator_llm,
224+
)
225+
226+
if RUN_TOOL_EVAL:
227+
tool_result = await evaluate_tool_usage()
228+
tool_df = tool_result.to_pandas()
229+
display(tool_df)
230+
231+
```
232+
233+
234+
Calling AG-UI Agent: 0%| | 0/1 [00:00<?, ?it/s]
235+
236+
237+
238+
Evaluating: 0%| | 0/1 [00:00<?, ?it/s]
239+
240+
241+
242+
<div>
243+
<style scoped>
244+
.dataframe tbody tr th:only-of-type {
245+
vertical-align: middle;
246+
}
247+
248+
.dataframe tbody tr th {
249+
vertical-align: top;
250+
}
251+
252+
.dataframe thead th {
253+
text-align: right;
254+
}
255+
</style>
256+
<table border="1" class="dataframe">
257+
<thead>
258+
<tr style="text-align: right;">
259+
<th></th>
260+
<th>user_input</th>
261+
<th>reference_tool_calls</th>
262+
<th>tool_call_f1</th>
263+
</tr>
264+
</thead>
265+
<tbody>
266+
<tr>
267+
<th>0</th>
268+
<td>[{'content': 'What's the weather in Paris?', '...</td>
269+
<td>[{'name': 'weatherTool', 'args': {'location': ...</td>
270+
<td>0.0</td>
271+
</tr>
272+
</tbody>
273+
</table>
274+
</div>
275+
276+
277+
## Convert recorded AG-UI events
278+
Use the conversion helpers when you already have an event log to grade offline.
279+
280+
281+
282+
```python
283+
events = [
284+
TextMessageChunkEvent(
285+
message_id="assistant-1",
286+
role="assistant",
287+
delta="Hello from AG-UI!",
288+
)
289+
]
290+
291+
messages_from_stream = convert_to_ragas_messages(events, metadata=True)
292+
293+
snapshot = MessagesSnapshotEvent(
294+
messages=[
295+
UserMessage(id="msg-1", content="Hello?"),
296+
AssistantMessage(id="msg-2", content="Hi! How can I help you today?"),
297+
]
298+
)
299+
300+
messages_from_snapshot = convert_messages_snapshot(snapshot)
301+
302+
messages_from_stream, messages_from_snapshot
303+
304+
```
305+
306+
307+
308+
309+
([AIMessage(content='Hello from AG-UI!', metadata={'timestamp': None, 'message_id': 'assistant-1'}, type='ai', tool_calls=None)],
310+
[HumanMessage(content='Hello?', metadata=None, type='human'),
311+
AIMessage(content='Hi! How can I help you today?', metadata=None, type='ai', tool_calls=None)])
312+
313+
314+
315+
316+
```python
317+
318+
```

0 commit comments

Comments
 (0)