Skip to content

Commit e754b10

Browse files
LlamaStack integration (#2011)
1 parent 98aafbe commit e754b10

File tree

7 files changed

+395
-26
lines changed

7 files changed

+395
-26
lines changed

docs/howtos/integrations/index.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,9 +9,10 @@ happy to look into it 🙂
99
## Frameworks
1010

1111
- [Amazon Bedrock](./amazon_bedrock.md) - Amazon Bedrock is a managed framework for building, deploying, and scaling intelligent agents and integrated AI solutions; more information can be found [here](https://aws.amazon.com/bedrock/).
12+
- [Haystack](./haystack.md) - Haystack is a LLM orchestration framework to build customizable, production-ready LLM applications, more information can be found [here](https://haystack.deepset.ai/).
1213
- [Langchain](./langchain.md) - Langchain is a framework for building LLM applications, more information can be found [here](https://www.langchain.com/).
1314
- [LlamaIndex](./_llamaindex.md) - LlamaIndex is a framework for building RAG applications, more information can be found [here](https://www.llamaindex.ai/).
14-
- [Haystack](./haystack.md) - Haystack is a LLM orchestration framework to build customizable, production-ready LLM applications, more information can be found [here](https://haystack.deepset.ai/).
15+
- [LlamaStack](./llama_stack.md) – A unified framework by Meta for building and deploying generative AI apps across local, cloud, and mobile; [docs](https://llama-stack.readthedocs.io/en/latest/)
1516
- [R2R](./r2r.md) - R2R is an all-in-one solution for AI Retrieval-Augmented Generation (RAG) with production-ready features, more information can be found [here](https://r2r-docs.sciphi.ai/introduction)
1617
- [Swarm](./swarm_agent_evaluation.md) - Swarm is a framework for orchestrating multiple AI agents, more information can be found [here](https://github.com/openai/swarm).
1718

Lines changed: 359 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,359 @@
1+
# Evaluating LlamaStack Web Search Groundedness with Llama 4
2+
3+
In this tutorial we will measure the groundedness of response generated by the LlamaStack's web search agent. [LlamaStack](https://llama-stack.readthedocs.io/en/latest/) is an open-source framework maintained by meta, that streamlines the development and deployment of large language model-powered applications. The evaluations will be done using the Ragas metrics and using Meta Llama 4 Maverick as the judge.
4+
5+
## Setup and Running a LlamaStack server
6+
7+
This command installs all the dependencies needed for the LlamaStack server with the together inference provider
8+
9+
Use the command with conda
10+
```shell
11+
!pip install ragas langchain-together uv
12+
!uv run --with llama-stack llama stack build --template together --image-type conda
13+
```
14+
15+
Use the command with venv
16+
```shell
17+
!pip install ragas langchain-together uv
18+
!uv run --with llama-stack llama stack build --template together --image-type venv
19+
```
20+
21+
22+
```python
23+
import os
24+
import subprocess
25+
26+
27+
def run_llama_stack_server_background():
28+
log_file = open("llama_stack_server.log", "w")
29+
process = subprocess.Popen(
30+
"uv run --with llama-stack llama stack run together --image-type venv",
31+
shell=True,
32+
stdout=log_file,
33+
stderr=log_file,
34+
text=True,
35+
)
36+
37+
print(f"Starting LlamaStack server with PID: {process.pid}")
38+
return process
39+
40+
41+
def wait_for_server_to_start():
42+
import requests
43+
from requests.exceptions import ConnectionError
44+
import time
45+
46+
url = "http://0.0.0.0:8321/v1/health"
47+
max_retries = 30
48+
retry_interval = 1
49+
50+
print("Waiting for server to start", end="")
51+
for _ in range(max_retries):
52+
try:
53+
response = requests.get(url)
54+
if response.status_code == 200:
55+
print("\nServer is ready!")
56+
return True
57+
except ConnectionError:
58+
print(".", end="", flush=True)
59+
time.sleep(retry_interval)
60+
61+
print("\nServer failed to start after", max_retries * retry_interval, "seconds")
62+
return False
63+
64+
65+
# use this helper if needed to kill the server
66+
def kill_llama_stack_server():
67+
# Kill any existing llama stack server processes
68+
os.system(
69+
"ps aux | grep -v grep | grep llama_stack.distribution.server.server | awk '{print $2}' | xargs kill -9"
70+
)
71+
```
72+
73+
## Starting the LlamaStack Server
74+
75+
76+
```python
77+
server_process = run_llama_stack_server_background()
78+
assert wait_for_server_to_start()
79+
```
80+
```
81+
Starting LlamaStack server with PID: 95508
82+
Waiting for server to start....
83+
Server is ready!
84+
```
85+
86+
87+
## Building a Search Agent
88+
89+
90+
```python
91+
from llama_stack_client import LlamaStackClient, Agent, AgentEventLogger
92+
93+
client = LlamaStackClient(
94+
base_url="http://0.0.0.0:8321",
95+
)
96+
97+
agent = Agent(
98+
client,
99+
model="meta-llama/Llama-3.1-8B-Instruct",
100+
instructions="You are a helpful assistant. Use web search tool to answer the questions.",
101+
tools=["builtin::websearch"],
102+
)
103+
user_prompts = [
104+
"In which major did Demis Hassabis complete his undergraduate degree? Search the web for the answer.",
105+
"Ilya Sutskever is one of the key figures in AI. From which institution did he earn his PhD in machine learning? Search the web for the answer.",
106+
"Sam Altman, widely known for his role at OpenAI, was born in which American city? Search the web for the answer.",
107+
]
108+
109+
session_id = agent.create_session("test-session")
110+
111+
112+
for prompt in user_prompts:
113+
response = agent.create_turn(
114+
messages=[
115+
{
116+
"role": "user",
117+
"content": prompt,
118+
}
119+
],
120+
session_id=session_id,
121+
)
122+
for log in AgentEventLogger().log(response):
123+
log.print()
124+
```
125+
126+
Now, let’s look deeper into the agent’s execution steps and see if how well our agent performs.
127+
128+
129+
```python
130+
session_response = client.agents.session.retrieve(
131+
session_id=session_id,
132+
agent_id=agent.agent_id,
133+
)
134+
```
135+
136+
## Evaluate Agent Responses
137+
138+
We want to measure the Groundedness of response generated by the LlamaStack web search Agent. To do this we will need [EvaluationDataset](../../concepts/components/eval_dataset.md) and metrics to assess the grounded response, Ragas provides a wide array of off the shelf metrics that can be used to measure various aspects of retrieval and generations.
139+
140+
For measuring groundedness of response we will use:-
141+
142+
1. [Faithfulness](../../concepts/metrics/available_metrics/faithfulness.md)
143+
2. [Response Groundedness](../../concepts/metrics/available_metrics/nvidia_metrics.md#response-groundedness)
144+
145+
### Constructing a Ragas EvaluationDataset
146+
147+
To perform evaluations using Ragas we will create a `EvaluationDataset`
148+
149+
150+
```python
151+
import json
152+
153+
# This function extracts the search results for the trace of each query
154+
def extract_retrieved_contexts(turn_object):
155+
results = []
156+
for step in turn_object.steps:
157+
if step.step_type == "tool_execution":
158+
tool_responses = step.tool_responses
159+
for response in tool_responses:
160+
content = response.content
161+
if content:
162+
try:
163+
parsed_result = json.loads(content)
164+
results.append(parsed_result)
165+
except json.JSONDecodeError:
166+
print("Warning: Unable to parse tool response content as JSON.")
167+
continue
168+
169+
retrieved_context = []
170+
for result in results:
171+
top_content_list = [item["content"] for item in result["top_k"]]
172+
retrieved_context.extend(top_content_list)
173+
return retrieved_context
174+
```
175+
176+
177+
```python
178+
from ragas.dataset_schema import EvaluationDataset
179+
180+
samples = []
181+
182+
references = [
183+
"Demis Hassabis completed his undergraduate degree in Computer Science.",
184+
"Ilya Sutskever earned his PhD from the University of Toronto.",
185+
"Sam Altman was born in Chicago, Illinois.",
186+
]
187+
188+
for i, turn in enumerate(session_response.turns):
189+
samples.append(
190+
{
191+
"user_input": turn.input_messages[0].content,
192+
"response": turn.output_message.content,
193+
"reference": references[i],
194+
"retrieved_contexts": extract_retrieved_contexts(turn),
195+
}
196+
)
197+
198+
ragas_eval_dataset = EvaluationDataset.from_list(samples)
199+
```
200+
201+
202+
```python
203+
ragas_eval_dataset.to_pandas()
204+
```
205+
206+
207+
<div>
208+
<style scoped>
209+
.dataframe tbody tr th:only-of-type {
210+
vertical-align: middle;
211+
}
212+
213+
.dataframe tbody tr th {
214+
vertical-align: top;
215+
}
216+
217+
.dataframe thead th {
218+
text-align: right;
219+
}
220+
</style>
221+
<table border="1">
222+
<thead>
223+
<tr style="text-align: right;">
224+
<th></th>
225+
<th>user_input</th>
226+
<th>retrieved_contexts</th>
227+
<th>response</th>
228+
<th>reference</th>
229+
</tr>
230+
</thead>
231+
<tbody>
232+
<tr>
233+
<th>0</th>
234+
<td>In which major did Demis Hassabis complete his...</td>
235+
<td>[Demis Hassabis holds a Bachelor's degree in C...</td>
236+
<td>Demis Hassabis completed his undergraduate deg...</td>
237+
<td>Demis Hassabis completed his undergraduate deg...</td>
238+
</tr>
239+
<tr>
240+
<th>1</th>
241+
<td>Ilya Sutskever is one of the key figures in AI...</td>
242+
<td>[Jump to content Main menu Search Donate Creat...</td>
243+
<td>Ilya Sutskever earned his PhD in machine learn...</td>
244+
<td>Ilya Sutskever earned his PhD from the Univers...</td>
245+
</tr>
246+
<tr>
247+
<th>2</th>
248+
<td>Sam Altman, widely known for his role at OpenA...</td>
249+
<td>[Sam Altman | Biography, OpenAI, Microsoft, &amp; ...</td>
250+
<td>Sam Altman was born in Chicago, Illinois, USA.</td>
251+
<td>Sam Altman was born in Chicago, Illinois.</td>
252+
</tr>
253+
</tbody>
254+
</table>
255+
</div>
256+
257+
258+
259+
### Setting the Ragas Metrics
260+
261+
262+
```python
263+
from ragas.metrics import AnswerAccuracy, Faithfulness, ResponseGroundedness
264+
from langchain_together import ChatTogether
265+
from ragas.llms import LangchainLLMWrapper
266+
267+
llm = ChatTogether(
268+
model="meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8",
269+
)
270+
evaluator_llm = LangchainLLMWrapper(llm)
271+
272+
ragas_metrics = [
273+
AnswerAccuracy(llm=evaluator_llm),
274+
Faithfulness(llm=evaluator_llm),
275+
ResponseGroundedness(llm=evaluator_llm),
276+
]
277+
```
278+
279+
## Evaluation
280+
281+
Finally, let's run the evaluation.
282+
283+
284+
```python
285+
from ragas import evaluate
286+
287+
results = evaluate(dataset=ragas_eval_dataset, metrics=ragas_metrics)
288+
results.to_pandas()
289+
```
290+
```
291+
Evaluating: 100%|██████████| 9/9 [00:04<00:00, 2.03it/s]
292+
```
293+
294+
<div>
295+
<style scoped>
296+
.dataframe tbody tr th:only-of-type {
297+
vertical-align: middle;
298+
}
299+
300+
.dataframe tbody tr th {
301+
vertical-align: top;
302+
}
303+
304+
.dataframe thead th {
305+
text-align: right;
306+
}
307+
</style>
308+
<table border="1">
309+
<thead>
310+
<tr style="text-align: right;">
311+
<th></th>
312+
<th>user_input</th>
313+
<th>retrieved_contexts</th>
314+
<th>response</th>
315+
<th>reference</th>
316+
<th>nv_accuracy</th>
317+
<th>faithfulness</th>
318+
<th>nv_response_groundedness</th>
319+
</tr>
320+
</thead>
321+
<tbody>
322+
<tr>
323+
<th>0</th>
324+
<td>In which major did Demis Hassabis complete his...</td>
325+
<td>[Demis Hassabis holds a Bachelor's degree in C...</td>
326+
<td>Demis Hassabis completed his undergraduate deg...</td>
327+
<td>Demis Hassabis completed his undergraduate deg...</td>
328+
<td>1.0</td>
329+
<td>1.0</td>
330+
<td>1.00</td>
331+
</tr>
332+
<tr>
333+
<th>1</th>
334+
<td>Ilya Sutskever is one of the key figures in AI...</td>
335+
<td>[Jump to content Main menu Search Donate Creat...</td>
336+
<td>Ilya Sutskever earned his PhD in machine learn...</td>
337+
<td>Ilya Sutskever earned his PhD from the Univers...</td>
338+
<td>1.0</td>
339+
<td>0.5</td>
340+
<td>0.75</td>
341+
</tr>
342+
<tr>
343+
<th>2</th>
344+
<td>Sam Altman, widely known for his role at OpenA...</td>
345+
<td>[Sam Altman | Biography, OpenAI, Microsoft, &amp; ...</td>
346+
<td>Sam Altman was born in Chicago, Illinois, USA.</td>
347+
<td>Sam Altman was born in Chicago, Illinois.</td>
348+
<td>1.0</td>
349+
<td>1.0</td>
350+
<td>1.00</td>
351+
</tr>
352+
</tbody>
353+
</table>
354+
</div>
355+
356+
357+
```python
358+
kill_llama_stack_server()
359+
```

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -116,6 +116,7 @@ nav:
116116
- LangGraph: howtos/integrations/_langgraph_agent_evaluation.md
117117
- LangSmith: howtos/integrations/langsmith.md
118118
- LlamaIndex: howtos/integrations/_llamaindex.md
119+
- LlamaStack: howtos/integrations/llama_stack.md
119120
- R2R: howtos/integrations/r2r.md
120121
- Swarm: howtos/integrations/swarm_agent_evaluation.md
121122
- Migrations:

src/ragas/embeddings/base.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -222,7 +222,7 @@ def __post_init__(self):
222222
super().__init__(cache=self.cache)
223223
try:
224224
import sentence_transformers
225-
from transformers import AutoConfig
225+
from transformers import AutoConfig # type: ignore
226226
from transformers.models.auto.modeling_auto import (
227227
MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING_NAMES,
228228
)

0 commit comments

Comments
 (0)