Skip to content

Commit b7cf854

Browse files
Griptape integration (#2021)
1 parent 0053fce commit b7cf854

File tree

4 files changed

+457
-0
lines changed

4 files changed

+457
-0
lines changed
Lines changed: 394 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,394 @@
1+
# Griptape Integration
2+
3+
If you're familiar with Griptape's RAG Engine and want to start evaluating your RAG system's performance, you're in the right place. In this tutorial we'll explore how to use Ragas to evaluate the responses generated by your Griptape RAG Engine.
4+
5+
## Griptape Setup
6+
7+
### Setting Up Our Environment
8+
9+
First, let's make sure we have all the required packages installed:
10+
11+
12+
```shell
13+
%pip install "griptape[all]" ragas -q
14+
```
15+
16+
### Creating Our Dataset
17+
18+
We'll use a small dataset of text chunks about major LLM providers and set up a simple RAG pipeline:
19+
20+
21+
```python
22+
chunks = [
23+
"OpenAI is one of the most recognized names in the large language model space, known for its GPT series of models. These models excel at generating human-like text and performing tasks like creative writing, answering questions, and summarizing content. GPT-4, their latest release, has set benchmarks in understanding context and delivering detailed responses.",
24+
"Anthropic is well-known for its Claude series of language models, designed with a strong focus on safety and ethical AI behavior. Claude is particularly praised for its ability to follow complex instructions and generate text that aligns closely with user intent.",
25+
"DeepMind, a division of Google, is recognized for its cutting-edge Gemini models, which are integrated into various Google products like Bard and Workspace tools. These models are renowned for their conversational abilities and their capacity to handle complex, multi-turn dialogues.",
26+
"Meta AI is best known for its LLaMA (Large Language Model Meta AI) series, which has been made open-source for researchers and developers. LLaMA models are praised for their ability to support innovation and experimentation due to their accessibility and strong performance.",
27+
"Meta AI with it's LLaMA models aims to democratize AI development by making high-quality models available for free, fostering collaboration across industries. Their open-source approach has been a game-changer for researchers without access to expensive resources.",
28+
"Microsoft’s Azure AI platform is famous for integrating OpenAI’s GPT models, enabling businesses to use these advanced models in a scalable and secure cloud environment. Azure AI powers applications like Copilot in Office 365, helping users draft emails, generate summaries, and more.",
29+
"Amazon’s Bedrock platform is recognized for providing access to various language models, including its own models and third-party ones like Anthropic’s Claude and AI21’s Jurassic. Bedrock is especially valued for its flexibility, allowing users to choose models based on their specific needs.",
30+
"Cohere is well-known for its language models tailored for business use, excelling in tasks like search, summarization, and customer support. Their models are recognized for being efficient, cost-effective, and easy to integrate into workflows.",
31+
"AI21 Labs is famous for its Jurassic series of language models, which are highly versatile and capable of handling tasks like content creation and code generation. The Jurassic models stand out for their natural language understanding and ability to generate detailed and coherent responses.",
32+
"In the rapidly advancing field of artificial intelligence, several companies have made significant contributions with their large language models. Notable players include OpenAI, known for its GPT Series (including GPT-4); Anthropic, which offers the Claude Series; Google DeepMind with its Gemini Models; Meta AI, recognized for its LLaMA Series; Microsoft Azure AI, which integrates OpenAI’s GPT Models; Amazon AWS (Bedrock), providing access to various models including Claude (Anthropic) and Jurassic (AI21 Labs); Cohere, which offers its own models tailored for business use; and AI21 Labs, known for its Jurassic Series. These companies are shaping the landscape of AI by providing powerful models with diverse capabilities.",
33+
]
34+
```
35+
36+
### Ingesting data in Vector Store
37+
38+
39+
```python
40+
import getpass
41+
import os
42+
43+
if "OPENAI_API_KEY" not in os.environ:
44+
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")
45+
```
46+
47+
48+
```python
49+
from griptape.drivers.embedding.openai import OpenAiEmbeddingDriver
50+
from griptape.drivers.vector.local import LocalVectorStoreDriver
51+
52+
# Set up a simple vector store with our data
53+
vector_store = LocalVectorStoreDriver(embedding_driver=OpenAiEmbeddingDriver())
54+
vector_store.upsert_collection({"major_llm_providers": chunks})
55+
```
56+
57+
### Setting up the RAG Engine
58+
59+
60+
```python
61+
from griptape.engines.rag import RagContext, RagEngine
62+
from griptape.engines.rag.modules import (
63+
PromptResponseRagModule,
64+
VectorStoreRetrievalRagModule,
65+
)
66+
from griptape.engines.rag.stages import (
67+
ResponseRagStage,
68+
RetrievalRagStage,
69+
)
70+
71+
# Create a basic RAG pipeline
72+
rag_engine = RagEngine(
73+
# Stage for retrieving relevant chunks
74+
retrieval_stage=RetrievalRagStage(
75+
retrieval_modules=[
76+
VectorStoreRetrievalRagModule(
77+
name="VectorStore_Retriever",
78+
vector_store_driver=vector_store,
79+
query_params={"namespace": "major_llm_providers"},
80+
),
81+
],
82+
),
83+
# Stage for generating a response
84+
response_stage=ResponseRagStage(
85+
response_modules=[
86+
PromptResponseRagModule(),
87+
]
88+
),
89+
)
90+
```
91+
92+
### Testing Our RAG Pipeline
93+
94+
Let's make sure our RAG pipeline works by testing it with a sample query:
95+
96+
97+
```python
98+
rag_context = RagContext(query="What makes Meta AI’s LLaMA models stand out?")
99+
rag_context = rag_engine.process(rag_context)
100+
rag_context.outputs[0].to_text()
101+
```
102+
Output:
103+
```
104+
"Meta AI's LLaMA models stand out for their open-source nature, which makes them accessible to researchers and developers. This accessibility supports innovation and experimentation, allowing for collaboration across industries. By making high-quality models available for free, Meta AI aims to democratize AI development, which has been a game-changer for researchers without access to expensive resources."
105+
```
106+
107+
## Ragas Evaluation
108+
109+
### Creating a Ragas Evaluation Dataset
110+
111+
112+
```python
113+
questions = [
114+
"Who are the major players in the large language model space?",
115+
"What is Microsoft’s Azure AI platform known for?",
116+
"What kind of models does Cohere provide?",
117+
]
118+
119+
references = [
120+
"The major players include OpenAI (GPT Series), Anthropic (Claude Series), Google DeepMind (Gemini Models), Meta AI (LLaMA Series), Microsoft Azure AI (integrating GPT Models), Amazon AWS (Bedrock with Claude and Jurassic), Cohere (business-focused models), and AI21 Labs (Jurassic Series).",
121+
"Microsoft’s Azure AI platform is known for integrating OpenAI’s GPT models, enabling businesses to use these models in a scalable and secure cloud environment.",
122+
"Cohere provides language models tailored for business use, excelling in tasks like search, summarization, and customer support.",
123+
]
124+
125+
griptape_rag_contexts = []
126+
127+
for que in questions:
128+
rag_context = RagContext(query=que)
129+
griptape_rag_contexts.append(rag_engine.process(rag_context))
130+
```
131+
132+
133+
```python
134+
from ragas.integrations.griptape import transform_to_ragas_dataset
135+
136+
ragas_eval_dataset = transform_to_ragas_dataset(
137+
grip_tape_rag_contexts=griptape_rag_contexts, references=references
138+
)
139+
```
140+
141+
142+
```python
143+
ragas_eval_dataset.to_pandas()
144+
```
145+
146+
<div>
147+
<style scoped>
148+
.dataframe tbody tr th:only-of-type {
149+
vertical-align: middle;
150+
}
151+
152+
.dataframe tbody tr th {
153+
vertical-align: top;
154+
}
155+
156+
.dataframe thead th {
157+
text-align: right;
158+
}
159+
</style>
160+
<table border="1">
161+
<thead>
162+
<tr style="text-align: right;">
163+
<th></th>
164+
<th>user_input</th>
165+
<th>retrieved_contexts</th>
166+
<th>response</th>
167+
<th>reference</th>
168+
</tr>
169+
</thead>
170+
<tbody>
171+
<tr>
172+
<th>0</th>
173+
<td>Who are the major players in the large languag...</td>
174+
<td>[In the rapidly advancing field of artificial ...</td>
175+
<td>The major players in the large language model ...</td>
176+
<td>The major players include OpenAI (GPT Series),...</td>
177+
</tr>
178+
<tr>
179+
<th>1</th>
180+
<td>What is Microsoft’s Azure AI platform known for?</td>
181+
<td>[Microsoft’s Azure AI platform is famous for i...</td>
182+
<td>Microsoft’s Azure AI platform is known for int...</td>
183+
<td>Microsoft’s Azure AI platform is known for int...</td>
184+
</tr>
185+
<tr>
186+
<th>2</th>
187+
<td>What kind of models does Cohere provide?</td>
188+
<td>[Cohere is well-known for its language models ...</td>
189+
<td>Cohere provides language models tailored for b...</td>
190+
<td>Cohere provides language models tailored for b...</td>
191+
</tr>
192+
</tbody>
193+
</table>
194+
</div>
195+
196+
197+
198+
### Running the Ragas Evaluation
199+
200+
Now, let's evaluate our RAG system using Ragas metrics:
201+
202+
#### Evaluating Retrieval
203+
204+
To evaluate our retrieval performance, we can utilize Ragas built-in metrics or create custom metrics tailored to our specific needs. For a comprehensive list of all available metrics and customization options, please visit the [documentation]().
205+
206+
We will use `ContextPrecision`, `ContextRecall` and `ContextRelevance` to measure the retrieval performance:
207+
208+
- [ContextPrecision](../../concepts/metrics/available_metrics/context_precision.md): Measures how well a RAG system's retriever ranks relevant chunks at the top of the retrieved context for a given query, calculated as the mean precision@k across all chunks.
209+
- [ContextRecall](../../concepts/metrics/available_metrics/context_recall.md): Measures the proportion of relevant information successfully retrieved from a knowledge base.
210+
- [ContextRelevance](../../concepts/metrics/available_metrics/nvidia_metrics.md#context-relevance): Measures how well the retrieved contexts address the user’s query by evaluating their pertinence through dual LLM judgments.
211+
212+
213+
```python
214+
from ragas.metrics import ContextPrecision, ContextRecall, ContextRelevance
215+
from ragas import evaluate
216+
from langchain_openai import ChatOpenAI
217+
from ragas.llms import LangchainLLMWrapper
218+
219+
llm = ChatOpenAI(model="gpt-4o-mini")
220+
evaluator_llm = LangchainLLMWrapper(llm)
221+
222+
ragas_metrics = [
223+
ContextPrecision(llm=evaluator_llm),
224+
ContextRecall(llm=evaluator_llm),
225+
ContextRelevance(llm=evaluator_llm),
226+
]
227+
228+
retrieval_results = evaluate(dataset=ragas_eval_dataset, metrics=ragas_metrics)
229+
retrieval_results.to_pandas()
230+
```
231+
```
232+
Evaluating: 100%|██████████| 9/9 [00:15<00:00, 1.77s/it]
233+
```
234+
235+
<div>
236+
<style scoped>
237+
.dataframe tbody tr th:only-of-type {
238+
vertical-align: middle;
239+
}
240+
241+
.dataframe tbody tr th {
242+
vertical-align: top;
243+
}
244+
245+
.dataframe thead th {
246+
text-align: right;
247+
}
248+
</style>
249+
<table border="1">
250+
<thead>
251+
<tr style="text-align: right;">
252+
<th></th>
253+
<th>user_input</th>
254+
<th>retrieved_contexts</th>
255+
<th>response</th>
256+
<th>reference</th>
257+
<th>context_precision</th>
258+
<th>context_recall</th>
259+
<th>nv_context_relevance</th>
260+
</tr>
261+
</thead>
262+
<tbody>
263+
<tr>
264+
<th>0</th>
265+
<td>Who are the major players in the large languag...</td>
266+
<td>[In the rapidly advancing field of artificial ...</td>
267+
<td>The major players in the large language model ...</td>
268+
<td>The major players include OpenAI (GPT Series),...</td>
269+
<td>1.000000</td>
270+
<td>1.0</td>
271+
<td>1.0</td>
272+
</tr>
273+
<tr>
274+
<th>1</th>
275+
<td>What is Microsoft’s Azure AI platform known for?</td>
276+
<td>[Microsoft’s Azure AI platform is famous for i...</td>
277+
<td>Microsoft’s Azure AI platform is known for int...</td>
278+
<td>Microsoft’s Azure AI platform is known for int...</td>
279+
<td>1.000000</td>
280+
<td>1.0</td>
281+
<td>1.0</td>
282+
</tr>
283+
<tr>
284+
<th>2</th>
285+
<td>What kind of models does Cohere provide?</td>
286+
<td>[Cohere is well-known for its language models ...</td>
287+
<td>Cohere provides language models tailored for b...</td>
288+
<td>Cohere provides language models tailored for b...</td>
289+
<td>0.833333</td>
290+
<td>1.0</td>
291+
<td>1.0</td>
292+
</tr>
293+
</tbody>
294+
</table>
295+
</div>
296+
297+
298+
299+
#### Evaluating Generation
300+
301+
To measure the generation performance we will use `FactualCorrectness`, `Faithfulness` and `ContextRelevance`:
302+
303+
- [FactualCorrectness](../../concepts/metrics/available_metrics/factual_correctness.md): Checks if all statements in a response are supported by the reference answer.
304+
- [Faithfulness](../../concepts/metrics/available_metrics/faithfulness.md): Measures how factually consistent a response is with the retrieved context.
305+
- [ResponseGroundedness](../../concepts/metrics/available_metrics/nvidia_metrics.md#response-groundedness): Measures whether the response is grounded in the provided context, helping to identify hallucinations or made-up information.
306+
307+
308+
```python
309+
from ragas.metrics import FactualCorrectness, Faithfulness, ResponseGroundedness
310+
311+
ragas_metrics = [
312+
FactualCorrectness(llm=evaluator_llm),
313+
Faithfulness(llm=evaluator_llm),
314+
ResponseGroundedness(llm=evaluator_llm),
315+
]
316+
317+
genration_results = evaluate(dataset=ragas_eval_dataset, metrics=ragas_metrics)
318+
genration_results.to_pandas()
319+
```
320+
```
321+
Evaluating: 100%|██████████| 9/9 [00:17<00:00, 1.90s/it]
322+
```
323+
324+
<div>
325+
<style scoped>
326+
.dataframe tbody tr th:only-of-type {
327+
vertical-align: middle;
328+
}
329+
330+
.dataframe tbody tr th {
331+
vertical-align: top;
332+
}
333+
334+
.dataframe thead th {
335+
text-align: right;
336+
}
337+
</style>
338+
<table border="1" class="dataframe">
339+
<thead>
340+
<tr style="text-align: right;">
341+
<th></th>
342+
<th>user_input</th>
343+
<th>retrieved_contexts</th>
344+
<th>response</th>
345+
<th>reference</th>
346+
<th>factual_correctness(mode=f1)</th>
347+
<th>faithfulness</th>
348+
<th>nv_response_groundedness</th>
349+
</tr>
350+
</thead>
351+
<tbody>
352+
<tr>
353+
<th>0</th>
354+
<td>Who are the major players in the large languag...</td>
355+
<td>[In the rapidly advancing field of artificial ...</td>
356+
<td>The major players in the large language model ...</td>
357+
<td>The major players include OpenAI (GPT Series),...</td>
358+
<td>1.00</td>
359+
<td>1.000000</td>
360+
<td>1.0</td>
361+
</tr>
362+
<tr>
363+
<th>1</th>
364+
<td>What is Microsoft’s Azure AI platform known for?</td>
365+
<td>[Microsoft’s Azure AI platform is famous for i...</td>
366+
<td>Microsoft’s Azure AI platform is known for int...</td>
367+
<td>Microsoft’s Azure AI platform is known for int...</td>
368+
<td>0.57</td>
369+
<td>0.833333</td>
370+
<td>1.0</td>
371+
</tr>
372+
<tr>
373+
<th>2</th>
374+
<td>What kind of models does Cohere provide?</td>
375+
<td>[Cohere is well-known for its language models ...</td>
376+
<td>Cohere provides language models tailored for b...</td>
377+
<td>Cohere provides language models tailored for b...</td>
378+
<td>0.57</td>
379+
<td>1.000000</td>
380+
<td>1.0</td>
381+
</tr>
382+
</tbody>
383+
</table>
384+
</div>
385+
386+
387+
388+
## Conclusion
389+
390+
Congratulations! You've successfully set up a Ragas evaluation pipeline for your Griptape RAG system. This evaluation provides valuable insights into how well your system retrieves relevant information and generates accurate responses.
391+
392+
Remember that RAG evaluation is an iterative process. Use these metrics to identify weaknesses in your system, make improvements, and re-evaluate until you achieve the performance level you need.
393+
394+
Happy RAGging! 😄

0 commit comments

Comments
 (0)