Skip to content

Commit 8efe80d

Browse files
authored
feat: improved the testset generation to_pandas and docs (#1536)
1 parent 2ca00c1 commit 8efe80d

File tree

9 files changed

+226
-63
lines changed

9 files changed

+226
-63
lines changed

docs/getstarted/rag_evaluation.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -51,3 +51,4 @@ df = results.to_pandas()
5151
df.head()
5252
```
5353

54+
![evaluation-result](./raga_evaluation_output.png)

docs/getstarted/rag_testset_generation.md

Lines changed: 110 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,10 @@
1-
## Testset Generation for RAG
1+
# Testset Generation for RAG
22

33
This simple guide will help you generate a testset for evaluating your RAG pipeline using your own documents.
44

5+
## Quickstart
6+
Let's walk through an quick example of generating a testset for a RAG pipeline. Following that will will explore the main components of the testset generation pipeline.
7+
58
### Load Sample Documents
69

710
For the sake of this tutorial we will use sample documents from this [repository](https://huggingface.co/datasets/explodinggradients/Sample_Docs_Markdown). You can replace this with your own documents.
@@ -47,3 +50,109 @@ You may now export and inspect the generated testset.
4750
```python
4851
dataset.to_pandas()
4952
```
53+
54+
![testset](./testset_output.png)
55+
56+
57+
## A Deeper Look
58+
59+
Now that we have a seen how to generate a testset, let's take a closer look at the main components of the testset generation pipeline and how you can quickly customize it.
60+
61+
At the core there are 2 main operations that are performed to generate a testset.
62+
63+
1. **KnowledgeGraph Creation**: We first create a [KnowledgeGraph][ragas.testset.graph.KnowledgeGraph] using the documents you provide and use various [Transformations][ragas.testset.transforms.base.BaseGraphTransformation] to enrich the knowledge graph with additional information that we can use to generate the testset. You can learn more about this from the [core concepts section](../concepts/test_data_generation/rag.md#knowledge-graph-creation).
64+
2. **Testset Generation**: We use the [KnowledgeGraph][ragas.testset.graph.KnowledgeGraph] to generate a set of [scenarios][ragas.testset.synthesizers.base.BaseScenario]. These scenarios are used to generate the [testset][ragas.testset.synthesizers.generate.Testset]. You can learn more about this from the [core concepts section](../concepts/test_data_generation/rag.md#scenario-generation).
65+
66+
Now let's see an example of how these components work together to generate a testset.
67+
68+
### KnowledgeGraph Creation
69+
70+
Let's first create a [KnowledgeGraph][ragas.testset.graph.KnowledgeGraph] using the documents we loaded earlier.
71+
72+
```python
73+
from ragas.testset.graph import KnowledgeGraph
74+
75+
kg = KnowledgeGraph()
76+
```
77+
```
78+
KnowledgeGraph(nodes: 0, relationships: 0)
79+
```
80+
81+
and then add the documents to the knowledge graph.
82+
83+
```python
84+
from ragas.testset.graph import Node, NodeType
85+
86+
for doc in docs:
87+
kg.nodes.append(
88+
Node(
89+
type=NodeType.DOCUMENT,
90+
properties={"page_content": doc.page_content, "document_metadata": doc.metadata}
91+
)
92+
)
93+
```
94+
```
95+
KnowledgeGraph(nodes: 10, relationships: 0)
96+
```
97+
98+
Now we will enrich the knowledge graph with additional information using [Transformations][ragas.testset.transforms.base.BaseGraphTransformation]. Here we will use [default_transforms][ragas.testset.transforms.default_transforms] to create a set of default transformations to apply with an LLM and Embedding Model of your choice.
99+
But you can mix and match transforms or build your own as needed.
100+
101+
```python
102+
from ragas.testset.transforms import default_transforms
103+
104+
# choose your LLM and Embedding Model
105+
from ragas.llms import llm_factory
106+
from ragas.embeddings import embedding_factory
107+
108+
transformer_llm = llm_factory("gpt-4o")
109+
embedding_model = embedding_factory("text-embedding-3-large")
110+
111+
trans = default_transforms(llm=transformer_llm, embedding_model=embedding_model)
112+
apply_transforms(kg, trans)
113+
```
114+
115+
Now we have a knowledge graph with additional information. You can save the knowledge graph too.
116+
117+
```python
118+
kg.save("knowledge_graph.json")
119+
loaded_kg = KnowledgeGraph.load("knowledge_graph.json")
120+
loaded_kg
121+
```
122+
```
123+
KnowledgeGraph(nodes: 48, relationships: 605)
124+
```
125+
126+
### Testset Generation
127+
128+
Now we will use the `loaded_kg` to create the [TestsetGenerator][ragas.testset.synthesizers.generate.TestsetGenerator].
129+
130+
```python
131+
from ragas.testset import TestsetGenerator
132+
133+
generator = TestsetGenerator(llm=generator_llm, knowledge_graph=loaded_kg)
134+
```
135+
136+
We can also define the distribution of queries we would like to generate. Here lets use the default distribution.
137+
138+
```python
139+
from ragas.testset.synthesizers import default_query_distribution
140+
141+
query_distribution = default_query_distribution(generator_llm)
142+
```
143+
```
144+
[
145+
(AbstractQuerySynthesizer(llm=generator_llm), 0.25),
146+
(ComparativeAbstractQuerySynthesizer(llm=generator_llm), 0.25),
147+
(SpecificQuerySynthesizer(llm=generator_llm), 0.5),
148+
]
149+
```
150+
151+
Now we can generate the testset.
152+
153+
```python
154+
testset = generator.generate(testset_size=10, query_distribution=query_distribution)
155+
testset.to_pandas()
156+
```
157+
158+
![testset](./testset_output.png)
87.4 KB
Loading

docs/getstarted/testset_output.png

376 KB
Loading

src/ragas/dataset_schema.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -237,8 +237,8 @@ def to_csv(self, path: t.Union[str, Path]):
237237
def to_jsonl(self, path: t.Union[str, Path]):
238238
"""Converts the dataset to a JSONL file."""
239239
with open(path, "w") as jsonlfile:
240-
for sample in self.samples:
241-
jsonlfile.write(json.dumps(sample.to_dict(), ensure_ascii=False) + "\n")
240+
for sample in self.to_list():
241+
jsonlfile.write(json.dumps(sample, ensure_ascii=False) + "\n")
242242

243243
@classmethod
244244
def from_jsonl(cls: t.Type[T], path: t.Union[str, Path]) -> T:

src/ragas/testset/synthesizers/generate.py

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,11 +20,14 @@
2020
from langchain_core.documents import Document as LCDocument
2121
from langchain_core.language_models import BaseLanguageModel as LangchainLLM
2222

23+
from ragas.embeddings.base import BaseRagasEmbeddings
24+
from ragas.llms.base import BaseRagasLLM
2325
from ragas.testset.synthesizers import QueryDistribution
2426
from ragas.testset.synthesizers.base import BaseScenario
2527

2628

2729
RAGAS_TESTSET_GENERATION_GROUP_NAME = "ragas testset generation"
30+
logger = logging.getLogger(__name__)
2831

2932

3033
@dataclass
@@ -60,6 +63,8 @@ def generate_with_langchain_docs(
6063
documents: t.Sequence[LCDocument],
6164
testset_size: int,
6265
transforms: t.Optional[Transforms] = None,
66+
transforms_llm: t.Optional[BaseRagasLLM] = None,
67+
transforms_embedding_model: t.Optional[BaseRagasEmbeddings] = None,
6368
query_distribution: t.Optional[QueryDistribution] = None,
6469
run_config: t.Optional[RunConfig] = None,
6570
callbacks: t.Optional[Callbacks] = None,
@@ -69,7 +74,19 @@ def generate_with_langchain_docs(
6974
"""
7075
Generates an evaluation dataset based on given scenarios and parameters.
7176
"""
72-
transforms = transforms or default_transforms()
77+
if transforms is None:
78+
# use default transforms
79+
if transforms_llm is None:
80+
transforms_llm = self.llm
81+
logger.info("Using TestGenerator.llm for transforms")
82+
if transforms_embedding_model is None:
83+
raise ValueError(
84+
"embedding_model must be provided for default_transforms. Alternatively you can provide your own transforms through the `transforms` parameter."
85+
)
86+
transforms = default_transforms(
87+
llm=transforms_llm or self.llm,
88+
embedding_model=transforms_embedding_model,
89+
)
7390

7491
# convert the documents to Ragas nodes
7592
nodes = []

src/ragas/testset/synthesizers/testset_schema.py

Lines changed: 14 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,12 @@ def to_list(self) -> t.List[t.Dict]:
5151
"""
5252
Converts the Testset to a list of dictionaries.
5353
"""
54-
return [sample.model_dump() for sample in self.samples]
54+
list_dict = []
55+
for sample in self.samples:
56+
sample_dict = sample.eval_sample.model_dump(exclude_none=True)
57+
sample_dict["synthesizer_name"] = sample.synthesizer_name
58+
list_dict.append(sample_dict)
59+
return list_dict
5560

5661
@classmethod
5762
def from_list(cls, data: t.List[t.Dict]) -> Testset:
@@ -61,19 +66,23 @@ def from_list(cls, data: t.List[t.Dict]) -> Testset:
6166
# first create the samples
6267
samples = []
6368
for sample in data:
64-
eval_sample = sample["eval_sample"]
69+
synthesizer_name = sample["synthesizer_name"]
70+
# remove the synthesizer name from the sample
71+
sample.pop("synthesizer_name")
72+
# the remaining sample is the eval_sample
73+
eval_sample = sample
6574

6675
# if user_input is a list it is MultiTurnSample
6776
if "user_input" in eval_sample and not isinstance(
6877
eval_sample.get("user_input"), list
6978
):
70-
eval_sample = SingleTurnSample(**sample["eval_sample"])
79+
eval_sample = SingleTurnSample(**eval_sample)
7180
else:
72-
eval_sample = MultiTurnSample(**sample["eval_sample"])
81+
eval_sample = MultiTurnSample(**eval_sample)
7382

7483
samples.append(
7584
TestsetSample(
76-
eval_sample=eval_sample, synthesizer_name=sample["synthesizer_name"]
85+
eval_sample=eval_sample, synthesizer_name=synthesizer_name
7786
)
7887
)
7988
# then create the testset

src/ragas/testset/transforms/__init__.py

Lines changed: 1 addition & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
from .base import BaseGraphTransformation, Extractor, RelationshipBuilder, Splitter
2+
from .default import default_transforms
23
from .engine import Parallel, Transforms, apply_transforms, rollback_transforms
34
from .extractors import (
45
EmbeddingExtractor,
@@ -13,60 +14,6 @@
1314
)
1415
from .splitters import HeadlineSplitter
1516

16-
17-
def default_transforms() -> Transforms:
18-
"""
19-
Creates and returns a default set of transforms for processing a knowledge graph.
20-
21-
This function defines a series of transformation steps to be applied to a
22-
knowledge graph, including extracting summaries, keyphrases, titles,
23-
headlines, and embeddings, as well as building similarity relationships
24-
between nodes.
25-
26-
The transforms are applied in the following order:
27-
1. Parallel extraction of summaries and headlines
28-
2. Embedding of summaries for document nodes
29-
3. Splitting of headlines
30-
4. Parallel extraction of embeddings, keyphrases, and titles
31-
5. Building cosine similarity relationships between nodes
32-
6. Building cosine similarity relationships between summaries
33-
34-
Returns
35-
-------
36-
Transforms
37-
A list of transformation steps to be applied to the knowledge graph.
38-
39-
"""
40-
from ragas.testset.graph import NodeType
41-
42-
# define the transforms
43-
summary_extractor = SummaryExtractor()
44-
keyphrase_extractor = KeyphrasesExtractor()
45-
title_extractor = TitleExtractor()
46-
headline_extractor = HeadlinesExtractor()
47-
embedding_extractor = EmbeddingExtractor()
48-
headline_splitter = HeadlineSplitter()
49-
cosine_sim_builder = CosineSimilarityBuilder(threshold=0.8)
50-
summary_embedder = EmbeddingExtractor(
51-
name="summary_embedder",
52-
property_name="summary_embedding",
53-
embed_property_name="summary",
54-
filter_nodes=lambda node: True if node.type == NodeType.DOCUMENT else False,
55-
)
56-
summary_cosine_sim_builder = SummaryCosineSimilarityBuilder(threshold=0.6)
57-
58-
# specify the transforms and their order to be applied
59-
transforms = [
60-
Parallel(summary_extractor, headline_extractor),
61-
summary_embedder,
62-
headline_splitter,
63-
Parallel(embedding_extractor, keyphrase_extractor, title_extractor),
64-
cosine_sim_builder,
65-
summary_cosine_sim_builder,
66-
]
67-
return transforms
68-
69-
7017
__all__ = [
7118
# base
7219
"BaseGraphTransformation",
Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
from __future__ import annotations
2+
3+
import typing as t
4+
5+
from .engine import Parallel
6+
from .extractors import (
7+
EmbeddingExtractor,
8+
HeadlinesExtractor,
9+
KeyphrasesExtractor,
10+
SummaryExtractor,
11+
TitleExtractor,
12+
)
13+
from .relationship_builders.cosine import (
14+
CosineSimilarityBuilder,
15+
SummaryCosineSimilarityBuilder,
16+
)
17+
from .splitters import HeadlineSplitter
18+
19+
if t.TYPE_CHECKING:
20+
from ragas.embeddings.base import BaseRagasEmbeddings
21+
from ragas.llms.base import BaseRagasLLM
22+
23+
from .engine import Transforms
24+
25+
26+
def default_transforms(
27+
llm: BaseRagasLLM,
28+
embedding_model: BaseRagasEmbeddings,
29+
) -> Transforms:
30+
"""
31+
Creates and returns a default set of transforms for processing a knowledge graph.
32+
33+
This function defines a series of transformation steps to be applied to a
34+
knowledge graph, including extracting summaries, keyphrases, titles,
35+
headlines, and embeddings, as well as building similarity relationships
36+
between nodes.
37+
38+
The transforms are applied in the following order:
39+
1. Parallel extraction of summaries and headlines
40+
2. Embedding of summaries for document nodes
41+
3. Splitting of headlines
42+
4. Parallel extraction of embeddings, keyphrases, and titles
43+
5. Building cosine similarity relationships between nodes
44+
6. Building cosine similarity relationships between summaries
45+
46+
Returns
47+
-------
48+
Transforms
49+
A list of transformation steps to be applied to the knowledge graph.
50+
51+
"""
52+
from ragas.testset.graph import NodeType
53+
54+
# define the transforms
55+
summary_extractor = SummaryExtractor(llm=llm)
56+
keyphrase_extractor = KeyphrasesExtractor(llm=llm)
57+
title_extractor = TitleExtractor(llm=llm)
58+
headline_extractor = HeadlinesExtractor(llm=llm)
59+
embedding_extractor = EmbeddingExtractor(embedding_model=embedding_model)
60+
headline_splitter = HeadlineSplitter()
61+
cosine_sim_builder = CosineSimilarityBuilder(threshold=0.8)
62+
summary_embedder = EmbeddingExtractor(
63+
name="summary_embedder",
64+
property_name="summary_embedding",
65+
embed_property_name="summary",
66+
filter_nodes=lambda node: True if node.type == NodeType.DOCUMENT else False,
67+
embedding_model=embedding_model,
68+
)
69+
summary_cosine_sim_builder = SummaryCosineSimilarityBuilder(threshold=0.6)
70+
71+
# specify the transforms and their order to be applied
72+
transforms = [
73+
Parallel(summary_extractor, headline_extractor),
74+
summary_embedder,
75+
headline_splitter,
76+
Parallel(embedding_extractor, keyphrase_extractor, title_extractor),
77+
cosine_sim_builder,
78+
summary_cosine_sim_builder,
79+
]
80+
return transforms

0 commit comments

Comments
 (0)