|
1 | | -## Testset Generation for RAG |
| 1 | +# Testset Generation for RAG |
2 | 2 |
|
3 | 3 | This simple guide will help you generate a testset for evaluating your RAG pipeline using your own documents. |
4 | 4 |
|
| 5 | +## Quickstart |
| 6 | +Let's walk through an quick example of generating a testset for a RAG pipeline. Following that will will explore the main components of the testset generation pipeline. |
| 7 | + |
5 | 8 | ### Load Sample Documents |
6 | 9 |
|
7 | 10 | For the sake of this tutorial we will use sample documents from this [repository](https://huggingface.co/datasets/explodinggradients/Sample_Docs_Markdown). You can replace this with your own documents. |
@@ -47,3 +50,109 @@ You may now export and inspect the generated testset. |
47 | 50 | ```python |
48 | 51 | dataset.to_pandas() |
49 | 52 | ``` |
| 53 | + |
| 54 | + |
| 55 | + |
| 56 | + |
| 57 | +## A Deeper Look |
| 58 | + |
| 59 | +Now that we have a seen how to generate a testset, let's take a closer look at the main components of the testset generation pipeline and how you can quickly customize it. |
| 60 | + |
| 61 | +At the core there are 2 main operations that are performed to generate a testset. |
| 62 | + |
| 63 | +1. **KnowledgeGraph Creation**: We first create a [KnowledgeGraph][ragas.testset.graph.KnowledgeGraph] using the documents you provide and use various [Transformations][ragas.testset.transforms.base.BaseGraphTransformation] to enrich the knowledge graph with additional information that we can use to generate the testset. You can learn more about this from the [core concepts section](../concepts/test_data_generation/rag.md#knowledge-graph-creation). |
| 64 | +2. **Testset Generation**: We use the [KnowledgeGraph][ragas.testset.graph.KnowledgeGraph] to generate a set of [scenarios][ragas.testset.synthesizers.base.BaseScenario]. These scenarios are used to generate the [testset][ragas.testset.synthesizers.generate.Testset]. You can learn more about this from the [core concepts section](../concepts/test_data_generation/rag.md#scenario-generation). |
| 65 | + |
| 66 | +Now let's see an example of how these components work together to generate a testset. |
| 67 | + |
| 68 | +### KnowledgeGraph Creation |
| 69 | + |
| 70 | +Let's first create a [KnowledgeGraph][ragas.testset.graph.KnowledgeGraph] using the documents we loaded earlier. |
| 71 | + |
| 72 | +```python |
| 73 | +from ragas.testset.graph import KnowledgeGraph |
| 74 | + |
| 75 | +kg = KnowledgeGraph() |
| 76 | +``` |
| 77 | +``` |
| 78 | +KnowledgeGraph(nodes: 0, relationships: 0) |
| 79 | +``` |
| 80 | + |
| 81 | +and then add the documents to the knowledge graph. |
| 82 | + |
| 83 | +```python |
| 84 | +from ragas.testset.graph import Node, NodeType |
| 85 | + |
| 86 | +for doc in docs: |
| 87 | + kg.nodes.append( |
| 88 | + Node( |
| 89 | + type=NodeType.DOCUMENT, |
| 90 | + properties={"page_content": doc.page_content, "document_metadata": doc.metadata} |
| 91 | + ) |
| 92 | + ) |
| 93 | +``` |
| 94 | +``` |
| 95 | +KnowledgeGraph(nodes: 10, relationships: 0) |
| 96 | +``` |
| 97 | + |
| 98 | +Now we will enrich the knowledge graph with additional information using [Transformations][ragas.testset.transforms.base.BaseGraphTransformation]. Here we will use [default_transforms][ragas.testset.transforms.default_transforms] to create a set of default transformations to apply with an LLM and Embedding Model of your choice. |
| 99 | +But you can mix and match transforms or build your own as needed. |
| 100 | + |
| 101 | +```python |
| 102 | +from ragas.testset.transforms import default_transforms |
| 103 | + |
| 104 | +# choose your LLM and Embedding Model |
| 105 | +from ragas.llms import llm_factory |
| 106 | +from ragas.embeddings import embedding_factory |
| 107 | + |
| 108 | +transformer_llm = llm_factory("gpt-4o") |
| 109 | +embedding_model = embedding_factory("text-embedding-3-large") |
| 110 | + |
| 111 | +trans = default_transforms(llm=transformer_llm, embedding_model=embedding_model) |
| 112 | +apply_transforms(kg, trans) |
| 113 | +``` |
| 114 | + |
| 115 | +Now we have a knowledge graph with additional information. You can save the knowledge graph too. |
| 116 | + |
| 117 | +```python |
| 118 | +kg.save("knowledge_graph.json") |
| 119 | +loaded_kg = KnowledgeGraph.load("knowledge_graph.json") |
| 120 | +loaded_kg |
| 121 | +``` |
| 122 | +``` |
| 123 | +KnowledgeGraph(nodes: 48, relationships: 605) |
| 124 | +``` |
| 125 | + |
| 126 | +### Testset Generation |
| 127 | + |
| 128 | +Now we will use the `loaded_kg` to create the [TestsetGenerator][ragas.testset.synthesizers.generate.TestsetGenerator]. |
| 129 | + |
| 130 | +```python |
| 131 | +from ragas.testset import TestsetGenerator |
| 132 | + |
| 133 | +generator = TestsetGenerator(llm=generator_llm, knowledge_graph=loaded_kg) |
| 134 | +``` |
| 135 | + |
| 136 | +We can also define the distribution of queries we would like to generate. Here lets use the default distribution. |
| 137 | + |
| 138 | +```python |
| 139 | +from ragas.testset.synthesizers import default_query_distribution |
| 140 | + |
| 141 | +query_distribution = default_query_distribution(generator_llm) |
| 142 | +``` |
| 143 | +``` |
| 144 | +[ |
| 145 | + (AbstractQuerySynthesizer(llm=generator_llm), 0.25), |
| 146 | + (ComparativeAbstractQuerySynthesizer(llm=generator_llm), 0.25), |
| 147 | + (SpecificQuerySynthesizer(llm=generator_llm), 0.5), |
| 148 | +] |
| 149 | +``` |
| 150 | + |
| 151 | +Now we can generate the testset. |
| 152 | + |
| 153 | +```python |
| 154 | +testset = generator.generate(testset_size=10, query_distribution=query_distribution) |
| 155 | +testset.to_pandas() |
| 156 | +``` |
| 157 | + |
| 158 | + |
0 commit comments