|
39 | 39 |
|
40 | 40 | - **Comprehensive Metric Library**: Covers Retrieval-Augmented Generation (RAG), Code Generation, Agent Tool Use, Classification and a variety of other LLM use cases. Mix and match Deterministic, Semantic and LLM-based metrics. |
41 | 41 |
|
42 | | -- **Leverage User Feedback in Evaluation**: Easily build a close-to-human ensemble evaluation pipeline with mathematical guarantees. |
43 | | - |
44 | | -- **Synthetic Dataset Generation**: Generate large-scale synthetic dataset to test your pipeline. |
| 42 | +- **Probabilistic Evaluation**: Evaluate your pipeline with probabilistic metrics |
45 | 43 |
|
46 | 44 | ## Getting Started |
47 | 45 |
|
@@ -84,147 +82,188 @@ metric = PrecisionRecallF1() |
84 | 82 | print(metric(**datum)) |
85 | 83 | ``` |
86 | 84 |
|
87 | | -### Available Metrics |
88 | | - |
89 | | -<table border="0"> |
90 | | - <tr> |
91 | | - <th>Module</th> |
92 | | - <th>Category</th> |
93 | | - <th>Metrics</th> |
94 | | - </tr> |
95 | | - <tr> |
96 | | - <td rowspan="2">Retrieval</td> |
97 | | - <td>Deterministic</td> |
98 | | - <td>PrecisionRecallF1, RankedRetrievalMetrics, TokenCount</td> |
99 | | - </tr> |
100 | | - <tr> |
101 | | - <td>LLM-based</td> |
102 | | - <td>LLMBasedContextPrecision, LLMBasedContextCoverage</td> |
103 | | - </tr> |
104 | | - <tr> |
105 | | - <td rowspan="3">Text Generation</td> |
106 | | - <td>Deterministic</td> |
107 | | - <td>DeterministicAnswerCorrectness, DeterministicFaithfulness, FleschKincaidReadability</td> |
108 | | - </tr> |
109 | | - <tr> |
110 | | - <td>Semantic</td> |
111 | | - <td>DebertaAnswerScores, BertAnswerRelevance, BertAnswerSimilarity</td> |
112 | | - </tr> |
113 | | - <tr> |
114 | | - <td>LLM-based</td> |
115 | | - <td>LLMBasedFaithfulness, LLMBasedAnswerCorrectness, LLMBasedAnswerRelevance, LLMBasedStyleConsistency</td> |
116 | | - </tr> |
117 | | - <tr> |
118 | | - <td rowspan="1">Classification</td> |
119 | | - <td>Deterministic</td> |
120 | | - <td>ClassificationAccuracy</td> |
121 | | - </tr> |
122 | | - <tr> |
123 | | - <td rowspan="2">Code Generation</td> |
124 | | - <td>Deterministic</td> |
125 | | - <td>CodeStringMatch, PythonASTSimilarity, SQLSyntaxMatch, SQLASTSimilarity</td> |
126 | | - </tr> |
127 | | - <tr> |
128 | | - <td>LLM-based</td> |
129 | | - <td>LLMBasedCodeGeneration</td> |
130 | | - </tr> |
131 | | - <tr> |
132 | | - <td>Agent Tools</td> |
133 | | - <td>Deterministic</td> |
134 | | - <td>ToolSelectionAccuracy</td> |
135 | | - </tr> |
136 | | - <tr> |
137 | | - <td>Custom</td> |
138 | | - <td></td> |
139 | | - <td>Define your own metrics</td> |
140 | | - </tr> |
141 | | -</table> |
142 | | - |
143 | | -To define your own metrics, you only need to extend the [Metric](continuous_eval/metrics/base.py#L23C7-L23C13) class implementing the `__call__` method. |
144 | | -Optional methods are `batch` (if it is possible to implement optimizations for batch processing) and `aggregate` (to aggregate metrics results over multiple samples_). |
145 | | - |
146 | | -## Run evaluation on a pipeline |
147 | | - |
148 | | -Define modules in your pipeline and select corresponding metrics. |
| 85 | +## Run an evaluation |
149 | 86 |
|
150 | | -```python |
151 | | -from continuous_eval.eval import Module, ModuleOutput, Pipeline, Dataset, EvaluationRunner |
152 | | -from continuous_eval.eval.logger import PipelineLogger |
153 | | -from continuous_eval.metrics.retrieval import PrecisionRecallF1, RankedRetrievalMetrics |
154 | | -from continuous_eval.metrics.generation.text import DeterministicAnswerCorrectness |
155 | | -from typing import List, Dict |
156 | | - |
157 | | -dataset = Dataset("dataset_folder") |
158 | | - |
159 | | -# Simple 3-step RAG pipeline with Retriever->Reranker->Generation |
160 | | -retriever = Module( |
161 | | - name="Retriever", |
162 | | - input=dataset.question, |
163 | | - output=List[str], |
164 | | - eval=[ |
165 | | - PrecisionRecallF1().use( |
166 | | - retrieved_context=ModuleOutput(), |
167 | | - ground_truth_context=dataset.ground_truth_context, |
168 | | - ), |
169 | | - ], |
170 | | -) |
| 87 | +If you want to run an evaluation on a dataset, you can use the `EvaluationRunner` class. |
171 | 88 |
|
172 | | -reranker = Module( |
173 | | - name="reranker", |
174 | | - input=retriever, |
175 | | - output=List[Dict[str, str]], |
176 | | - eval=[ |
177 | | - RankedRetrievalMetrics().use( |
178 | | - retrieved_context=ModuleOutput(), |
179 | | - ground_truth_context=dataset.ground_truth_context, |
180 | | - ), |
181 | | - ], |
| 89 | +```python |
| 90 | +from time import perf_counter |
| 91 | + |
| 92 | +from continuous_eval.data_downloader import example_data_downloader |
| 93 | +from continuous_eval.eval import EvaluationRunner, SingleModulePipeline |
| 94 | +from continuous_eval.eval.tests import GreaterOrEqualThan |
| 95 | +from continuous_eval.metrics.retrieval import ( |
| 96 | + PrecisionRecallF1, |
| 97 | + RankedRetrievalMetrics, |
182 | 98 | ) |
183 | 99 |
|
184 | | -llm = Module( |
185 | | - name="answer_generator", |
186 | | - input=reranker, |
187 | | - output=str, |
188 | | - eval=[ |
189 | | - FleschKincaidReadability().use(answer=ModuleOutput()), |
190 | | - DeterministicAnswerCorrectness().use( |
191 | | - answer=ModuleOutput(), ground_truth_answers=dataset.ground_truths |
192 | | - ), |
193 | | - ], |
194 | | -) |
195 | 100 |
|
196 | | -pipeline = Pipeline([retriever, reranker, llm], dataset=dataset) |
197 | | -print(pipeline.graph_repr()) # optional: visualize the pipeline |
| 101 | +def main(): |
| 102 | + # Let's download the retrieval dataset example |
| 103 | + dataset = example_data_downloader("retrieval") |
| 104 | + |
| 105 | + # Setup evaluation pipeline (i.e., dataset, metrics and tests) |
| 106 | + pipeline = SingleModulePipeline( |
| 107 | + dataset=dataset, |
| 108 | + eval=[ |
| 109 | + PrecisionRecallF1().use( |
| 110 | + retrieved_context=dataset.retrieved_contexts, |
| 111 | + ground_truth_context=dataset.ground_truth_contexts, |
| 112 | + ), |
| 113 | + RankedRetrievalMetrics().use( |
| 114 | + retrieved_context=dataset.retrieved_contexts, |
| 115 | + ground_truth_context=dataset.ground_truth_contexts, |
| 116 | + ), |
| 117 | + ], |
| 118 | + tests=[ |
| 119 | + GreaterOrEqualThan( |
| 120 | + test_name="Recall", metric_name="context_recall", min_value=0.8 |
| 121 | + ), |
| 122 | + ], |
| 123 | + ) |
| 124 | + |
| 125 | + # Start the evaluation manager and run the metrics (and tests) |
| 126 | + tic = perf_counter() |
| 127 | + runner = EvaluationRunner(pipeline) |
| 128 | + eval_results = runner.evaluate() |
| 129 | + toc = perf_counter() |
| 130 | + print("Evaluation results:") |
| 131 | + print(eval_results.aggregate()) |
| 132 | + print(f"Elapsed time: {toc - tic:.2f} seconds\n") |
| 133 | + |
| 134 | + print("Running tests...") |
| 135 | + test_results = runner.test(eval_results) |
| 136 | + print(test_results) |
| 137 | + |
| 138 | + |
| 139 | +if __name__ == "__main__": |
| 140 | + # It is important to run this script in a new process to avoid |
| 141 | + # multiprocessing issues |
| 142 | + main() |
198 | 143 | ``` |
199 | 144 |
|
200 | | -Now you can run the evaluation on your pipeline |
| 145 | +## Run evaluation on a pipeline (modular evaluation) |
| 146 | + |
| 147 | +Sometimes the system is composed of multiple modules, each with its own metrics and tests. |
| 148 | +Continuous-eval supports this use case by allowing you to define modules in your pipeline and select corresponding metrics. |
201 | 149 |
|
202 | 150 | ```python |
203 | | -pipelog = PipelineLogger(pipeline=pipeline) |
| 151 | +from typing import Any, Dict, List |
| 152 | + |
| 153 | +from continuous_eval.data_downloader import example_data_downloader |
| 154 | +from continuous_eval.eval import ( |
| 155 | + Dataset, |
| 156 | + EvaluationRunner, |
| 157 | + Module, |
| 158 | + ModuleOutput, |
| 159 | + Pipeline, |
| 160 | +) |
| 161 | +from continuous_eval.eval.result_types import PipelineResults |
| 162 | +from continuous_eval.metrics.generation.text import AnswerCorrectness |
| 163 | +from continuous_eval.metrics.retrieval import PrecisionRecallF1, RankedRetrievalMetrics |
204 | 164 |
|
205 | | -# now run your LLM application pipeline, and for each module, log the results: |
206 | | -pipelog.log(uid=sample_uid, module="module_name", value=data) |
207 | 165 |
|
208 | | -# Once you finish logging the data, you can use the EvaluationRunner to evaluate the logs |
209 | | -evalrunner = EvaluationRunner(pipeline) |
210 | | -metrics = evalrunner.evaluate(pipelog) |
211 | | -metrics.results() # returns a dictionary with the results |
| 166 | +def page_content(docs: List[Dict[str, Any]]) -> List[str]: |
| 167 | + # Extract the content of the retrieved documents from the pipeline results |
| 168 | + return [doc["page_content"] for doc in docs] |
| 169 | + |
| 170 | + |
| 171 | +def main(): |
| 172 | + dataset: Dataset = example_data_downloader("graham_essays/small/dataset") |
| 173 | + results: Dict = example_data_downloader("graham_essays/small/results") |
| 174 | + |
| 175 | + # Simple 3-step RAG pipeline with Retriever->Reranker->Generation |
| 176 | + retriever = Module( |
| 177 | + name="retriever", |
| 178 | + input=dataset.question, |
| 179 | + output=List[str], |
| 180 | + eval=[ |
| 181 | + PrecisionRecallF1().use( |
| 182 | + retrieved_context=ModuleOutput(page_content), # specify how to extract what we need (i.e., page_content) |
| 183 | + ground_truth_context=dataset.ground_truth_context, |
| 184 | + ), |
| 185 | + ], |
| 186 | + ) |
| 187 | + |
| 188 | + reranker = Module( |
| 189 | + name="reranker", |
| 190 | + input=retriever, |
| 191 | + output=List[Dict[str, str]], |
| 192 | + eval=[ |
| 193 | + RankedRetrievalMetrics().use( |
| 194 | + retrieved_context=ModuleOutput(page_content), |
| 195 | + ground_truth_context=dataset.ground_truth_context, |
| 196 | + ), |
| 197 | + ], |
| 198 | + ) |
| 199 | + |
| 200 | + llm = Module( |
| 201 | + name="llm", |
| 202 | + input=reranker, |
| 203 | + output=str, |
| 204 | + eval=[ |
| 205 | + AnswerCorrectness().use( |
| 206 | + question=dataset.question, |
| 207 | + answer=ModuleOutput(), |
| 208 | + ground_truth_answers=dataset.ground_truth_answers, |
| 209 | + ), |
| 210 | + ], |
| 211 | + ) |
| 212 | + |
| 213 | + pipeline = Pipeline([retriever, reranker, llm], dataset=dataset) |
| 214 | + print(pipeline.graph_repr()) # visualize the pipeline in marmaid format |
| 215 | + |
| 216 | + runner = EvaluationRunner(pipeline) |
| 217 | + eval_results = runner.evaluate(PipelineResults.from_dict(results)) |
| 218 | + print(eval_results.aggregate()) |
| 219 | + |
| 220 | + |
| 221 | +if __name__ == "__main__": |
| 222 | + main() |
212 | 223 | ``` |
213 | 224 |
|
214 | | -To run evaluation over an existing dataset (BYODataset), you can run the following: |
| 225 | +> Note: it is important to wrap your code in a main function (with the `if __name__ == "__main__":` guard) to make sure the parallelization works properly. |
215 | 226 |
|
216 | | -```python |
217 | | -dataset = Dataset(...) |
218 | | -evalrunner = EvaluationRunner(pipeline) |
219 | | -metrics = evalrunner.evaluate(dataset) |
220 | | -``` |
| 227 | +## Custom Metrics |
| 228 | + |
| 229 | +There are several ways to create custom metrics, see the [Custom Metrics](https://continuous-eval.docs.relari.ai/v0.3/metrics/overview) section in the docs. |
221 | 230 |
|
222 | | -## Synthetic Data Generation |
| 231 | +The simplest way is to leverage the `CustomMetric` class to create a LLM-as-a-Judge. |
223 | 232 |
|
224 | | -Ground truth data, or reference data, is important for evaluation as it can offer a comprehensive and consistent measurement of system performance. However, it is often costly and time-consuming to manually curate such a golden dataset. |
225 | | -We have created a synthetic data pipeline that can custom generate user interaction data for a variety of use cases such as RAG, agents, copilots. They can serve a starting point for a golden dataset for evaluation or for other training purposes. |
| 233 | +```python |
| 234 | +from continuous_eval.metrics.base.metric import Arg, Field |
| 235 | +from continuous_eval.metrics.custom import CustomMetric |
| 236 | +from typing import List |
| 237 | + |
| 238 | +criteria = "Check that the generated answer does not contain PII or other sensitive information." |
| 239 | +rubric = """Use the following rubric to assign a score to the answer based on its conciseness: |
| 240 | +- Yes: The answer contains PII or other sensitive information. |
| 241 | +- No: The answer does not contain PII or other sensitive information. |
| 242 | +""" |
| 243 | + |
| 244 | +metric = CustomMetric( |
| 245 | + name="PIICheck", |
| 246 | + criteria=criteria, |
| 247 | + rubric=rubric, |
| 248 | + arguments={"answer": Arg(type=str, description="The answer to evaluate.")}, |
| 249 | + response_format={ |
| 250 | + "reasoning": Field( |
| 251 | + type=str, |
| 252 | + description="The reasoning for the score given to the answer", |
| 253 | + ), |
| 254 | + "score": Field( |
| 255 | + type=str, description="The score of the answer: Yes or No" |
| 256 | + ), |
| 257 | + "identifies": Field( |
| 258 | + type=List[str], |
| 259 | + description="The PII or other sensitive information identified in the answer", |
| 260 | + ), |
| 261 | + }, |
| 262 | +) |
226 | 263 |
|
227 | | -To generate custom synthetic data, please visit [Relari](https://www.relari.ai/) to create a free account and you can then generate custom synthetic golden datasets through the Relari Cloud. |
| 264 | +# Let's calculate the metric for the first datum |
| 265 | +print(metric(answer="John Doe resides at 123 Main Street, Springfield.")) |
| 266 | +``` |
228 | 267 |
|
229 | 268 | ## 💡 Contributing |
230 | 269 |
|
|
0 commit comments