Skip to content

Commit 31e305e

Browse files
authored
Probabilistic metrics
- Adds probabilistic metrics - Refactor all metrics to use Jinja templates for prompts - Improve custom metrics (e.g., LLM-as-a-Judge) creation - Refactor LLMFactory - Improve parallelism with CPU and IO bound processing - Add support for model names in TokenCount metric - Fix issue with memory leakage when using any matching strategy in deterministic metric - Fix issues with Deberta models, runner and NLTK - Update docs (incl. update Starlight version)
1 parent eb8384e commit 31e305e

File tree

144 files changed

+12624
-16119
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

144 files changed

+12624
-16119
lines changed

.pre-commit-config.yaml

Lines changed: 9 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -31,21 +31,13 @@ repos:
3131
- id: pyupgrade
3232
args: ['--keep-percent-format', '--keep-runtime-typing']
3333
stages: [commit]
34-
- repo: https://github.com/pre-commit/mirrors-isort # Sort imports, similar to PyCharm organize imports
35-
rev: v5.10.1
36-
hooks:
37-
- id: isort
38-
args: ['--line-length=120', '--profile=black']
39-
stages: [commit]
40-
- id: isort
41-
args: ['--line-length=120', '--profile=black', '--check']
42-
stages: [manual]
43-
- repo: https://github.com/psf/black
44-
rev: 22.3.0
34+
repos:
35+
- repo: https://github.com/astral-sh/ruff-pre-commit
36+
# Ruff version.
37+
rev: v0.8.3
4538
hooks:
46-
- id: black
47-
args: ['--line-length=120', '--skip-string-normalization']
48-
stages: [commit]
49-
- id: black
50-
args: ['--line-length=120', '--skip-string-normalization', '--check']
51-
stages: [manual]
39+
# Run the linter.
40+
- id: ruff
41+
args: [ --fix ]
42+
# Run the formatter.
43+
- id: ruff-format

README.md

Lines changed: 166 additions & 127 deletions
Original file line numberDiff line numberDiff line change
@@ -39,9 +39,7 @@
3939

4040
- **Comprehensive Metric Library**: Covers Retrieval-Augmented Generation (RAG), Code Generation, Agent Tool Use, Classification and a variety of other LLM use cases. Mix and match Deterministic, Semantic and LLM-based metrics.
4141

42-
- **Leverage User Feedback in Evaluation**: Easily build a close-to-human ensemble evaluation pipeline with mathematical guarantees.
43-
44-
- **Synthetic Dataset Generation**: Generate large-scale synthetic dataset to test your pipeline.
42+
- **Probabilistic Evaluation**: Evaluate your pipeline with probabilistic metrics
4543

4644
## Getting Started
4745

@@ -84,147 +82,188 @@ metric = PrecisionRecallF1()
8482
print(metric(**datum))
8583
```
8684

87-
### Available Metrics
88-
89-
<table border="0">
90-
<tr>
91-
<th>Module</th>
92-
<th>Category</th>
93-
<th>Metrics</th>
94-
</tr>
95-
<tr>
96-
<td rowspan="2">Retrieval</td>
97-
<td>Deterministic</td>
98-
<td>PrecisionRecallF1, RankedRetrievalMetrics, TokenCount</td>
99-
</tr>
100-
<tr>
101-
<td>LLM-based</td>
102-
<td>LLMBasedContextPrecision, LLMBasedContextCoverage</td>
103-
</tr>
104-
<tr>
105-
<td rowspan="3">Text Generation</td>
106-
<td>Deterministic</td>
107-
<td>DeterministicAnswerCorrectness, DeterministicFaithfulness, FleschKincaidReadability</td>
108-
</tr>
109-
<tr>
110-
<td>Semantic</td>
111-
<td>DebertaAnswerScores, BertAnswerRelevance, BertAnswerSimilarity</td>
112-
</tr>
113-
<tr>
114-
<td>LLM-based</td>
115-
<td>LLMBasedFaithfulness, LLMBasedAnswerCorrectness, LLMBasedAnswerRelevance, LLMBasedStyleConsistency</td>
116-
</tr>
117-
<tr>
118-
<td rowspan="1">Classification</td>
119-
<td>Deterministic</td>
120-
<td>ClassificationAccuracy</td>
121-
</tr>
122-
<tr>
123-
<td rowspan="2">Code Generation</td>
124-
<td>Deterministic</td>
125-
<td>CodeStringMatch, PythonASTSimilarity, SQLSyntaxMatch, SQLASTSimilarity</td>
126-
</tr>
127-
<tr>
128-
<td>LLM-based</td>
129-
<td>LLMBasedCodeGeneration</td>
130-
</tr>
131-
<tr>
132-
<td>Agent Tools</td>
133-
<td>Deterministic</td>
134-
<td>ToolSelectionAccuracy</td>
135-
</tr>
136-
<tr>
137-
<td>Custom</td>
138-
<td></td>
139-
<td>Define your own metrics</td>
140-
</tr>
141-
</table>
142-
143-
To define your own metrics, you only need to extend the [Metric](continuous_eval/metrics/base.py#L23C7-L23C13) class implementing the `__call__` method.
144-
Optional methods are `batch` (if it is possible to implement optimizations for batch processing) and `aggregate` (to aggregate metrics results over multiple samples_).
145-
146-
## Run evaluation on a pipeline
147-
148-
Define modules in your pipeline and select corresponding metrics.
85+
## Run an evaluation
14986

150-
```python
151-
from continuous_eval.eval import Module, ModuleOutput, Pipeline, Dataset, EvaluationRunner
152-
from continuous_eval.eval.logger import PipelineLogger
153-
from continuous_eval.metrics.retrieval import PrecisionRecallF1, RankedRetrievalMetrics
154-
from continuous_eval.metrics.generation.text import DeterministicAnswerCorrectness
155-
from typing import List, Dict
156-
157-
dataset = Dataset("dataset_folder")
158-
159-
# Simple 3-step RAG pipeline with Retriever->Reranker->Generation
160-
retriever = Module(
161-
name="Retriever",
162-
input=dataset.question,
163-
output=List[str],
164-
eval=[
165-
PrecisionRecallF1().use(
166-
retrieved_context=ModuleOutput(),
167-
ground_truth_context=dataset.ground_truth_context,
168-
),
169-
],
170-
)
87+
If you want to run an evaluation on a dataset, you can use the `EvaluationRunner` class.
17188

172-
reranker = Module(
173-
name="reranker",
174-
input=retriever,
175-
output=List[Dict[str, str]],
176-
eval=[
177-
RankedRetrievalMetrics().use(
178-
retrieved_context=ModuleOutput(),
179-
ground_truth_context=dataset.ground_truth_context,
180-
),
181-
],
89+
```python
90+
from time import perf_counter
91+
92+
from continuous_eval.data_downloader import example_data_downloader
93+
from continuous_eval.eval import EvaluationRunner, SingleModulePipeline
94+
from continuous_eval.eval.tests import GreaterOrEqualThan
95+
from continuous_eval.metrics.retrieval import (
96+
PrecisionRecallF1,
97+
RankedRetrievalMetrics,
18298
)
18399

184-
llm = Module(
185-
name="answer_generator",
186-
input=reranker,
187-
output=str,
188-
eval=[
189-
FleschKincaidReadability().use(answer=ModuleOutput()),
190-
DeterministicAnswerCorrectness().use(
191-
answer=ModuleOutput(), ground_truth_answers=dataset.ground_truths
192-
),
193-
],
194-
)
195100

196-
pipeline = Pipeline([retriever, reranker, llm], dataset=dataset)
197-
print(pipeline.graph_repr()) # optional: visualize the pipeline
101+
def main():
102+
# Let's download the retrieval dataset example
103+
dataset = example_data_downloader("retrieval")
104+
105+
# Setup evaluation pipeline (i.e., dataset, metrics and tests)
106+
pipeline = SingleModulePipeline(
107+
dataset=dataset,
108+
eval=[
109+
PrecisionRecallF1().use(
110+
retrieved_context=dataset.retrieved_contexts,
111+
ground_truth_context=dataset.ground_truth_contexts,
112+
),
113+
RankedRetrievalMetrics().use(
114+
retrieved_context=dataset.retrieved_contexts,
115+
ground_truth_context=dataset.ground_truth_contexts,
116+
),
117+
],
118+
tests=[
119+
GreaterOrEqualThan(
120+
test_name="Recall", metric_name="context_recall", min_value=0.8
121+
),
122+
],
123+
)
124+
125+
# Start the evaluation manager and run the metrics (and tests)
126+
tic = perf_counter()
127+
runner = EvaluationRunner(pipeline)
128+
eval_results = runner.evaluate()
129+
toc = perf_counter()
130+
print("Evaluation results:")
131+
print(eval_results.aggregate())
132+
print(f"Elapsed time: {toc - tic:.2f} seconds\n")
133+
134+
print("Running tests...")
135+
test_results = runner.test(eval_results)
136+
print(test_results)
137+
138+
139+
if __name__ == "__main__":
140+
# It is important to run this script in a new process to avoid
141+
# multiprocessing issues
142+
main()
198143
```
199144

200-
Now you can run the evaluation on your pipeline
145+
## Run evaluation on a pipeline (modular evaluation)
146+
147+
Sometimes the system is composed of multiple modules, each with its own metrics and tests.
148+
Continuous-eval supports this use case by allowing you to define modules in your pipeline and select corresponding metrics.
201149

202150
```python
203-
pipelog = PipelineLogger(pipeline=pipeline)
151+
from typing import Any, Dict, List
152+
153+
from continuous_eval.data_downloader import example_data_downloader
154+
from continuous_eval.eval import (
155+
Dataset,
156+
EvaluationRunner,
157+
Module,
158+
ModuleOutput,
159+
Pipeline,
160+
)
161+
from continuous_eval.eval.result_types import PipelineResults
162+
from continuous_eval.metrics.generation.text import AnswerCorrectness
163+
from continuous_eval.metrics.retrieval import PrecisionRecallF1, RankedRetrievalMetrics
204164

205-
# now run your LLM application pipeline, and for each module, log the results:
206-
pipelog.log(uid=sample_uid, module="module_name", value=data)
207165

208-
# Once you finish logging the data, you can use the EvaluationRunner to evaluate the logs
209-
evalrunner = EvaluationRunner(pipeline)
210-
metrics = evalrunner.evaluate(pipelog)
211-
metrics.results() # returns a dictionary with the results
166+
def page_content(docs: List[Dict[str, Any]]) -> List[str]:
167+
# Extract the content of the retrieved documents from the pipeline results
168+
return [doc["page_content"] for doc in docs]
169+
170+
171+
def main():
172+
dataset: Dataset = example_data_downloader("graham_essays/small/dataset")
173+
results: Dict = example_data_downloader("graham_essays/small/results")
174+
175+
# Simple 3-step RAG pipeline with Retriever->Reranker->Generation
176+
retriever = Module(
177+
name="retriever",
178+
input=dataset.question,
179+
output=List[str],
180+
eval=[
181+
PrecisionRecallF1().use(
182+
retrieved_context=ModuleOutput(page_content), # specify how to extract what we need (i.e., page_content)
183+
ground_truth_context=dataset.ground_truth_context,
184+
),
185+
],
186+
)
187+
188+
reranker = Module(
189+
name="reranker",
190+
input=retriever,
191+
output=List[Dict[str, str]],
192+
eval=[
193+
RankedRetrievalMetrics().use(
194+
retrieved_context=ModuleOutput(page_content),
195+
ground_truth_context=dataset.ground_truth_context,
196+
),
197+
],
198+
)
199+
200+
llm = Module(
201+
name="llm",
202+
input=reranker,
203+
output=str,
204+
eval=[
205+
AnswerCorrectness().use(
206+
question=dataset.question,
207+
answer=ModuleOutput(),
208+
ground_truth_answers=dataset.ground_truth_answers,
209+
),
210+
],
211+
)
212+
213+
pipeline = Pipeline([retriever, reranker, llm], dataset=dataset)
214+
print(pipeline.graph_repr()) # visualize the pipeline in marmaid format
215+
216+
runner = EvaluationRunner(pipeline)
217+
eval_results = runner.evaluate(PipelineResults.from_dict(results))
218+
print(eval_results.aggregate())
219+
220+
221+
if __name__ == "__main__":
222+
main()
212223
```
213224

214-
To run evaluation over an existing dataset (BYODataset), you can run the following:
225+
> Note: it is important to wrap your code in a main function (with the `if __name__ == "__main__":` guard) to make sure the parallelization works properly.
215226
216-
```python
217-
dataset = Dataset(...)
218-
evalrunner = EvaluationRunner(pipeline)
219-
metrics = evalrunner.evaluate(dataset)
220-
```
227+
## Custom Metrics
228+
229+
There are several ways to create custom metrics, see the [Custom Metrics](https://continuous-eval.docs.relari.ai/v0.3/metrics/overview) section in the docs.
221230

222-
## Synthetic Data Generation
231+
The simplest way is to leverage the `CustomMetric` class to create a LLM-as-a-Judge.
223232

224-
Ground truth data, or reference data, is important for evaluation as it can offer a comprehensive and consistent measurement of system performance. However, it is often costly and time-consuming to manually curate such a golden dataset.
225-
We have created a synthetic data pipeline that can custom generate user interaction data for a variety of use cases such as RAG, agents, copilots. They can serve a starting point for a golden dataset for evaluation or for other training purposes.
233+
```python
234+
from continuous_eval.metrics.base.metric import Arg, Field
235+
from continuous_eval.metrics.custom import CustomMetric
236+
from typing import List
237+
238+
criteria = "Check that the generated answer does not contain PII or other sensitive information."
239+
rubric = """Use the following rubric to assign a score to the answer based on its conciseness:
240+
- Yes: The answer contains PII or other sensitive information.
241+
- No: The answer does not contain PII or other sensitive information.
242+
"""
243+
244+
metric = CustomMetric(
245+
name="PIICheck",
246+
criteria=criteria,
247+
rubric=rubric,
248+
arguments={"answer": Arg(type=str, description="The answer to evaluate.")},
249+
response_format={
250+
"reasoning": Field(
251+
type=str,
252+
description="The reasoning for the score given to the answer",
253+
),
254+
"score": Field(
255+
type=str, description="The score of the answer: Yes or No"
256+
),
257+
"identifies": Field(
258+
type=List[str],
259+
description="The PII or other sensitive information identified in the answer",
260+
),
261+
},
262+
)
226263

227-
To generate custom synthetic data, please visit [Relari](https://www.relari.ai/) to create a free account and you can then generate custom synthetic golden datasets through the Relari Cloud.
264+
# Let's calculate the metric for the first datum
265+
print(metric(answer="John Doe resides at 123 Main Street, Springfield."))
266+
```
228267

229268
## 💡 Contributing
230269

continuous_eval/classifiers/__init__.py

Lines changed: 0 additions & 1 deletion
This file was deleted.

0 commit comments

Comments
 (0)