Skip to content

Commit a9823db

Browse files
authored
Add TokenCount metric (#74)
* Add TokenCount metric * token count docs * update readme + bump version --------- Co-authored-by: yisz
1 parent 96de984 commit a9823db

File tree

7 files changed

+99
-21
lines changed

7 files changed

+99
-21
lines changed

README.md

Lines changed: 8 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -18,14 +18,14 @@
1818
</div>
1919

2020
<h2 align="center">
21-
<p>Production-Grade Evaluation for LLM-Powered Applications</p>
21+
<p>Data-Driven Evaluation for LLM-Powered Applications</p>
2222
</h2>
2323

2424

2525

2626
## Overview
2727

28-
`continuous-eval` is an open-source package created for granular and rigorous evaluation of LLM-powered application.
28+
`continuous-eval` is an open-source package created for data-driven evaluation of LLM-powered application.
2929

3030
<h1 align="center">
3131
<img
@@ -63,7 +63,7 @@ To run LLM-based metrics, the code requires at least one of the LLM API keys in
6363
## Run a single metric
6464

6565
Here's how you run a single metric on a datum.
66-
Check all available metrics here: [link](https://docs.relari.ai/)
66+
Check all available metrics here: [link](https://continuous-eval.docs.relari.ai/)
6767

6868
```python
6969
from continuous_eval.metrics.retrieval import PrecisionRecallF1
@@ -95,7 +95,7 @@ print(metric(**datum))
9595
<tr>
9696
<td rowspan="2">Retrieval</td>
9797
<td>Deterministic</td>
98-
<td>PrecisionRecallF1, RankedRetrievalMetrics</td>
98+
<td>PrecisionRecallF1, RankedRetrievalMetrics, TokenCount</td>
9999
</tr>
100100
<tr>
101101
<td>LLM-based</td>
@@ -222,21 +222,17 @@ metrics = evalrunner.evaluate(dataset)
222222
## Synthetic Data Generation
223223

224224
Ground truth data, or reference data, is important for evaluation as it can offer a comprehensive and consistent measurement of system performance. However, it is often costly and time-consuming to manually curate such a golden dataset.
225-
We have created a synthetic data pipeline that can custom generate user interaction data for a variety of use cases such as RAG, agents, copilots. They can serve a starting point for a golden dataset for evaluation or for other training purposes. Below is an example for Coding Agents.
225+
We have created a synthetic data pipeline that can custom generate user interaction data for a variety of use cases such as RAG, agents, copilots. They can serve a starting point for a golden dataset for evaluation or for other training purposes.
226226

227-
<h1 align="center">
228-
<img
229-
src="docs/public/synthetic-data-demo.png"
230-
>
231-
</h1>
227+
To generate custom synthetic data, please visit [Relari](https://www.relari.ai/) to create a free account and you can then generate custom synthetic golden datasets through the Relari Cloud.
232228

233229
## 💡 Contributing
234230

235231
Interested in contributing? See our [Contribution Guide](CONTRIBUTING.md) for more details.
236232

237233
## Resources
238234

239-
- **Docs:** [link](https://docs.relari.ai/)
235+
- **Docs:** [link](https://continuous-eval.docs.relari.ai/)
240236
- **Examples Repo**: [end-to-end example repo](https://github.com/relari-ai/examples)
241237
- **Blog Posts:**
242238
- Practical Guide to RAG Pipeline Evaluation: [Part 1: Retrieval](https://medium.com/relari/a-practical-guide-to-rag-pipeline-evaluation-part-1-27a472b09893), [Part 2: Generation](https://medium.com/relari/a-practical-guide-to-rag-evaluation-part-2-generation-c79b1bde0f5d)
@@ -246,7 +242,7 @@ Interested in contributing? See our [Contribution Guide](CONTRIBUTING.md) for mo
246242
- How to Make the Most Out of LLM Production Data: Simulated User Feedback [(link)](https://medium.com/towards-data-science/how-to-make-the-most-out-of-llm-production-data-simulated-user-feedback-843c444febc7)
247243
- Generate Synthetic Data to Test LLM Applications [(link)](https://medium.com/relari/generate-synthetic-data-to-test-llm-applications-4bffeb51b80e)
248244
- **Discord:** Join our community of LLM developers [Discord](https://discord.gg/GJnM8SRsHr)
249-
- **Reach out to founders:** [Email](mailto:[email protected]) or [Schedule a chat](https://cal.com/pasquale/continuous-eval)
245+
- **Reach out to founders:** [Email](mailto:[email protected]) or [Schedule a chat](https://cal.com/relari/demo)
250246

251247
## License
252248

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,13 @@
1-
from continuous_eval.metrics.retrieval.precision_recall_f1 import PrecisionRecallF1
2-
from continuous_eval.metrics.retrieval.ranked import RankedRetrievalMetrics
1+
from continuous_eval.metrics.retrieval.llm_based import (
2+
LLMBasedContextCoverage,
3+
LLMBasedContextPrecision,
4+
)
35
from continuous_eval.metrics.retrieval.matching_strategy import (
46
ExactChunkMatch,
57
ExactSentenceMatch,
68
RougeChunkMatch,
79
RougeSentenceMatch,
810
)
9-
from continuous_eval.metrics.retrieval.llm_based import (
10-
LLMBasedContextCoverage,
11-
LLMBasedContextPrecision,
12-
)
11+
from continuous_eval.metrics.retrieval.precision_recall_f1 import PrecisionRecallF1
12+
from continuous_eval.metrics.retrieval.ranked import RankedRetrievalMetrics
13+
from continuous_eval.metrics.retrieval.tokens import TokenCount
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
import tiktoken
2+
3+
from continuous_eval.metrics.base import Metric
4+
5+
_CHARACTERS_PER_TOKEN = 4.0
6+
7+
8+
class TokenCount(Metric):
9+
def __init__(self, encoder_name: str) -> None:
10+
super().__init__()
11+
if encoder_name == "approx":
12+
self._encoder = None
13+
else:
14+
try:
15+
self._encoder = tiktoken.get_encoding(encoder_name)
16+
except ValueError:
17+
raise ValueError(f"Invalid encoder name: {encoder_name}")
18+
19+
def __call__(self, retrieved_context, **kwargs):
20+
ctx = "\n".join(retrieved_context)
21+
if self._encoder is None:
22+
num_tokens = int(len(ctx) / _CHARACTERS_PER_TOKEN)
23+
else:
24+
num_tokens = len(self._encoder.encode(ctx))
25+
return {"num_tokens": num_tokens}
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
---
2+
title: Token Count
3+
---
4+
5+
### Definitions
6+
7+
Token Count calculates the number of tokens used in the retrieved context.
8+
9+
A required input for the metrics is `encoder_name` for tiktoken.
10+
11+
For example, for the most recent OpenAI models, you use `cl100k_base` as the encoder. For other models, you should look up the specific tokenizer used, or alternatively, you can also use `approx` to get an approximate token count which measures 1 token for every 4 characters.
12+
13+
:::tip
14+
**Tokens in `retrieved_context` often accounts for the majority of LLM token usage in a RAG application.**
15+
Token count is useful to keep track of if you are concerned about LLM cost, LLM context window limit, and LLM performance issued caused by low context precision (such as "needle-in-a-haystack" problems).
16+
:::
17+
18+
Required data items: `retrieved_context`
19+
20+
```python
21+
from continuous_eval.metrics.retrieval import TokenCount
22+
23+
datum = {
24+
"retrieved_context": [
25+
"Lyon is a major city in France.",
26+
"Paris is the capital of France and also the largest city in the country.",
27+
],
28+
"ground_truth_context": ["Paris is the capital of France."],
29+
}
30+
31+
metric = TokenCount(encoder_name="cl100k_base")
32+
print(metric(**datum))
33+
```
34+
35+
### Example Output
36+
37+
```JSON
38+
{
39+
'num_tokens': 24,
40+
}
41+
```

docs/src/content/docs/metrics/overview.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ Below is the list of metrics available:
3535
<tr>
3636
<td rowspan="2">Retrieval</td>
3737
<td>Deterministic</td>
38-
<td>PrecisionRecallF1, RankedRetrievalMetrics</td>
38+
<td>PrecisionRecallF1, RankedRetrievalMetrics, TokenCount</td>
3939
</tr>
4040
<tr>
4141
<td>LLM-based</td>
@@ -93,6 +93,10 @@ Below is the list of metrics available:
9393
- **Definition:** Rank-aware metrics including Mean Average Precision (MAP), Mean Reciprical Rank (MRR), NDCG (Normalized Discounted Cumulative Gain) of retrieved contexts
9494
- **Inputs:** `retrieved_context`, `ground_truth_context`
9595

96+
**`TokenCount`**
97+
- **Definition:** Counts the amount of tokens in the retrieved context.
98+
- **Inputs:** `retrieved_context`
99+
96100
##### LLM-based
97101

98102
**`LLMBasedContextPrecision`**

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[tool.poetry]
22
name = "continuous-eval"
3-
version = "0.3.12"
3+
version = "0.3.13"
44
description = "Open-Source Evaluation for GenAI Application Pipelines."
55
authors = ["Yi Zhang <[email protected]>", "Pasquale Antonante <[email protected]>"]
66
readme = "README.md"

tests/retrieval_metrics_test.py

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,9 +9,10 @@
99
RankedRetrievalMetrics,
1010
RougeChunkMatch,
1111
RougeSentenceMatch,
12+
TokenCount,
1213
)
1314
from tests.helpers import example_datum
14-
from tests.helpers.utils import all_close, in_zero_one, list_of_dicts_to_dict_of_lists
15+
from tests.helpers.utils import all_close, in_zero_one
1516

1617

1718
def test_precision_recall_exact_chunk_match():
@@ -75,3 +76,13 @@ def test_llm_based_context_coverage_openai():
7576

7677
metric = LLMBasedContextCoverage(model=LLMFactory("gpt-3.5-turbo-1106"))
7778
assert all(in_zero_one(metric(**datum)["LLM_based_context_coverage"]) for datum in data)
79+
80+
81+
def test_token_count():
82+
data = [example_datum.CAPITAL_OF_FRANCE, example_datum.ROMEO_AND_JULIET]
83+
metric = TokenCount("o200k_base")
84+
expected = [17, 16]
85+
assert (result := [metric(**datum)["num_tokens"] for datum in data]) == expected, result
86+
expected = [17, 18]
87+
metric = TokenCount("approx")
88+
assert (result := [metric(**datum)["num_tokens"] for datum in data]) == expected, result

0 commit comments

Comments
 (0)