Skip to content

Commit 4f8cbad

Browse files
committed
Update README to clarify the purpose and usage of the script,
1 parent c03d990 commit 4f8cbad

File tree

1 file changed

+58
-65
lines changed

1 file changed

+58
-65
lines changed

evaluation/README.md

Lines changed: 58 additions & 65 deletions
Original file line numberDiff line numberDiff line change
@@ -1,63 +1,27 @@
11
# Evaluations
22

3-
## LLM Output Evaluator
3+
## `evals`: LLM evaluations to test and improve model outputs
44

5-
The `evals` script evaluates the outputs of Large Language Models (LLMs) and estimates the associated token usage and cost
5+
LLM evals test a prompt with a set of test data by scoring each item in the data set
66

7-
This script helps teams compare LLM outputs using extractiveness metrics, token usage, and cost. It is especially useful for evaluating multiple models over a batch of queries and reference answers.
7+
To test Balancer's structured text extraction of medication rules, `evals` computes:
88

9-
It supports batch evaluation via a configuration CSV and produces a detailed metrics report in CSV format.
9+
[Extractiveness](https://huggingface.co/docs/lighteval/en/metric-list#automatic-metrics-for-generative-tasks):
1010

11-
### Usage
11+
* Extractiveness Coverage:
12+
- Percentage of words in the summary that are part of an extractive fragment with the article
13+
* Extractiveness Density:
14+
- Average length of the extractive fragment to which each word in the summary belongs
15+
* Extractiveness Compression:
16+
- Word ratio between the article and the summary
1217

13-
Execute [using `uv` to manage depenendices](https://docs.astral.sh/uv/guides/scripts/) without manually managing enviornments:
14-
15-
```sh
16-
uv run evals.py --config path/to/<CONFIG_CSV> --reference path/to/<REFERENCE_CSV> --output path/to/<OUTPUT_CSV>
17-
```
18-
19-
Execute without using uv run by ensuring it is executable:
20-
21-
```sh
22-
./evals.py --config path/to/<CONFIG_CSV> --reference path/to/<REFERENCE_CSV> --output path/to/<OUTPUT_CSV>
23-
```
24-
25-
The arguments to the script are:
26-
27-
- Path to the config CSV file: Must include the columns "Model Name" and "Query"
28-
- Path to the reference CSV file: Must include the columns "Context" and "Reference"
29-
- Path where the evaluation results will be saved
30-
31-
### Configuration File
32-
33-
Generate the config CSV file:
34-
35-
```
36-
import pandas as pd
37-
38-
# Define the data
39-
data = [
40-
41-
{
42-
"Model Name": "<MODEL_NAME_1>",
43-
"Query": """<YOUR_QUERY_1>"""
44-
},
45-
46-
{
47-
"Model Name": "<MODEL_NAME_2>",
48-
"Query": """<YOUR_QUERY_2>"""
49-
},
50-
]
51-
52-
# Create DataFrame from records
53-
df = pd.DataFrame.from_records(data)
54-
55-
# Write to CSV
56-
df.to_csv("<CONFIG_CSV_PATH>", index=False)
57-
```
18+
API usage:
5819

20+
* Token usage (input/output)
21+
* Estimated cost in USD
22+
* Duration (in seconds)
5923

60-
### Reference File
24+
### Test Data:
6125

6226
Generate the reference file by connecting to a database of references
6327

@@ -77,7 +41,10 @@ echo 'export PATH="/Applications/Postgres.app/Contents/Versions/latest/bin:$PATH
7741
7842
createdb <DB_NAME>
7943
pg_restore -v -d <DB_NAME> <PATH_TO_BACKUP>.sql
44+
```
8045

46+
```
47+
from sqlalchemy import create_engine
8148
engine = create_engine("postgresql://<USER>@localhost:5432/<DB_NAME>")
8249
```
8350

@@ -99,26 +66,54 @@ df_grouped = df_grouped.rename(columns={'formatted_chunk': 'concatenated_chunks'
9966
df_grouped.to_csv('<REFERENCE_CSV_PATH>', index=False)
10067
```
10168

102-
### Output File
10369

104-
The script outputs a CSV with the following columns:
70+
### Running an Evaluation
10571

106-
Extractiveness Metrics based on the methodology from: https://aclanthology.org/N18-1065.pdf
72+
#### Test Input: Bulk model and prompt experimentation
10773

108-
* Evaluates LLM outputs for:
74+
Compare the results of many different prompts and models at once
10975

110-
* Extractiveness Coverage: Percentage of words in the summary that are part of an extractive fragment with the article
111-
* Extractiveness Density: Average length of the extractive fragment to which each word in the summary belongs
112-
* Extractiveness Compression: Word ratio between the article and the summary
76+
```
77+
import pandas as pd
11378
114-
* Computes:
79+
# Define the data
80+
data = [
11581
116-
* Token usage (input/output)
117-
* Estimated cost in USD
118-
* Duration (in seconds)
82+
{
83+
"Model Name": "<MODEL_NAME_1>",
84+
"Query": """<YOUR_QUERY_1>"""
85+
},
11986
87+
{
88+
"Model Name": "<MODEL_NAME_2>",
89+
"Query": """<YOUR_QUERY_2>"""
90+
},
91+
]
12092
121-
Exploratory data analysis:
93+
# Create DataFrame from records
94+
df = pd.DataFrame.from_records(data)
95+
96+
# Write to CSV
97+
df.to_csv("<CONFIG_CSV_PATH>", index=False)
98+
```
99+
100+
101+
#### Execute on the command line
102+
103+
104+
Execute [using `uv` to manage depenendices](https://docs.astral.sh/uv/guides/scripts/) without manually managing enviornments:
105+
106+
```sh
107+
uv run evals.py --config path/to/<CONFIG_CSV> --reference path/to/<REFERENCE_CSV> --output path/to/<OUTPUT_CSV>
108+
```
109+
110+
Execute without using uv run by ensuring it is executable:
111+
112+
```sh
113+
./evals.py --config path/to/<CONFIG_CSV> --reference path/to/<REFERENCE_CSV> --output path/to/<OUTPUT_CSV>
114+
```
115+
116+
### Analyzing Test Results
122117

123118
```
124119
import pandas as pd
@@ -158,6 +153,4 @@ for i, metric in enumerate(all_metrics):
158153
plt.tight_layout()
159154
plt.show()
160155
161-
#TODO: Calculate efficiency metrics: Total Token Usage, Cost per Token, Tokens per Second, Cost per Second
162-
163156
```

0 commit comments

Comments
 (0)