11# Evaluations
22
3- ## LLM Output Evaluator
3+ ## ` evals ` : LLM evaluations to test and improve model outputs
44
5- The ` evals ` script evaluates the outputs of Large Language Models (LLMs) and estimates the associated token usage and cost
5+ LLM evals test a prompt with a set of test data by scoring each item in the data set
66
7- This script helps teams compare LLM outputs using extractiveness metrics, token usage, and cost. It is especially useful for evaluating multiple models over a batch of queries and reference answers.
7+ To test Balancer's structured text extraction of medication rules, ` evals ` computes:
88
9- It supports batch evaluation via a configuration CSV and produces a detailed metrics report in CSV format.
9+ [ Extractiveness ] ( https://huggingface.co/docs/lighteval/en/metric-list#automatic-metrics-for-generative-tasks ) :
1010
11- ### Usage
11+ * Extractiveness Coverage:
12+ - Percentage of words in the summary that are part of an extractive fragment with the article
13+ * Extractiveness Density:
14+ - Average length of the extractive fragment to which each word in the summary belongs
15+ * Extractiveness Compression:
16+ - Word ratio between the article and the summary
1217
13- Execute [ using ` uv ` to manage depenendices] ( https://docs.astral.sh/uv/guides/scripts/ ) without manually managing enviornments:
14-
15- ``` sh
16- uv run evals.py --config path/to/< CONFIG_CSV> --reference path/to/< REFERENCE_CSV> --output path/to/< OUTPUT_CSV>
17- ```
18-
19- Execute without using uv run by ensuring it is executable:
20-
21- ``` sh
22- ./evals.py --config path/to/< CONFIG_CSV> --reference path/to/< REFERENCE_CSV> --output path/to/< OUTPUT_CSV>
23- ```
24-
25- The arguments to the script are:
26-
27- - Path to the config CSV file: Must include the columns "Model Name" and "Query"
28- - Path to the reference CSV file: Must include the columns "Context" and "Reference"
29- - Path where the evaluation results will be saved
30-
31- ### Configuration File
32-
33- Generate the config CSV file:
34-
35- ```
36- import pandas as pd
37-
38- # Define the data
39- data = [
40-
41- {
42- "Model Name": "<MODEL_NAME_1>",
43- "Query": """<YOUR_QUERY_1>"""
44- },
45-
46- {
47- "Model Name": "<MODEL_NAME_2>",
48- "Query": """<YOUR_QUERY_2>"""
49- },
50- ]
51-
52- # Create DataFrame from records
53- df = pd.DataFrame.from_records(data)
54-
55- # Write to CSV
56- df.to_csv("<CONFIG_CSV_PATH>", index=False)
57- ```
18+ API usage:
5819
20+ * Token usage (input/output)
21+ * Estimated cost in USD
22+ * Duration (in seconds)
5923
60- ### Reference File
24+ ### Test Data:
6125
6226Generate the reference file by connecting to a database of references
6327
@@ -77,7 +41,10 @@ echo 'export PATH="/Applications/Postgres.app/Contents/Versions/latest/bin:$PATH
7741
7842createdb <DB_NAME>
7943pg_restore -v -d <DB_NAME> <PATH_TO_BACKUP>.sql
44+ ```
8045
46+ ```
47+ from sqlalchemy import create_engine
8148engine = create_engine("postgresql://<USER>@localhost:5432/<DB_NAME>")
8249```
8350
@@ -99,26 +66,54 @@ df_grouped = df_grouped.rename(columns={'formatted_chunk': 'concatenated_chunks'
9966df_grouped.to_csv('<REFERENCE_CSV_PATH>', index=False)
10067```
10168
102- ### Output File
10369
104- The script outputs a CSV with the following columns:
70+ ### Running an Evaluation
10571
106- Extractiveness Metrics based on the methodology from: https://aclanthology.org/N18-1065.pdf
72+ #### Test Input: Bulk model and prompt experimentation
10773
108- * Evaluates LLM outputs for:
74+ Compare the results of many different prompts and models at once
10975
110- * Extractiveness Coverage: Percentage of words in the summary that are part of an extractive fragment with the article
111- * Extractiveness Density: Average length of the extractive fragment to which each word in the summary belongs
112- * Extractiveness Compression: Word ratio between the article and the summary
76+ ```
77+ import pandas as pd
11378
114- * Computes:
79+ # Define the data
80+ data = [
11581
116- * Token usage (input/output)
117- * Estimated cost in USD
118- * Duration (in seconds)
82+ {
83+ "Model Name": "<MODEL_NAME_1>",
84+ "Query": """<YOUR_QUERY_1>"""
85+ },
11986
87+ {
88+ "Model Name": "<MODEL_NAME_2>",
89+ "Query": """<YOUR_QUERY_2>"""
90+ },
91+ ]
12092
121- Exploratory data analysis:
93+ # Create DataFrame from records
94+ df = pd.DataFrame.from_records(data)
95+
96+ # Write to CSV
97+ df.to_csv("<CONFIG_CSV_PATH>", index=False)
98+ ```
99+
100+
101+ #### Execute on the command line
102+
103+
104+ Execute [ using ` uv ` to manage depenendices] ( https://docs.astral.sh/uv/guides/scripts/ ) without manually managing enviornments:
105+
106+ ``` sh
107+ uv run evals.py --config path/to/< CONFIG_CSV> --reference path/to/< REFERENCE_CSV> --output path/to/< OUTPUT_CSV>
108+ ```
109+
110+ Execute without using uv run by ensuring it is executable:
111+
112+ ``` sh
113+ ./evals.py --config path/to/< CONFIG_CSV> --reference path/to/< REFERENCE_CSV> --output path/to/< OUTPUT_CSV>
114+ ```
115+
116+ ### Analyzing Test Results
122117
123118```
124119import pandas as pd
@@ -158,6 +153,4 @@ for i, metric in enumerate(all_metrics):
158153plt.tight_layout()
159154plt.show()
160155
161- #TODO: Calculate efficiency metrics: Total Token Usage, Cost per Token, Tokens per Second, Cost per Second
162-
163156```
0 commit comments