1-
21# Evaluations
32
4- #TODO: Open AI evals documentaiton: https://platform.openai.com/docs/guides/evals
5-
63## LLM Output Evaluator
74
8- The ` evals ` script evaluates the outputs of Large Language Models (LLMs) and estimates the associated token usage and cost.
5+ The ` evals ` script evaluates the outputs of Large Language Models (LLMs) and estimates the associated token usage and cost
96
10- It supports batch evalaution via a configuration CSV and produces a detailed metrics report in CSV format .
7+ This script helps teams compare LLM outputs using extractiveness metrics, token usage, and cost. It is especially useful for evaluating multiple models over a batch of queries and reference answers .
118
12- ### Usage
9+ It supports batch evaluation via a configuration CSV and produces a detailed metrics report in CSV format.
1310
14- This script evaluates LLM outputs using the ` lighteval ` library:
15- https://huggingface.co/docs/lighteval/en/metric-list#automatic-metrics-for-generative-tasks
11+ ### Usage
1612
17- ##TODO: Use uv to execute scripts without manually manging enviornments https://docs.astral.sh/uv/guides/scripts/
13+ Execute [ using ` uv ` to manage depenendices ] ( https://docs.astral.sh/uv/guides/scripts/ ) without manually managing enviornments:
1814
19- Ensure you have the ` lighteval ` library and any model SDKs (e.g., OpenAI) configured properly.
15+ ``` sh
16+ uv run evals.py --config path/to/< CONFIG_CSV> --reference path/to/< REFERENCE_CSV> --output path/to/< OUTPUT_CSV>
17+ ```
2018
19+ Execute without using uv run by ensuring it is executable:
2120
22- ``` bash
23- python evals.py --config path/to/config.csv --reference path/to/reference.csv --output path/to/results.csv
21+ ``` sh
22+ ./ evals.py --config path/to/< CONFIG_CSV > --reference path/to/< REFERENCE_CSV > --output path/to/< OUTPUT_CSV >
2423```
2524
2625The arguments to the script are:
2726
2827- Path to the config CSV file: Must include the columns "Model Name" and "Query"
28+ - Path to the reference CSV file: Must include the columns "Context" and "Reference"
29+ - Path where the evaluation results will be saved
30+
31+ ### Configuration File
32+
33+ Generate the config CSV file:
2934
3035```
3136import pandas as pd
@@ -34,153 +39,93 @@ import pandas as pd
3439data = [
3540
3641 {
37- "Model Name": "GPT_4O_MINI",
38- "Query": """
39- You're analyzing medical text from multiple sources. Each chunk is labeled [chunk-X].
40-
41- Act as a seasoned physician or medical professional who treats patients with bipolar disorder.
42-
43- Identify rules for medication inclusion or exclusion based on medical history or concerns.
44-
45- For each rule you find, return a JSON object using the following format:
46-
47- {
48- "rule": "<condition or concern>",
49- "type": "INCLUDE" or "EXCLUDE",
50- "reason": "<short explanation for why this rule applies>",
51- "medications": ["<medication 1>", "<medication 2>", ...],
52- "source": "<chunk-X>"
53- }
54-
55- Only include rules that are explicitly stated or strongly implied in the chunk.
56-
57- Only use the chunks provided. If no rule is found in a chunk, skip it.
58-
59- Return the entire output as a JSON array.
60- """
42+ "Model Name": "<MODEL_NAME_1>",
43+ "Query": """<YOUR_QUERY_1>"""
6144 },
6245
6346 {
64- "Model Name": "GPT_41_NANO",
65- "Query": """
66-
67- # Role and Objective
68-
69- - You are a seasoned physician or medical professional who is developing a bipolar disorder treatment algorithim
70-
71- - You are extracting bipolar medication decision points from a research paper that is chunked into multiple parts each labeled with an ID
72-
73- # Instructions
74-
75- - Identify decision points for bipolar medications
76-
77- - For each decision point you find, return a JSON object using the following format:
78-
79- {
80- "criterion": "<condition or concern>",
81- "decision": "INCLUDE" or "EXCLUDE",
82- "medications": ["<medication 1>", "<medication 2>", ...],
83- "reason": "<short explanation for why this criterion applies>",
84- "sources": ["<ID-X>"]
85- }
86-
87-
88- - Only extract bipolar medication decision points that are explicitly stated or strongly implied in the context and never rely on your own knowledge
89-
90- # Output Format
91-
92- - Return the extracted bipolar medication decision points as a JSON array and if no decision points are found in the context return an empty array
93-
94- # Example
95-
96- [
97- {
98- "criterion": "History of suicide attempts",
99- "decision": "INCLUDE",
100- "medications": ["Lithium"],
101- "reason": "Lithium is the only medication on the market that has been proven to reduce suicidality in patients with bipolar disorder",
102- "sources": ["ID-0"]
103- },
104- {
105- "criterion": "Weight gain concerns",
106- "decision": "EXCLUDE",
107- "medications": ["Quetiapine", "Aripiprazole", "Olanzapine", "Risperidone"],
108- "reason": "Seroquel, Risperdal, Abilify, and Zyprexa are known for causing weight gain",
109- "sources": ["ID-0", "ID-1", "ID-2"]
110- }
111- ]
112-
113- """
114-
47+ "Model Name": "<MODEL_NAME_2>",
48+ "Query": """<YOUR_QUERY_2>"""
11549 },
11650]
11751
11852# Create DataFrame from records
11953df = pd.DataFrame.from_records(data)
12054
12155# Write to CSV
122- df.to_csv("~/Desktop/evals_config.csv ", index=False)
56+ df.to_csv("<CONFIG_CSV_PATH> ", index=False)
12357```
12458
12559
126- - Path to the reference CSV file: Must include the columns "Context" and "Reference"
60+ ### Reference File
61+
62+ Generate the reference file by connecting to a database of references
63+
64+ Connect to the Postgres database of your local Balancer instance:
12765
12866```
12967from sqlalchemy import create_engine
130- import pandas as pd
13168
13269engine = create_engine("postgresql+psycopg2://balancer:balancer@localhost:5433/balancer_dev")
133- # Filter out papers that shouldn't be used from local database
134- query = "SELECT * FROM api_embeddings WHERE date_of_upload > '2025-03-14';"
135- df = pd.read_sql(query, engine)
136-
137- df['formatted_chunk'] = df.apply(lambda row: f"ID: {row['chunk_number']} | CONTENT: {row['text']}", axis=1)
138- # Ensure the chunks are joined in order of chunk_number by sorting the DataFrame before grouping and joining
139- df = df.sort_values(by=['name', 'upload_file_id', 'chunk_number'])
140- df_grouped = df.groupby(['name', 'upload_file_id'])['formatted_chunk'].apply(lambda chunks: "\n".join(chunks)).reset_index()
141- df_grouped = df_grouped.rename(columns={'formatted_chunk': 'concatenated_chunks'})
142- df_grouped.to_csv('~/Desktop/formatted_chunks.csv', index=False)
14370```
14471
72+ Connect to the Postgres database of the production Balancer instance using a SQL file:
73+
14574```
75+ # Install Postgres.app and add binaries to the PATH
14676echo 'export PATH="/Applications/Postgres.app/Contents/Versions/latest/bin:$PATH"' >> ~/.zshrc
147- source ~/.zshrc
14877
149- createdb backupDBBalancer07012025
150- pg_restore -v -d backupDBBalancer07012025 ~/Downloads/backupDBBalancer07012025 .sql
78+ createdb <DB_NAME>
79+ pg_restore -v -d <DB_NAME> <PATH_TO_BACKUP> .sql
15180
152- pip install psycopg2-binary
81+ engine = create_engine("postgresql://<USER>@localhost:5432/<DB_NAME>")
82+ ```
15383
154- from sqlalchemy import create_engine
155- import pandas as pd
84+ Generate the reference CSV file:
15685
157- # Alternative: Standard psycopg2 connection (if you get psycopg2 working)
158- # engine = create_engine("postgresql://sahildshah@localhost:5432/backupDBBalancer07012025")
86+ ```
87+ import pandas as pd
15988
160- # Fixed the variable name (was "database query", now "query")
16189query = "SELECT * FROM api_embeddings;"
162-
163- # Execute the query and load into DataFrame
16490df = pd.read_sql(query, engine)
16591
16692df['formatted_chunk'] = df.apply(lambda row: f"ID: {row['chunk_number']} | CONTENT: {row['text']}", axis=1)
93+
16794# Ensure the chunks are joined in order of chunk_number by sorting the DataFrame before grouping and joining
16895df = df.sort_values(by=['name', 'upload_file_id', 'chunk_number'])
16996df_grouped = df.groupby(['name', 'upload_file_id'])['formatted_chunk'].apply(lambda chunks: "\n".join(chunks)).reset_index()
97+
17098df_grouped = df_grouped.rename(columns={'formatted_chunk': 'concatenated_chunks'})
171- df_grouped.to_csv('~/Desktop/formatted_chunks.csv ', index=False)
99+ df_grouped.to_csv('<REFERENCE_CSV_PATH> ', index=False)
172100```
173101
102+ ### Output File
103+
104+ The script outputs a CSV with the following columns:
105+
106+ Extractiveness Metrics based on the methodology from: https://aclanthology.org/N18-1065.pdf
107+
108+ * Evaluates LLM outputs for:
109+
110+ * Extractiveness Coverage: Percentage of words in the summary that are part of an extractive fragment with the article
111+ * Extractiveness Density: Average length of the extractive fragment to which each word in the summary belongs
112+ * Extractiveness Compression: Word ratio between the article and the summary
113+
114+ * Computes:
174115
116+ * Token usage (input/output)
117+ * Estimated cost in USD
118+ * Duration (in seconds)
175119
176- - Path where the evaluation resuls will be saved
177120
121+ Exploratory data analysis:
122+
123+ ```
178124import pandas as pd
179125import matplotlib.pyplot as plt
180126import numpy as np
181127
182-
183- df = pd.read_csv("~ /Desktop/evals_out-20250702.csv")
128+ df = pd.read_csv("<OUTPUT_CSV_PATH>")
184129
185130# Define the metrics of interest
186131extractiveness_cols = ['Extractiveness Coverage', 'Extractiveness Density', 'Extractiveness Compression']
@@ -213,22 +158,6 @@ for i, metric in enumerate(all_metrics):
213158plt.tight_layout()
214159plt.show()
215160
216- #TODO: Compute count, min, quantiles and max by model
217161#TODO: Calculate efficiency metrics: Total Token Usage, Cost per Token, Tokens per Second, Cost per Second
218162
219-
220- The script outputs a CSV with the following columns:
221-
222- #TODO: Summarize https://aclanthology.org/N18-1065.pdf
223-
224- * Evaluates LLM outputs for:
225-
226- * Extractiveness Coverage: Percentage of words in the summary that are part of an extractive fragment with the article
227- * Extractiveness Density: Average length of the extractive fragement to which each word in the summary belongs
228- * Extractiveness Compression: Word ratio between the article and the summary
229-
230- * Computes:
231-
232- * Token usage (input/output)
233- * Estimated cost in USD
234- * Duration (in seconds)
163+ ```
0 commit comments