Skip to content

Commit c03d990

Browse files
committed
Update README with detailed usage instructions and enhance evals.py to include environment setup and dependencies
1 parent c483e69 commit c03d990

File tree

3 files changed

+80
-135
lines changed

3 files changed

+80
-135
lines changed

evaluation/README.md

Lines changed: 61 additions & 132 deletions
Original file line numberDiff line numberDiff line change
@@ -1,31 +1,36 @@
1-
21
# Evaluations
32

4-
#TODO: Open AI evals documentaiton: https://platform.openai.com/docs/guides/evals
5-
63
## LLM Output Evaluator
74

8-
The `evals` script evaluates the outputs of Large Language Models (LLMs) and estimates the associated token usage and cost.
5+
The `evals` script evaluates the outputs of Large Language Models (LLMs) and estimates the associated token usage and cost
96

10-
It supports batch evalaution via a configuration CSV and produces a detailed metrics report in CSV format.
7+
This script helps teams compare LLM outputs using extractiveness metrics, token usage, and cost. It is especially useful for evaluating multiple models over a batch of queries and reference answers.
118

12-
### Usage
9+
It supports batch evaluation via a configuration CSV and produces a detailed metrics report in CSV format.
1310

14-
This script evaluates LLM outputs using the `lighteval` library:
15-
https://huggingface.co/docs/lighteval/en/metric-list#automatic-metrics-for-generative-tasks
11+
### Usage
1612

17-
##TODO: Use uv to execute scripts without manually manging enviornments https://docs.astral.sh/uv/guides/scripts/
13+
Execute [using `uv` to manage depenendices](https://docs.astral.sh/uv/guides/scripts/) without manually managing enviornments:
1814

19-
Ensure you have the `lighteval` library and any model SDKs (e.g., OpenAI) configured properly.
15+
```sh
16+
uv run evals.py --config path/to/<CONFIG_CSV> --reference path/to/<REFERENCE_CSV> --output path/to/<OUTPUT_CSV>
17+
```
2018

19+
Execute without using uv run by ensuring it is executable:
2120

22-
```bash
23-
python evals.py --config path/to/config.csv --reference path/to/reference.csv --output path/to/results.csv
21+
```sh
22+
./evals.py --config path/to/<CONFIG_CSV> --reference path/to/<REFERENCE_CSV> --output path/to/<OUTPUT_CSV>
2423
```
2524

2625
The arguments to the script are:
2726

2827
- Path to the config CSV file: Must include the columns "Model Name" and "Query"
28+
- Path to the reference CSV file: Must include the columns "Context" and "Reference"
29+
- Path where the evaluation results will be saved
30+
31+
### Configuration File
32+
33+
Generate the config CSV file:
2934

3035
```
3136
import pandas as pd
@@ -34,153 +39,93 @@ import pandas as pd
3439
data = [
3540
3641
{
37-
"Model Name": "GPT_4O_MINI",
38-
"Query": """
39-
You're analyzing medical text from multiple sources. Each chunk is labeled [chunk-X].
40-
41-
Act as a seasoned physician or medical professional who treats patients with bipolar disorder.
42-
43-
Identify rules for medication inclusion or exclusion based on medical history or concerns.
44-
45-
For each rule you find, return a JSON object using the following format:
46-
47-
{
48-
"rule": "<condition or concern>",
49-
"type": "INCLUDE" or "EXCLUDE",
50-
"reason": "<short explanation for why this rule applies>",
51-
"medications": ["<medication 1>", "<medication 2>", ...],
52-
"source": "<chunk-X>"
53-
}
54-
55-
Only include rules that are explicitly stated or strongly implied in the chunk.
56-
57-
Only use the chunks provided. If no rule is found in a chunk, skip it.
58-
59-
Return the entire output as a JSON array.
60-
"""
42+
"Model Name": "<MODEL_NAME_1>",
43+
"Query": """<YOUR_QUERY_1>"""
6144
},
6245
6346
{
64-
"Model Name": "GPT_41_NANO",
65-
"Query": """
66-
67-
# Role and Objective
68-
69-
- You are a seasoned physician or medical professional who is developing a bipolar disorder treatment algorithim
70-
71-
- You are extracting bipolar medication decision points from a research paper that is chunked into multiple parts each labeled with an ID
72-
73-
# Instructions
74-
75-
- Identify decision points for bipolar medications
76-
77-
- For each decision point you find, return a JSON object using the following format:
78-
79-
{
80-
"criterion": "<condition or concern>",
81-
"decision": "INCLUDE" or "EXCLUDE",
82-
"medications": ["<medication 1>", "<medication 2>", ...],
83-
"reason": "<short explanation for why this criterion applies>",
84-
"sources": ["<ID-X>"]
85-
}
86-
87-
88-
- Only extract bipolar medication decision points that are explicitly stated or strongly implied in the context and never rely on your own knowledge
89-
90-
# Output Format
91-
92-
- Return the extracted bipolar medication decision points as a JSON array and if no decision points are found in the context return an empty array
93-
94-
# Example
95-
96-
[
97-
{
98-
"criterion": "History of suicide attempts",
99-
"decision": "INCLUDE",
100-
"medications": ["Lithium"],
101-
"reason": "Lithium is the only medication on the market that has been proven to reduce suicidality in patients with bipolar disorder",
102-
"sources": ["ID-0"]
103-
},
104-
{
105-
"criterion": "Weight gain concerns",
106-
"decision": "EXCLUDE",
107-
"medications": ["Quetiapine", "Aripiprazole", "Olanzapine", "Risperidone"],
108-
"reason": "Seroquel, Risperdal, Abilify, and Zyprexa are known for causing weight gain",
109-
"sources": ["ID-0", "ID-1", "ID-2"]
110-
}
111-
]
112-
113-
"""
114-
47+
"Model Name": "<MODEL_NAME_2>",
48+
"Query": """<YOUR_QUERY_2>"""
11549
},
11650
]
11751
11852
# Create DataFrame from records
11953
df = pd.DataFrame.from_records(data)
12054
12155
# Write to CSV
122-
df.to_csv("~/Desktop/evals_config.csv", index=False)
56+
df.to_csv("<CONFIG_CSV_PATH>", index=False)
12357
```
12458

12559

126-
- Path to the reference CSV file: Must include the columns "Context" and "Reference"
60+
### Reference File
61+
62+
Generate the reference file by connecting to a database of references
63+
64+
Connect to the Postgres database of your local Balancer instance:
12765

12866
```
12967
from sqlalchemy import create_engine
130-
import pandas as pd
13168
13269
engine = create_engine("postgresql+psycopg2://balancer:balancer@localhost:5433/balancer_dev")
133-
# Filter out papers that shouldn't be used from local database
134-
query = "SELECT * FROM api_embeddings WHERE date_of_upload > '2025-03-14';"
135-
df = pd.read_sql(query, engine)
136-
137-
df['formatted_chunk'] = df.apply(lambda row: f"ID: {row['chunk_number']} | CONTENT: {row['text']}", axis=1)
138-
# Ensure the chunks are joined in order of chunk_number by sorting the DataFrame before grouping and joining
139-
df = df.sort_values(by=['name', 'upload_file_id', 'chunk_number'])
140-
df_grouped = df.groupby(['name', 'upload_file_id'])['formatted_chunk'].apply(lambda chunks: "\n".join(chunks)).reset_index()
141-
df_grouped = df_grouped.rename(columns={'formatted_chunk': 'concatenated_chunks'})
142-
df_grouped.to_csv('~/Desktop/formatted_chunks.csv', index=False)
14370
```
14471

72+
Connect to the Postgres database of the production Balancer instance using a SQL file:
73+
14574
```
75+
# Install Postgres.app and add binaries to the PATH
14676
echo 'export PATH="/Applications/Postgres.app/Contents/Versions/latest/bin:$PATH"' >> ~/.zshrc
147-
source ~/.zshrc
14877
149-
createdb backupDBBalancer07012025
150-
pg_restore -v -d backupDBBalancer07012025 ~/Downloads/backupDBBalancer07012025.sql
78+
createdb <DB_NAME>
79+
pg_restore -v -d <DB_NAME> <PATH_TO_BACKUP>.sql
15180
152-
pip install psycopg2-binary
81+
engine = create_engine("postgresql://<USER>@localhost:5432/<DB_NAME>")
82+
```
15383

154-
from sqlalchemy import create_engine
155-
import pandas as pd
84+
Generate the reference CSV file:
15685

157-
# Alternative: Standard psycopg2 connection (if you get psycopg2 working)
158-
# engine = create_engine("postgresql://sahildshah@localhost:5432/backupDBBalancer07012025")
86+
```
87+
import pandas as pd
15988
160-
# Fixed the variable name (was "database query", now "query")
16189
query = "SELECT * FROM api_embeddings;"
162-
163-
# Execute the query and load into DataFrame
16490
df = pd.read_sql(query, engine)
16591
16692
df['formatted_chunk'] = df.apply(lambda row: f"ID: {row['chunk_number']} | CONTENT: {row['text']}", axis=1)
93+
16794
# Ensure the chunks are joined in order of chunk_number by sorting the DataFrame before grouping and joining
16895
df = df.sort_values(by=['name', 'upload_file_id', 'chunk_number'])
16996
df_grouped = df.groupby(['name', 'upload_file_id'])['formatted_chunk'].apply(lambda chunks: "\n".join(chunks)).reset_index()
97+
17098
df_grouped = df_grouped.rename(columns={'formatted_chunk': 'concatenated_chunks'})
171-
df_grouped.to_csv('~/Desktop/formatted_chunks.csv', index=False)
99+
df_grouped.to_csv('<REFERENCE_CSV_PATH>', index=False)
172100
```
173101

102+
### Output File
103+
104+
The script outputs a CSV with the following columns:
105+
106+
Extractiveness Metrics based on the methodology from: https://aclanthology.org/N18-1065.pdf
107+
108+
* Evaluates LLM outputs for:
109+
110+
* Extractiveness Coverage: Percentage of words in the summary that are part of an extractive fragment with the article
111+
* Extractiveness Density: Average length of the extractive fragment to which each word in the summary belongs
112+
* Extractiveness Compression: Word ratio between the article and the summary
113+
114+
* Computes:
174115

116+
* Token usage (input/output)
117+
* Estimated cost in USD
118+
* Duration (in seconds)
175119

176-
- Path where the evaluation resuls will be saved
177120

121+
Exploratory data analysis:
122+
123+
```
178124
import pandas as pd
179125
import matplotlib.pyplot as plt
180126
import numpy as np
181127
182-
183-
df = pd.read_csv("~/Desktop/evals_out-20250702.csv")
128+
df = pd.read_csv("<OUTPUT_CSV_PATH>")
184129
185130
# Define the metrics of interest
186131
extractiveness_cols = ['Extractiveness Coverage', 'Extractiveness Density', 'Extractiveness Compression']
@@ -213,22 +158,6 @@ for i, metric in enumerate(all_metrics):
213158
plt.tight_layout()
214159
plt.show()
215160
216-
#TODO: Compute count, min, quantiles and max by model
217161
#TODO: Calculate efficiency metrics: Total Token Usage, Cost per Token, Tokens per Second, Cost per Second
218162
219-
220-
The script outputs a CSV with the following columns:
221-
222-
#TODO: Summarize https://aclanthology.org/N18-1065.pdf
223-
224-
* Evaluates LLM outputs for:
225-
226-
* Extractiveness Coverage: Percentage of words in the summary that are part of an extractive fragment with the article
227-
* Extractiveness Density: Average length of the extractive fragement to which each word in the summary belongs
228-
* Extractiveness Compression: Word ratio between the article and the summary
229-
230-
* Computes:
231-
232-
* Token usage (input/output)
233-
* Estimated cost in USD
234-
* Duration (in seconds)
163+
```

evaluation/evals.py

100644100755
Lines changed: 16 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,24 @@
1+
#!/usr/bin/env -S uv run --script
2+
# /// script
3+
# requires-python = "==3.11.11"
4+
# dependencies = [
5+
# "pandas==2.2.3",
6+
# "lighteval==0.10.0",
7+
# "openai==1.83.0"
8+
# ]
9+
# ///
10+
111
"""
212
Evaluate LLM outputs using multiple metrics and compute associated costs
313
"""
414

5-
#TODO: Run this script with uv to manage dependencies
15+
#This script evaluates LLM outputs using the `lighteval` library
16+
#https://huggingface.co/docs/lighteval/en/metric-list#automatic-metrics-for-generative-tasks
17+
18+
#This script uses Python 3.11 where prebuilt wheels for `sentencepiece` exist
19+
620

7-
# TODO: Add tests on a small dummy dataset to confirm it handles errors gracefully and produces expected outputs
21+
#TODO: Add tests on a small dummy dataset to confirm it handles errors gracefully and produces expected outputs
822

923
import sys
1024
import os

server/api/services/llm_services.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,11 +17,13 @@ def handle_request(
1717
pass
1818

1919
# LLM Pricing Calculator: https://www.llm-prices.com/
20+
# TODO: Add support for more models and their pricing
2021

21-
# Anthropic Model Pricing: https://docs.anthropic.com/en/docs/about-claude/pricing#model-pricing
22+
# Anthropic Model Pricing: https://docs.anthropic.com/en/docs/about-claude/pricing#model-pricing
2223

2324
class GPT4OMiniHandler(BaseModelHandler):
2425
MODEL = "gpt-4o-mini"
26+
# TODO: Get the latest model pricing from OpenAI's API or documentation
2527
# Model Pricing: https://platform.openai.com/docs/pricing
2628
PRICING_DOLLARS_PER_MILLION_TOKENS = {"input": 0.15, "output": 0.60}
2729

0 commit comments

Comments
 (0)