22
33## ` evals ` : LLM evaluations to test and improve model outputs
44
5- ### Metrics
6-
7- [ Extractiveness] ( https://huggingface.co/docs/lighteval/en/metric-list#automatic-metrics-for-generative-tasks ) :
5+ ### Evaluation Metrics
86
97Natural Language Generation Performance:
108
9+ [ Extractiveness] ( https://huggingface.co/docs/lighteval/en/metric-list#automatic-metrics-for-generative-tasks ) :
10+
1111* Extractiveness Coverage:
1212 - Percentage of words in the summary that are part of an extractive fragment with the article
1313* Extractiveness Density:
@@ -23,7 +23,7 @@ API Performance:
2323
2424### Test Data
2525
26- Generate the dataset file by connecting to a database of references
26+ Generate the dataset file by connecting to a database of research papers:
2727
2828Connect to the Postgres database of your local Balancer instance:
2929
@@ -36,72 +36,63 @@ engine = create_engine("postgresql+psycopg2://balancer:balancer@localhost:5433/b
3636Connect to the Postgres database of the production Balancer instance using a SQL file:
3737
3838```
39- # Install Postgres.app and add binaries to the PATH
39+ # Add Postgres.app binaries to the PATH
4040echo 'export PATH="/Applications/Postgres.app/Contents/Versions/latest/bin:$PATH"' >> ~/.zshrc
4141
4242createdb <DB_NAME>
4343pg_restore -v -d <DB_NAME> <PATH_TO_BACKUP>.sql
4444```
4545
46- ```
47- from sqlalchemy import create_engine
48- engine = create_engine("postgresql://<USER>@localhost:5432/<DB_NAME>")
49- ```
50-
5146Generate the dataset CSV file:
5247
5348```
49+ from sqlalchemy import create_engine
5450import pandas as pd
5551
52+ engine = create_engine("postgresql://<USER>@localhost:5432/<DB_NAME>")
53+
5654query = "SELECT * FROM api_embeddings;"
5755df = pd.read_sql(query, engine)
5856
59- df['formatted_chunk '] = df.apply(lambda row: f"ID: {row['chunk_number']} | CONTENT: {row['text']}", axis=1)
57+ df['INPUT '] = df.apply(lambda row: f"ID: {row['chunk_number']} | CONTENT: {row['text']}", axis=1)
6058
6159# Ensure the chunks are joined in order of chunk_number by sorting the DataFrame before grouping and joining
6260df = df.sort_values(by=['name', 'upload_file_id', 'chunk_number'])
63- df_grouped = df.groupby(['name', 'upload_file_id'])['formatted_chunk '].apply(lambda chunks: "\n".join(chunks)).reset_index()
61+ df_grouped = df.groupby(['name', 'upload_file_id'])['INPUT '].apply(lambda chunks: "\n".join(chunks)).reset_index()
6462
65- df_grouped = df_grouped.rename(columns={'formatted_chunk': 'concatenated_chunks'})
6663df_grouped.to_csv('<DATASET_CSV_PATH>', index=False)
6764```
6865
69-
7066### Running an Evaluation
7167
72- #### Test Input: Bulk model and prompt experimentation
68+ #### Bulk Model and Prompt Experimentation
7369
7470Compare the results of many different prompts and models at once
7571
7672```
7773import pandas as pd
7874
79- # Define the data
8075data = [
81-
8276 {
83- "Model Name ": "<MODEL_NAME_1>",
84- "Query ": """<YOUR_QUERY_1>"""
77+ "MODEL ": "<MODEL_NAME_1>",
78+ "INSTRUCTIONS ": """<YOUR_QUERY_1>"""
8579 },
86-
8780 {
88- "Model Name ": "<MODEL_NAME_2>",
89- "Query ": """<YOUR_QUERY_2>"""
81+ "MODEL ": "<MODEL_NAME_2>",
82+ "INSTRUCTIONS ": """<YOUR_QUERY_2>"""
9083 },
9184]
9285
93- # Create DataFrame from records
9486df = pd.DataFrame.from_records(data)
9587
96- # Write to CSV
9788df.to_csv("<EXPERIMENTS_CSV_PATH>", index=False)
9889```
9990
10091
101- #### Execute on the command line
92+ #### Execute on the Command Line
10293
10394
104- Execute [ using ` uv ` to manage depenendices ] ( https://docs.astral.sh/uv/guides/scripts/ ) without manually managing enviornments:
95+ Execute [ using ` uv ` to manage dependencies ] ( https://docs.astral.sh/uv/guides/scripts/ ) without manually managing enviornments:
10596
10697``` sh
10798uv run evals.py --experiments path/to/< EXPERIMENTS_CSV> --dataset path/to/< DATASET_CSV> --results path/to/< RESULTS_CSV>
@@ -156,3 +147,5 @@ plt.show()
156147```
157148
158149### Contributing
150+
151+ You're welcome to add LLM models to test in ` server/api/services/llm_services `
0 commit comments