LLM Engineer's Handbook · Evaluating LLMs

This repository contains my personal implementation of the coding part from Chapter 7 of the LLM Engineer's Handbook, focusing on evaluating LLMs (Large Language Models).

About the Project

This project was developed as part of my participation in the AI from Scratch study group. Rather than following the book's code exactly, I've rewritten it from scratch to deepen my understanding of the concepts.

Goal

We want to evaluate this three models:

Reference model: meta-llama/Meta-Llama-3.1-8B-Instruct
Finetuned model: mlabonne/TwinLlama-3.1-8B-GGUF
DPO model: mlabonne/TwinLlama-3.1-8B-DPO-GGUF

Using the instructions from this dataset:

Instructions dataset: mlabonne/llmtwin

Approach

To achieve this, we'll use OpenAI's gpt-4o-mini as a judge who will follow the evaluate-answer prompt.

Installation

To set up this project on your local machine:

Clone the repository

git clone https://github.com/elcapo/llm-engineers-handbook-evaluating-llms.git
cd llm-engineers-handbook-evaluating-llms

Create and activate a virtual environment

python -m venv .venv
source .venv/bin/activate

Install the required dependencies

pip install -r requirements.txt

Set up your environment variables

cp env.example .env

Then edit the .env file to add your Hugging Face and OpenAI API tokens.

Usage

The project includes several CLI scripts to help you evaluate LLMs:

Preview Instructions

View the first few instructions from the dataset:

./preview_instructions.py

An example of the output can be found in instructions.txt.

Preview Prompts

View the first few prompts generated from the instructions:

./preview_prompts.py

An example of the output can be found in prompts.txt.

Preview Answers

Preview answers generated by a specific model:

./preview_answers.py <model_id> [endpoint_url]

Note

Take into account that in order to preview and generate answers with the TwinLlama models you'll need to set up your own inference endpoint. When you create one, HuggingFace will assign a unique URL for it that you'll need to add when using the following commands.

Examples:

./preview_answers.py meta-llama/Meta-Llama-3.1-8B-Instruct
./preview_answers.py mlabonne/TwinLlama-3.1-8B-GGUF https://endpoint-url.location.provider.endpoints.huggingface.cloud
./preview_answers.py mlabonne/TwinLlama-3.1-8B-DPO-GGUF https://endpoint-url.location.provider.endpoints.huggingface.cloud

An example of the output for each model can be found in:

Generate Answers

Generate and save all answers for a specific model:

./generate_answers.py <model_id> [endpoint_url]

Examples:

./generate_answers.py meta-llama/Meta-Llama-3.1-8B-Instruct
./generate_answers.py mlabonne/TwinLlama-3.1-8B-GGUF https://endpoint-url.location.provider.endpoints.huggingface.cloud
./generate_answers.py mlabonne/TwinLlama-3.1-8B-DPO-GGUF https://endpoint-url.location.provider.endpoints.huggingface.cloud

The generated datasets of answers in JSONL format can be found in:

Inference Performance

Below is a comparison of inference performance across different models and setups:

Model	Inference Server	Generated Answers	Batch Size	Time Elapsed	Answers / Second
meta-llama-3.1-8b-instruct	Local (8GB VRAM)	248	1	9h 25m 52s	0.007
twinllama-3.1-8b	HuggingFace Endpoint (16GB VRAM)	334	4	18m 48s	0.30
twinllama-3.1-8b-dpo	HuggingFace Endpoint (16GB VRAM)	334	4	29m 30s	0.19

Cost Summary

The HuggingFace inference endpoints were priced at $0.5 per hour.
The entire process had a total cost of $1.34.

Evaluate Answers

Evaluate all answers for a specific model:

./evaluate_answers.py <model_id>

Examples:

./evaluate_answers.py meta-llama/Meta-Llama-3.1-8B-Instruct
./evaluate_answers.py mlabonne/TwinLlama-3.1-8B-GGUF
./evaluate_answers.py mlabonne/TwinLlama-3.1-8B-DPO-GGUF

The generated datasets of answers in JSONL format can be found in:

Evaluation Performance

Below is a comparison of the evaluation performance for each of the models, although OpenAI was the "LLM judge" in all cases.

Evaluated Model	Evaluated Answers	Batch Size	Time Elapsed	Answers / Second
meta-llama-3.1-8b-instruct	334	4	5m 25s	1.03
twinllama-3.1-8b	334	4	5m 11s	1.07
twinllama-3.1-8b-dpo	334	4	4m 23s	1.27

Cost Summary

The entire process had a cost of $0.17.

Compute Evaluations

Compute the evaluations and print a table with the results.

./compute_evaluations.py

Results Summary

Although it's a bit shocking, according to our evaluation it seems like we actually didn't win much when we executed the Direct Preference Optimization as the winner model for both accuracy and style is the finetuned one. The variability wasn't too big anyway it it may not be significant.

Also, it's noticeable how the results presented in the book show how for the finetuned model a lower accuracy was obtained, what was read as a tradeoff necessary to get a the desired writing style. According to our evaluation, that lost didn't happen.

╔═══════════╤════════════════╤════════════════╤═════════════╤═════════════╗
║ Model     │ Accuracy (avg) │ Accuracy (std) │ Style (avg) │ Style (std) ║
╟───────────┼────────────────┼────────────────┼─────────────┼─────────────╢
║ Reference │           2.77 │           0.49 │        2.04 │        0.35 ║
╟───────────┼────────────────┼────────────────┼─────────────┼─────────────╢
║ Finetuned │           2.87 │           0.33 │        2.38 │        0.48 ║
╟───────────┼────────────────┼────────────────┼─────────────┼─────────────╢
║ DPO       │           2.84 │           0.37 │        2.37 │        0.49 ║
╚═══════════╧════════════════╧════════════════╧═════════════╧═════════════╝

Show Slides

Show the slides.

marimo run ./show_slides.py

Notes

The local inference class has been removed from the main codebase to keep it as simple as possible. For reference, the original code is still available at assets/code/answers_local_generator.py.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Engineer's Handbook · Evaluating LLMs

About the Project

Goal

Approach

Installation

Clone the repository

Create and activate a virtual environment

Install the required dependencies

Set up your environment variables

Usage

Preview Instructions

Preview Prompts

Preview Answers

Generate Answers

Inference Performance

Cost Summary

Evaluate Answers

Evaluation Performance

Cost Summary

Compute Evaluations

Results Summary

Show Slides

Notes

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
assets		assets
datasets		datasets
evaluating_llms		evaluating_llms
layouts		layouts
prompts		prompts
.gitignore		.gitignore
README.md		README.md
compute_evaluations.py		compute_evaluations.py
env.example		env.example
evaluate_answers.py		evaluate_answers.py
generate_answers.py		generate_answers.py
preview_answers.py		preview_answers.py
preview_instructions.py		preview_instructions.py
preview_prompts.py		preview_prompts.py
requirements.txt		requirements.txt
show_slides.py		show_slides.py

elcapo/llm-engineers-handbook-evaluating-llms

Folders and files

Latest commit

History

Repository files navigation

LLM Engineer's Handbook · Evaluating LLMs

About the Project

Goal

Approach

Installation

Clone the repository

Create and activate a virtual environment

Install the required dependencies

Set up your environment variables

Usage

Preview Instructions

Preview Prompts

Preview Answers

Generate Answers

Inference Performance

Cost Summary

Evaluate Answers

Evaluation Performance

Cost Summary

Compute Evaluations

Results Summary

Show Slides

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages