This repository contains my personal implementation of the coding part from Chapter 7 of the LLM Engineer's Handbook, focusing on evaluating LLMs (Large Language Models).
This project was developed as part of my participation in the AI from Scratch study group. Rather than following the book's code exactly, I've rewritten it from scratch to deepen my understanding of the concepts.
We want to evaluate this three models:
- Reference model: meta-llama/Meta-Llama-3.1-8B-Instruct
- Finetuned model: mlabonne/TwinLlama-3.1-8B-GGUF
- DPO model: mlabonne/TwinLlama-3.1-8B-DPO-GGUF
Using the instructions from this dataset:
- Instructions dataset: mlabonne/llmtwin
To achieve this, we'll use OpenAI's gpt-4o-mini as a judge who will follow the evaluate-answer prompt.
To set up this project on your local machine:
git clone https://github.com/elcapo/llm-engineers-handbook-evaluating-llms.git
cd llm-engineers-handbook-evaluating-llmspython -m venv .venv
source .venv/bin/activatepip install -r requirements.txtcp env.example .envThen edit the .env file to add your Hugging Face and OpenAI API tokens.
The project includes several CLI scripts to help you evaluate LLMs:
View the first few instructions from the dataset:
./preview_instructions.pyAn example of the output can be found in instructions.txt.
View the first few prompts generated from the instructions:
./preview_prompts.pyAn example of the output can be found in prompts.txt.
Preview answers generated by a specific model:
./preview_answers.py <model_id> [endpoint_url]Note
Take into account that in order to preview and generate answers with the TwinLlama models you'll need to set up your own inference endpoint. When you create one, HuggingFace will assign a unique URL for it that you'll need to add when using the following commands.
Examples:
./preview_answers.py meta-llama/Meta-Llama-3.1-8B-Instruct
./preview_answers.py mlabonne/TwinLlama-3.1-8B-GGUF https://endpoint-url.location.provider.endpoints.huggingface.cloud
./preview_answers.py mlabonne/TwinLlama-3.1-8B-DPO-GGUF https://endpoint-url.location.provider.endpoints.huggingface.cloudAn example of the output for each model can be found in:
Generate and save all answers for a specific model:
./generate_answers.py <model_id> [endpoint_url]Examples:
./generate_answers.py meta-llama/Meta-Llama-3.1-8B-Instruct
./generate_answers.py mlabonne/TwinLlama-3.1-8B-GGUF https://endpoint-url.location.provider.endpoints.huggingface.cloud
./generate_answers.py mlabonne/TwinLlama-3.1-8B-DPO-GGUF https://endpoint-url.location.provider.endpoints.huggingface.cloudThe generated datasets of answers in JSONL format can be found in:
Below is a comparison of inference performance across different models and setups:
| Model | Inference Server | Generated Answers | Batch Size | Time Elapsed | Answers / Second |
|---|---|---|---|---|---|
| meta-llama-3.1-8b-instruct | Local (8GB VRAM) | 248 | 1 | 9h 25m 52s | 0.007 |
| twinllama-3.1-8b | HuggingFace Endpoint (16GB VRAM) | 334 | 4 | 18m 48s | 0.30 |
| twinllama-3.1-8b-dpo | HuggingFace Endpoint (16GB VRAM) | 334 | 4 | 29m 30s | 0.19 |
- The HuggingFace inference endpoints were priced at $0.5 per hour.
- The entire process had a total cost of $1.34.
Evaluate all answers for a specific model:
./evaluate_answers.py <model_id>Examples:
./evaluate_answers.py meta-llama/Meta-Llama-3.1-8B-Instruct
./evaluate_answers.py mlabonne/TwinLlama-3.1-8B-GGUF
./evaluate_answers.py mlabonne/TwinLlama-3.1-8B-DPO-GGUFThe generated datasets of answers in JSONL format can be found in:
Below is a comparison of the evaluation performance for each of the models, although OpenAI was the "LLM judge" in all cases.
| Evaluated Model | Evaluated Answers | Batch Size | Time Elapsed | Answers / Second |
|---|---|---|---|---|
| meta-llama-3.1-8b-instruct | 334 | 4 | 5m 25s | 1.03 |
| twinllama-3.1-8b | 334 | 4 | 5m 11s | 1.07 |
| twinllama-3.1-8b-dpo | 334 | 4 | 4m 23s | 1.27 |
- The entire process had a cost of $0.17.
Compute the evaluations and print a table with the results.
./compute_evaluations.pyAlthough it's a bit shocking, according to our evaluation it seems like we actually didn't win much when we executed the Direct Preference Optimization as the winner model for both accuracy and style is the finetuned one. The variability wasn't too big anyway it it may not be significant.
Also, it's noticeable how the results presented in the book show how for the finetuned model a lower accuracy was obtained, what was read as a tradeoff necessary to get a the desired writing style. According to our evaluation, that lost didn't happen.
╔═══════════╤════════════════╤════════════════╤═════════════╤═════════════╗
║ Model │ Accuracy (avg) │ Accuracy (std) │ Style (avg) │ Style (std) ║
╟───────────┼────────────────┼────────────────┼─────────────┼─────────────╢
║ Reference │ 2.77 │ 0.49 │ 2.04 │ 0.35 ║
╟───────────┼────────────────┼────────────────┼─────────────┼─────────────╢
║ Finetuned │ 2.87 │ 0.33 │ 2.38 │ 0.48 ║
╟───────────┼────────────────┼────────────────┼─────────────┼─────────────╢
║ DPO │ 2.84 │ 0.37 │ 2.37 │ 0.49 ║
╚═══════════╧════════════════╧════════════════╧═════════════╧═════════════╝
Show the slides.
marimo run ./show_slides.pyThe local inference class has been removed from the main codebase to keep it as simple as possible. For reference, the original code is still available at assets/code/answers_local_generator.py.


