Skip to content

Personal follow up of the Evaluating LLMs chapter of the LLM Engineer's Handbook

Notifications You must be signed in to change notification settings

elcapo/llm-engineers-handbook-evaluating-llms

Repository files navigation

LLM Engineer's Handbook · Evaluating LLMs

This repository contains my personal implementation of the coding part from Chapter 7 of the LLM Engineer's Handbook, focusing on evaluating LLMs (Large Language Models).

About the Project

This project was developed as part of my participation in the AI from Scratch study group. Rather than following the book's code exactly, I've rewritten it from scratch to deepen my understanding of the concepts.

Goal

We want to evaluate this three models:

Using the instructions from this dataset:

Approach

To achieve this, we'll use OpenAI's gpt-4o-mini as a judge who will follow the evaluate-answer prompt.

Installation

To set up this project on your local machine:

Clone the repository

git clone https://github.com/elcapo/llm-engineers-handbook-evaluating-llms.git
cd llm-engineers-handbook-evaluating-llms

Create and activate a virtual environment

python -m venv .venv
source .venv/bin/activate

Install the required dependencies

pip install -r requirements.txt

Set up your environment variables

cp env.example .env

Then edit the .env file to add your Hugging Face and OpenAI API tokens.

Usage

The project includes several CLI scripts to help you evaluate LLMs:

Preview Instructions

View the first few instructions from the dataset:

./preview_instructions.py

An example of the output can be found in instructions.txt.

Preview Prompts

View the first few prompts generated from the instructions:

./preview_prompts.py

An example of the output can be found in prompts.txt.

Preview Answers

Preview answers generated by a specific model:

./preview_answers.py <model_id> [endpoint_url]

Note

Take into account that in order to preview and generate answers with the TwinLlama models you'll need to set up your own inference endpoint. When you create one, HuggingFace will assign a unique URL for it that you'll need to add when using the following commands.

Examples:

./preview_answers.py meta-llama/Meta-Llama-3.1-8B-Instruct
./preview_answers.py mlabonne/TwinLlama-3.1-8B-GGUF https://endpoint-url.location.provider.endpoints.huggingface.cloud
./preview_answers.py mlabonne/TwinLlama-3.1-8B-DPO-GGUF https://endpoint-url.location.provider.endpoints.huggingface.cloud

An example of the output for each model can be found in:

Generate Answers

Generate and save all answers for a specific model:

./generate_answers.py <model_id> [endpoint_url]

Examples:

./generate_answers.py meta-llama/Meta-Llama-3.1-8B-Instruct
./generate_answers.py mlabonne/TwinLlama-3.1-8B-GGUF https://endpoint-url.location.provider.endpoints.huggingface.cloud
./generate_answers.py mlabonne/TwinLlama-3.1-8B-DPO-GGUF https://endpoint-url.location.provider.endpoints.huggingface.cloud

The generated datasets of answers in JSONL format can be found in:

Inference Performance

Below is a comparison of inference performance across different models and setups:

Model Inference Server Generated Answers Batch Size Time Elapsed Answers / Second
meta-llama-3.1-8b-instruct Local (8GB VRAM) 248 1 9h 25m 52s 0.007
twinllama-3.1-8b HuggingFace Endpoint (16GB VRAM) 334 4 18m 48s 0.30
twinllama-3.1-8b-dpo HuggingFace Endpoint (16GB VRAM) 334 4 29m 30s 0.19

Instance Analytics

Cost Summary
  • The HuggingFace inference endpoints were priced at $0.5 per hour.
  • The entire process had a total cost of $1.34.

HuggingFace Inference Cost

Evaluate Answers

Evaluate all answers for a specific model:

./evaluate_answers.py <model_id>

Examples:

./evaluate_answers.py meta-llama/Meta-Llama-3.1-8B-Instruct
./evaluate_answers.py mlabonne/TwinLlama-3.1-8B-GGUF
./evaluate_answers.py mlabonne/TwinLlama-3.1-8B-DPO-GGUF

The generated datasets of answers in JSONL format can be found in:

Evaluation Performance

Below is a comparison of the evaluation performance for each of the models, although OpenAI was the "LLM judge" in all cases.

Evaluated Model Evaluated Answers Batch Size Time Elapsed Answers / Second
meta-llama-3.1-8b-instruct 334 4 5m 25s 1.03
twinllama-3.1-8b 334 4 5m 11s 1.07
twinllama-3.1-8b-dpo 334 4 4m 23s 1.27
Cost Summary
  • The entire process had a cost of $0.17.

OpenAI Inference Cost

Compute Evaluations

Compute the evaluations and print a table with the results.

./compute_evaluations.py

Results Summary

Although it's a bit shocking, according to our evaluation it seems like we actually didn't win much when we executed the Direct Preference Optimization as the winner model for both accuracy and style is the finetuned one. The variability wasn't too big anyway it it may not be significant.

Also, it's noticeable how the results presented in the book show how for the finetuned model a lower accuracy was obtained, what was read as a tradeoff necessary to get a the desired writing style. According to our evaluation, that lost didn't happen.

╔═══════════╤════════════════╤════════════════╤═════════════╤═════════════╗
║ Model     │ Accuracy (avg) │ Accuracy (std) │ Style (avg) │ Style (std) ║
╟───────────┼────────────────┼────────────────┼─────────────┼─────────────╢
║ Reference │           2.77 │           0.49 │        2.04 │        0.35 ║
╟───────────┼────────────────┼────────────────┼─────────────┼─────────────╢
║ Finetuned │           2.87 │           0.33 │        2.38 │        0.48 ║
╟───────────┼────────────────┼────────────────┼─────────────┼─────────────╢
║ DPO       │           2.84 │           0.37 │        2.37 │        0.49 ║
╚═══════════╧════════════════╧════════════════╧═════════════╧═════════════╝

Show Slides

Show the slides.

marimo run ./show_slides.py

Notes

The local inference class has been removed from the main codebase to keep it as simple as possible. For reference, the original code is still available at assets/code/answers_local_generator.py.

About

Personal follow up of the Evaluating LLMs chapter of the LLM Engineer's Handbook

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages