Skip to content

Benchmarking tool for evaluating LLM inference server performance with OpenAI-compatible APIs.

Notifications You must be signed in to change notification settings

cezbloch/llm_benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Benchmark Tool

A comprehensive client-side benchmarking tool for LLM inference servers with OpenAI-compatible APIs.

Installation

Linux or WSL

git clone https://github.com/cezbloch/llm_benchmark.git
cd llm_benchmark
pip install -e .[dev]

Docker Dev Container

To run dev containers you need:

  • Docker
  • VS Code + Dev Containers extension
git clone https://github.com/cezbloch/llm_benchmark.git
  • Open the llm_benchmark folder in VS Code.
  • F1 -> Dev Containers -> Reopen in Container
    • VS Code will build the container and install necessary pip packages

Run

# Local server without API key 
llm-benchmark \
  --base-url http://localhost:8000 \
  --model meta/llama-3.2-1b-instruct \
  --prompts "Tell me a good joke."
# Public server from HF
llm-benchmark \
  --base-url https://router.huggingface.co \
  --model mistralai/Mistral-7B-Instruct-v0.2:featherless-ai \
  --api-key <your_key> \
  --prompts "Tell me a good joke."
# Fully configured run
llm-benchmark \
  --base-url https://router.huggingface.co \
  --model mistralai/Mistral-7B-Instruct-v0.2:featherless-ai \
  --api-key <your_key> \
  --prompts data/sample_prompts.txt \
  --requests 50 \
  --concurrency 5 \
  --output results/benchmark_results.csv \

Features

Fundamental concept of LLM Benchmarking are described here.

Core Metrics

This benchmark measured the following metrics:

  • Time to First Token (TTFT): Time from request send → first meaningful non-empty token arrives
  • Inter-Token Latency (ITL): Time between consecutive tokens
  • Tokens Per Second (TPS): Token generation rate
  • End-to-End Latency: Time from request send → lest meaningful non-empty token arrives
  • Requests Per Second (RPS): Number of token / generation time

Advanced Features

  • Streaming Support: Real-time token-by-token measurement
  • Concurrency Control: Configurable concurrent requests
  • Multiple Prompts: Round-robin through prompt sets
  • Statistical Analysis: P50/P95 percentiles and averages
  • CSV Export: Detailed per-request metrics
  • Rich Console Output: Beautiful formatted results

Metric Formulas

Below are detailed formulas for calculating each metric, together with explanations.

This benchmark intends to calculate metrics as Nvidia's GenAI-Perf does.

Per Request

FT - timestamp First Token - colllected when first meaningful token arrived - disregards the initial responses that have no content or content with an empty string (no token present)

RS - timestamp Request Sent

$$TTFT = FT - RS$$

LT - timestamp of Last Token presented to the user. Last DONE token is not included.

E2E - End-to-End Latency

$$E2E = LT - RS$$

ITL - Inter-Token Latency

TOK - Total Output Tokens - number of token received for the request.

$$ITL = \frac{LT - FT}{TOK - 1}$$

TPS_user

$$TPS_{user} = \frac{TOK}{E2E}$$

Per Benchmark

TOK - Total Output Tokens - sum of output tokens from all requests

T_x - timestamp of sending the first request

T_y - timestamp of receiving last meaningful token from any request

$$TPS = \frac{TOK}{Ty - Tx}$$

RPS - Requests Per Second - is the average number of requests that can be successfully completed by the system in a 1-second period.

TCR - Total completed requests - number or successfully completed requests.

$$RPS = \frac{TCR}{Ty - Tx}$$

Tests

System Tests

System tests are slow tests that perform real life use case on an actual server. Hugging Face inference provider was chosen. Access to this server requires a token.

  • create an access token at HF - Tokens
  • select Make calls to Inference Providers and Read access to contents of all public gated repos you can access
  • in the root of this repo create .env file and paste your token there as a line HF_TOKEN=<your_token>
  • token is already set on github, so system tests will run there

Setting up your own Inference Server (Optional)

Below are instruction on how to run your own Nvidia NIM server. We will use Llama 3.2 1B instruct model, which can be run on a few years old consumer GPUs.

Follow instruction at Nvidia - Llama 3.2 1B instruct to run the model.

  • you need to create NGC API key in order to download the Docker image with the model's service exposed.

Run the Llama NIM:

export NGC_API_KEY=<your_nvidia_api_key>
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"

docker run -it --rm \
    --gpus all \
    --shm-size=16GB \
    -e NGC_API_KEY \
    -e NIM_MAX_MODEL_LEN=4096 \
    -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
    -u $(id -u) \
    -p 8000:8000 \
    nvcr.io/nim/meta/llama-3.2-1b-instruct:latest

After the server started and downloaded the model (which can take several minutes to hours depending on your internet connection), test it with the command:

curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
    "model": "meta/llama-3.2-1b-instruct",
    "messages": [{"role":"user", "content":"What are the best companies to work for?"}],
    "max_tokens": 1024,
    "stream": true
}'

Change the IP address if needed.

About

Benchmarking tool for evaluating LLM inference server performance with OpenAI-compatible APIs.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published