LLM Benchmark Tool

A comprehensive client-side benchmarking tool for LLM inference servers with OpenAI-compatible APIs.

Installation

Linux or WSL

git clone https://github.com/cezbloch/llm_benchmark.git
cd llm_benchmark
pip install -e .[dev]

Docker Dev Container

To run dev containers you need:

Docker
VS Code + Dev Containers extension

git clone https://github.com/cezbloch/llm_benchmark.git

Open the llm_benchmark folder in VS Code.
F1 -> Dev Containers -> Reopen in Container
- VS Code will build the container and install necessary pip packages

Run

# Local server without API key 
llm-benchmark \
  --base-url http://localhost:8000 \
  --model meta/llama-3.2-1b-instruct \
  --prompts "Tell me a good joke."

# Public server from HF
llm-benchmark \
  --base-url https://router.huggingface.co \
  --model mistralai/Mistral-7B-Instruct-v0.2:featherless-ai \
  --api-key <your_key> \
  --prompts "Tell me a good joke."

# Fully configured run
llm-benchmark \
  --base-url https://router.huggingface.co \
  --model mistralai/Mistral-7B-Instruct-v0.2:featherless-ai \
  --api-key <your_key> \
  --prompts data/sample_prompts.txt \
  --requests 50 \
  --concurrency 5 \
  --output results/benchmark_results.csv \

Features

Fundamental concept of LLM Benchmarking are described here.

Core Metrics

This benchmark measured the following metrics:

Time to First Token (TTFT): Time from request send → first meaningful non-empty token arrives
Inter-Token Latency (ITL): Time between consecutive tokens
Tokens Per Second (TPS): Token generation rate
End-to-End Latency: Time from request send → lest meaningful non-empty token arrives
Requests Per Second (RPS): Number of token / generation time

Advanced Features

Streaming Support: Real-time token-by-token measurement
Concurrency Control: Configurable concurrent requests
Multiple Prompts: Round-robin through prompt sets
Statistical Analysis: P50/P95 percentiles and averages
CSV Export: Detailed per-request metrics
Rich Console Output: Beautiful formatted results

Metric Formulas

Below are detailed formulas for calculating each metric, together with explanations.

This benchmark intends to calculate metrics as Nvidia's GenAI-Perf does.

Per Request

FT - timestamp First Token - colllected when first meaningful token arrived - disregards the initial responses that have no content or content with an empty string (no token present)

RS - timestamp Request Sent

$$TTFT = FT - RS$$

LT - timestamp of Last Token presented to the user. Last DONE token is not included.

E2E - End-to-End Latency

$$E2E = LT - RS$$

ITL - Inter-Token Latency

TOK - Total Output Tokens - number of token received for the request.

$$ITL = \frac{LT - FT}{TOK - 1}$$

TPS_user

$$TPS_{user} = \frac{TOK}{E2E}$$

Per Benchmark

TOK - Total Output Tokens - sum of output tokens from all requests

T_x - timestamp of sending the first request

T_y - timestamp of receiving last meaningful token from any request

$$TPS = \frac{TOK}{Ty - Tx}$$

RPS - Requests Per Second - is the average number of requests that can be successfully completed by the system in a 1-second period.

TCR - Total completed requests - number or successfully completed requests.

$$RPS = \frac{TCR}{Ty - Tx}$$

Tests

System Tests

System tests are slow tests that perform real life use case on an actual server. Hugging Face inference provider was chosen. Access to this server requires a token.

create an access token at HF - Tokens
select Make calls to Inference Providers and Read access to contents of all public gated repos you can access
in the root of this repo create .env file and paste your token there as a line HF_TOKEN=<your_token>
token is already set on github, so system tests will run there

Setting up your own Inference Server (Optional)

Below are instruction on how to run your own Nvidia NIM server. We will use Llama 3.2 1B instruct model, which can be run on a few years old consumer GPUs.

Follow instruction at Nvidia - Llama 3.2 1B instruct to run the model.

you need to create NGC API key in order to download the Docker image with the model's service exposed.

Run the Llama NIM:

export NGC_API_KEY=<your_nvidia_api_key>
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"

docker run -it --rm \
    --gpus all \
    --shm-size=16GB \
    -e NGC_API_KEY \
    -e NIM_MAX_MODEL_LEN=4096 \
    -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
    -u $(id -u) \
    -p 8000:8000 \
    nvcr.io/nim/meta/llama-3.2-1b-instruct:latest

After the server started and downloaded the model (which can take several minutes to hours depending on your internet connection), test it with the command:

curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
    "model": "meta/llama-3.2-1b-instruct",
    "messages": [{"role":"user", "content":"What are the best companies to work for?"}],
    "max_tokens": 1024,
    "stream": true
}'

Change the IP address if needed.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
data		data
llm_benchmark		llm_benchmark
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Benchmark Tool

Installation

Linux or WSL

Docker Dev Container

Run

Features

Core Metrics

Advanced Features

Metric Formulas

Per Request

Per Benchmark

Tests

System Tests

Setting up your own Inference Server (Optional)

About

Uh oh!

Releases

Packages

Languages

cezbloch/llm_benchmark

Folders and files

Latest commit

History

Repository files navigation

LLM Benchmark Tool

Installation

Linux or WSL

Docker Dev Container

Run

Features

Core Metrics

Advanced Features

Metric Formulas

Per Request

Per Benchmark

Tests

System Tests

Setting up your own Inference Server (Optional)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages