A comprehensive client-side benchmarking tool for LLM inference servers with OpenAI-compatible APIs.
git clone https://github.com/cezbloch/llm_benchmark.git
cd llm_benchmark
pip install -e .[dev]To run dev containers you need:
- Docker
- VS Code + Dev Containers extension
git clone https://github.com/cezbloch/llm_benchmark.git- Open the
llm_benchmarkfolder in VS Code. - F1 -> Dev Containers -> Reopen in Container
- VS Code will build the container and install necessary pip packages
# Local server without API key
llm-benchmark \
--base-url http://localhost:8000 \
--model meta/llama-3.2-1b-instruct \
--prompts "Tell me a good joke."# Public server from HF
llm-benchmark \
--base-url https://router.huggingface.co \
--model mistralai/Mistral-7B-Instruct-v0.2:featherless-ai \
--api-key <your_key> \
--prompts "Tell me a good joke."# Fully configured run
llm-benchmark \
--base-url https://router.huggingface.co \
--model mistralai/Mistral-7B-Instruct-v0.2:featherless-ai \
--api-key <your_key> \
--prompts data/sample_prompts.txt \
--requests 50 \
--concurrency 5 \
--output results/benchmark_results.csv \Fundamental concept of LLM Benchmarking are described here.
This benchmark measured the following metrics:
- Time to First Token (TTFT): Time from request send → first meaningful non-empty token arrives
- Inter-Token Latency (ITL): Time between consecutive tokens
- Tokens Per Second (TPS): Token generation rate
- End-to-End Latency: Time from request send → lest meaningful non-empty token arrives
- Requests Per Second (RPS): Number of token / generation time
- Streaming Support: Real-time token-by-token measurement
- Concurrency Control: Configurable concurrent requests
- Multiple Prompts: Round-robin through prompt sets
- Statistical Analysis: P50/P95 percentiles and averages
- CSV Export: Detailed per-request metrics
- Rich Console Output: Beautiful formatted results
Below are detailed formulas for calculating each metric, together with explanations.
This benchmark intends to calculate metrics as Nvidia's GenAI-Perf does.
FT - timestamp First Token - colllected when first meaningful token arrived - disregards the initial responses that have no content or content with an empty string (no token present)
RS - timestamp Request Sent
LT - timestamp of Last Token presented to the user. Last DONE token is not included.
E2E - End-to-End Latency
ITL - Inter-Token Latency
TOK - Total Output Tokens - number of token received for the request.
TPS_user
TOK - Total Output Tokens - sum of output tokens from all requests
T_x - timestamp of sending the first request
T_y - timestamp of receiving last meaningful token from any request
RPS - Requests Per Second - is the average number of requests that can be successfully completed by the system in a 1-second period.
TCR - Total completed requests - number or successfully completed requests.
System tests are slow tests that perform real life use case on an actual server. Hugging Face inference provider was chosen. Access to this server requires a token.
- create an access token at HF - Tokens
- select
Make calls to Inference ProvidersandRead access to contents of all public gated repos you can access - in the root of this repo create
.envfile and paste your token there as a lineHF_TOKEN=<your_token> - token is already set on github, so system tests will run there
Below are instruction on how to run your own Nvidia NIM server. We will use
Llama 3.2 1B instruct model, which can be run on a few years old consumer GPUs.
Follow instruction at Nvidia - Llama 3.2 1B instruct to run the model.
- you need to create NGC API key in order to download the Docker image with the model's service exposed.
Run the Llama NIM:
export NGC_API_KEY=<your_nvidia_api_key>
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
docker run -it --rm \
--gpus all \
--shm-size=16GB \
-e NGC_API_KEY \
-e NIM_MAX_MODEL_LEN=4096 \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-u $(id -u) \
-p 8000:8000 \
nvcr.io/nim/meta/llama-3.2-1b-instruct:latestAfter the server started and downloaded the model (which can take several minutes to hours depending on your internet connection), test it with the command:
curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "meta/llama-3.2-1b-instruct",
"messages": [{"role":"user", "content":"What are the best companies to work for?"}],
"max_tokens": 1024,
"stream": true
}'Change the IP address if needed.