|
| 1 | +# Warmup Example |
| 2 | + |
| 3 | +This example demonstrates the **warmup phase** feature using |
| 4 | +[Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct), |
| 5 | +a small 0.5B parameter model that is easy to run locally. |
| 6 | + |
| 7 | +The warmup phase issues randomly generated requests to the endpoint before the timed |
| 8 | +performance window begins. |
| 9 | + |
| 10 | +## What warmup does |
| 11 | + |
| 12 | +Before the benchmark clock starts, the warmup phase sends a configurable number of |
| 13 | +requests using randomly generated token sequences. This primes the endpoint by: |
| 14 | + |
| 15 | +- Establishing and reusing TCP connections |
| 16 | +- Filling KV caches to steady-state occupancy |
| 17 | +- Triggering JIT compilation / CUDA graph capture in the inference runtime |
| 18 | + |
| 19 | +Warmup samples are **excluded from all reported metrics** — they complete before |
| 20 | +`TEST_STARTED` is recorded, so they do not affect throughput, latency, TTFT, or TPOT. |
| 21 | + |
| 22 | +## Warmup configuration |
| 23 | + |
| 24 | +Add a `warmup` block to any YAML config: |
| 25 | + |
| 26 | +```yaml |
| 27 | +warmup: |
| 28 | + num_samples: 64 # number of warmup requests to issue |
| 29 | + input_seq_length: 256 # ISL: target input token count |
| 30 | + output_seq_length: 64 # OSL: max_new_tokens for warmup requests |
| 31 | + range_ratio: 0.9 # ISL variance: generates ISL in [256*0.9, 256] |
| 32 | + random_seed: 42 |
| 33 | +``` |
| 34 | +
|
| 35 | +No real dataset is needed for warmup — sequences are generated at runtime from random |
| 36 | +token IDs using the model's own tokenizer. |
| 37 | +
|
| 38 | +## Quick test with the echo server |
| 39 | +
|
| 40 | +The built-in echo server lets you verify the warmup flow locally without a GPU. |
| 41 | +
|
| 42 | +```bash |
| 43 | +# Terminal 1 — start the echo server |
| 44 | +python -m inference_endpoint.testing.echo_server --port 8000 |
| 45 | + |
| 46 | +# Terminal 2 — run offline benchmark with warmup |
| 47 | +inference-endpoint benchmark from-config -c examples/09_Warmup_Example/warmup_offline.yaml |
| 48 | +``` |
| 49 | + |
| 50 | +The log output will show the warmup phase completing before the performance run starts: |
| 51 | + |
| 52 | +``` |
| 53 | +INFO Warmup dataset ready: 64 samples (ISL=256, OSL=64) |
| 54 | +INFO Warmup: issuing samples... |
| 55 | +INFO Warmup samples issued, waiting for responses to drain... |
| 56 | +INFO Warmup complete |
| 57 | +INFO Running... |
| 58 | +``` |
| 59 | + |
| 60 | +## Running against a real endpoint |
| 61 | + |
| 62 | +### Prerequisites |
| 63 | + |
| 64 | +```bash |
| 65 | +export HF_TOKEN=<your Hugging Face token> |
| 66 | +export HF_HOME=<path to your HuggingFace cache, e.g. ~/.cache/huggingface> |
| 67 | +``` |
| 68 | + |
| 69 | +Download the model before launching so vLLM can reuse the local cache: |
| 70 | + |
| 71 | +```bash |
| 72 | +huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct |
| 73 | +``` |
| 74 | + |
| 75 | +### Launch a vLLM server |
| 76 | + |
| 77 | +The `--trust-request-chat-template` flag is required because the CNN DailyMail dataset |
| 78 | +sends requests with a custom chat template. |
| 79 | + |
| 80 | +```bash |
| 81 | +docker run --runtime nvidia --gpus all \ |
| 82 | + -v ${HF_HOME}:/root/.cache/huggingface \ |
| 83 | + --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \ |
| 84 | + -p 8000:8000 --ipc=host \ |
| 85 | + vllm/vllm-openai:latest \ |
| 86 | + --model Qwen/Qwen2.5-0.5B-Instruct \ |
| 87 | + --trust-request-chat-template |
| 88 | +``` |
| 89 | + |
| 90 | +### Offline benchmark with warmup |
| 91 | + |
| 92 | +```bash |
| 93 | +inference-endpoint benchmark from-config -c examples/09_Warmup_Example/warmup_offline.yaml |
| 94 | +``` |
| 95 | + |
| 96 | +### Online benchmark with warmup |
| 97 | + |
| 98 | +```bash |
| 99 | +inference-endpoint benchmark from-config -c examples/09_Warmup_Example/warmup_online.yaml |
| 100 | +``` |
| 101 | + |
| 102 | +## Tuning warmup parameters |
| 103 | + |
| 104 | +| Parameter | Guidance | |
| 105 | +| ------------------- | ----------------------------------------------------------------------- | |
| 106 | +| `num_samples` | Use enough to saturate the KV cache; 32–128 is typical for small models | |
| 107 | +| `input_seq_length` | Match the ISL distribution of your real workload | |
| 108 | +| `output_seq_length` | Match the OSL distribution; lower values make warmup finish faster | |
| 109 | +| `range_ratio` | `1.0` = fixed ISL; `0.8`–`0.9` adds light variance for broader coverage | |
| 110 | +| `random_seed` | Change to vary which token sequences are generated | |
0 commit comments