Skip to content

Commit 4565bb8

Browse files
committed
Initial warmup implementation
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
1 parent 03df2ca commit 4565bb8

File tree

11 files changed

+624
-16
lines changed

11 files changed

+624
-16
lines changed
Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
# Warmup Example
2+
3+
This example demonstrates the **warmup phase** feature using
4+
[Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct),
5+
a small 0.5B parameter model that is easy to run locally.
6+
7+
The warmup phase issues randomly generated requests to the endpoint before the timed
8+
performance window begins.
9+
10+
## What warmup does
11+
12+
Before the benchmark clock starts, the warmup phase sends a configurable number of
13+
requests using randomly generated token sequences. This primes the endpoint by:
14+
15+
- Establishing and reusing TCP connections
16+
- Filling KV caches to steady-state occupancy
17+
- Triggering JIT compilation / CUDA graph capture in the inference runtime
18+
19+
Warmup samples are **excluded from all reported metrics** — they complete before
20+
`TEST_STARTED` is recorded, so they do not affect throughput, latency, TTFT, or TPOT.
21+
22+
## Warmup configuration
23+
24+
Add a `warmup` block to any YAML config:
25+
26+
```yaml
27+
warmup:
28+
num_samples: 64 # number of warmup requests to issue
29+
input_seq_length: 256 # ISL: target input token count
30+
output_seq_length: 64 # OSL: max_new_tokens for warmup requests
31+
range_ratio: 0.9 # ISL variance: generates ISL in [256*0.9, 256]
32+
random_seed: 42
33+
```
34+
35+
No real dataset is needed for warmup — sequences are generated at runtime from random
36+
token IDs using the model's own tokenizer.
37+
38+
## Quick test with the echo server
39+
40+
The built-in echo server lets you verify the warmup flow locally without a GPU.
41+
42+
```bash
43+
# Terminal 1 — start the echo server
44+
python -m inference_endpoint.testing.echo_server --port 8000
45+
46+
# Terminal 2 — run offline benchmark with warmup
47+
inference-endpoint benchmark from-config -c examples/09_Warmup_Example/warmup_offline.yaml
48+
```
49+
50+
The log output will show the warmup phase completing before the performance run starts:
51+
52+
```
53+
INFO Warmup dataset ready: 64 samples (ISL=256, OSL=64)
54+
INFO Warmup: issuing samples...
55+
INFO Warmup samples issued, waiting for responses to drain...
56+
INFO Warmup complete
57+
INFO Running...
58+
```
59+
60+
## Running against a real endpoint
61+
62+
### Prerequisites
63+
64+
```bash
65+
export HF_TOKEN=<your Hugging Face token>
66+
export HF_HOME=<path to your HuggingFace cache, e.g. ~/.cache/huggingface>
67+
```
68+
69+
Download the model before launching so vLLM can reuse the local cache:
70+
71+
```bash
72+
huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct
73+
```
74+
75+
### Launch a vLLM server
76+
77+
The `--trust-request-chat-template` flag is required because the CNN DailyMail dataset
78+
sends requests with a custom chat template.
79+
80+
```bash
81+
docker run --runtime nvidia --gpus all \
82+
-v ${HF_HOME}:/root/.cache/huggingface \
83+
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
84+
-p 8000:8000 --ipc=host \
85+
vllm/vllm-openai:latest \
86+
--model Qwen/Qwen2.5-0.5B-Instruct \
87+
--trust-request-chat-template
88+
```
89+
90+
### Offline benchmark with warmup
91+
92+
```bash
93+
inference-endpoint benchmark from-config -c examples/09_Warmup_Example/warmup_offline.yaml
94+
```
95+
96+
### Online benchmark with warmup
97+
98+
```bash
99+
inference-endpoint benchmark from-config -c examples/09_Warmup_Example/warmup_online.yaml
100+
```
101+
102+
## Tuning warmup parameters
103+
104+
| Parameter | Guidance |
105+
| ------------------- | ----------------------------------------------------------------------- |
106+
| `num_samples` | Use enough to saturate the KV cache; 32–128 is typical for small models |
107+
| `input_seq_length` | Match the ISL distribution of your real workload |
108+
| `output_seq_length` | Match the OSL distribution; lower values make warmup finish faster |
109+
| `range_ratio` | `1.0` = fixed ISL; `0.8``0.9` adds light variance for broader coverage |
110+
| `random_seed` | Change to vary which token sequences are generated |
Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
# Offline Throughput Benchmark with Warmup Phase
2+
#
3+
# The warmup phase issues randomly generated requests before the timed
4+
# performance window starts. This primes the endpoint by:
5+
# - Establishing and reusing TCP connections
6+
# - Filling KV caches to steady-state
7+
# - Triggering JIT compilation in the inference runtime
8+
#
9+
# Warmup samples are excluded from all reported metrics.
10+
name: "warmup-offline-qwen2.5-0.5b"
11+
version: "1.0"
12+
type: "offline"
13+
14+
# Warmup configuration: runs before the timed performance test.
15+
# Uses randomly generated token sequences; no real dataset required.
16+
warmup:
17+
num_samples: 64 # Number of warmup requests to issue
18+
input_seq_length: 256 # ISL: target input sequence length in tokens
19+
output_seq_length: 64 # OSL: max_new_tokens for warmup requests
20+
range_ratio: 0.9 # ISL variance: generates ISL in [256*0.9, 256]
21+
random_seed: 42
22+
23+
model_params:
24+
name: "Qwen/Qwen2.5-0.5B-Instruct"
25+
temperature: 0.0
26+
top_p: 1.0
27+
max_new_tokens: 128
28+
29+
datasets:
30+
- name: cnn_dailymail::llama3_8b
31+
type: "performance"
32+
samples: 18
33+
parser:
34+
input: prompt
35+
36+
settings:
37+
runtime:
38+
min_duration_ms: 60000 # 1 minute
39+
max_duration_ms: 360000 # 6 minutes
40+
scheduler_random_seed: 137
41+
dataloader_random_seed: 111
42+
n_samples_to_issue: 4
43+
44+
load_pattern:
45+
type: "max_throughput"
46+
47+
client:
48+
workers: 4
49+
50+
metrics:
51+
collect:
52+
- "throughput"
53+
- "latency"
54+
- "ttft"
55+
- "tpot"
56+
57+
endpoint_config:
58+
endpoints:
59+
- "http://localhost:8000"
60+
api_key: null
61+
62+
report_dir: logs/warmup_offline_fixed
Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
# Online (Sustained QPS) Benchmark with Warmup Phase
2+
#
3+
# The warmup phase issues randomly generated requests before the timed
4+
# performance window starts. This primes the endpoint by:
5+
# - Establishing and reusing TCP connections
6+
# - Filling KV caches to steady-state
7+
# - Triggering JIT compilation in the inference runtime
8+
#
9+
# Warmup samples are excluded from all reported metrics.
10+
name: "warmup-online-qwen2.5-0.5b"
11+
version: "1.0"
12+
type: "online"
13+
14+
# Warmup configuration: runs before the timed performance test.
15+
# Uses randomly generated token sequences; no real dataset required.
16+
warmup:
17+
num_samples: 32 # Number of warmup requests to issue
18+
input_seq_length: 128 # ISL: target input sequence length in tokens
19+
output_seq_length: 32 # OSL: max_new_tokens for warmup requests
20+
range_ratio: 0.8 # ISL variance: generates ISL in [128*0.8, 128]
21+
random_seed: 42
22+
23+
model_params:
24+
name: "Qwen/Qwen2.5-0.5B-Instruct"
25+
temperature: 0.0
26+
top_p: 1.0
27+
max_new_tokens: 128
28+
streaming: "on"
29+
30+
datasets:
31+
- name: cnn_dailymail::llama3_8b
32+
type: "performance"
33+
samples: 13368
34+
parser:
35+
input: prompt
36+
37+
settings:
38+
runtime:
39+
min_duration_ms: 60000 # 1 minute
40+
max_duration_ms: 360000 # 6 minutes
41+
scheduler_random_seed: 137
42+
dataloader_random_seed: 111
43+
44+
load_pattern:
45+
type: "poisson"
46+
target_qps: 10.0
47+
48+
client:
49+
workers: 4
50+
51+
metrics:
52+
collect:
53+
- "throughput"
54+
- "latency"
55+
- "ttft"
56+
- "tpot"
57+
58+
endpoint_config:
59+
endpoints:
60+
- "http://localhost:8000"
61+
api_key: null
62+
63+
report_dir: logs/warmup_online

src/inference_endpoint/commands/benchmark.py

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -451,6 +451,39 @@ def _run_benchmark(
451451
logger.info("Streaming: disabled (auto, offline mode)")
452452
config.model_params.streaming = StreamingMode.OFF
453453

454+
# Build warmup dataset if configured
455+
warmup_dataset = None
456+
if config.warmup is not None:
457+
if tokenizer is None:
458+
raise InputValidationError(
459+
"A tokenizer is required to generate the warmup dataset. Ensure model_params.name is set."
460+
)
461+
from inference_endpoint.dataset_manager.predefined.random import RandomDataset
462+
463+
warmup_cfg = config.warmup
464+
warmup_df = RandomDataset.generate(
465+
datasets_dir=None,
466+
force=False,
467+
num_sequences=warmup_cfg.num_samples,
468+
input_seq_length=warmup_cfg.input_seq_length,
469+
range_ratio=warmup_cfg.range_ratio,
470+
random_seed=warmup_cfg.random_seed,
471+
tokenizer=tokenizer,
472+
)
473+
warmup_dataset = RandomDataset(warmup_df)
474+
warmup_model_params = ModelParams(
475+
name=config.model_params.name,
476+
max_new_tokens=warmup_cfg.output_seq_length,
477+
)
478+
warmup_dataset.load(
479+
api_type=config.endpoint_config.api_type,
480+
model_params=warmup_model_params,
481+
)
482+
logger.info(
483+
f"Warmup dataset ready: {warmup_dataset.num_samples()} samples "
484+
f"(ISL={warmup_cfg.input_seq_length}, OSL={warmup_cfg.output_seq_length})"
485+
)
486+
454487
# Get dataset - from CLI or from config
455488
# TODO: Dataset Logic is not yet fully implemented
456489

@@ -609,6 +642,7 @@ def _run_benchmark(
609642
dataloader,
610643
sample_issuer,
611644
scheduler,
645+
warmup_dataset=warmup_dataset,
612646
name=f"cli_benchmark_{uuid.uuid4().hex[0:8]}",
613647
report_dir=report_dir,
614648
tokenizer_override=tokenizer,

src/inference_endpoint/config/schema.py

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -234,6 +234,29 @@ class AccuracyConfig(BaseModel):
234234
num_repeats: int = 1
235235

236236

237+
class WarmupConfig(BaseModel):
238+
"""Configuration for the warmup phase using randomly generated data.
239+
240+
The warmup phase runs before the timed performance test to prime the
241+
endpoint (warm TCP connections, fill KV caches, trigger JIT compilation).
242+
Uses randomly generated token sequences with configurable ISL and OSL.
243+
244+
Fields:
245+
num_samples: Number of warmup queries to issue.
246+
input_seq_length: Target input sequence length in tokens (ISL).
247+
output_seq_length: Max output tokens for warmup requests (OSL).
248+
range_ratio: ISL variance factor in [0.0, 1.0]. Generates ISL in
249+
the range [input_seq_length * range_ratio, input_seq_length].
250+
random_seed: Seed for reproducible warmup data generation.
251+
"""
252+
253+
num_samples: int = 100
254+
input_seq_length: int = 512
255+
output_seq_length: int = 128
256+
range_ratio: float = 1.0
257+
random_seed: int = 42
258+
259+
237260
class RuntimeConfig(BaseModel):
238261
"""Runtime configuration from YAML (user-facing).
239262
@@ -392,6 +415,7 @@ class BenchmarkConfig(BaseModel):
392415
# - True = auto (compute optimal NUMA-aware plan)
393416
# - False = disabled (no CPU pinning)
394417
enable_cpu_affinity: bool = True
418+
warmup: WarmupConfig | None = None
395419

396420
@classmethod
397421
def from_yaml_file(cls, path: Path) -> BenchmarkConfig:

src/inference_endpoint/load_generator/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@
2929
PoissonDistributionScheduler,
3030
SampleOrder,
3131
Scheduler,
32+
SequentialSampleOrder,
3233
WithoutReplacementSampleOrder,
3334
WithReplacementSampleOrder,
3435
)
@@ -46,6 +47,7 @@
4647
"MaxThroughputScheduler",
4748
"PoissonDistributionScheduler",
4849
"SampleOrder",
50+
"SequentialSampleOrder",
4951
"WithReplacementSampleOrder",
5052
"WithoutReplacementSampleOrder",
5153
"LoadGenerator",

src/inference_endpoint/load_generator/scheduler.py

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -168,6 +168,24 @@ def next_sample_index(self) -> int:
168168
return self.rng.randint(0, self.n_samples_in_dataset - 1)
169169

170170

171+
class SequentialSampleOrder(SampleOrder):
172+
"""Sample ordering without randomness.
173+
174+
Issues dataset rows in their natural order and wraps around if more samples are
175+
requested than the dataset contains.
176+
"""
177+
178+
def next_sample_index(self) -> int:
179+
"""Get next sample index in dataset order.
180+
181+
Returns:
182+
Sample index from dataset.
183+
"""
184+
if self.n_samples_in_dataset <= 0:
185+
raise IndexError("Cannot issue samples from an empty dataset")
186+
return self._issued_samples % self.n_samples_in_dataset
187+
188+
171189
def uniform_delay_fn(
172190
max_delay_ns: int = 0, rng: random.Random | None = None
173191
) -> Callable[[], float]:

0 commit comments

Comments
 (0)