| sidebar-title |
|---|
Fixed Schedule Benchmarking |
Fixed schedule benchmarking provides precise timing control by executing requests at specific timestamps. This mode is ideal for simulating exact traffic patterns, testing temporal performance characteristics, and reproducing time-sensitive scenarios.
Fixed schedule mode enables:
- Precise Timing: Execute requests at exact millisecond intervals
- Traffic Simulation: Replicate real-world traffic patterns
- Performance Analysis: Identify how response times vary with request timing
- Load Testing: Test system behavior under controlled temporal stress patterns
Fixed schedule files use JSONL format with timestamp-based entries:
{"timestamp": 0, "input_length": 100, "output_length": 200, "hash_ids": [1001]}
{"timestamp": 500, "input_length": 200, "output_length": 400, "hash_ids": [1002]}
{"timestamp": 1000, "input_length": 550, "output_length": 500, "hash_ids": [1003, 1005]}Field Descriptions:
timestamp: Milliseconds from schedule start when request should be sentinput_length: Number of tokens in the input promptinput_text: Exact text to send in the request (provided instead of input_length)output_length: Maximum number of tokens in the response (optional)hash_ids: Hash block identifiers to simulate text reuse with 512-token blocks (optional)
# Start vLLM server for fixed schedule testing
docker pull vllm/vllm-openai:latest
docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest \
--model Qwen/Qwen3-0.6B \
--host 0.0.0.0 --port 8000 &# Wait for server to be ready
timeout 900 bash -c 'while [ "$(curl -s -o /dev/null -w "%{http_code}" localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d "{\"model\":\"Qwen/Qwen3-0.6B\",\"messages\":[{\"role\":\"user\",\"content\":\"test\"}],\"max_tokens\":1}")" != "200" ]; do sleep 2; done' || { echo "vLLM not ready after 15min"; exit 1; }{/* aiperf-run-vllm-default-openai-endpoint-server */}
# Create a fixed schedule with precise timing
cat > precise_schedule.jsonl << 'EOF'
{"timestamp": 0, "input_length": 100, "hash_ids": [3001]}
{"timestamp": 500, "input_length": 200, "hash_ids": [3002]}
{"timestamp": 750, "input_length": 150, "hash_ids": [3003]}
{"timestamp": 1000, "input_length": 300, "hash_ids": [3004]}
{"timestamp": 1250, "input_length": 180, "hash_ids": [3005]}
{"timestamp": 2000, "input_length": 400, "hash_ids": [3006]}
{"timestamp": 2500, "input_length": 250, "hash_ids": [3007]}
{"timestamp": 3000, "input_length": 350, "hash_ids": [3008]}
{"timestamp": 4000, "input_length": 500, "hash_ids": [3009]}
{"timestamp": 5000, "input_length": 600, "hash_ids": [3010, 3050]}
EOF
# Run basic fixed schedule benchmarking
aiperf profile \
--model Qwen/Qwen3-0.6B \
--endpoint-type chat \
--endpoint /v1/chat/completions \
--streaming \
--url localhost:8000 \
--input-file precise_schedule.jsonl \
--custom-dataset-type mooncake_trace \
--fixed-schedule \
--fixed-schedule-auto-offset{/* /aiperf-run-vllm-default-openai-endpoint-server */}
Sample Output (Successful Run):
INFO Starting AIPerf System
INFO Using Fixed Schedule mode with auto-offset
INFO Loaded 10 entries from precise_schedule.jsonl
INFO Schedule duration: 5.0 seconds
INFO AIPerf System is PROFILING
Profiling: 10/10 |████████████████████████| 100% [00:05<00:00]
INFO Benchmark completed successfully
INFO Results saved to: artifacts/Qwen_Qwen3-0.6B-chat-fixed-schedule/
NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃ Metric ┃ avg ┃ min ┃ max ┃ p99 ┃ p50 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│ Request Latency (ms) │ 345.67 │ 234.56 │ 498.12 │ 476.34 │ 338.90 │
│ Time to First Token (ms) │ 78.45 │ 52.34 │ 112.67 │ 108.23 │ 76.12 │
│ Inter Token Latency (ms) │ 15.23 │ 11.45 │ 22.34 │ 21.12 │ 14.89 │
│ Request Throughput (req/s) │ 2.89 │ - │ - │ - │ - │
└────────────────────────────┴────────┴────────┴────────┴────────┴────────┘
JSON Export: artifacts/Qwen_Qwen3-0.6B-chat-fixed-schedule/profile_export_aiperf.json
Key Parameters:
--fixed-schedule-auto-offset: Automatically adjusts timestamps to start from 0
Execute only a portion of the schedule using start and end offsets:
{/* aiperf-run-vllm-default-openai-endpoint-server */}
# Execute schedule from 2s to 6s window
aiperf profile \
--model Qwen/Qwen3-0.6B \
--endpoint-type chat \
--endpoint /v1/chat/completions \
--streaming \
--url localhost:8000 \
--input-file precise_schedule.jsonl \
--custom-dataset-type mooncake_trace \
--fixed-schedule \
--fixed-schedule-start-offset 2000 \
--fixed-schedule-end-offset 4000{/* /aiperf-run-vllm-default-openai-endpoint-server */}
Sample Output (Successful Run):
INFO Starting AIPerf System
INFO Using Fixed Schedule mode with time window [2000ms - 4000ms]
INFO Loaded 10 entries from precise_schedule.jsonl
INFO Filtered to 2 entries within time window
INFO Schedule duration: 2.0 seconds
INFO AIPerf System is PROFILING
Profiling: 2/2 |████████████████████████| 100% [00:02<00:00]
INFO Benchmark completed successfully
INFO Results saved to: artifacts/Qwen_Qwen3-0.6B-chat-fixed-schedule/
NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃ Metric ┃ avg ┃ min ┃ max ┃ p99 ┃ p50 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│ Request Latency (ms) │ 389.45 │ 312.67 │ 466.23 │ 466.23 │ 389.45 │
│ Time to First Token (ms) │ 89.12 │ 71.34 │ 106.90 │ 106.90 │ 89.12 │
│ Inter Token Latency (ms) │ 16.78 │ 14.23 │ 19.34 │ 19.34 │ 16.78 │
│ Request Throughput (req/s) │ 1.45 │ - │ - │ - │ - │
└────────────────────────────┴────────┴────────┴────────┴────────┴────────┘
JSON Export: artifacts/Qwen_Qwen3-0.6B-chat-fixed-schedule/profile_export_aiperf.json
Windowing Parameters:
--fixed-schedule-start-offset 2000: Start execution at 2000ms timestamp--fixed-schedule-end-offset 4000: End execution at 4000ms timestamp
- Custom Prompt Benchmarking - For sending custom prompts without timing control
- Time-based Benchmarking - For duration-based testing
- Request Cancellation - For timeout testing