| sidebar-title |
|---|
Request Cancellation Testing |
AIPerf supports request timeout and cancellation scenarios, which are important for calculating the impact of user cancellation on performance.
Request cancellation tests how inference servers handle client disconnections. A percentage of requests are sent completely, then the client disconnects before receiving the full response.
T0: Request scheduled
│
│← Worker processing, connection acquired from pool
▼
T1: Start writing request to socket
│
│← HTTP headers + body transmitted
▼
T2: Request fully sent (cancellation timer starts here)
│
│← --request-cancellation-delay
▼
T3: Request cancelled if still waiting for response
The cancellation timer starts at T2 ("request fully sent") for two reasons:
-
Realistic simulation: The server always receives the complete request before cancellation, just like when a real user closes their browser tab.
-
Reproducibility: The delay is measured from a fixed point (request fully sent) rather than being affected by variable queue times or connection setup. This means running the same benchmark twice with
--request-cancellation-delay 0.5will cancel requests at the same point in their lifecycle, regardless of system load.
| Delay | Behavior |
|---|---|
0 |
Disconnect immediately after request is fully sent |
0.5 |
Wait 0.5 seconds after sending, then disconnect |
5 |
Wait 5 seconds after sending, then disconnect |
The delay parameter can be used to target different inference phases:
| Delay | Likely Cancelled During | Tests |
|---|---|---|
0 or very small |
Prefill phase | Prefill worker cancellation, KV cache allocation cleanup |
| Longer delays | Generation phase | Decode worker cancellation, partial KV cache cleanup |
This is useful for testing how disaggregated architectures (separate prefill and decode workers) handle cancellations at different stages of request processing.
# Start vLLM server
docker pull vllm/vllm-openai:latest
docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest \
--model Qwen/Qwen3-0.6B \
--host 0.0.0.0 --port 8000 &# Wait for server to be ready
timeout 900 bash -c 'while [ "$(curl -s -o /dev/null -w "%{http_code}" localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d "{\"model\":\"Qwen/Qwen3-0.6B\",\"messages\":[{\"role\":\"user\",\"content\":\"test\"}],\"max_tokens\":1}")" != "200" ]; do sleep 2; done' || { echo "vLLM not ready after 15min"; exit 1; }Test with a small percentage of cancelled requests:
{/* aiperf-run-vllm-default-openai-endpoint-server */}
# Profile with 10% request cancellation
aiperf profile \
--model Qwen/Qwen3-0.6B \
--endpoint-type chat \
--endpoint /v1/chat/completions \
--streaming \
--url localhost:8000 \
--request-cancellation-rate 10 \
--request-cancellation-delay 0.5 \
--synthetic-input-tokens-mean 800 \
--synthetic-input-tokens-stddev 80 \
--output-tokens-mean 400 \
--output-tokens-stddev 40 \
--concurrency 8 \
--request-count 50 \
--warmup-request-count 5{/* /aiperf-run-vllm-default-openai-endpoint-server */}
Sample Output (Successful Run):
INFO Starting AIPerf System
INFO Request cancellation enabled: 10.0% rate, 0.5s delay
INFO AIPerf System is WARMING UP
Warming Up: 5/5 |████████████████████████| 100% [00:04<00:00]
INFO Warmup completed, starting profiling phase
INFO AIPerf System is PROFILING
Profiling: 50/50 |████████████████████████| 100% [01:15<00:00]
INFO Benchmark completed successfully
INFO Cancelled requests: 5 (10.0%)
INFO Results saved to: artifacts/Qwen_Qwen3-0.6B-chat-concurrency8/
NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃ Metric ┃ avg ┃ min ┃ max ┃ p99 ┃ p50 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│ Request Latency (ms) │ 1234.56 │ 987.34 │ 1678.90 │ 1598.23 │ 1198.45 │
│ Time to First Token (ms) │ 234.56 │ 187.90 │ 298.34 │ 289.67 │ 228.12 │
│ Inter Token Latency (ms) │ 14.23 │ 11.45 │ 19.67 │ 18.90 │ 13.89 │
│ Output Token Count (tokens) │ 400.00 │ 360.00 │ 440.00 │ 438.00 │ 398.00 │
│ Request Throughput (req/s) │ 12.34 │ - │ - │ - │ - │
└─────────────────────────────┴─────────┴────────┴─────────┴─────────┴─────────┘
JSON Export: artifacts/Qwen_Qwen3-0.6B-chat-concurrency8/profile_export_aiperf.json
Parameters Explained:
--request-cancellation-rate 10: Cancel 10% of requests (value between 0.0 and 100.0)--request-cancellation-delay 0.5: Wait .5 seconds before cancelling selected requests
Test service resilience under frequent cancellations:
{/* aiperf-run-vllm-default-openai-endpoint-server */}
# Profile with 50% request cancellation
aiperf profile \
--model Qwen/Qwen3-0.6B \
--endpoint-type chat \
--endpoint /v1/chat/completions \
--streaming \
--url localhost:8000 \
--request-cancellation-rate 50 \
--request-cancellation-delay 1.0 \
--synthetic-input-tokens-mean 1200 \
--output-tokens-mean 600 \
--concurrency 10 \
--request-count 40{/* /aiperf-run-vllm-default-openai-endpoint-server */}
Sample Output (Successful Run):
INFO Starting AIPerf System
INFO Request cancellation enabled: 50.0% rate, 1.0s delay
INFO AIPerf System is PROFILING
Profiling: 40/40 |████████████████████████| 100% [01:30<00:00]
INFO Benchmark completed successfully
INFO Cancelled requests: 20 (50.0%)
INFO Results saved to: artifacts/Qwen_Qwen3-0.6B-chat-concurrency10/
NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃ Metric ┃ avg ┃ min ┃ max ┃ p99 ┃ p50 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│ Request Latency (ms) │ 1890.45 │ 1456.78 │ 2456.89 │ 2398.12 │ 1867.34 │
│ Time to First Token (ms) │ 345.67 │ 278.90 │ 456.23 │ 445.67 │ 338.45 │
│ Inter Token Latency (ms) │ 16.78 │ 13.45 │ 22.34 │ 21.56 │ 16.45 │
│ Output Token Count (tokens) │ 600.00 │ 540.00 │ 660.00 │ 658.00 │ 598.00 │
│ Request Throughput (req/s) │ 8.90 │ - │ - │ - │ - │
└─────────────────────────────┴─────────┴─────────┴─────────┴─────────┴─────────┘
JSON Export: artifacts/Qwen_Qwen3-0.6B-chat-concurrency10/profile_export_aiperf.json
Test immediate disconnection where the client closes the connection right after sending the request:
{/* aiperf-run-vllm-default-openai-endpoint-server */}
# Profile with immediate cancellation (0 delay)
aiperf profile \
--model Qwen/Qwen3-0.6B \
--endpoint-type chat \
--endpoint /v1/chat/completions \
--streaming \
--url localhost:8000 \
--request-cancellation-rate 30 \
--request-cancellation-delay 0.0 \
--synthetic-input-tokens-mean 500 \
--output-tokens-mean 100 \
--concurrency 15 \
--request-count 60{/* /aiperf-run-vllm-default-openai-endpoint-server */}
Sample Output (Successful Run):
INFO Starting AIPerf System
INFO Request cancellation enabled: 30.0% rate, 0.0s delay (immediate)
INFO AIPerf System is PROFILING
Profiling: 60/60 |████████████████████████| 100% [00:45<00:00]
INFO Benchmark completed successfully
INFO Cancelled requests: 18 (30.0%)
INFO Results saved to: artifacts/Qwen_Qwen3-0.6B-chat-concurrency15/
NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃ Metric ┃ avg ┃ min ┃ max ┃ p99 ┃ p50 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│ Request Latency (ms) │ 678.90 │ 534.56 │ 898.12 │ 876.34 │ 667.89 │
│ Time to First Token (ms) │ 156.78 │ 123.45 │ 198.90 │ 192.34 │ 154.23 │
│ Inter Token Latency (ms) │ 12.45 │ 9.89 │ 16.78 │ 16.12 │ 12.23 │
│ Output Token Count (tokens) │ 100.00 │ 90.00 │ 110.00 │ 109.00 │ 99.00 │
│ Request Throughput (req/s) │ 23.45 │ - │ - │ - │ - │
└─────────────────────────────┴────────┴────────┴────────┴────────┴────────┘
JSON Export: artifacts/Qwen_Qwen3-0.6B-chat-concurrency15/profile_export_aiperf.json
What happens with delay=0:
- The full request (headers + body) is sent to the server
- The client immediately disconnects after sending
- The server receives the complete request but the client won't read any response
- Tests how the server handles abandoned requests and cleans up resources