@@ -3579,30 +3579,40 @@ def update_sizes_for_sequence_parallelism(self,
3579
3579
3580
3580
def _set_cudagraph_sizes (self ):
3581
3581
"""
3582
- cudagraph batchsize padding logic:
3582
+ vLLM defines the default candidate list of batch sizes for CUDA graph
3583
+ capture as:
3583
3584
3584
- `[1, 2, 4] + [8 * i for i in range(1, 1025)]` is a list of all possible
3585
- batch sizes that cudagraph will capture.
3586
-
3587
- Depending on the engine's configuration of `max_num_seqs`, the
3588
- candidate batch sizes to capture cudagraph will shrink to the subset
3589
- which just cover the range of `[1, max_num_seqs]`. In the common case,
3590
- `max_num_seqs` is 256, and the cudagraph batch sizes will be
3591
- `[1, 2, 4, 8, 16, 24, 32, 40, ..., 256]`.
3592
-
3593
- However, if users specify the cudagraph capture sizes through
3594
- compilation config, we will use the specified sizes instead.
3585
+ ```python
3586
+ max_graph_size = min(max_num_seqs * 2, 512)
3587
+ # 1, 2, 4, then multiples of 8 up to max_graph_size
3588
+ cuda_graph_sizes = [1, 2, 4, 8, 16, 24, 32, 40, ..., max_graph_size]
3595
3589
3596
3590
In the end, `vllm_config.compilation_config.cudagraph_capture_sizes`
3597
3591
will be the final sizes to capture cudagraph (in descending order).
3598
3592
3599
- During runtime, if batchsize is larger than
3600
- `vllm_config.compilation_config.cudagraph_capture_sizes`,
3601
- no cudagraph will be used.
3602
- If the batch size is no larger than
3603
- `vllm_config.compilation_config.cudagraph_capture_sizes`,
3604
- we can quickly find the padded graph size for a given batch size by
3605
- looking up `vllm_config.compilation_config.bs_to_padded_graph_size`.
3593
+ These sizes are used to capture and reuse CUDA graphs for
3594
+ performance-critical paths (e.g., decoding). Capturing enables
3595
+ significantly faster kernel dispatch by avoiding Python overhead. The
3596
+ list is then filtered based on `max_num_batched_tokens` (e.g., 8192 on
3597
+ most GPUs), which controls the total allowed number of tokens in a
3598
+ batch. Since each sequence may have a variable number of tokens, the
3599
+ maximum usable batch size will depend on actual sequence lengths.
3600
+
3601
+ Example:
3602
+ With `max_num_batched_tokens = 8192`, and typical sequences
3603
+ averaging ~32 tokens, most practical batch sizes fall below 256.
3604
+ However, the system will still allow capture sizes up to 512 if
3605
+ shape and memory permit.
3606
+
3607
+ Note:
3608
+ If users explicitly specify cudagraph capture sizes in the
3609
+ compilation config, those will override this default logic.
3610
+ At runtime:
3611
+
3612
+ - If batch size <= one of the `cudagraph_capture_sizes`, the closest
3613
+ padded CUDA graph will be used.
3614
+ - If batch size > largest `cudagraph_capture_sizes`, cudagraph will
3615
+ not be used.
3606
3616
"""
3607
3617
3608
3618
# calculate the default `batch_size_capture_list`
0 commit comments