Skip to content

Commit 7a30fa8

Browse files
Zazzle516hmellor
andauthored
[Doc] Clarify cudagraph capture size logic and default behavior in scheduler (#18698)
Signed-off-by: Zazzle516 <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Co-authored-by: Harry Mellor <[email protected]>
1 parent f82f7a8 commit 7a30fa8

File tree

1 file changed

+29
-19
lines changed

1 file changed

+29
-19
lines changed

vllm/config/__init__.py

Lines changed: 29 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -3579,30 +3579,40 @@ def update_sizes_for_sequence_parallelism(self,
35793579

35803580
def _set_cudagraph_sizes(self):
35813581
"""
3582-
cudagraph batchsize padding logic:
3582+
vLLM defines the default candidate list of batch sizes for CUDA graph
3583+
capture as:
35833584
3584-
`[1, 2, 4] + [8 * i for i in range(1, 1025)]` is a list of all possible
3585-
batch sizes that cudagraph will capture.
3586-
3587-
Depending on the engine's configuration of `max_num_seqs`, the
3588-
candidate batch sizes to capture cudagraph will shrink to the subset
3589-
which just cover the range of `[1, max_num_seqs]`. In the common case,
3590-
`max_num_seqs` is 256, and the cudagraph batch sizes will be
3591-
`[1, 2, 4, 8, 16, 24, 32, 40, ..., 256]`.
3592-
3593-
However, if users specify the cudagraph capture sizes through
3594-
compilation config, we will use the specified sizes instead.
3585+
```python
3586+
max_graph_size = min(max_num_seqs * 2, 512)
3587+
# 1, 2, 4, then multiples of 8 up to max_graph_size
3588+
cuda_graph_sizes = [1, 2, 4, 8, 16, 24, 32, 40, ..., max_graph_size]
35953589
35963590
In the end, `vllm_config.compilation_config.cudagraph_capture_sizes`
35973591
will be the final sizes to capture cudagraph (in descending order).
35983592
3599-
During runtime, if batchsize is larger than
3600-
`vllm_config.compilation_config.cudagraph_capture_sizes`,
3601-
no cudagraph will be used.
3602-
If the batch size is no larger than
3603-
`vllm_config.compilation_config.cudagraph_capture_sizes`,
3604-
we can quickly find the padded graph size for a given batch size by
3605-
looking up `vllm_config.compilation_config.bs_to_padded_graph_size`.
3593+
These sizes are used to capture and reuse CUDA graphs for
3594+
performance-critical paths (e.g., decoding). Capturing enables
3595+
significantly faster kernel dispatch by avoiding Python overhead. The
3596+
list is then filtered based on `max_num_batched_tokens` (e.g., 8192 on
3597+
most GPUs), which controls the total allowed number of tokens in a
3598+
batch. Since each sequence may have a variable number of tokens, the
3599+
maximum usable batch size will depend on actual sequence lengths.
3600+
3601+
Example:
3602+
With `max_num_batched_tokens = 8192`, and typical sequences
3603+
averaging ~32 tokens, most practical batch sizes fall below 256.
3604+
However, the system will still allow capture sizes up to 512 if
3605+
shape and memory permit.
3606+
3607+
Note:
3608+
If users explicitly specify cudagraph capture sizes in the
3609+
compilation config, those will override this default logic.
3610+
At runtime:
3611+
3612+
- If batch size <= one of the `cudagraph_capture_sizes`, the closest
3613+
padded CUDA graph will be used.
3614+
- If batch size > largest `cudagraph_capture_sizes`, cudagraph will
3615+
not be used.
36063616
"""
36073617

36083618
# calculate the default `batch_size_capture_list`

0 commit comments

Comments
 (0)