-
-
Notifications
You must be signed in to change notification settings - Fork 13.2k
use the same stream for cuda graph catpure and replay for NCCL #29207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Amir Samani <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request makes the CUDA stream explicit when capturing CUDA graphs, which is a good practice for clarity and correctness. I've found one area for improvement in vllm/compilation/cuda_graph.py regarding the use of torch.cuda.current_stream(), where a more performant, cached version from vLLM's utilities should be used instead.
| with torch.cuda.graph( | ||
| cudagraph, | ||
| pool=self.graph_pool, | ||
| stream=torch.cuda.current_stream(), | ||
| ): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For performance reasons, it's better to use current_stream from vllm.utils.torch_utils instead of torch.cuda.current_stream(). The vLLM version is a cached version that avoids the overhead of creating a new stream object on each call, as documented in vllm/utils/torch_utils.py.
You'll need to update the import on line 20:
from vllm.utils.torch_utils import current_stream, weak_ref_tensors| with torch.cuda.graph( | |
| cudagraph, | |
| pool=self.graph_pool, | |
| stream=torch.cuda.current_stream(), | |
| ): | |
| with torch.cuda.graph( | |
| cudagraph, | |
| pool=self.graph_pool, | |
| stream=current_stream(), | |
| ): |
Signed-off-by: Amir Samani <[email protected]>
Signed-off-by: Amir Samani <[email protected]>
Signed-off-by: Amir Samani <[email protected]>
vllm/compilation/cuda_graph.py
Outdated
| with torch.cuda.graph( | ||
| cudagraph, | ||
| pool=self.graph_pool, | ||
| stream=torch.cuda.current_stream(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iiuc, graph capture should happen on a side stream instead of current main stream?
We can add self.stream = torch.cuda.Stream() in the _init_ of CUDAGraphWrapper, and use this stream for both warmup and graph capture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't the current_stream here already a non-default stream shared with warm-up iterations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you have a code pointer to a non-default stream shared with warm-up iterations? Looks like in cuda_graph.py, there is no explicit cuda stream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will check the fellow again to see where a context manager sets a new stream.
| set_graph_pool_id(current_platform.graph_pool_handle()) | ||
| # mind-exploding: carefully manage the reference and memory. | ||
| with torch.cuda.graph(cudagraph, pool=self.graph_pool): | ||
| with torch.cuda.graph( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Amir-19 what's the context behind this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zou3519 this came out as the result of investigating #28901. With NCCL 2.28+, window registration during cuda graph capture crashes with NCCL complaining about the registration. since ncclMemAlloc is tied to a stream, warm up and cuda graph capture on separate streams causes new memory allocations and thus window registrations. in this PR, we explicitly set the stream for cuda graph capture and forcing it to be on the same stream as warm up iterations. before this PR, cuda graph didn't have an explicit stream so it would create a new side stream, is that intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
before this PR, cuda graph didn't have an explicit stream so it would create a new side stream, is that intentional?
yes. torch.cuda.graph(...) automatically creates a side stream, unless it is called with explicit stream torch.cuda.graph(..., stream=explicit_stream).
With NCCL 2.28+, window registration during cuda graph capture crashes with NCCL complaining about the registration.
Could you elaborate what is window registration?
we explicitly set the stream for cuda graph capture and forcing it to be on the same stream as warm up iterations
In general, warm up on one stream and graph capture on another stream is fine, except some extra memory consumption. So using the same stream for warmup and capture is an optimization.
However, using different streams should not lead to an error. Could you elaborate a bit on why it errors? e.g., the window registration is a cudagraph unsafe op?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BoyuanFeng ncclCommWindowRegister is used to register local buffers into NCCL window which enables us to use symmetric kernels. This window registration also requires the memory to be from a VMM-based allocators like ncclMemAlloc. since memory allocated using ncclMemAlloc is tied to a stream, when you use the mempool associated with ncclMemAlloc and ncclCommWindowRegister on different streams, if there are no available segments, you need new allocations and thus registration.
intuitively, graph capture means "do this every time graph replays", so even if registration was allowed during cuda graph capture, it would have lead to creating new window handles each time which required destroying them later. This is not efficient or useful. the proper pattern is to register the window once before capture, then reuse it.
starting with NCCL 2.28, there is this restriction that ncclCommWindowRegister should not be called during graph capture which caused the failure reported in #28901
to fix this we need to make sure that the warm up and cuda graph capture are on the same side stream.
| graph_pool = torch.cuda.graph_pool_handle() | ||
| set_graph_pool_id(graph_pool) | ||
| with torch.cuda.graph(graph, pool=graph_pool): | ||
| with torch.cuda.graph(graph, pool=graph_pool, stream=stream): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what would be the issue without this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this benchmark would fail with NCCL 2.28 as cuda graph would create a new side stream for capture.
| batch_descriptor=batch_descriptor, | ||
| ), | ||
| patch("torch.cuda.graph", wraps=torch.cuda.graph) as mock_cuda_graph, | ||
| torch.cuda.stream(stream), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
setting context via torch.cuda.stream(stream) does not pass the stream to cudagraph capture. this is a no-op.
Would there be an issue w/o the change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
without this the call to the stream=torch.cuda.current_stream() in cuda_graph.py would return the default stream and capturing on the default stream is not allowed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should not be necessary after 73094c7
vllm/compilation/cuda_graph.py
Outdated
| with torch.cuda.graph( | ||
| cudagraph, | ||
| pool=self.graph_pool, | ||
| stream=torch.cuda.current_stream(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you have a code pointer to a non-default stream shared with warm-up iterations? Looks like in cuda_graph.py, there is no explicit cuda stream.
| set_graph_pool_id(current_platform.graph_pool_handle()) | ||
| # mind-exploding: carefully manage the reference and memory. | ||
| with torch.cuda.graph(cudagraph, pool=self.graph_pool): | ||
| with torch.cuda.graph( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
before this PR, cuda graph didn't have an explicit stream so it would create a new side stream, is that intentional?
yes. torch.cuda.graph(...) automatically creates a side stream, unless it is called with explicit stream torch.cuda.graph(..., stream=explicit_stream).
With NCCL 2.28+, window registration during cuda graph capture crashes with NCCL complaining about the registration.
Could you elaborate what is window registration?
we explicitly set the stream for cuda graph capture and forcing it to be on the same stream as warm up iterations
In general, warm up on one stream and graph capture on another stream is fine, except some extra memory consumption. So using the same stream for warmup and capture is an optimization.
However, using different streams should not lead to an error. Could you elaborate a bit on why it errors? e.g., the window registration is a cudagraph unsafe op?
youkaichao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the long delay.
Taking a closer look at the error stack at #28901 , the error happens at https://github.com/NVIDIA/nccl/blob/v2.28.3-1/src/dev_runtime.cc#L591 , and the stream nccl uses to synchonize is created by nccl itself from https://github.com/NVIDIA/nccl/blob/v2.28.3-1/src/dev_runtime.cc#L584 , which is a local variable defined at https://github.com/NVIDIA/nccl/blob/v2.28.3-1/src/dev_runtime.cc#L553 . It should not interfere with anything outside nccl.
My hypothesis is there might be some caching inside the driver, that when we create many streams, some streams will actually be the same stream, causing the trouble here. The solution is, then to create less streams to reduce the stream collision.
NOTE: I think pytorch has some stream-caching:
import torch
assert torch.cuda.is_available(), "CUDA is required for CUDA streams"
streams = [torch.cuda.Stream() for _ in range(1000)]
# Collect stream pointers
ptrs = []
for i, s in enumerate(streams):
# cuda_stream is an integer pointer (uintptr_t)
ptr = s.cuda_stream
ptrs.append(ptr)
# Check for duplicates
unique_ptrs = set(ptrs)
print("\nSummary:")
print(f"Total streams created: {len(ptrs)}")
print(f"Unique stream pointers: {len(unique_ptrs)}")
if len(unique_ptrs) != len(ptrs):
print("⚠️ Duplicate streams detected!")
else:
print("✅ No duplicate streams detected.")It only prints:
Summary:
Total streams created: 1000
Unique stream pointers: 32
⚠️ Duplicate streams detected!
Which means I can create at most 32 unique streams from pytorch.
If I don't use pytorch, I can create 10k unique streams from cuda directly. It still seems to be a bug somewhere.
I'll try to dig further with driver / nccl team, but for now the PR looks good to me. Reducing the number of streams created by pytorch, can reduce the chance of stream collision (since pytorch can only create up to 32 streams).
|
Diving deeper, after turning on
The limitation imposed by this PR "forcing cudagraph capture stream to be the same stream as warm up iterations" seems to be either improper use of nccl or a driver bug. Nevertheless, I will stop here and let nccl/driver team continue the investigation. The fix in this PR is simple, and we can go ahead with this workaround. |
|
there's an error in ci:
@Amir-19 I think we need to update vllm/vllm/utils/torch_utils.py Line 451 in 42826bb
|
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
|
the only failing test comes from main. merging. |
…project#29207) Signed-off-by: Amir Samani <[email protected]> Signed-off-by: youkaichao <[email protected]> Co-authored-by: youkaichao <[email protected]> Signed-off-by: tianwenjing <[email protected]>
…project#29207) Signed-off-by: Amir Samani <[email protected]> Signed-off-by: youkaichao <[email protected]> Co-authored-by: youkaichao <[email protected]>
…project#29207) Signed-off-by: Amir Samani <[email protected]> Signed-off-by: youkaichao <[email protected]> Co-authored-by: youkaichao <[email protected]>
|
@youkaichao Pouya from NCCL team here.
I am not sure there is a bug in nccl VS a usage issue. let me explain: symmetric memory window registration is tied to a CUDA stream. Warm up is capturing that and putting things in a graph means "do this everytime graph is executed". Re-registering the same memory might be ok, we cache that at the physical layer, but we don't guarantee that the window handle written will be the same if the memory is already registered. So it literally means write over the window handle with a new window (that needs destroying later) every time. then who will free up that handle? |
…project#29207) Signed-off-by: Amir Samani <[email protected]> Signed-off-by: youkaichao <[email protected]> Co-authored-by: youkaichao <[email protected]> Signed-off-by: dsuhinin <[email protected]>
…project#29207) Signed-off-by: Amir Samani <[email protected]> Signed-off-by: youkaichao <[email protected]> Co-authored-by: youkaichao <[email protected]> Signed-off-by: daje0601 <[email protected]>
Purpose
With NCCL 2.28+, window registration during cuda graph capture crashes with NCCL complaining about the registration. since
ncclMemAllocis tied to a stream, warm up and cuda graph capture on separate streams causes new memory allocations and thus window registrations (when you use the mempool associated with ncclMemAlloc and ncclCommWindowRegister on different streams, if there are no available segments, you need new allocations and thus registration.). in this PR, we explicitly set the stream for cuda graph capture and forcing it to be on the same stream as warm up iterations.Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.