You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This commit fixes GitHub issue pytorch/pytorch#157363 where custom CUDA
kernels were not properly synchronized with PyTorch's CUDA stream when
used with torch.compile in reduce-overhead mode.
Changes:
- Add #include <ATen/cuda/CUDAContext.h> for getCurrentCUDAStream()
- Use at::cuda::getCurrentCUDAStream() to get PyTorch's current CUDA stream
- Launch all kernels with the correct stream parameter
The issue occurred because custom kernels launched on the default CUDA stream
while PyTorch operations (like nn.Linear) run on PyTorch's managed stream.
This created race conditions where custom kernels would execute before
PyTorch operations completed, resulting in incorrect output values.
With this fix, all custom kernels are properly synchronized with PyTorch's
CUDA stream, ensuring correct execution order and preventing race conditions
when used with torch.compile.
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <[email protected]>
0 commit comments