Replies: 2 comments
-
I think I actually fixed my own issue. I called |
Beta Was this translation helpful? Give feedback.
0 replies
-
Yep, Triton executes kernels in the current stream, so synchronizing it should be enough. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hello!
I am working on very large (5D, [B, C, X, Y, Z] ) tensors for medical image segmentation and i am trying to use triton to fuse some operations. I am running into trouble ensuring that all triton cores finish execution while training in a distributed environment. I initialze the output vector as empty, usually at a size of [2, 3, 300, 300, 30] and launch the triton kernels, however occasionally there are NaN's in the output. I think this is due to that bit of memory not being populated yet by a kernel in the grid. Is there a good way to ensure all triton cores finish excecution? I tried torch.cuda.synchronize, but that does not always work for whatever reason...
I've been struggling with this for a while and would appreciate any help! Thank you so much :)
Beta Was this translation helpful? Give feedback.
All reactions