-
Notifications
You must be signed in to change notification settings - Fork 52
Description
Previously we landed stream=None mapping to legacy default stream (a safer case). As DLPack get popularized, one most canonical use-case is to exchange between library and pytorch. As most libraries are not updated to take stream passing, and many expects that the behavior is no-sync, which works better for cases like CUDAGraph:
s = torch.cuda.Stream()
x = torch.randn(8, device="cuda")
g = torch.cuda.CUDAGraph()
with torch.cuda.stream(s):
with torch.cuda.graph(g):
_ = x + 1
mylib_tensor = mylib.from_dlpack(x)
mylib_kernel(mylib_tensor)
In the above code example, if the stream=None
maps to no sync(currently stream=-1
), then the cuda graph capture will work out of box. Otherwise, the cudagraph capture no longer work because of the sync. This is only the choice of default behavior as mylib
can always pick a specific stream to be passed in.
So the discussion only focuses on the guideline for default behavior. The original rationale of the default was that legacy stream was a "safe choice". However, as DLPack based exchange becomes popularized and CUDAGraph integration becomes criticial. It could make sense for the default to optimize for common usecases (stream=None default to nosync if applicable).
It is worth pointing out the nosync was also the implicit original behavior before the stream proposal before frameworks get updated (many only recently like in the case of torch), so many libraries may indeed implicitly relied on such behavior.
Regardless of choices here, I think we should definitely update guideline to encourage the users to explicitly pass in stream, and document the rationale of nosync behavior, relation to CUDAgraph etc, to help libraries pick.