Skip to content
Discussion options

You must be logged in to vote

In our old communication functions, there were a lot of smaller kernels. So we used cudaGraph to reduce the kernel launch overhead. But later, we found that manually fusing the small kernels was faster than cudaGraph for our cases. So we no longer use cudaGraph in communication unless one forces it by setting cudaGraph region.

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by indra098124
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants