Skip to content
Discussion options

You must be logged in to vote

You need to look into the meaning of the warptile parameters, they are not independent. I'll try to summarize what I remember:

The 11 parameters are: BLOCK_SIZE, BM, BN, BK, WM, WN, WMITER, TM, TN, TK and WARP.
They originate from this CUDA article, look at the kernel 10 information: https://siboehm.com/articles/22/CUDA-MMM
Especially the diagram is helpful.

For your problem: You need to make sure that the amount of warps in the workgroup (BLOCK_SIZE) is identical to the amount of warptiles. For example in the Nvidia case (warps of size 32) we have a workgroup of size BLOCK_SIZE=128, meaning 4 warps. BM=64, BN=64 and WM=32, WN=32 means we have 4 tiles. This is why it works.

In your WARP=1…

Replies: 3 comments 9 replies

Comment options

You must be logged in to vote
0 replies
Comment options

rmatif
May 15, 2025
Collaborator Author

You must be logged in to vote
6 replies
@rmatif
Comment options

rmatif May 15, 2025
Collaborator Author

@jeffbolznv
Comment options

@0cc4m
Comment options

0cc4m May 16, 2025
Collaborator

@rmatif
Comment options

rmatif May 16, 2025
Collaborator Author

@0cc4m
Comment options

0cc4m May 16, 2025
Collaborator

Comment options

rmatif
May 17, 2025
Collaborator Author

You must be logged in to vote
3 replies
@0cc4m
Comment options

0cc4m May 17, 2025
Collaborator

Answer selected by rmatif
@rmatif
Comment options

rmatif May 22, 2025
Collaborator Author

@0cc4m
Comment options

0cc4m May 23, 2025
Collaborator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
3 participants