Skip to content

Commit f668533

Browse files
authored
Add a note suggesting users prefer PTX MMA over WMMA (#2816)
[only docs]
1 parent c8c2142 commit f668533

File tree

1 file changed

+7
-0
lines changed

1 file changed

+7
-0
lines changed

docs/src/development/kernel.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -698,3 +698,10 @@ double each element in a fragment, you can simply use:
698698
```julia
699699
frag = 2.0f0 .* frag
700700
```
701+
702+
!!! note
703+
The WMMA instructions don't take advantage of [memory swizzling](https://leimao.github.io/blog/CUDA-Shared-Memory-Swizzling/).
704+
The custom load/store operations for WMMA don't allow the programmer to control *how* data is loaded,
705+
so register bank conflicts can only be reduced, but not entirely eliminated. In general, using the PTX
706+
instructions [`mma.sync`](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-mma)
707+
and friends are preferred, as they give the programmer finer control over the memory access pattern.

0 commit comments

Comments
 (0)