Add a note suggesting users prefer PTX MMA over WMMA (#2816)

kshyatt · web-flow · commit f66853351a7a · 2025-08-11T17:32:04.000+02:00
[only docs]
diff --git a/docs/src/development/kernel.md b/docs/src/development/kernel.md
@@ -698,3 +698,10 @@ double each element in a fragment, you can simply use:
 ```julia
 frag = 2.0f0 .* frag
 ```
+
+!!! note
+  The WMMA instructions don't take advantage of [memory swizzling](https://leimao.github.io/blog/CUDA-Shared-Memory-Swizzling/).
+  The custom load/store operations for WMMA don't allow the programmer to control *how* data is loaded,
+  so register bank conflicts can only be reduced, but not entirely eliminated. In general, using the PTX
+  instructions [`mma.sync`](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-mma)
+   and friends are preferred, as they give the programmer finer control over the memory access pattern.