@@ -147,3 +147,33 @@ be specified. NVIDIA's documentation provides information about
147147using the PTX instruction set, helping programmers compile code based on
148148the corresponding syntax rules, as shown in
149149Code `lst:ptx`.
150+
151+ **lst:ptx**
152+ ```cpp
153+ half_t *a, *b;
154+ float *C, *D;
155+ unsigned const* A = reinterpret_cast<unsigned const*>(a);
156+ unsigned const* B = reinterpret_cast<unsigned const*>(b);
157+
158+ asm volatile(
159+ "mma.sync.aligned.m8n8k4.row.row.f32.f16.f16.f32 "
160+ "{%0,%1,%2,%3,%4,%5,%6,%7}, {%8,%9}, {%10,%11}, "
161+ "{%12,%13,%14,%15,%16,%17,%18,%19};\n"
162+ : "=f"(D[0]), "=f"(D[1]), "=f"(D[2]), "=f"(D[3]), "=f"(D[4]),
163+ "=f"(D[5]), "=f"(D[6]), "=f"(D[7])
164+ : "r"(A[0]), "r"(A[1]), "r"(B[0]), "r"(B[1]), "f"(C[0]),
165+ "f"(C[1]), "f"(C[2]), "f"(C[3]), "f"(C[4]), "f"(C[5]),
166+ "f"(C[6]), "f"(C[7]));
167+ ```
168+
169+ Data elements are directly used as the input (` unsigned ` type is used
170+ for containing FP16 data elements). Moreover, NVIDIA provides the
171+ ` ldmatrix ` instruction to load data from the shared memory to fragments.
172+
173+ A finer-grained instruction, ` mma ` , can form a warp-level WMMA API of
174+ more diversified shapes to control the mapping between threads and data
175+ in the warp. The PTX instructions offer greater flexibility than
176+ directly using CUDA C++ codes.
177+
178+ [ ^ 1 ] : available at
179+ < https://docs.nvidia.com/cuda/inline-ptx-assembly/index.html >
0 commit comments