Skip to content

Commit edbe169

Browse files
committed
debug
1 parent 0f32074 commit edbe169

File tree

1 file changed

+30
-0
lines changed

1 file changed

+30
-0
lines changed

chapter_accelerator/Programming_Methods.md

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -147,3 +147,33 @@ be specified. NVIDIA's documentation provides information about
147147
using the PTX instruction set, helping programmers compile code based on
148148
the corresponding syntax rules, as shown in
149149
Code `lst:ptx`.
150+
151+
**lst:ptx**
152+
```cpp
153+
half_t *a, *b;
154+
float *C, *D;
155+
unsigned const* A = reinterpret_cast<unsigned const*>(a);
156+
unsigned const* B = reinterpret_cast<unsigned const*>(b);
157+
158+
asm volatile(
159+
"mma.sync.aligned.m8n8k4.row.row.f32.f16.f16.f32 "
160+
"{%0,%1,%2,%3,%4,%5,%6,%7}, {%8,%9}, {%10,%11}, "
161+
"{%12,%13,%14,%15,%16,%17,%18,%19};\n"
162+
: "=f"(D[0]), "=f"(D[1]), "=f"(D[2]), "=f"(D[3]), "=f"(D[4]),
163+
"=f"(D[5]), "=f"(D[6]), "=f"(D[7])
164+
: "r"(A[0]), "r"(A[1]), "r"(B[0]), "r"(B[1]), "f"(C[0]),
165+
"f"(C[1]), "f"(C[2]), "f"(C[3]), "f"(C[4]), "f"(C[5]),
166+
"f"(C[6]), "f"(C[7]));
167+
```
168+
169+
Data elements are directly used as the input (`unsigned` type is used
170+
for containing FP16 data elements). Moreover, NVIDIA provides the
171+
`ldmatrix` instruction to load data from the shared memory to fragments.
172+
173+
A finer-grained instruction, `mma`, can form a warp-level WMMA API of
174+
more diversified shapes to control the mapping between threads and data
175+
in the warp. The PTX instructions offer greater flexibility than
176+
directly using CUDA C++ codes.
177+
178+
[^1]: available at
179+
<https://docs.nvidia.com/cuda/inline-ptx-assembly/index.html>

0 commit comments

Comments
 (0)