You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PR llvm#96083 added intrinsics for async copy of 'tensor' data
using TMA. Following a similar design, this PR adds intrinsics
for async copy of bulk data (non-tensor variants) through TMA.
* These intrinsics optionally support multicast and cache_hints,
as indicated by the boolean arguments at the end of the intrinsics.
* The backend looks through these flag arguments and lowers to the
appropriate PTX instructions.
* Lit tests are added for all combinations of these intrinsics in
cp-async-bulk.ll.
* The generated PTX is verified with a 12.3 ptxas executable.
* Added docs for these intrinsics in NVPTXUsage.rst file.
PTX Spec reference:
https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-async-bulk
Signed-off-by: Durgadoss R <[email protected]>
0 commit comments