@@ -962,6 +962,109 @@ The ``griddepcontrol`` intrinsics allows the dependent grids and prerequisite gr
962962For more information, refer
963963`PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-griddepcontrol >`__.
964964
965+ TCGEN05 family of Intrinsics
966+ ----------------------------
967+
968+ The llvm.nvvm.tcgen05.* intrinsics model the TCGEN05 family of instructions
969+ exposed by PTX. These intrinsics use 'Tensor Memory' (henceforth ``tmem ``).
970+ NVPTX represents this memory using ``addrspace(6) `` and is always 32-bits.
971+
972+ For more information, refer to the PTX ISA
973+ `<https://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-memory >`_.
974+
975+ The tensor-memory pointers may only be used with the tcgen05 intrinsics.
976+ There are specialized load/store instructions provided (tcgen05.ld/st) to
977+ work with tensor-memory.
978+
979+ See the PTX ISA for more information on tensor-memory load/store instructions
980+ `<https://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-memory-and-register-load-store-instructions >`_.
981+
982+ All tcgen05 intrinsics use a ``null `` pointer in tmem address
983+ space as their last operand. This helps to preserve ordering among the tcgen05
984+ operations especially when the intrinsic lacks any tmem operands. This
985+ last operand is dropped during Codegen.
986+
987+ '``llvm.nvvm.tcgen05.alloc ``'
988+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
989+
990+ Syntax:
991+ """""""
992+
993+ .. code-block :: llvm
994+
995+ declare void @llvm.nvvm.tcgen05.alloc.cg1(ptr %dst, i32 %ncols, ptr addrspace(6) null)
996+ declare void @llvm.nvvm.tcgen05.alloc.cg2(ptr %dst, i32 %ncols, ptr addrspace(6) null)
997+ declare void @llvm.nvvm.tcgen05.alloc.shared.cg1(ptr addrspace(3) %dst, i32 %ncols, ptr addrspace(6) null)
998+ declare void @llvm.nvvm.tcgen05.alloc.shared.cg2(ptr addrspace(3) %dst, i32 %ncols, ptr addrspace(6) null)
999+
1000+ Overview:
1001+ """""""""
1002+
1003+ The '``@llvm.nvvm.tcgen05.alloc.* ``' intrinsics correspond to the
1004+ ``tcgen05.alloc.cta_group*.sync.aligned.b32 `` family of PTX instructions.
1005+ The ``tcgen05.alloc `` is a potentially blocking instruction which dynamically
1006+ allocates the specified number of columns in the Tensor Memory and writes
1007+ the address of the allocated Tensor Memory into shared memory at the
1008+ location specified by ``%dst ``. The 32-bit operand ``%ncols `` specifies
1009+ the number of columns to be allocated and it must be a power-of-two.
1010+ The ``.shared `` variant explicitly uses shared memory address space for
1011+ the ``%dst `` operand. The ``.cg1 `` and ``.cg2 `` variants generate
1012+ ``cta_group::1 `` and ``cta_group::2 `` variants of the instruction respectively.
1013+
1014+ For more information, refer to the PTX ISA
1015+ `<https://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-memory-allocation-and-management-instructions >`_.
1016+
1017+ '``llvm.nvvm.tcgen05.dealloc ``'
1018+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1019+
1020+ Syntax:
1021+ """""""
1022+
1023+ .. code-block :: llvm
1024+
1025+ declare void @llvm.nvvm.tcgen05.dealloc.cg1(ptr addrspace(6) %tmem_addr, i32 %ncols, ptr addrspace(6) null)
1026+ declare void @llvm.nvvm.tcgen05.dealloc.cg2(ptr addrspace(6) %tmem_addr, i32 %ncols, ptr addrspace(6) null)
1027+
1028+ Overview:
1029+ """""""""
1030+
1031+ The '``@llvm.nvvm.tcgen05.dealloc.* ``' intrinsics correspond to the
1032+ ``tcgen05.dealloc.* `` set of PTX instructions. The ``tcgen05.dealloc ``
1033+ instructions deallocates the Tensor Memory specified by the Tensor Memory
1034+ address ``%tmem_addr ``. The operand ``%tmem_addr `` must point to a previous
1035+ Tensor Memory allocation. The 32-bit operand ``%ncols `` specifies the number
1036+ of columns to be de-allocated. The ``.cg1 `` and ``.cg2 `` variants generate
1037+ ``cta_group::1 `` and ``cta_group::2 `` variants of the instruction respectively.
1038+
1039+ For more information, refer to the PTX ISA
1040+ `<https://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-memory-allocation-and-management-instructions >`_.
1041+
1042+ '``llvm.nvvm.tcgen05.relinq.alloc.permit ``'
1043+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1044+
1045+ Syntax:
1046+ """""""
1047+
1048+ .. code-block :: llvm
1049+
1050+ declare void @llvm.nvvm.tcgen05.relinq.alloc.permit.cg1(ptr addrspace(6) null)
1051+ declare void @llvm.nvvm.tcgen05.relinq.alloc.permit.cg2(ptr addrspace(6) null)
1052+
1053+ Overview:
1054+ """""""""
1055+
1056+ The '``@llvm.nvvm.tcgen05.relinq.alloc.permit.* ``' intrinsics correspond
1057+ to the ``tcgen05.relinquish_alloc_permit.* `` set of PTX instructions.
1058+ This instruction specifies that the CTA of the executing thread is
1059+ relinquishing the right to allocate Tensor Memory. So, it is illegal
1060+ for a CTA to perform ``tcgen05.alloc `` after any of its constituent
1061+ threads execute ``tcgen05.relinquish_alloc_permit ``. The ``.cg1 ``
1062+ and ``.cg2 `` variants generate ``cta_group::1 `` and ``cta_group::2 ``
1063+ flavors of the instruction respectively.
1064+
1065+ For more information, refer to the PTX ISA
1066+ `<https://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-memory-allocation-and-management-instructions >`_.
1067+
9651068Other Intrinsics
9661069----------------
9671070
0 commit comments