@@ -962,6 +962,104 @@ The ``griddepcontrol`` intrinsics allows the dependent grids and prerequisite gr
962962For more information, refer
963963`PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-griddepcontrol >`__.
964964
965+ TCGEN05 family of Intrinsics
966+ ----------------------------
967+
968+ The llvm.nvvm.tcgen05.* intrinsics model the TCGEN05 family of instructions
969+ exposed by PTX. These intrinsics use 'Tensor Memory' (henceforth ``tmem ``).
970+ NVPTX represents this memory using ``addrspace(6) `` and is always 32-bits.
971+
972+ For more information, refer to the PTX ISA
973+ `<https://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-memory >`_.
974+
975+ The tensor-memory pointers may only be used with the tcgen05 intrinsics.
976+ There are specialized load/store instructions provided (tcgen05.ld/st) to
977+ work with tensor-memory.
978+
979+ See the PTX ISA for more information on tensor-memory load/store instructions
980+ `<https://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-memory-and-register-load-store-instructions >`_.
981+
982+ '``llvm.nvvm.tcgen05.alloc ``'
983+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
984+
985+ Syntax:
986+ """""""
987+
988+ .. code-block :: llvm
989+
990+ declare void @llvm.nvvm.tcgen05.alloc.cg1(ptr %dst, i32 %ncols)
991+ declare void @llvm.nvvm.tcgen05.alloc.cg2(ptr %dst, i32 %ncols)
992+ declare void @llvm.nvvm.tcgen05.alloc.shared.cg1(ptr addrspace(3) %dst, i32 %ncols)
993+ declare void @llvm.nvvm.tcgen05.alloc.shared.cg2(ptr addrspace(3) %dst, i32 %ncols)
994+
995+ Overview:
996+ """""""""
997+
998+ The '``@llvm.nvvm.tcgen05.alloc.* ``' intrinsics correspond to the
999+ ``tcgen05.alloc.cta_group*.sync.aligned.b32 `` family of PTX instructions.
1000+ The ``tcgen05.alloc `` is a potentially blocking instruction which dynamically
1001+ allocates the specified number of columns in the Tensor Memory and writes
1002+ the address of the allocated Tensor Memory into shared memory at the
1003+ location specified by ``%dst ``. The 32-bit operand ``%ncols `` specifies
1004+ the number of columns to be allocated and it must be a power-of-two.
1005+ The ``.shared `` variant explicitly uses shared memory address space for
1006+ the ``%dst `` operand. The ``.cg1 `` and ``.cg2 `` variants generate
1007+ ``cta_group::1 `` and ``cta_group::2 `` variants of the instruction respectively.
1008+
1009+ For more information, refer to the PTX ISA
1010+ `<https://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-memory-allocation-and-management-instructions >`_.
1011+
1012+ '``llvm.nvvm.tcgen05.dealloc ``'
1013+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1014+
1015+ Syntax:
1016+ """""""
1017+
1018+ .. code-block :: llvm
1019+
1020+ declare void @llvm.nvvm.tcgen05.dealloc.cg1(ptr addrspace(6) %tmem_addr, i32 %ncols)
1021+ declare void @llvm.nvvm.tcgen05.dealloc.cg2(ptr addrspace(6) %tmem_addr, i32 %ncols)
1022+
1023+ Overview:
1024+ """""""""
1025+
1026+ The '``@llvm.nvvm.tcgen05.dealloc.* ``' intrinsics correspond to the
1027+ ``tcgen05.dealloc.* `` set of PTX instructions. The ``tcgen05.dealloc ``
1028+ instructions deallocates the Tensor Memory specified by the Tensor Memory
1029+ address ``%tmem_addr ``. The operand ``%tmem_addr `` must point to a previous
1030+ Tensor Memory allocation. The 32-bit operand ``%ncols `` specifies the number
1031+ of columns to be de-allocated. The ``.cg1 `` and ``.cg2 `` variants generate
1032+ ``cta_group::1 `` and ``cta_group::2 `` variants of the instruction respectively.
1033+
1034+ For more information, refer to the PTX ISA
1035+ `<https://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-memory-allocation-and-management-instructions >`_.
1036+
1037+ '``llvm.nvvm.tcgen05.relinq.alloc.permit ``'
1038+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1039+
1040+ Syntax:
1041+ """""""
1042+
1043+ .. code-block :: llvm
1044+
1045+ declare void @llvm.nvvm.tcgen05.relinq.alloc.permit.cg1()
1046+ declare void @llvm.nvvm.tcgen05.relinq.alloc.permit.cg2()
1047+
1048+ Overview:
1049+ """""""""
1050+
1051+ The '``@llvm.nvvm.tcgen05.relinq.alloc.permit.* ``' intrinsics correspond
1052+ to the ``tcgen05.relinquish_alloc_permit.* `` set of PTX instructions.
1053+ This instruction specifies that the CTA of the executing thread is
1054+ relinquishing the right to allocate Tensor Memory. So, it is illegal
1055+ for a CTA to perform ``tcgen05.alloc `` after any of its constituent
1056+ threads execute ``tcgen05.relinquish_alloc_permit ``. The ``.cg1 ``
1057+ and ``.cg2 `` variants generate ``cta_group::1 `` and ``cta_group::2 ``
1058+ flavors of the instruction respectively.
1059+
1060+ For more information, refer to the PTX ISA
1061+ `<https://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-memory-allocation-and-management-instructions >`_.
1062+
9651063Other Intrinsics
9661064----------------
9671065
0 commit comments