@@ -733,6 +733,96 @@ these intrinsics is a boolean flag, with the same functionality as described in
733733For more information, refer PTX ISA
734734`<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-reduce-async-bulk-tensor >`_.
735735
736+ Warp Group Intrinsics
737+ ---------------------
738+
739+ '``llvm.nvvm.wgmma.fence.sync.aligned ``'
740+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
741+
742+ Syntax:
743+ """""""
744+
745+ .. code-block :: llvm
746+
747+ declare void @llvm.nvvm.wgmma.fence.sync.aligned()
748+
749+ Overview:
750+ """""""""
751+
752+ The '``@llvm.nvvm.wgmma.fence.sync.aligned ``' intrinsic generates the
753+ ``wgmma.fence.sync.aligned `` PTX instruction, which establishes an ordering
754+ between prior accesses to any warpgroup registers and subsequent accesses to
755+ the same registers by a ``wgmma.mma_async `` instruction.
756+
757+ The ``wgmma.fence `` instruction must be issued by all warps of the warpgroup in
758+ the following locations:
759+
760+ * Before the first ``wgmma.mma_async `` operation in a warpgroup.
761+ * Between a register access by a thread in the warpgroup and any
762+ ``wgmma.mma_async `` instruction that accesses the same registers, except when
763+ these are accumulator register accesses across multiple ``wgmma.mma_async ``
764+ instructions of the same shape in which case an ordering guarantee is
765+ provided by default.
766+
767+ For more information, refer PTX ISA
768+ `<https://docs.nvidia.com/cuda/parallel-thread-execution/#asynchronous-warpgroup-level-matrix-instructions-wgmma-fence >`_.
769+
770+ '``llvm.nvvm.wgmma.commit_group.sync.aligned ``'
771+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
772+
773+ Syntax:
774+ """""""
775+
776+ .. code-block :: llvm
777+
778+ declare void @llvm.nvvm.wgmma.commit_group.sync.aligned()
779+
780+ Overview:
781+ """""""""
782+
783+ The '``@llvm.nvvm.wgmma.commit_group.sync.aligned ``' intrinsic generates the
784+ ``wgmma.commit_group.sync.aligned `` PTX instruction, which creates a new
785+ wgmma-group per warpgroup and batches all prior ``wgmma.mma_async ``
786+ instructions initiated by the executing warp but not committed to any
787+ wgmma-group into the new wgmma-group. If there are no uncommitted ``wgmma
788+ mma_async `` instructions then, ``wgmma.commit_group `` results in an empty
789+ wgmma-group.
790+
791+ An executing thread can wait for the completion of all ``wgmma.mma_async ``
792+ operations in a wgmma-group by using ``wgmma.wait_group ``.
793+
794+ For more information, refer PTX ISA
795+ `<https://docs.nvidia.com/cuda/parallel-thread-execution/#asynchronous-warpgroup-level-matrix-instructions-wgmma-commit-group >`_.
796+
797+ '``llvm.nvvm.wgmma.wait_group.sync.aligned ``'
798+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
799+
800+ Syntax:
801+ """""""
802+
803+ .. code-block :: llvm
804+
805+ declare void @llvm.nvvm.wgmma.wait_group.sync.aligned(i64 immarg N)
806+
807+ Overview:
808+ """""""""
809+
810+ The '``@llvm.nvvm.wgmma.wait_group.sync.aligned ``' intrinsic generates the
811+ ``wgmma.commit_group.sync.aligned N `` PTX instruction, which will cause the
812+ executing thread to wait until only ``N `` or fewer of the most recent
813+ wgmma-groups are pending and all the prior wgmma-groups committed by the
814+ executing threads are complete. For example, when ``N `` is 0, the executing
815+ thread waits on all the prior wgmma-groups to complete. Operand ``N `` is an
816+ integer constant.
817+
818+ Accessing the accumulator register or the input register containing the
819+ fragments of matrix A of a ``wgmma.mma_async `` instruction without first
820+ performing a ``wgmma.wait_group `` instruction that waits on a wgmma-group
821+ including that ``wgmma.mma_async `` instruction is undefined behavior.
822+
823+ For more information, refer PTX ISA
824+ `<https://docs.nvidia.com/cuda/parallel-thread-execution/#asynchronous-warpgroup-level-matrix-instructions-wgmma-wait-group >`_.
825+
736826Other Intrinsics
737827----------------
738828
0 commit comments