@@ -633,7 +633,7 @@ specified by the ``i32 %d0 ... i32 %d4`` arguments.
633633For more information, refer PTX ISA
634634`<https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-async-bulk-prefetch-tensor >`_.
635635
636- '``llvm.nvvm.cp.async.bulk.tensor.prefetch.im2col.[1 -5]d ``'
636+ '``llvm.nvvm.cp.async.bulk.tensor.prefetch.im2col.[3 -5]d ``'
637637^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
638638
639639Syntax:
@@ -648,7 +648,7 @@ Syntax:
648648 Overview:
649649"""""""""
650650
651- The '``@llvm.nvvm.cp.async.bulk.tensor.prefetch.im2col.[1 -5]d ``' intrinsics
651+ The '``@llvm.nvvm.cp.async.bulk.tensor.prefetch.im2col.[3 -5]d ``' intrinsics
652652correspond to the ``cp.async.bulk.prefetch.tensor.[1-5]d.L2.global* `` set
653653of PTX instructions. These instructions initiate an asynchronous prefetch
654654of tensor data from global memory to the L2 cache. In im2col mode, some
@@ -663,6 +663,76 @@ the same functionality as described in the ``tile`` mode intrinsics above.
663663For more information, refer PTX ISA
664664`<https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-async-bulk-prefetch-tensor >`_.
665665
666+ '``llvm.nvvm.cp.async.bulk.tensor.reduce.[red_op].tile.[1-5]d ``'
667+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
668+
669+ Syntax:
670+ """""""
671+
672+ .. code-block :: llvm
673+
674+ declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.add.tile.1d(ptr addrspace(3) %src, ptr %tensor_map, i32 %d0, i64 %ch, i1 %flag_ch)
675+ declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.min.tile.1d(ptr addrspace(3) %src, ptr %tensor_map, i32 %d0, i64 %ch, i1 %flag_ch)
676+ declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.max.tile.1d(ptr addrspace(3) %src, ptr %tensor_map, i32 %d0, i64 %ch, i1 %flag_ch)
677+ declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.inc.tile.1d(ptr addrspace(3) %src, ptr %tensor_map, i32 %d0, i64 %ch, i1 %flag_ch)
678+ declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.dec.tile.1d(ptr addrspace(3) %src, ptr %tensor_map, i32 %d0, i64 %ch, i1 %flag_ch)
679+ declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.and.tile.1d(ptr addrspace(3) %src, ptr %tensor_map, i32 %d0, i64 %ch, i1 %flag_ch)
680+ declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.or.tile.1d(ptr addrspace(3) %src, ptr %tensor_map, i32 %d0, i64 %ch, i1 %flag_ch)
681+ declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.xor.tile.1d(ptr addrspace(3) %src, ptr %tensor_map, i32 %d0, i64 %ch, i1 %flag_ch)
682+
683+ declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.<red_op>.tile.2d(..., i32 %d0, i32 %d1, ...)
684+ declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.<red_op>.tile.3d(..., i32 %d0, i32 %d1, i32 %d2, ...)
685+ declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.<red_op>.tile.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, ...)
686+ declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.<red_op>.tile.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, ...)
687+
688+ Overview:
689+ """""""""
690+
691+ The '``@llvm.nvvm.cp.async.bulk.tensor.reduce.<red_op>.tile.[1-5]d ``' intrinsics
692+ correspond to the ``cp.reduce.async.bulk.tensor.[1-5]d.* `` set of PTX instructions.
693+ These instructions initiate an asynchronous reduction operation of tensor data
694+ in global memory with the tensor data in shared{::cta} memory, using ``tile `` mode.
695+ The dimension of the tensor data ranges from 1d to 5d with the coordinates
696+ specified by the ``i32 %d0 ... i32 %d4 `` arguments. The supported reduction
697+ operations are {add, min, max, inc, dec, and, or, xor} as described in the
698+ ``tile.1d `` intrinsics.
699+
700+ * The last argument to these intrinsics is a boolean flag
701+ indicating support for cache_hint. This flag argument must
702+ be a compile-time constant. When set, it indicates a valid
703+ cache_hint (``i64 %ch ``) and generates the ``.L2::cache_hint ``
704+ variant of the PTX instruction.
705+
706+ For more information, refer PTX ISA
707+ `<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-reduce-async-bulk-tensor >`_.
708+
709+ '``llvm.nvvm.cp.async.bulk.tensor.reduce.[red_op].im2col.[3-5]d ``'
710+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
711+
712+ Syntax:
713+ """""""
714+
715+ .. code-block :: llvm
716+
717+ declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.<red_op>.im2col.3d(ptr addrspace(3) %src, ptr %tensor_map, i32 %d0, i32 %d1, i32 %d2, i64 %ch, i1 %flag_ch)
718+ declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.<red_op>.im2col.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, ...)
719+ declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.<red_op>.im2col.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, ...)
720+
721+ Overview:
722+ """""""""
723+
724+ The '``@llvm.nvvm.cp.async.bulk.tensor.reduce.<red_op>.im2col.[3-5]d ``' intrinsics
725+ correspond to the ``cp.reduce.async.bulk.tensor.[3-5]d.* `` set of PTX instructions.
726+ These instructions initiate an asynchronous reduction operation of tensor data
727+ in global memory with the tensor data in shared{::cta} memory, using ``im2col `` mode.
728+ In this mode, the tensor has to be at least three-dimensional. The supported reduction
729+ operations supported are the same as the ones in the tile mode. The last argument to
730+ these intrinsics is a boolean flag, with the same functionality as described in the
731+ ``tile `` mode intrinsics above.
732+
733+ For more information, refer PTX ISA
734+ `<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-reduce-async-bulk-tensor >`_.
735+
666736Other Intrinsics
667737----------------
668738
0 commit comments