You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[MLIR][NVVM] Update mbarrier.arrive.* Op (#168758)
This patch updates the mbarrier.arrive.* family of Ops to include
all features added up-to Blackwell.
* Update the `mbarrier.arrive` Op to include shared_cluster
memory space, cta/cluster scope and an option to lower using
relaxed semantics.
* An `arrive_drop` variant is added for both the `arrive` and
`arrive.nocomplete` operations.
* Updates for expect_tx and complete_tx operations.
* Verifier checks are added wherever appropriate.
* lit tests are added to verify the lowering to the intrinsics.
TODO:
* Updates for the remaining mbarrier family will be done in
subsequent PRs. (mainly, arrive.expect-tx, test_wait and try_waits)
Signed-off-by: Durgadoss R <[email protected]>
The `nvvm.mbarrier.expect_tx` operation increases the transaction count
645
+
of the mbarrier located at `addr` by `txcount` amount. The `scope`
646
+
specifies the set of threads that can directly observe the memory
647
+
synchronizing effect of the `mbarrier.expect_tx` operation. `CTA`
648
+
and `CLUSTER` are the only allowed values for `scope`.
649
+
650
+
[For more information, see PTX ISA](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-mbarrier-expect-tx)
The `nvvm.mbarrier.complete_tx` operation decrements the transaction
679
+
count of the *mbarrier object* at `addr` by `txcount`. It also signals
680
+
the completion of asynchronous transactions that were tracked by the
681
+
current phase. The `scope` specifies the set of threads that can directly
682
+
observe the memory synchronizing effect of the `mbarrier.complete_tx`
683
+
operation. `CTA` and `CLUSTER` are the only allowed values for `scope`.
684
+
685
+
[For more information, see PTX ISA](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-mbarrier-complete-tx)
This operation causes the executing thread to signal its arrival at the barrier.
655
-
The operation returns an opaque value that captures the phase of the
656
-
*mbarrier object* prior to the arrive-on operation. The contents of this state
657
-
value are implementation-specific.
658
722
659
-
The operation takes the following operand:
723
+
- `res`: When the `space` is not shared_cluster, this operation returns an
724
+
opaque 64-bit value capturing the phase of the *mbarrier object* prior to
725
+
the arrive-on operation. The contents of this return value are
726
+
implementation-specific. An *mbarrier object* located in the shared_cluster
727
+
space cannot return a value.
728
+
729
+
The operation takes the following operands:
660
730
- `addr`: A pointer to the memory location of the *mbarrier object*. The `addr`
661
-
must be a pointer to generic or shared::cta memory. When it is generic, the
662
-
underlying address must be within the shared::cta memory space; otherwise
663
-
the behavior is undefined.
731
+
must be a pointer to generic or shared_cta or shared_cluster memory. When it
732
+
is generic, the underlying address must be within the shared_cta memory space;
733
+
otherwise the behavior is undefined.
734
+
- `count`: This specifies the amount by which the pending arrival count is
735
+
decremented. If the `count` argument is not specified, the pending arrival
736
+
count is decremented by 1.
737
+
- `scope`: This specifies the set of threads that directly observe the memory
738
+
synchronizing effect of the `mbarrier.arrive` operation.
739
+
- `space`: This indicates the memory space where the mbarrier object resides.
740
+
- `relaxed`: When set to true, the `arrive` operation has relaxed memory semantics
741
+
and does not provide any ordering or visibility guarantees.
664
742
665
743
[For more information, see PTX ISA](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-mbarrier-arrive)
666
744
}];
667
-
let assemblyFormat = "$addr attr-dict `:` type($addr) `->` type($res)";
The `nvvm.mbarrier.arrive_drop` operation decrements the expected arrival
779
+
count of the *mbarrier object* by `count` and then performs an arrive-on
780
+
operation. When `count` is not specified, it defaults to 1. The decrement
781
+
of the expected arrival count applies to all the subsequent phases of the
782
+
*mbarrier object*. The remaining semantics are identical to those of the
783
+
`nvvm.mbarrier.arrive` operation.
784
+
785
+
[For more information, see PTX ISA](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-mbarrier-arrive-drop)
let summary = "MBarrier Arrive-Drop No-Complete Operation";
868
+
let description = [{
869
+
The `nvvm.mbarrier.arrive_drop.nocomplete` operation decrements the expected
870
+
arrival count of the *mbarrier object* by the amount `count` and then performs
871
+
an arrive-on operation on the *mbarrier object* with the guarantee that it
872
+
will not cause the barrier to complete its current phase.
873
+
874
+
[For more information, see PTX ISA](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-mbarrier-arrive-drop)
0 commit comments