Skip to content

[mlir] Added Convergent trait that matches LLVM's semantics #152358

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 27 additions & 17 deletions mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td
Original file line number Diff line number Diff line change
Expand Up @@ -14,12 +14,13 @@
#define NVVMIR_OPS

include "mlir/IR/EnumAttr.td"
include "mlir/Interfaces/ControlFlowInterfaces.td"
include "mlir/Interfaces/InferIntRangeInterface.td"
include "mlir/Interfaces/SideEffectInterfaces.td"
include "mlir/Dialect/GPU/IR/CompilationAttrInterfaces.td"
include "mlir/Dialect/LLVMIR/LLVMOpBase.td"
include "mlir/Dialect/LLVMIR/NVVMRequiresSMTraits.td"
include "mlir/Interfaces/SideEffectInterfaces.td"
include "mlir/Dialect/LLVMIR/BasicPtxBuilderInterface.td"
include "mlir/Interfaces/InferIntRangeInterface.td"
include "mlir/Dialect/LLVMIR/LLVMTypes.td"

def LLVM_PointerGeneric : LLVM_PointerInAddressSpace<0>;
Expand Down Expand Up @@ -105,9 +106,10 @@ class NVVM_Op<string mnemonic, list<Trait> traits = []> :
}

/// Base class that defines BasicPtxBuilderOpInterface.
class NVVM_PTXBuilder_Op<string mnemonic,
list<Trait> traits = [DeclareOpInterfaceMethods<BasicPtxBuilderOpInterface>]> :
LLVM_OpBase<NVVM_Dialect, mnemonic, traits> {
class NVVM_PTXBuilder_Op<string mnemonic, list<Trait> traits = []> :
LLVM_OpBase<NVVM_Dialect, mnemonic,
!listconcat(traits,
[DeclareOpInterfaceMethods<BasicPtxBuilderOpInterface>])> {
}

//===----------------------------------------------------------------------===//
Expand Down Expand Up @@ -561,7 +563,7 @@ def NVVM_MBarrierTestWaitSharedOp : NVVM_Op<"mbarrier.test.wait.shared">,
// NVVM synchronization op definitions
//===----------------------------------------------------------------------===//

def NVVM_Barrier0Op : NVVM_Op<"barrier0"> {
def NVVM_Barrier0Op : NVVM_Op<"barrier0", [Convergent]> {
let assemblyFormat = "attr-dict";
string llvmBuilder = [{
createIntrinsicCall(
Expand All @@ -570,8 +572,9 @@ def NVVM_Barrier0Op : NVVM_Op<"barrier0"> {
}];
}

def NVVM_BarrierOp : NVVM_Op<"barrier", [AttrSizedOperandSegments]> {
let arguments = (ins
def NVVM_BarrierOp : NVVM_Op<"barrier",
[Convergent, AttrSizedOperandSegments]> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This instruction can synchronize different thread counts. Could you clarify what convergent refers to in this context? I may be missing the nuance—does it mean a convergent warp or a convergent CTA?

nvvm.barrier 32
nvvm.barrier 128

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's is based on LLVM's assumption that intrinsics are convergent.
See my other reply about consistency of convergent properties.

let arguments = (ins
Optional<I32>:$barrierId,
Optional<I32>:$numberOfThreads);
string llvmBuilder = [{
Expand All @@ -598,7 +601,7 @@ def NVVM_BarrierOp : NVVM_Op<"barrier", [AttrSizedOperandSegments]> {
];
}

def NVVM_BarrierArriveOp : NVVM_PTXBuilder_Op<"barrier.arrive">
def NVVM_BarrierArriveOp : NVVM_PTXBuilder_Op<"barrier.arrive", [Convergent]>
{
let arguments = (ins Optional<I32>:$barrierId, I32:$numberOfThreads);

Expand All @@ -624,7 +627,7 @@ def NVVM_BarrierArriveOp : NVVM_PTXBuilder_Op<"barrier.arrive">
}];
}

def NVVM_ClusterArriveOp : NVVM_Op<"cluster.arrive"> {
def NVVM_ClusterArriveOp : NVVM_Op<"cluster.arrive", [Convergent]> {
let arguments = (ins OptionalAttr<UnitAttr>:$aligned);

let summary = "Cluster Barrier Arrive Op";
Expand All @@ -647,7 +650,8 @@ def NVVM_ClusterArriveOp : NVVM_Op<"cluster.arrive"> {
let assemblyFormat = "attr-dict";
}

def NVVM_ClusterArriveRelaxedOp : NVVM_Op<"cluster.arrive.relaxed", [NVVMRequiresSM<90>]> {
def NVVM_ClusterArriveRelaxedOp : NVVM_Op<"cluster.arrive.relaxed",
[Convergent, NVVMRequiresSM<90>]> {
let arguments = (ins OptionalAttr<UnitAttr>:$aligned);

let summary = "Cluster Barrier Relaxed Arrive Op";
Expand All @@ -673,7 +677,8 @@ def NVVM_ClusterArriveRelaxedOp : NVVM_Op<"cluster.arrive.relaxed", [NVVMRequire
let assemblyFormat = "attr-dict";
}

def NVVM_ClusterWaitOp : NVVM_Op<"cluster.wait", [NVVMRequiresSM<90>]> {
def NVVM_ClusterWaitOp : NVVM_Op<"cluster.wait",
[Convergent, NVVMRequiresSM<90>]> {
let arguments = (ins OptionalAttr<UnitAttr>:$aligned);

let summary = "Cluster Barrier Wait Op";
Expand Down Expand Up @@ -1054,7 +1059,8 @@ def NVVM_CpAsyncWaitGroupOp : NVVM_Op<"cp.async.wait.group">,
let assemblyFormat = "$n attr-dict";
}

def NVVM_CpAsyncMBarrierArriveOp : NVVM_Op<"cp.async.mbarrier.arrive"> {
def NVVM_CpAsyncMBarrierArriveOp : NVVM_Op<"cp.async.mbarrier.arrive",
[Convergent]> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that these are actually convergent?
Actually it's not clear to me that any of the post-volta intrinsics needs to be.

See this comment: https://discourse.llvm.org/t/llvm-convergence-semantics/77642/12

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow your comment. Are you referring to cp.async.mbarrier.arrive ? If so, I don't see it has been discussed in that comment/thread.
As for cp.async.mbarrier.arrive, it's marked in LLVM with IntrConvergent.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so, I don't see it has been discussed in that comment/thread.

The comment is a general one, so it potentially applies to all the intrinsics :)

As for cp.async.mbarrier.arrive, it's marked in LLVM with IntrConvergent.

I do have low confidence for LLVM current annotation as a reliable source of documentation right now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do have low confidence for LLVM current annotation as a reliable source of documentation right now.

That's valid point. At the same time having different traits set on LLVM intrinsic and MLIR operation, that is lowered to that intrinsic, looks not only confusing, but stinky.
Ideally, for low-level dialects, such as NVVM, it will be great just to query that property from the intrinsic it's lowered to, but that does require to have LLVMContext.

Specifically about convergent, my understanding, if intrinsics is marked with it, but in reality it's not convergent, it's only bad for performance, but stability-wise it's still correct.
But yeah, it will be great to go through all LLVM intrinsics and make sure proper traits are used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlexMaclean @Prince781 that will be great to revisit nvvm's intrinsic properties. I know PTX doc definitely describes nvvm_barrier_cta_sync_aligned_all as convergent, but I couldn't find such statement for others

Copy link
Collaborator

@joker-eph joker-eph Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the ".aligned" specifier in PTX is what brings the convergence requirement.

But I'm not sure why cp.async.mbarrier.arrive for example would need to care about convergence?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's valid point. At the same time having different traits set on LLVM intrinsic and MLIR operation, that is lowered to that intrinsic, looks not only confusing, but stinky.

Absolutely: we should fix LLVM ;)
The problem is that NVVM non-aligned intrinsics should be modeled with some sort of write effects to a private memory: that could be pessimizing transformations (because special casing this everywhere isn't something LLVM has been optimizing for).

let summary = "NVVM Dialect Op for cp.async.mbarrier.arrive";
let description = [{
The `cp.async.mbarrier.arrive` Op makes the mbarrier object track
Expand All @@ -1079,7 +1085,8 @@ def NVVM_CpAsyncMBarrierArriveOp : NVVM_Op<"cp.async.mbarrier.arrive"> {
}];
}

def NVVM_CpAsyncMBarrierArriveSharedOp : NVVM_Op<"cp.async.mbarrier.arrive.shared"> {
def NVVM_CpAsyncMBarrierArriveSharedOp : NVVM_Op<"cp.async.mbarrier.arrive.shared",
[Convergent]> {
let summary = "NVVM Dialect Op for cp.async.mbarrier.arrive.shared";
let description = [{
The `cp.async.mbarrier.arrive.shared` Op makes the mbarrier object
Expand Down Expand Up @@ -2806,7 +2813,8 @@ def NVVM_CpAsyncBulkSharedCTAToGlobalOp :
// NVVM Wgmma Ops
//===----------------------------------------------------------------------===//

def NVVM_WgmmaFenceAlignedOp : NVVM_Op<"wgmma.fence.aligned", [NVVMRequiresSMa<[90]>]> {
def NVVM_WgmmaFenceAlignedOp : NVVM_Op<"wgmma.fence.aligned",
[Convergent, NVVMRequiresSMa<[90]>]> {
let arguments = (ins);
let description = [{
Enforce an ordering of register accesses between warpgroup level matrix
Expand All @@ -2820,7 +2828,8 @@ def NVVM_WgmmaFenceAlignedOp : NVVM_Op<"wgmma.fence.aligned", [NVVMRequiresSMa<[
}];
}

def NVVM_WgmmaGroupSyncAlignedOp : NVVM_Op<"wgmma.commit.group.sync.aligned", [NVVMRequiresSMa<[90]>]> {
def NVVM_WgmmaGroupSyncAlignedOp : NVVM_Op<"wgmma.commit.group.sync.aligned",
[Convergent, NVVMRequiresSMa<[90]>]> {
let assemblyFormat = "attr-dict";
let description = [{
Commits all prior uncommitted warpgroup level matrix multiplication operations.
Expand All @@ -2832,7 +2841,8 @@ def NVVM_WgmmaGroupSyncAlignedOp : NVVM_Op<"wgmma.commit.group.sync.aligned", [N
}];
}

def NVVM_WgmmaWaitGroupSyncOp : NVVM_Op<"wgmma.wait.group.sync.aligned", [NVVMRequiresSMa<[90]>]> {
def NVVM_WgmmaWaitGroupSyncOp : NVVM_Op<"wgmma.wait.group.sync.aligned",
[Convergent, NVVMRequiresSMa<[90]>]> {
let arguments = (ins I64Attr:$group);
let assemblyFormat = "attr-dict $group";
let description = [{
Expand Down
6 changes: 5 additions & 1 deletion mlir/include/mlir/Interfaces/ControlFlowInterfaces.h
Original file line number Diff line number Diff line change
Expand Up @@ -337,8 +337,12 @@ struct ReturnLike : public TraitBase<ConcreteType, ReturnLike> {
return success();
}
};
} // namespace OpTrait

// The Operation may not be made control-dependent on any additional values.
// See https://llvm.org/docs/ConvergentOperations.html for more details.
template <typename ConcreteType>
struct Convergent : public TraitBase<ConcreteType, Convergent> {};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any impact that MLIR is using structured control-flow in the support for convergence? In particular aren't token inserted during lowering to CFG to preserve loop structures for example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no direct impact on absence of this in MLIR, but this trait is a prerequisite to properly annotate functions that invoke such operations. OtherwiseLLVM will apply optimizations incorrectly.
Specifically, the use case I saw was:

fn callee() {
  nvvm.barrier0
}

fn caller() {
  if (condition) {
    call callee()
  }
}

callee has to be marked with convergent attribute, otherwise if jump threading is applied, the code will become invalid.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OtherwiseLLVM will apply optimizations incorrectly.

For each Op that MLIR generates, LLVM sets convergency. So LLVM won't do incorrect optimization for MLIR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For each Op that MLIR generates, LLVM sets convergency.

Are you sure we would do it inter-procedurally and annotate the actual call here?
Or are we relying on some attribute propagation later?

@npanchen : when mapping a structured control-flow to a CFG, just annotating things as "convergent" isn't enough I believe. See this example: https://youtu.be/_Z5DuiVCFAw?t=434 ; basically structured control-flow needs to say "something" about the reconvergence property and the management of anchors (the video is amazing for explaining all this I think!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the convergent property is only set on the intrinsic, not on arbitrary function that calls to that intrinsic. Since LLVM does not infer that property automatically, that can result to undesired optimization.
Just to highlight that JT works differently for the case I described above: https://godbolt.org/z/9T4d7MM58. To note, the IR there looks ok, but the problem I observed is similar, but just has more complex CF.

Also, the MLIR trait seems generally useful to prevent, say, loop multiversioning by trip count if loop contains such operation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@npanchen : when mapping a structured control-flow to a CFG, just annotating things as "convergent" isn't enough I believe. See this example: https://youtu.be/_Z5DuiVCFAw?t=434 ; basically structured control-flow needs to say "something" about the reconvergence property and the management of anchors (the video is amazing for explaining all this I think!)

Thanks, that's quite interesting video. I actually was unable to find when convergence.loop/convergence.entry/convergence.anchor are generated. Are they even still in use ?
If they are, yeah, that might be more interesting from design point of view. As of know I can only think about adding RecursivelyConvergent trait. But not sure that's a good idea for functions.

completely side note: there's also noduplicate property, which seems not to be used a lot, but it's not supported by MLIR and it does directly impact inliner.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are emitted by clang, search for shouldEmitConvergenceTokens() there, you'll see the kind of things I have in mind for our structured-to-cfg needs (and actually: a semantics definition of convergence guarantee on our structured control-flow)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I was not searching right.

} // namespace OpTrait
} // namespace mlir

//===----------------------------------------------------------------------===//
Expand Down
3 changes: 3 additions & 0 deletions mlir/include/mlir/Interfaces/ControlFlowInterfaces.td
Original file line number Diff line number Diff line change
Expand Up @@ -511,4 +511,7 @@ def ReturnLike : TraitList<[
>
]>;

// Op is "convergent".
def Convergent : NativeOpTrait<"Convergent">;

#endif // MLIR_INTERFACES_CONTROLFLOWINTERFACES
5 changes: 5 additions & 0 deletions mlir/test/lib/Dialect/Test/TestOps.td
Original file line number Diff line number Diff line change
Expand Up @@ -2113,6 +2113,11 @@ def TestTypeChangerOp : TEST_Op<"type_changer">,
def TestValidOp : TEST_Op<"valid", [Terminator]>,
Arguments<(ins Variadic<AnyType>)>;

def TestConvergentOp : TEST_Op<"convergent", [Convergent]> {
let arguments = (ins AnyType);
let results = (outs AnyType);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this doing right now? Without an actual test exercising this op, adding an op isn't useful.


def TestMergeBlocksOp : TEST_Op<"merge_blocks"> {
let summary = "merge_blocks operation";
let description = [{
Expand Down