Skip to content

[Blackwell] Fix thrown away load due to wrong wait placement#9636

Open
LunNova wants to merge 1 commit intotriton-lang:mainfrom
LunNova:lunnova/blackwell-wait-placement
Open

[Blackwell] Fix thrown away load due to wrong wait placement#9636
LunNova wants to merge 1 commit intotriton-lang:mainfrom
LunNova:lunnova/blackwell-wait-placement

Conversation

@LunNova
Copy link
Contributor

@LunNova LunNova commented Mar 4, 2026

A faulty check line with no -COUNT- was masking bad codegen. I think.

Provisionally, fixes #9635. I am not very familiar with this area but trying my best! Feel free to close, edit with maintainer access or whatever.

New contributor declaration

  • I am not making a trivial change, such as fixing a typo in a comment.

  • I have written a PR description following these
    rules.

  • I have run pre-commit run --from-ref origin/main --to-ref HEAD.

  • This PR does not need a test because Changing existing broken test & codegen.

  • I have not added any lit tests.

Copy link
Collaborator

@ThomasRaoux ThomasRaoux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch!

Comment on lines +536 to +560
// Combine partial reductions into one value per thread.
// This must happen after tcgen05.wait since tcgen05.ld.red is async and
// the redval registers are not valid until the wait completes.
if (redvalVals.size() > 1) {
auto isMin = *redOp == TMEMLoadReduceModifier::MIN;
auto applyMinMax = [&](Value lhs, Value rhs) {
return useNaN
? (isMin ? LLVM::MinimumOp::create(rewriter, loc, lhs, rhs)
: LLVM::MaximumOp::create(rewriter, loc, lhs, rhs))
->getResult(0)
: (isMin ? LLVM::MinNumOp::create(rewriter, loc, lhs, rhs)
: LLVM::MaxNumOp::create(rewriter, loc, lhs, rhs))
->getResult(0);
};
// Use tree reduction: pair up elements at each level
while (redvalVals.size() > 1) {
SmallVector<Value> reduced;
assert(redvalVals.size() % 2 == 0 &&
"redvalVals must be a multiple of 2");
for (size_t i = 0; i < redvalVals.size(); i += 2) {
reduced.push_back(applyMinMax(redvalVals[i], redvalVals[i + 1]));
}
redvalVals = std::move(reduced);
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could you move that into a separate function?

@masahi masahi requested a review from 3gx March 4, 2026 08:28
Copy link
Contributor

@lezcano lezcano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this makes sense.

See Thomas' review, but otherwise LGTM

A faulty check line with no -COUNT- was masking bad codegen.
@LunNova LunNova force-pushed the lunnova/blackwell-wait-placement branch from b0a7b19 to d4fa9a9 Compare March 5, 2026 15:24
@LunNova LunNova marked this pull request as ready for review March 5, 2026 15:25
@LunNova LunNova requested a review from ptillet as a code owner March 5, 2026 15:25
Copy link
Collaborator

@ThomasRaoux ThomasRaoux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Blackwell] tritongpu_to_llvm_blackwell tmem_load tests have bogus CHECK, may have codegen bug

3 participants