Skip to content

[RISCV][TTI] Enable masked interleave access for scalable vector #149981

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 25, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 6 additions & 4 deletions llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -979,10 +979,12 @@ InstructionCost RISCVTTIImpl::getInterleavedMemoryOpCost(
Align Alignment, unsigned AddressSpace, TTI::TargetCostKind CostKind,
bool UseMaskForCond, bool UseMaskForGaps) const {

// The interleaved memory access pass will lower interleaved memory ops (i.e
// a load and store followed by a specific shuffle) to vlseg/vsseg
// intrinsics.
if (!UseMaskForCond && !UseMaskForGaps &&
// The interleaved memory access pass will lower (de)interleave ops combined
// with an adjacent appropriate memory to vlseg/vsseg intrinsics. vlseg/vsseg
// only support masking per-iteration (i.e. condition), not per-segment (i.e.
// gap).
// TODO: Support masked interleaved access for fixed length vector.
if ((isa<ScalableVectorType>(VecTy) || !UseMaskForCond) && !UseMaskForGaps &&
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, I've landed the last change for the fixed vector path. You can delete UseMaskForCand entirely if you want, if not, I'll do it as a follow up review.

Factor <= TLI->getMaxSupportedInterleaveFactor()) {
auto *VTy = cast<VectorType>(VecTy);
std::pair<InstructionCost, MVT> LT = getTypeLegalizationCost(VTy);
Expand Down
4 changes: 4 additions & 0 deletions llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
Original file line number Diff line number Diff line change
Expand Up @@ -398,6 +398,10 @@ class RISCVTTIImpl final : public BasicTTIImplBase<RISCVTTIImpl> {

bool enableInterleavedAccessVectorization() const override { return true; }

bool enableMaskedInterleavedAccessVectorization() const override {
return ST->hasVInstructions();
}

unsigned getMinTripCountTailFoldingThreshold() const override;

enum RISCVRegisterClass { GPRRC, FPRRC, VRRC };
Expand Down
2 changes: 2 additions & 0 deletions llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1359,7 +1359,9 @@ class LoopVectorizationCostModel {
return;
// Override EVL styles if needed.
// FIXME: Investigate opportunity for fixed vector factor.
// FIXME: Support interleave accesses.
bool EVLIsLegal = UserIC <= 1 && IsScalableVF &&
!InterleaveInfo.hasGroups() &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit unfortunate the timing of these patches, but I just LGTM'd #150074, since I think we don't actually need the EVLIsLegal check, it should be ok to keep around the mask for now and we can optimise it away to a VP intrinsic later.

I guess it looks like we also need to account for the fact that we don't handle masked.{load,store} for shufflevector [de]interleaves?

In any case we still need to do the actual VP intrinsic transform so I will create an issue on github for that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve completed the simplest initial implementation of VPInterleaveEVLRecipe in my local, but I found it couldn’t be converted into EVL recipe. That’s when I realized that masked interleave access hasn’t been enabled upstream yet.

However, I believe enabling both tail folding by mask and EVL at the same time is unsafe.
Consider this scenario: the VF is 4, and the trip count is 5. With EVL, the vectorized loop gets two iterations, fetching 3 and 2 elements (EVL) respectively. The VPWidenInductionRecipe with EVL would then produce [0, 1, 2, 3(X)] and [3, 4, 5(X), 6(X)], where the lanes marked (X) are the lanes that should not be used.

If there is an interleaved store with factor 2 using the values produced by the VPWidenInductionRecipe, and apply interleaved masks [T, T, T, T, T, T, T, T] and [T, T, F, F, F, F, F, F], we might end up storing incorrect values like [0, 0, 1, 1, 2, 2, 3(X), 3(X)] and [3, 3, X, X, X, X, X, X].

Wouldn’t that lead to storing incorrect values in memory?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC the mask is interleaved with itself before being fed into the widen load, i.e. see these lines: https://github.com/llvm/llvm-project/pull/150074/files#diff-8b722352f6c665b720f4318006f7ae773cfaddd722fd211fdcf8ef18d8d115dcR42-R43

So the widen load's mask is actually interleave([T, T, T, F], [T, T, T, F]) = [T, T, T, T, T, T, F, F] on the first iteration, and interleave([T, T, F, F], [T, T, F, F]) = [T, T, T, T, F, F, F, F] on the second iteration

The interleaved access pass then checks to make sure that the mask is actually interleaved before deinterleaving it again for the vlseg/vsseg intrinsic:

static Value *getMask(Value *WideMask, unsigned Factor,
ElementCount LeafValueEC) {
if (auto *IMI = dyn_cast<IntrinsicInst>(WideMask)) {
if (unsigned F = getInterleaveIntrinsicFactor(IMI->getIntrinsicID());
F && F == Factor && llvm::all_equal(IMI->args())) {
return IMI->getArgOperand(0);
}
}

So it should emit a vlseg with mask [T, T, T, F] on the first iteration, and [T, T, F, F] on the second iteration

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, although I previously said the mask might be incorrect, it’s actually different from what you described.

; IF-EVL-NEXT:    [[BROADCAST_SPLATINSERT1:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[EVL_BASED_IV]], i64 0
; IF-EVL-NEXT:    [[BROADCAST_SPLAT2:%.*]] = shufflevector <vscale x 4 x i64> [[BROADCAST_SPLATINSERT1]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
; IF-EVL-NEXT:    [[TMP10:%.*]] = call <vscale x 4 x i64> @llvm.stepvector.nxv4i64()
; IF-EVL-NEXT:    [[TMP12:%.*]] = add <vscale x 4 x i64> zeroinitializer, [[TMP10]]
; IF-EVL-NEXT:    [[VEC_IV:%.*]] = add <vscale x 4 x i64> [[BROADCAST_SPLAT2]], [[TMP12]]
; IF-EVL-NEXT:    [[TMP13:%.*]] = icmp ule <vscale x 4 x i64> [[VEC_IV]], [[BROADCAST_SPLAT]]

In reality, the two masks should be: [T, T, T, T, T, T, T, T] and [T, T, T, T, F, F, F, F].

For the first iteration:
([0, 0, 0, 0] + [0, 1, 2, 3]) <= [4, 4, 4, 4], where 4 is TripCount - 1.

For the second iteration:
([3, 3, 3, 3] + [0, 1, 2, 3]) <= [4, 4, 4, 4].

Is that correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right. This looks like an underlying issue with mixing the header masks and EVL based IV.

I think we need to convert the header masks from icmp ule wide-canonical-iv, backedge-tc to icmp ult step-vector, EVL.

Because on the second-to-last iteration the original header mask isn't going to take into account the possible truncation.

So on the first iteration, it should really be:

[0, 1, 2, 3] < 3 = [T, T, T, F]

And on the second iteration

[0, 1, 2, 3] < 2 = [T, T, F, F].

I'll create an issue for this, this is a good find.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

; IF-EVL-NEXT: [[TMP9:%.*]] = getelementptr inbounds [2 x i32], ptr [[B:%.*]], i64 [[EVL_BASED_IV]], i32 0

But thinking about it more carefully, since the GEP has already been adjusted based on the EVL base PHI, even though the header mask is not in sync with the EVL, the final store result in the example should still be correct. It's just that addresses which were originally written to only once may now be written to multiple times. The value written the first time might be incorrect, but it will eventually be overwritten with the correct value. However, I'm not sure if this could cause other issues. In any case, I still recommend avoiding mixing tail folding modes—it doesn’t seem entirely safe.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just that addresses which were originally written to only once may now be written to multiple times. The value written the first time might be incorrect, but it will eventually be overwritten with the correct value.

Just FYI, this reasoning is suspect. Introducing a duplicate write to the same location is possibly a violation of the memory model. I'd have to think that through carefully to see if it was, but it's definitely undesirable if we can reasonable avoid.

(Topic switch)

If I'm following this thread correctly, the issue being theorized here is a problem for any interleave group with EVL right? Not just the masked variants?

Copy link
Collaborator

@preames preames Jul 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replying to myself after reading more context - yeah, this is a generic EVL bug.

@Mel-Chen Would you mind separating your EVLIsLegal change into a separate review with a test case which shows the bug? I'd like to get EVL disabled for this case while Luke's correctness change works it way through review.

Edit: Once we do that, I can enable the masked interleave support (for what ends up being masking and non-predicated only), then we can remove the limitation once the root issue is properly fixed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about this later today, and realized I'd missed the obvious here. While the bug is theoretically reachable through other paths, the combination of predicatation and interleave-groups currently only happens with this patch to enable the predication. EVL without this patch can't hit this code path; I'd missed that when thinking about this earlier.

@Mel-Chen I'm tempted to just take your patch as the primary path forward here towards enabling masking interleave. Any concerns with that? On reflection, I don't know the splitting I'd suggested earlier is actually worthwhile.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh of course, no problem! I pulled in some changes I thought were really good from #150074, rewrote a new description based on your commit message, and listed you as a co-author.
4688b4c

TTI.hasActiveVectorLength() && !EnableVPlanNativePath;
if (EVLIsLegal)
return;
Expand Down
Loading