Skip to content

Conversation

@andcarminati
Copy link
Collaborator

@andcarminati andcarminati commented Nov 6, 2025

In this case, we can load the scalar value directly instead of building a full vector (legalizer will scalarize this load anyway) to extract.

This combiner was motivated by the following real case (there are more pathological cases, related to <32 x s6> for example):

name:            test
alignment:       16
legalized:       false
tracksRegLiveness: true
body:             |
  bb.1:
  liveins: $p0

    %0:_(p0) = COPY $p0
    %14:_(<8 x s16>) = G_LOAD %0(p0) :: (dereferenceable load (<8 x s16>), align 2)
    %15:_(<16 x s8>) = G_BITCAST %14(<8 x s16>)
    %17:_(s32) = G_CONSTANT i32 4
    %16:_(s8) = G_EXTRACT_VECTOR_ELT %15(<16 x s8>), %17(s32)
    %18:_(s32) = G_ZEXT %16(s8)
    %158:_(<64 x s8>) = G_AIE_PAD_VECTOR_UNDEF %15(<16 x s8>)
    %23:_(s32) = G_CONSTANT i32 2
    %156:_(s32) = G_AIE_ZEXT_EXTRACT_VECTOR_ELT %158(<64 x s8>), %23(s32)
    PseudoRET implicit $lr, implicit %18, implicit %156

...

Peano's current final code for this case is:

test:                                   // @test
// %bb.0:
	lda.s16	 r0, [p0], #2
	lda.s16	 r1, [p0], #2
	lda.s16	 r2, [p0], #2
	lda.s16	 r3, [p0], #2
	lda.s16	 r4, [p0], #2
	lda.s16	 r5, [p0], #2
	lda.s16	 r6, [p0, #0]
	lda.s16	 r7, [p0, #2];		vpush.hi.16	 x0, x0, r0
	vpush.hi.16	 x0, x0, r1
	vpush.hi.16	 x0, x0, r2
	vpush.hi.16	 x0, x0, r3
	vpush.hi.16	 x0, x0, r4
	vpush.hi.16	 x0, x0, r5
	ret	lr;		vpush.hi.16	 x0, x0, r6
	mova	r0, #48;		vpush.hi.16	 x0, x0, r7 //  Delay Slot 5
	vshift	x0, x0, x0, r0                  //  Delay Slot 4
	vextract.8	 r0, x0, #4, vaddsign0  //  Delay Slot 3
	vextract.8	 r1, x0, #2, vaddsign0  //  Delay Slot 2
	nop	                                //  Delay Slot 1

With this PR:

test:                                   // @test
// %bb.0:
	lda.u8	 r0, [p0, #4];		nopb	;		nopxm	;		nops	
	lda.u8	 r1, [p0, #2]
	ret	lr
	nop	                                //  Delay Slot 5
	nop	                                //  Delay Slot 4
	nop	                                //  Delay Slot 3
	nop	                                //  Delay Slot 2
	nop	                                //  Delay Slot 1

@andcarminati
Copy link
Collaborator Author

QoR results:
Core_Insn_Count
Core_PMSize_absolute
Core_StackSize_absolute

Regressions: HardswishAsHardsigmoid_aie2_0 and Hardswish_aie2_0. We have a regalloc side effect, one spilling is killing the SWP.

const bool IsSExtExtract = (Opcode == SExtExtractOpcode);
const bool IsPlainExtract = (Opcode == TargetOpcode::G_EXTRACT_VECTOR_ELT);

if (!IsZExtExtract && !IsSExtExtract && !IsPlainExtract)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks as if the pattern's opcode check precludes this case?


// Get the index operand
const Register IdxReg = MI.getOperand(2).getReg();
auto IdxCst = getIConstantVRegValWithLookThrough(IdxReg, MRI);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const?

if (MMO->getAlign().value() >= LoadVecSizeInBytes)
return false;

const unsigned ElemSizeInBytes = ElemSize / 8;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Declare closer to its first use

const Register PtrReg = LoadMI->getOperand(1).getReg();
const LLT S20 = LLT::scalar(20);

// Calculate byte offset: Index * ElemSizeInBytes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment is redundant

/// %new_ptr:_(p0) = G_PTR_ADD %ptr, %offset
/// %elt:_(sX) = G_LOAD %new_ptr :: (align 1)
/// %result:_(s32) = G_[Z/S]EXT %elt
bool llvm::matchUnalignedExtractLoad(MachineInstr &MI, MachineRegisterInfo &MRI,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a more descriptive name for MI? ExtractMI?


// Set insertion point right after the original vector load
if (LoadMI->getNextNode())
B.setInstr(*LoadMI->getNextNode());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't std::next(iterator(LoadMI)) be always valid for setInsertPt() ?

// alignment using GCD to find the maximum provable alignment
const unsigned OrigAlign = MMO->getAlign().value();
const unsigned ScalarAlign =
ByteOffset == 0 ? OrigAlign : std::gcd(OrigAlign, (unsigned)ByteOffset);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be written as std::gcd(OrigAlign, OrigAlign + ByteOffset);

// Handle the result based on the original opcode
if (IsZExtExtract || IsSExtExtract) {
// Need to extend to s32
const Register DstReg = MI.getOperand(0).getReg();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be hoisted, it's necessary for all cases. Then we can do
if (IsZExtract) {
} else if (IsSExtract) {
} else {
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second set of extracts can be removed actually.

@andcarminati andcarminati force-pushed the andreu.unaligned.load.extract branch from 4838126 to e684035 Compare November 13, 2025 10:51
@andcarminati andcarminati changed the title [AIEX] Add a combiner to handle extract from unaligned vector load [AIEX] Add a combiners to optimize unaligned vector loads Nov 13, 2025
@andcarminati
Copy link
Collaborator Author

Another combiner was included to improve scalarization of unaligned vector loads.

In this case, we can load the scalar value directly instead of building
a full vector (legalizer will scalarize this load anyway) to extract.
unsigned NewElemSize = 0;
if (Alignment >= 8 && ElemSize < 64) {
NewElemSize = 64;
} else if (Alignment >= 4 && ElemSize < 32) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will only arrive here with ElemSize 8 or 16, both smaller than 32 and 64

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, this will be simplified.

@andcarminati andcarminati force-pushed the andreu.unaligned.load.extract branch from e684035 to 20e7808 Compare November 13, 2025 12:37

// Check if the vector size is compatible with the new element size
const unsigned VecSizeInBits = DstTy.getSizeInBits();
if (VecSizeInBits % NewElemSize != 0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may be discarding 64 while 32 or 16 might still work.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not handling 64 anymore.

const unsigned ElemSize = ElemTy.getSizeInBits();

// Skip if the load is only used for extracts - let matchUnalignedExtractLoad
// handle it This prevents the two combiners from competing for the same
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: missing .

return false;

// Skip if the load has a single user that is a G_STORE with the same
// alignment This case can be perfectly scalarized during legalization
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: missing .

return false;

// Calculate new number of elements
const unsigned NewNumElems = VecSizeInBits / NewElemSize;
Copy link
Collaborator

@martien-de-jong martien-de-jong Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's confusing that both these variable are in bits, but not both called InBits

const unsigned NewNumElems = VecSizeInBits / NewElemSize;

// Capture the pointer register before creating the lambda
const Register PtrReg = LoadMI.getOperand(1).getReg();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can writePtrReg = LoadMI.getOperand(1).getReg()in the capture list, giving it local scope

MachineFunction &MF = B.getMF();

// Create the new vector type with better-aligned elements
const LLT NewVecTy = LLT::fixed_vector(NewNumElems, NewElemSize);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

capture NewVecTy, dropping the two constructor parameters?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved NewNumElems to the lambda.

…gnment

In this case, we can improve the legalized code.
@andcarminati andcarminati force-pushed the andreu.unaligned.load.extract branch from 20e7808 to ec55fce Compare November 13, 2025 14:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants