Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions llvm/lib/Passes/PassBuilderPipelines.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -690,9 +690,6 @@ PassBuilder::buildFunctionSimplificationPipeline(OptimizationLevel Level,

LPM2.addPass(LoopDeletionPass());

if (PTO.LoopInterchange)
LPM2.addPass(LoopInterchangePass());
Comment on lines -693 to -694
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be reasonable to keep this, at least for now?


// Do not enable unrolling in PreLinkThinLTO phase during sample PGO
// because it changes IR to makes profile annotation in back compile
// inaccurate. The normal unroller doesn't pay attention to forced full unroll
Expand Down Expand Up @@ -1547,6 +1544,10 @@ PassBuilder::buildModuleOptimizationPipeline(OptimizationLevel Level,
// this may need to be revisited once we run GVN before loop deletion
// in the simplification pipeline.
LPM.addPass(LoopDeletionPass());

if (PTO.LoopInterchange)
LPM.addPass(LoopInterchangePass());
Comment on lines +1548 to +1549
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure where LoopInterchange should be placed within the optimization pipeline...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, would be interesting to get some more data about the impact on different positions, but current palcement seems reasonable to me at least

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, collecting diverse data would be ideal, but since the current LoopInterchange is rarely triggered in practice, I haven’t been able to gather enough data to have a meaningful discussion about the optimal placement...

TSVC s235 might be an interesting example. If LoopDistribute were extended to support non-innermost loops and it ran before LoopInterchange, the following could become possible:

Original:

for (int nl = 0; nl < 200*(ntimes/LEN2); nl++)
  for (int i = 0; i < LEN2; i++) {
    a[i] += b[i] * c[i];
    for (int j = 1; j < LEN2; j++) {
      aa[j][i] = aa[j-1][i] + bb[j][i] * a[i];
    }
  }

Distributed:

for (int nl = 0; nl < 200*(ntimes/LEN2); nl++)
  for (int i = 0; i < LEN2; i++)
    a[i] += b[i] * c[i];

for (int nl = 0; nl < 200*(ntimes/LEN2); nl++)
  for (int i = 0; i < LEN2; i++)
    for (int j = 1; j < LEN2; j++)
      aa[j][i] = aa[j-1][i] + bb[j][i] * a[i];

Interchanged:

for (int nl = 0; nl < 200*(ntimes/LEN2); nl++)
  for (int i = 0; i < LEN2; i++)
    a[i] += b[i] * c[i];

// The i-loop and j-loop are interchanged.
for (int nl = 0; nl < 200*(ntimes/LEN2); nl++)
  for (int j = 1; j < LEN2; j++)
    for (int i = 0; i < LEN2; i++)
      aa[j][i] = aa[j-1][i] + bb[j][i] * a[i];

IIRC, when I ran the last code in the past, it was about 8x faster than the original. But achieving this would require a lot of work on LoopDistribute...


OptimizePM.addPass(createFunctionToLoopPassAdaptor(
std::move(LPM), /*UseMemorySSA=*/false, /*UseBlockFrequencyInfo=*/false));

Expand Down
48 changes: 48 additions & 0 deletions llvm/test/Transforms/LoopInterchange/position-in-pipeline.ll
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
; RUN: opt -passes='default<O3>' -enable-loopinterchange -disable-output \
; RUN: -disable-verify -verify-analysis-invalidation=0 \
; RUN: -debug-pass-manager=quiet %s 2>&1 | FileCheck %s

; Test the position of LoopInterchange in the pass pipeline.

; CHECK-NOT: Running pass: LoopInterchangePass
; CHECK: Running pass: ControlHeightReductionPass
; CHECK-NEXT: Running pass: LoopSimplifyPass
; CHECK-NEXT: Running pass: LCSSAPass
; CHECK-NEXT: Running pass: LoopRotatePass
; CHECK-NEXT: Running pass: LoopDeletionPass
; CHECK-NEXT: Running pass: LoopRotatePass
; CHECK-NEXT: Running pass: LoopDeletionPass
; CHECK-NEXT: Running pass: LoopInterchangePass
; CHECK-NEXT: Running pass: LoopDistributePass
; CHECK-NEXT: Running pass: InjectTLIMappings
; CHECK-NEXT: Running pass: LoopVectorizePass
Comment on lines +1 to +18
Copy link
Contributor Author

@kasuga-fj kasuga-fj Jun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please let me know if there is a better way...



define void @foo(ptr %a, i32 %n) {
entry:
br label %for.i.header

for.i.header:
%i = phi i32 [ 0, %entry ], [ %i.next, %for.i.latch ]
br label %for.j

for.j:
%j = phi i32 [ 0, %for.i.header ], [ %j.next, %for.j ]
%tmp = mul i32 %i, %n
%offset = add i32 %tmp, %j
%idx = getelementptr inbounds i32, ptr %a, i32 %offset
%load = load i32, ptr %idx, align 4
%inc = add i32 %load, 1
store i32 %inc, ptr %idx, align 4
%j.next = add i32 %j, 1
%j.exit = icmp eq i32 %j.next, %n
br i1 %j.exit, label %for.i.latch, label %for.j

for.i.latch:
%i.next = add i32 %i, 1
%i.exit = icmp eq i32 %i.next, %n
br i1 %i.exit, label %for.i.header, label %exit

exit:
ret void
}