Skip to content
Open
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions llvm/docs/LangRef.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7593,6 +7593,24 @@ Note that setting ``llvm.loop.interleave.count`` to 1 disables interleaving
multiple iterations of the loop. If ``llvm.loop.interleave.count`` is set to 0
then the interleave count will be determined automatically.

'``llvm.loop.vectorize.reassociate_fpreductions.enable``' Metadata
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This metadata selectively allows or disallows reassociating floating-point
reductions, which otherwise may be unsafe to reassociate, during the loop
vectorization. For example, a floating point ``ADD`` reduction without
``reassoc`` fast-math flags may be vectorized provided that this metadata
allows it. The first operand is the string
``llvm.loop.vectorize.reassociate_fpreductions.enable``
and the second operand is a bit. If the bit operand value is 1 unsafe
reduction reassociations are enabled. A value of 0 disables unsafe
reduction reassociations.

.. code-block:: llvm

!0 = !{!"llvm.loop.vectorize.reassociate_fpreductions.enable", i1 0}
!1 = !{!"llvm.loop.vectorize.reassociate_fpreductions.enable", i1 1}

'``llvm.loop.vectorize.enable``' Metadata
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick thought - what would you expect to happen for nested loops where the reduction variable is used at all levels of the loop? For example,

  float v = 0;
  for (...) {
    for (...) {
      for (...) {
        v += ...;
      }
      v += ...;
    }
    v += ...;
  }

Suppose the metadata is only added to the outer loop, but not the inner loops. It's possible that the inner loops get fully unrolled such that only the outer loop remains by the time we run the loop vectoriser. Is it valid to still reassociate? If so, that implies all inner loops must inherit the property from the outer loop. Would you consider it a bug to add it to the outer loop, but not the inner loops? Alternatively, if the inner loops do not get unrolled, would it be legal for the vectoriser to walk up to the outermost loop and use the metadata on the outermost loop to reassociate reductions on the innermost, etc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the example!

Since the user allowed reassociation for the reduction computation in the outer loop (e.g. via an option), we may think of the reduction computations as "inaccurate" already. So we are free to do any reassociations even for the code in the inner loops (if they are unrolled) or not do it (if they are not unrolled).

Sticking to the same logic, it should be legal for vectorizer to walk up to the outermost loop and use the metadata to reassociate reductions on the inner loops.

So far I am planning to inject the metadata based on the command line option, so a module will have the metadata consistently attached to all loops. The situation you described may occur due to LTO, and I think it is hard to provide finegrain controls such as "compute this part of the reduction without reassociation, and this part with reassociation". So, basically, the outer loop "wins" and the only way to prevent this is to use noinline (effectively disabling all optimizations in the outer loop).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems reasonable and thanks for explaining. I think it's worth explicitly stating this in the LangRef because once the metadata exists in LLVM it could be used by other frontends. For example, I can imagine in future someone may add a C level pragma that maps to this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, I think the reverse is also true. Suppose in your outer loop you set llvm.loop.vectorize.reassociate_fpreductions.enable to 0, that should override any inner loop that sets it to 1 for consistency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, I think the reverse is also true. Suppose in your outer loop you set llvm.loop.vectorize.reassociate_fpreductions.enable to 0, that should override any inner loop that sets it to 1 for consistency.

Hmm, that does not sound right to me. If the inner loop computes a different reduction than the outer loop, then the metadata should probably not apply to the inner loop, e.g.:

double s1 = 0.0;
for (...) {
  double s2 = 0.0;
  for (...) {
    s2 += ...;
  }
  s1 += ...;
}

Do you think it will be more consistent to propagate the metadata's "enable" effect to the whole loop-nest regardless of which loop it is set on?

P.S. I am on vacation for 1.5 weeks, and I won't be able to reply to the comments during my absence. Sorry for the inconvenience.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I guess that depends upon how you want this metadata to behave and what you want to achieve. My point really was that it should be consistent in my opinion - it would seem odd to permit llvm.loop.vectorize.reassociate_fpreductions.enable=1 to override inner loops, but not permit llvm.loop.vectorize.reassociate_fpreductions.enable=0 given something has gone to the effort of explicitly adding it. Of course if the metadata is completely missing from the outer loop (surely the common case?), then it cannot override any metadata on inner loops anyway. I think whatever behaviour we decide upon should be documented explicitly in the LangRef to avoid confusion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the long delay. I finally found time to get back to this. I promised to show how NVHPC compiler works, and I have some details now.

Nvfortran has an option -Mvect=assoc/noassoc that allows/disallows vectorizing FP reductions. Nvfortran may be not a great example to demonstrate how the mix of the different options works in case of the cross-module inlining, because it looks like it just relies on whatever the options are during the compilation after the cross-module function inlining.

I tried the following example:

callee.f90:

subroutine inner(y,s)
  real :: y(*), s
  do j=i,100
     s=s+y(j)
  end do
end subroutine inner

caller.f90:

subroutine test(x,y,s)
  interface
     subroutine inner(y,s)
       real :: y(*), s
     end subroutine inner
  end interface
  real :: x(*), y(*), s
  do i=1,100
     call inner(y,s)
     s=s+x(i)
  end do
end subroutine test

The first step is to create an inlining "library" for the callee.f90: nvfortran -cpp -O3 callee.f90 -Minfo=all -Mvect=assoc/noassoc -c -Mextract=lib:reductions

The second step is to use the inlining "library" during the compilation of the caller.f90: nvfortran -cpp -O3 caller.f90 -Minfo=all -Mvect=assoc/noassoc -Minline=lib:reductions -c

Regardless of the -Mvect=assoc/noassoc option used during the first step, the vectorization decision is based on the option value used during the second step. I.e. -Mvect=assoc results in the inner loop being vectorized, and -Mvect=noassoc disables vectorization.

Besides the reordering of the reduction computations, nvfortran does not apply any other FP math reassociations.

The most usual use-case that I anticipate the NVHPC users may want is that most of the code is compiled with allowing FP reductions reassociation. But then some accuracy-critical loops with reductions may need to be compiled without reductions reassociation. One way to do this is to extract such loops into separate functions/module and compile them without reductions reassociation. Then after the cross-module inlining, the reduction computations within these loops are not supposed to be reassociated (even if they are loops with constant trip counts that may be completely unrolled and appear inside the outer loops existing in the caller compiled with more relaxed reduction behavior).

In this usage model, it is expected that the metadata is set to either 1 or 0 for all the loops, but how we can define the metadata merging rules?

For correctness, it sounds like the inner loops should maintain their 0 value even when completely unrolled, so 0 (or the absense of metadata) should propagate outwards and override any 1 on the outer loops. And 1 cannot be propagated outward and override any outer 0 (or the absence of metadata).

I am not sure now where such metadata propagation can be made reliably, given that different passes may do function inlining. It does not seem feasible to require that the metadata propagation is run after each such pass that may change the loop nesting. Can this be done in vectorizer itself by querying the whole loop nest where the loop being vectorized is located?

You brought up a great point, and I do not know how to address it properly.

I am wondering now if the approach suggested during the vectorizer meeting is more viable: (sorry, I did not remember the name of the person) suggested a FastMathFlag to be attached to FP operations that will allow their reassociation only if it is required for vectorizing reductions. It sounds to be more consistent, but maybe someone can find drawbacks in it as well.

I think I need to collect more performance and correctness data before pushing this forward, and the LTO aspect is not a thing that I am concerned about right now. Would that be acceptable to add an engineering option that allows reductions reassociation, so that I can experiment with multiple benchmarks and bring back some factual data? (this was one of the suggestions during the vectorizer meeting as well)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,8 @@ class LoopVectorizeHints {
HK_FORCE,
HK_ISVECTORIZED,
HK_PREDICATE,
HK_SCALABLE
HK_SCALABLE,
HK_REASSOCIATE_FP_REDUCTIONS,
};

/// Hint - associates name and validation with the hint value.
Expand Down Expand Up @@ -97,6 +98,10 @@ class LoopVectorizeHints {
/// Says whether we should use fixed width or scalable vectorization.
Hint Scalable;

/// Says whether unsafe reassociation of reductions is allowed
/// during the loop vectorization.
Hint ReassociateFPReductions;

/// Return the loop metadata prefix.
static StringRef Prefix() { return "llvm.loop."; }

Expand Down Expand Up @@ -162,6 +167,13 @@ class LoopVectorizeHints {
return (ScalableForceKind)Scalable.Value == SK_FixedWidthOnly;
}

enum ForceKind getReassociateFPReductions() const {
if ((ForceKind)ReassociateFPReductions.Value == FK_Undefined &&
hasDisableAllTransformsHint(TheLoop))
return FK_Disabled;
return (ForceKind)ReassociateFPReductions.Value;
}

/// If hints are provided that force vectorization, use the AlwaysPrint
/// pass name to force the frontend to print the diagnostic.
const char *vectorizeAnalysisPassName() const;
Expand All @@ -173,6 +185,10 @@ class LoopVectorizeHints {
/// error accumulates in the loop.
bool allowReordering() const;

/// Returns true iff the loop hints allow reassociating floating-point
/// reductions for the purpose of vectorization.
bool allowFPReductionReassociation() const;

bool isPotentiallyUnsafe() const {
// Avoid FP vectorization if the target is unsure about proper support.
// This may be related to the SIMD unit in the target not handling
Expand Down
42 changes: 29 additions & 13 deletions llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,7 @@ bool LoopVectorizeHints::Hint::validate(unsigned Val) {
case HK_ISVECTORIZED:
case HK_PREDICATE:
case HK_SCALABLE:
case HK_REASSOCIATE_FP_REDUCTIONS:
return (Val == 0 || Val == 1);
}
return false;
Expand All @@ -112,6 +113,8 @@ LoopVectorizeHints::LoopVectorizeHints(const Loop *L,
IsVectorized("isvectorized", 0, HK_ISVECTORIZED),
Predicate("vectorize.predicate.enable", FK_Undefined, HK_PREDICATE),
Scalable("vectorize.scalable.enable", SK_Unspecified, HK_SCALABLE),
ReassociateFPReductions("vectorize.reassociate_fpreductions.enable",
FK_Undefined, HK_REASSOCIATE_FP_REDUCTIONS),
TheLoop(L), ORE(ORE) {
// Populate values with existing loop metadata.
getHintsFromMetadata();
Expand Down Expand Up @@ -254,6 +257,11 @@ bool LoopVectorizeHints::allowReordering() const {
EC.getKnownMinValue() > 1);
}

bool LoopVectorizeHints::allowFPReductionReassociation() const {
return HintsAllowReordering &&
getReassociateFPReductions() == LoopVectorizeHints::FK_Enabled;
}

void LoopVectorizeHints::getHintsFromMetadata() {
MDNode *LoopID = TheLoop->getLoopID();
if (!LoopID)
Expand Down Expand Up @@ -300,8 +308,13 @@ void LoopVectorizeHints::setHint(StringRef Name, Metadata *Arg) {
return;
unsigned Val = C->getZExtValue();

Hint *Hints[] = {&Width, &Interleave, &Force,
&IsVectorized, &Predicate, &Scalable};
Hint *Hints[] = {&Width,
&Interleave,
&Force,
&IsVectorized,
&Predicate,
&Scalable,
&ReassociateFPReductions};
for (auto *H : Hints) {
if (Name == H->Name) {
if (H->validate(Val))
Expand Down Expand Up @@ -1311,22 +1324,25 @@ bool LoopVectorizationLegality::canVectorizeFPMath(
return true;

// If the above is false, we have ExactFPMath & do not allow reordering.
// If the EnableStrictReductions flag is set, first check if we have any
// Exact FP induction vars, which we cannot vectorize.
if (!EnableStrictReductions ||
any_of(getInductionVars(), [&](auto &Induction) -> bool {
// First check if we have any Exact FP induction vars, which we cannot
// vectorize.
if (any_of(getInductionVars(), [&](auto &Induction) -> bool {
InductionDescriptor IndDesc = Induction.second;
return IndDesc.getExactFPMathInst();
}))
return false;

// We can now only vectorize if all reductions with Exact FP math also
// have the isOrdered flag set, which indicates that we can move the
// reduction operations in-loop.
return (all_of(getReductionVars(), [&](auto &Reduction) -> bool {
const RecurrenceDescriptor &RdxDesc = Reduction.second;
return !RdxDesc.hasExactFPMath() || RdxDesc.isOrdered();
}));
// We can now only vectorize if EnableStrictReductions flag is set and
// all reductions with Exact FP math also have the isOrdered flag set,
// which indicates that we can move the reduction operations in-loop.
// If the hints allow reassociating FP reductions, then skip
// all the checks.
return (Hints->allowFPReductionReassociation() ||
all_of(getReductionVars(), [&](auto &Reduction) -> bool {
const RecurrenceDescriptor &RdxDesc = Reduction.second;
return !RdxDesc.hasExactFPMath() ||
(EnableStrictReductions && RdxDesc.isOrdered());
}));
}

bool LoopVectorizationLegality::isInvariantStoreOfReduction(StoreInst *SI) {
Expand Down
5 changes: 3 additions & 2 deletions llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1000,9 +1000,10 @@ class LoopVectorizationCostModel {
/// Returns true if we should use strict in-order reductions for the given
/// RdxDesc. This is true if the -enable-strict-reductions flag is passed,
/// the IsOrdered flag of RdxDesc is set and we do not allow reordering
/// of FP operations.
/// of FP operations or FP reductions.
bool useOrderedReductions(const RecurrenceDescriptor &RdxDesc) const {
return !Hints->allowReordering() && RdxDesc.isOrdered();
return !Hints->allowReordering() &&
!Hints->allowFPReductionReassociation() && RdxDesc.isOrdered();
}

/// \returns The smallest bitwidth each instruction can be represented with.
Expand Down
47 changes: 47 additions & 0 deletions llvm/test/Transforms/LoopVectorize/reduction-reassociate.ll
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
; Check that the loop with a floating-point reduction is vectorized
; due to llvm.loop.vectorize.reassociate_fpreductions.enable metadata.
; RUN: opt -passes=loop-vectorize -S < %s 2>&1 | FileCheck %s

source_filename = "FIRModule"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

; Function Attrs: nofree norecurse nosync nounwind memory(argmem: readwrite)
define void @test_(ptr captures(none) %0, ptr readonly captures(none) %1) local_unnamed_addr #0 {
; CHECK-LABEL: define void @test_(
; CHECK: fadd contract <4 x float> {{.*}}
; CHECK: call contract float @llvm.vector.reduce.fadd.v4f32(float -0.000000e+00, <4 x float> {{.*}})
;
%invariant.gep = getelementptr i8, ptr %1, i64 -4
%.promoted = load float, ptr %0, align 4
br label %3

3: ; preds = %2, %3
%indvars.iv = phi i64 [ 1, %2 ], [ %indvars.iv.next, %3 ]
%4 = phi float [ %.promoted, %2 ], [ %6, %3 ]
%gep = getelementptr float, ptr %invariant.gep, i64 %indvars.iv
%5 = load float, ptr %gep, align 4
%6 = fadd contract float %4, %5
%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
%exitcond.not = icmp eq i64 %indvars.iv.next, 1001
br i1 %exitcond.not, label %7, label %3, !llvm.loop !2

7: ; preds = %3
%.lcssa = phi float [ %6, %3 ]
store float %.lcssa, ptr %0, align 4
ret void
}

attributes #0 = { nofree norecurse nosync nounwind memory(argmem: readwrite) "target-cpu"="x86-64" }

!llvm.ident = !{!0}
!llvm.module.flags = !{!1}

!0 = !{!"flang version 21.0.0"}
!1 = !{i32 2, !"Debug Info Version", i32 3}
!2 = distinct !{!2, !3}
!3 = !{!"llvm.loop.vectorize.reassociate_fpreductions.enable", i1 true}

; CHECK-NOT: llvm.loop.vectorize.reassociate_fpreductions.enable
; CHECK: !{!"llvm.loop.isvectorized", i32 1}
; CHECK: !{!"llvm.loop.unroll.runtime.disable"}