Skip to content

Conversation

@felipealmeida
Copy link
Contributor

Change the intersect for the anticipated algorithm to ignore unknown when anticipating. This effectively allows VXRM writes speculatively because it could do a VXRM write even when there's branches where VXRM is unneeded.

The importance of this change is because VXRM writes causes pipeline flushes in some micro-architectures and so it makes sense to allow more aggressive hoisting even if it causes some degradation for the slow path.

An example is this code:

typedef unsigned char uint8_t;
__attribute__ ((noipa))
void foo (uint8_t *dst,  int i_dst_stride,
           uint8_t *src1, int i_src1_stride,
           uint8_t *src2, int i_src2_stride,
           int i_width, int i_height )
{
   for( int y = 0; y < i_height; y++ )
     {
       for( int x = 0; x < i_width; x++ )
         dst[x] = ( src1[x] + src2[x] + 1 ) >> 1;
       dst  += i_dst_stride;
       src1 += i_src1_stride;
       src2 += i_src2_stride;
     }
}

With this patch, the code above generates a hoisting VXRM writes out of the outer loop.

@github-actions
Copy link

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@llvmbot
Copy link
Member

llvmbot commented Sep 25, 2024

@llvm/pr-subscribers-backend-risc-v

Author: Felipe Magno de Almeida (felipealmeida)

Changes

Change the intersect for the anticipated algorithm to ignore unknown when anticipating. This effectively allows VXRM writes speculatively because it could do a VXRM write even when there's branches where VXRM is unneeded.

The importance of this change is because VXRM writes causes pipeline flushes in some micro-architectures and so it makes sense to allow more aggressive hoisting even if it causes some degradation for the slow path.

An example is this code:

typedef unsigned char uint8_t;
__attribute__ ((noipa))
void foo (uint8_t *dst,  int i_dst_stride,
           uint8_t *src1, int i_src1_stride,
           uint8_t *src2, int i_src2_stride,
           int i_width, int i_height )
{
   for( int y = 0; y &lt; i_height; y++ )
     {
       for( int x = 0; x &lt; i_width; x++ )
         dst[x] = ( src1[x] + src2[x] + 1 ) &gt;&gt; 1;
       dst  += i_dst_stride;
       src1 += i_src1_stride;
       src2 += i_src2_stride;
     }
}

With this patch, the code above generates a hoisting VXRM writes out of the outer loop.


Full diff: https://github.com/llvm/llvm-project/pull/110044.diff

3 Files Affected:

  • (modified) llvm/lib/Target/RISCV/RISCVInsertWriteVXRM.cpp (+30-1)
  • (added) llvm/test/CodeGen/RISCV/rvv/vxrm-insert-out-of-loop.ll (+386)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vxrm-insert.ll (+4-5)
diff --git a/llvm/lib/Target/RISCV/RISCVInsertWriteVXRM.cpp b/llvm/lib/Target/RISCV/RISCVInsertWriteVXRM.cpp
index f72ba2d5c667b8..2d3d30722d4880 100644
--- a/llvm/lib/Target/RISCV/RISCVInsertWriteVXRM.cpp
+++ b/llvm/lib/Target/RISCV/RISCVInsertWriteVXRM.cpp
@@ -109,6 +109,35 @@ class VXRMInfo {
     return VXRMInfo::getUnknown();
   }
 
+  // Calculate the VXRMInfo visible to a block assuming this and Other
+  // are both predecessors. To allow speculatively running WriteVXRM
+  // we will ignore Unknowns if one of this and Other have valid
+  // WriteVXRM. Rationale: WriteVXRM causes a pipeline flush in some
+  // uarchs and moving it outside loops is very important for some
+  // workloads.
+  VXRMInfo intersectAnticipated(const VXRMInfo &Other) const {
+    // If the new value isn't valid, ignore it.
+    if (!Other.isValid())
+      return *this;
+
+    // If this value isn't valid, this must be the first predecessor, use it.
+    if (!isValid())
+      return Other;
+
+    // If either is unknown, the result is the other one.
+    if (isUnknown())
+      return Other;
+    else if (Other.isUnknown())
+      return *this;
+
+    // If we have an exact match, return this.
+    if (*this == Other)
+      return *this;
+
+    // Otherwise the result is unknown.
+    return VXRMInfo::getUnknown();
+  }
+
 #if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
   /// Support for debugging, callable in GDB: V->dump()
   LLVM_DUMP_METHOD void dump() const {
@@ -290,7 +319,7 @@ void RISCVInsertWriteVXRM::computeAnticipated(const MachineBasicBlock &MBB) {
   } else {
     for (const MachineBasicBlock *S : MBB.successors())
       Anticipated =
-          Anticipated.intersect(BlockInfo[S->getNumber()].AnticipatedIn);
+          Anticipated.intersectAnticipated(BlockInfo[S->getNumber()].AnticipatedIn);
   }
 
   // If we don't have any valid anticipated info, wait until we do.
diff --git a/llvm/test/CodeGen/RISCV/rvv/vxrm-insert-out-of-loop.ll b/llvm/test/CodeGen/RISCV/rvv/vxrm-insert-out-of-loop.ll
new file mode 100644
index 00000000000000..62c7adeddeb146
--- /dev/null
+++ b/llvm/test/CodeGen/RISCV/rvv/vxrm-insert-out-of-loop.ll
@@ -0,0 +1,386 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 2
+
+; RUN: sed 's/iXLen/i32/g' %s | llc -mtriple=riscv32 -mattr=+v,+m \
+; RUN:   -verify-machineinstrs | FileCheck %s --check-prefixes=RV32
+; RUN: sed 's/iXLen/i64/g' %s | llc -mtriple=riscv64 -mattr=+v,+m \
+; RUN:   -verify-machineinstrs | FileCheck %s --check-prefixes=RV64
+
+
+; test1
+define dso_local void @test1(ptr nocapture noundef writeonly %dst, i32 noundef signext %i_dst_stride, ptr nocapture noundef readonly %src1, i32 noundef signext %i_src1_stride, ptr nocapture noundef readonly %src2, i32 noundef signext %i_src2_stride, i32 noundef signext %i_width, i32 noundef signext %i_height) local_unnamed_addr #0 {
+; RV32-LABEL: test1:
+; RV32:       # %bb.0: # %entry
+; RV32-NEXT:    csrwi vxrm, 0
+; RV32-NEXT:    blez a7, .LBB0_17
+; RV32-NEXT:  # %bb.1: # %for.cond1.preheader.lr.ph
+; RV32-NEXT:    blez a6, .LBB0_17
+; RV32-NEXT:  # %bb.2: # %for.cond1.preheader.us.preheader
+; RV32-NEXT:    addi t0, a7, -1
+; RV32-NEXT:    mul t3, a1, t0
+; RV32-NEXT:    mul t4, a3, t0
+; RV32-NEXT:    mul t5, a5, t0
+; RV32-NEXT:    csrr t2, vlenb
+; RV32-NEXT:    slli t1, t2, 1
+; RV32-NEXT:    li t6, 32
+; RV32-NEXT:    mv t0, t1
+; RV32-NEXT:    bnez zero, .LBB0_4
+; RV32-NEXT:  # %bb.3: # %for.cond1.preheader.us.preheader
+; RV32-NEXT:    li t0, 32
+; RV32-NEXT:  .LBB0_4: # %for.cond1.preheader.us.preheader
+; RV32-NEXT:    addi sp, sp, -16
+; RV32-NEXT:    .cfi_def_cfa_offset 16
+; RV32-NEXT:    sw s0, 12(sp) # 4-byte Folded Spill
+; RV32-NEXT:    sw s1, 8(sp) # 4-byte Folded Spill
+; RV32-NEXT:    sw s2, 4(sp) # 4-byte Folded Spill
+; RV32-NEXT:    .cfi_offset s0, -4
+; RV32-NEXT:    .cfi_offset s1, -8
+; RV32-NEXT:    .cfi_offset s2, -12
+; RV32-NEXT:    add t3, a0, t3
+; RV32-NEXT:    add t4, a2, t4
+; RV32-NEXT:    add s0, a4, t5
+; RV32-NEXT:    bltu t6, t1, .LBB0_6
+; RV32-NEXT:  # %bb.5: # %for.cond1.preheader.us.preheader
+; RV32-NEXT:    li t1, 32
+; RV32-NEXT:  .LBB0_6: # %for.cond1.preheader.us.preheader
+; RV32-NEXT:    add t3, t3, a6
+; RV32-NEXT:    add t5, t4, a6
+; RV32-NEXT:    add t4, s0, a6
+; RV32-NEXT:    beqz zero, .LBB0_8
+; RV32-NEXT:  # %bb.7: # %for.cond1.preheader.us.preheader
+; RV32-NEXT:    mv t1, t0
+; RV32-NEXT:  .LBB0_8: # %for.cond1.preheader.us.preheader
+; RV32-NEXT:    li t0, 0
+; RV32-NEXT:    sltu t5, a0, t5
+; RV32-NEXT:    sltu t6, a2, t3
+; RV32-NEXT:    and t5, t5, t6
+; RV32-NEXT:    or t6, a1, a3
+; RV32-NEXT:    slti t6, t6, 0
+; RV32-NEXT:    or t5, t5, t6
+; RV32-NEXT:    sltu t4, a0, t4
+; RV32-NEXT:    sltu t3, a4, t3
+; RV32-NEXT:    and t3, t4, t3
+; RV32-NEXT:    or t4, a1, a5
+; RV32-NEXT:    slti t4, t4, 0
+; RV32-NEXT:    or t3, t3, t4
+; RV32-NEXT:    or t3, t5, t3
+; RV32-NEXT:    sltu t1, a6, t1
+; RV32-NEXT:    or t1, t1, t3
+; RV32-NEXT:    andi t1, t1, 1
+; RV32-NEXT:    slli t2, t2, 1
+; RV32-NEXT:    j .LBB0_10
+; RV32-NEXT:  .LBB0_9: # %for.cond1.for.cond.cleanup3_crit_edge.us
+; RV32-NEXT:    # in Loop: Header=BB0_10 Depth=1
+; RV32-NEXT:    add a0, a0, a1
+; RV32-NEXT:    add a2, a2, a3
+; RV32-NEXT:    addi t0, t0, 1
+; RV32-NEXT:    add a4, a4, a5
+; RV32-NEXT:    beq t0, a7, .LBB0_16
+; RV32-NEXT:  .LBB0_10: # %for.cond1.preheader.us
+; RV32-NEXT:    # =>This Loop Header: Depth=1
+; RV32-NEXT:    # Child Loop BB0_13 Depth 2
+; RV32-NEXT:    # Child Loop BB0_15 Depth 2
+; RV32-NEXT:    beqz t1, .LBB0_12
+; RV32-NEXT:  # %bb.11: # in Loop: Header=BB0_10 Depth=1
+; RV32-NEXT:    li t4, 0
+; RV32-NEXT:    li t3, 0
+; RV32-NEXT:    j .LBB0_15
+; RV32-NEXT:  .LBB0_12: # %vector.ph
+; RV32-NEXT:    # in Loop: Header=BB0_10 Depth=1
+; RV32-NEXT:    li t3, 0
+; RV32-NEXT:    neg t4, t2
+; RV32-NEXT:    and t4, t4, a6
+; RV32-NEXT:    li t6, 0
+; RV32-NEXT:    li t5, 0
+; RV32-NEXT:    vsetvli s0, zero, e8, m2, ta, ma
+; RV32-NEXT:  .LBB0_13: # %vector.body
+; RV32-NEXT:    # Parent Loop BB0_10 Depth=1
+; RV32-NEXT:    # => This Inner Loop Header: Depth=2
+; RV32-NEXT:    add s0, a2, t6
+; RV32-NEXT:    vl2r.v v8, (s0)
+; RV32-NEXT:    add s0, a4, t6
+; RV32-NEXT:    vl2r.v v10, (s0)
+; RV32-NEXT:    vaaddu.vv v8, v8, v10
+; RV32-NEXT:    add s0, a0, t6
+; RV32-NEXT:    add s1, t6, t2
+; RV32-NEXT:    sltu t6, s1, t6
+; RV32-NEXT:    add t5, t5, t6
+; RV32-NEXT:    xor t6, s1, t4
+; RV32-NEXT:    or s2, t6, t5
+; RV32-NEXT:    vs2r.v v8, (s0)
+; RV32-NEXT:    mv t6, s1
+; RV32-NEXT:    bnez s2, .LBB0_13
+; RV32-NEXT:  # %bb.14: # %middle.block
+; RV32-NEXT:    # in Loop: Header=BB0_10 Depth=1
+; RV32-NEXT:    beq t4, a6, .LBB0_9
+; RV32-NEXT:  .LBB0_15: # %for.body4.us
+; RV32-NEXT:    # Parent Loop BB0_10 Depth=1
+; RV32-NEXT:    # => This Inner Loop Header: Depth=2
+; RV32-NEXT:    add t5, a2, t4
+; RV32-NEXT:    lbu t5, 0(t5)
+; RV32-NEXT:    add t6, a4, t4
+; RV32-NEXT:    lbu t6, 0(t6)
+; RV32-NEXT:    add t5, t5, t6
+; RV32-NEXT:    addi t5, t5, 1
+; RV32-NEXT:    srli t5, t5, 1
+; RV32-NEXT:    add t6, a0, t4
+; RV32-NEXT:    addi t4, t4, 1
+; RV32-NEXT:    seqz s0, t4
+; RV32-NEXT:    add t3, t3, s0
+; RV32-NEXT:    xor s0, t4, a6
+; RV32-NEXT:    or s0, s0, t3
+; RV32-NEXT:    sb t5, 0(t6)
+; RV32-NEXT:    bnez s0, .LBB0_15
+; RV32-NEXT:    j .LBB0_9
+; RV32-NEXT:  .LBB0_16:
+; RV32-NEXT:    lw s0, 12(sp) # 4-byte Folded Reload
+; RV32-NEXT:    lw s1, 8(sp) # 4-byte Folded Reload
+; RV32-NEXT:    lw s2, 4(sp) # 4-byte Folded Reload
+; RV32-NEXT:    addi sp, sp, 16
+; RV32-NEXT:  .LBB0_17: # %for.cond.cleanup
+; RV32-NEXT:    ret
+;
+; RV64-LABEL: test1:
+; RV64:       # %bb.0: # %entry
+; RV64-NEXT:    csrwi vxrm, 0
+; RV64-NEXT:    blez a7, .LBB0_14
+; RV64-NEXT:  # %bb.1: # %for.cond1.preheader.lr.ph
+; RV64-NEXT:    blez a6, .LBB0_14
+; RV64-NEXT:  # %bb.2: # %for.cond1.preheader.us.preheader
+; RV64-NEXT:    addi sp, sp, -48
+; RV64-NEXT:    .cfi_def_cfa_offset 48
+; RV64-NEXT:    sd s0, 40(sp) # 8-byte Folded Spill
+; RV64-NEXT:    sd s1, 32(sp) # 8-byte Folded Spill
+; RV64-NEXT:    sd s2, 24(sp) # 8-byte Folded Spill
+; RV64-NEXT:    sd s3, 16(sp) # 8-byte Folded Spill
+; RV64-NEXT:    sd s4, 8(sp) # 8-byte Folded Spill
+; RV64-NEXT:    .cfi_offset s0, -8
+; RV64-NEXT:    .cfi_offset s1, -16
+; RV64-NEXT:    .cfi_offset s2, -24
+; RV64-NEXT:    .cfi_offset s3, -32
+; RV64-NEXT:    .cfi_offset s4, -40
+; RV64-NEXT:    addi t0, a7, -1
+; RV64-NEXT:    slli t0, t0, 32
+; RV64-NEXT:    srli t0, t0, 32
+; RV64-NEXT:    mul t4, a1, t0
+; RV64-NEXT:    add t1, a0, a6
+; RV64-NEXT:    add t4, t1, t4
+; RV64-NEXT:    mul t6, a3, t0
+; RV64-NEXT:    add t1, a2, a6
+; RV64-NEXT:    add t6, t1, t6
+; RV64-NEXT:    mul s0, a5, t0
+; RV64-NEXT:    add t2, a4, a6
+; RV64-NEXT:    csrr t0, vlenb
+; RV64-NEXT:    slli t1, t0, 1
+; RV64-NEXT:    li t3, 32
+; RV64-NEXT:    add s0, t2, s0
+; RV64-NEXT:    mv t5, t1
+; RV64-NEXT:    bltu t3, t1, .LBB0_4
+; RV64-NEXT:  # %bb.3: # %for.cond1.preheader.us.preheader
+; RV64-NEXT:    li t5, 32
+; RV64-NEXT:  .LBB0_4: # %for.cond1.preheader.us.preheader
+; RV64-NEXT:    li t2, 0
+; RV64-NEXT:    li t3, 0
+; RV64-NEXT:    sltu t6, a0, t6
+; RV64-NEXT:    sltu s1, a2, t4
+; RV64-NEXT:    and t6, t6, s1
+; RV64-NEXT:    or s1, a1, a3
+; RV64-NEXT:    slti s1, s1, 0
+; RV64-NEXT:    or t6, t6, s1
+; RV64-NEXT:    sltu s0, a0, s0
+; RV64-NEXT:    sltu t4, a4, t4
+; RV64-NEXT:    and t4, s0, t4
+; RV64-NEXT:    or s0, a1, a5
+; RV64-NEXT:    slti s0, s0, 0
+; RV64-NEXT:    or t4, t4, s0
+; RV64-NEXT:    or t4, t6, t4
+; RV64-NEXT:    sltu t5, a6, t5
+; RV64-NEXT:    or t4, t5, t4
+; RV64-NEXT:    andi t4, t4, 1
+; RV64-NEXT:    mv t5, a0
+; RV64-NEXT:    j .LBB0_6
+; RV64-NEXT:  .LBB0_5: # %for.cond1.for.cond.cleanup3_crit_edge.us
+; RV64-NEXT:    # in Loop: Header=BB0_6 Depth=1
+; RV64-NEXT:    add t5, t5, a1
+; RV64-NEXT:    add a2, a2, a3
+; RV64-NEXT:    add a4, a4, a5
+; RV64-NEXT:    addiw t3, t3, 1
+; RV64-NEXT:    addi t2, t2, 1
+; RV64-NEXT:    beq t3, a7, .LBB0_13
+; RV64-NEXT:  .LBB0_6: # %for.cond1.preheader.us
+; RV64-NEXT:    # =>This Loop Header: Depth=1
+; RV64-NEXT:    # Child Loop BB0_9 Depth 2
+; RV64-NEXT:    # Child Loop BB0_12 Depth 2
+; RV64-NEXT:    beqz t4, .LBB0_8
+; RV64-NEXT:  # %bb.7: # in Loop: Header=BB0_6 Depth=1
+; RV64-NEXT:    li t6, 0
+; RV64-NEXT:    j .LBB0_11
+; RV64-NEXT:  .LBB0_8: # %vector.ph
+; RV64-NEXT:    # in Loop: Header=BB0_6 Depth=1
+; RV64-NEXT:    slli t6, t0, 28
+; RV64-NEXT:    sub t6, t6, t1
+; RV64-NEXT:    and t6, t6, a6
+; RV64-NEXT:    mv s0, a2
+; RV64-NEXT:    mv s1, a4
+; RV64-NEXT:    mv s2, t5
+; RV64-NEXT:    mv s3, t6
+; RV64-NEXT:    vsetvli s4, zero, e8, m2, ta, ma
+; RV64-NEXT:  .LBB0_9: # %vector.body
+; RV64-NEXT:    # Parent Loop BB0_6 Depth=1
+; RV64-NEXT:    # => This Inner Loop Header: Depth=2
+; RV64-NEXT:    vl2r.v v8, (s0)
+; RV64-NEXT:    vl2r.v v10, (s1)
+; RV64-NEXT:    vaaddu.vv v8, v8, v10
+; RV64-NEXT:    vs2r.v v8, (s2)
+; RV64-NEXT:    sub s3, s3, t1
+; RV64-NEXT:    add s2, s2, t1
+; RV64-NEXT:    add s1, s1, t1
+; RV64-NEXT:    add s0, s0, t1
+; RV64-NEXT:    bnez s3, .LBB0_9
+; RV64-NEXT:  # %bb.10: # %middle.block
+; RV64-NEXT:    # in Loop: Header=BB0_6 Depth=1
+; RV64-NEXT:    beq t6, a6, .LBB0_5
+; RV64-NEXT:  .LBB0_11: # %for.body4.us.preheader
+; RV64-NEXT:    # in Loop: Header=BB0_6 Depth=1
+; RV64-NEXT:    mul s0, a1, t2
+; RV64-NEXT:    add s1, a0, a6
+; RV64-NEXT:    add s0, s1, s0
+; RV64-NEXT:    add s1, t5, t6
+; RV64-NEXT:    add s2, a4, t6
+; RV64-NEXT:    add t6, a2, t6
+; RV64-NEXT:  .LBB0_12: # %for.body4.us
+; RV64-NEXT:    # Parent Loop BB0_6 Depth=1
+; RV64-NEXT:    # => This Inner Loop Header: Depth=2
+; RV64-NEXT:    lbu s3, 0(t6)
+; RV64-NEXT:    lbu s4, 0(s2)
+; RV64-NEXT:    add s3, s3, s4
+; RV64-NEXT:    addi s3, s3, 1
+; RV64-NEXT:    srli s3, s3, 1
+; RV64-NEXT:    sb s3, 0(s1)
+; RV64-NEXT:    addi s1, s1, 1
+; RV64-NEXT:    addi s2, s2, 1
+; RV64-NEXT:    addi t6, t6, 1
+; RV64-NEXT:    bne s1, s0, .LBB0_12
+; RV64-NEXT:    j .LBB0_5
+; RV64-NEXT:  .LBB0_13:
+; RV64-NEXT:    ld s0, 40(sp) # 8-byte Folded Reload
+; RV64-NEXT:    ld s1, 32(sp) # 8-byte Folded Reload
+; RV64-NEXT:    ld s2, 24(sp) # 8-byte Folded Reload
+; RV64-NEXT:    ld s3, 16(sp) # 8-byte Folded Reload
+; RV64-NEXT:    ld s4, 8(sp) # 8-byte Folded Reload
+; RV64-NEXT:    addi sp, sp, 48
+; RV64-NEXT:  .LBB0_14: # %for.cond.cleanup
+; RV64-NEXT:    ret
+entry:
+  %cmp29 = icmp sgt i32 %i_height, 0
+  br i1 %cmp29, label %for.cond1.preheader.lr.ph, label %for.cond.cleanup
+
+for.cond1.preheader.lr.ph:                        ; preds = %entry
+  %cmp227 = icmp sgt i32 %i_width, 0
+  %idx.ext = sext i32 %i_dst_stride to i64
+  %idx.ext12 = sext i32 %i_src1_stride to i64
+  %idx.ext14 = sext i32 %i_src2_stride to i64
+  br i1 %cmp227, label %for.cond1.preheader.us.preheader, label %for.cond.cleanup
+
+for.cond1.preheader.us.preheader:                 ; preds = %for.cond1.preheader.lr.ph
+  %wide.trip.count = zext nneg i32 %i_width to i64
+  %0 = add nsw i32 %i_height, -1
+  %1 = zext i32 %0 to i64
+  %2 = mul nsw i64 %idx.ext, %1
+  %3 = getelementptr i8, ptr %dst, i64 %2
+  %scevgep = getelementptr i8, ptr %3, i64 %wide.trip.count
+  %4 = mul nsw i64 %idx.ext12, %1
+  %5 = getelementptr i8, ptr %src1, i64 %4
+  %scevgep36 = getelementptr i8, ptr %5, i64 %wide.trip.count
+  %6 = mul nsw i64 %idx.ext14, %1
+  %7 = getelementptr i8, ptr %src2, i64 %6
+  %scevgep37 = getelementptr i8, ptr %7, i64 %wide.trip.count
+  %8 = tail call i64 @llvm.vscale.i64()
+  %9 = shl nuw nsw i64 %8, 4
+  %10 = tail call i64 @llvm.umax.i64(i64 %9, i64 32)
+  %min.iters.check = icmp ugt i64 %10, %wide.trip.count
+  %bound0 = icmp ult ptr %dst, %scevgep36
+  %bound1 = icmp ult ptr %src1, %scevgep
+  %found.conflict = and i1 %bound0, %bound1
+  %11 = or i32 %i_dst_stride, %i_src1_stride
+  %12 = icmp slt i32 %11, 0
+  %13 = or i1 %found.conflict, %12
+  %bound039 = icmp ult ptr %dst, %scevgep37
+  %bound140 = icmp ult ptr %src2, %scevgep
+  %found.conflict41 = and i1 %bound039, %bound140
+  %14 = or i32 %i_dst_stride, %i_src2_stride
+  %15 = icmp slt i32 %14, 0
+  %16 = or i1 %found.conflict41, %15
+  %conflict.rdx = or i1 %13, %16
+  br label %for.cond1.preheader.us
+
+for.cond1.preheader.us:                           ; preds = %for.cond1.preheader.us.preheader, %for.cond1.for.cond.cleanup3_crit_edge.us
+  %y.033.us = phi i32 [ %inc17.us, %for.cond1.for.cond.cleanup3_crit_edge.us ], [ 0, %for.cond1.preheader.us.preheader ]
+  %dst.addr.032.us = phi ptr [ %add.ptr.us, %for.cond1.for.cond.cleanup3_crit_edge.us ], [ %dst, %for.cond1.preheader.us.preheader ]
+  %src1.addr.031.us = phi ptr [ %add.ptr13.us, %for.cond1.for.cond.cleanup3_crit_edge.us ], [ %src1, %for.cond1.preheader.us.preheader ]
+  %src2.addr.030.us = phi ptr [ %add.ptr15.us, %for.cond1.for.cond.cleanup3_crit_edge.us ], [ %src2, %for.cond1.preheader.us.preheader ]
+  %brmerge = select i1 %min.iters.check, i1 true, i1 %conflict.rdx
+  br i1 %brmerge, label %for.body4.us.preheader, label %vector.ph
+
+vector.ph:                                        ; preds = %for.cond1.preheader.us
+  %17 = tail call i64 @llvm.vscale.i64()
+  %.neg = mul nuw nsw i64 %17, 2147483632
+  %n.vec = and i64 %.neg, %wide.trip.count
+  %18 = tail call i64 @llvm.vscale.i64()
+  %19 = shl nuw nsw i64 %18, 4
+  br label %vector.body
+
+vector.body:                                      ; preds = %vector.body, %vector.ph
+  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
+  %20 = getelementptr inbounds i8, ptr %src1.addr.031.us, i64 %index
+  %wide.load = load <vscale x 16 x i8>, ptr %20, align 1
+  %21 = zext <vscale x 16 x i8> %wide.load to <vscale x 16 x i16>
+  %22 = getelementptr inbounds i8, ptr %src2.addr.030.us, i64 %index
+  %wide.load44 = load <vscale x 16 x i8>, ptr %22, align 1
+  %23 = zext <vscale x 16 x i8> %wide.load44 to <vscale x 16 x i16>
+  %24 = add nuw nsw <vscale x 16 x i16> %21, shufflevector (<vscale x 16 x i16> insertelement (<vscale x 16 x i16> poison, i16 1, i64 0), <vscale x 16 x i16> poison, <vscale x 16 x i32> zeroinitializer)
+  %25 = add nuw nsw <vscale x 16 x i16> %24, %23
+  %26 = lshr <vscale x 16 x i16> %25, shufflevector (<vscale x 16 x i16> insertelement (<vscale x 16 x i16> poison, i16 1, i64 0), <vscale x 16 x i16> poison, <vscale x 16 x i32> zeroinitializer)
+  %27 = trunc <vscale x 16 x i16> %26 to <vscale x 16 x i8>
+  %28 = getelementptr inbounds i8, ptr %dst.addr.032.us, i64 %index
+  store <vscale x 16 x i8> %27, ptr %28, align 1
+  %index.next = add nuw i64 %index, %19
+  %29 = icmp eq i64 %index.next, %n.vec
+  br i1 %29, label %middle.block, label %vector.body
+
+middle.block:                                     ; preds = %vector.body
+  %cmp.n = icmp eq i64 %n.vec, %wide.trip.count
+  br i1 %cmp.n, label %for.cond1.for.cond.cleanup3_crit_edge.us, label %for.body4.us.preheader
+
+for.body4.us.preheader:                           ; preds = %for.cond1.preheader.us, %middle.block
+  %indvars.iv.ph = phi i64 [ 0, %for.cond1.preheader.us ], [ %n.vec, %middle.block ]
+  br label %for.body4.us
+
+for.body4.us:                                     ; preds = %for.body4.us.preheader, %for.body4.us
+  %indvars.iv = phi i64 [ %indvars.iv.next, %for.body4.us ], [ %indvars.iv.ph, %for.body4.us.preheader ]
+  %arrayidx.us = getelementptr inbounds i8, ptr %src1.addr.031.us, i64 %indvars.iv
+  %30 = load i8, ptr %arrayidx.us, align 1
+  %conv.us = zext i8 %30 to i16
+  %arrayidx6.us = getelementptr inbounds i8, ptr %src2.addr.030.us, i64 %indvars.iv
+  %31 = load i8, ptr %arrayidx6.us, align 1
+  %conv7.us = zext i8 %31 to i16
+  %add.us = add nuw nsw i16 %conv.us, 1
+  %add8.us = add nuw nsw i16 %add.us, %conv7.us
+  %shr.us = lshr i16 %add8.us, 1
+  %conv9.us = trunc nuw i16 %shr.us to i8
+  %arrayidx11.us = getelementptr inbounds i8, ptr %dst.addr.032.us, i64 %indvars.iv
+  store i8 %conv9.us, ptr %arrayidx11.us, align 1
+  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
+  %exitcond.not = icmp eq i64 %indvars.iv.next, %wide.trip.count
+  br i1 %exitcond.not, label %for.cond1.for.cond.cleanup3_crit_edge.us, label %for.body4.us
+
+for.cond1.for.cond.cleanup3_crit_edge.us:         ; preds = %for.body4.us, %middle.block
+  %add.ptr.us = getelementptr inbounds i8, ptr %dst.addr.032.us, i64 %idx.ext
+  %add.ptr13.us = getelementptr inbounds i8, ptr %src1.addr.031.us, i64 %idx.ext12
+  %add.ptr15.us = getelementptr inbounds i8, ptr %src2.addr.030.us, i64 %idx.ext14
+  %inc17.us = add nuw nsw i32 %y.033.us, 1
+  %exitcond35.not = icmp eq i32 %inc17.us, %i_height
+  br i1 %exitcond35.not, label %for.cond.cleanup, label %for.cond1.preheader.us
+
+for.cond.cleanup:                                 ; preds = %for.cond1.for.cond.cleanup3_crit_edge.us, %for.cond1.preheader.lr.ph, %entry
+  ret void
+}
diff --git a/llvm/test/CodeGen/RISCV/rvv/vxrm-insert.ll b/llvm/test/CodeGen/RISCV/rvv/vxrm-insert.ll
index a869b433a4952e..6f20b48f5662a9 100644
--- a/llvm/test/CodeGen/RISCV/rvv/vxrm-insert.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/vxrm-insert.ll
@@ -388,10 +388,9 @@ mergeblock:
 define void @test10(ptr nocapture %ptr_dest, ptr nocapture readonly %ptr_op1, ptr nocapture readonly %ptr_op2, iXLen %n) {
 ; CHECK-LABEL: test10:
 ; CHECK:       # %bb.0: # %entry
-; CHECK-NEXT:    beqz a3, .LBB9_3
-; CHECK-NEXT:  # %bb.1: # %for.body.preheader
 ; CHECK-NEXT:    csrwi vxrm, 2
-; CHECK-NEXT:  .LBB9_2: # %for.body
+; CHECK-NEXT:    beqz a3, .LBB9_2
+; CHECK-NEXT:  .LBB9_1: # %for.body
 ; CHECK-NEXT:    # =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:    vsetvli a4, a3, e8, mf8, ta, ma
 ; CHECK-NEXT:    vle8.v v8, (a1)
@@ -399,8 +398,8 @@ define void @test10(ptr nocapture %ptr_dest, ptr nocapture readonly %ptr_op1, pt
 ; CHECK-NEXT:    vaadd.vv v8, v8, v9
 ; CHECK-NEXT:    sub a3, a3, a4
 ; CHECK-NEXT:    vse8.v v8, (a0)
-; CHECK-NEXT:    bnez a3, .LBB9_2
-; CHECK-NEXT:  .LBB9_3: # %for.end
+; CHECK-NEXT:    bnez a3, .LBB9_1
+; CHECK-NEXT:  .LBB9_2: # %for.end
 ; CHECK-NEXT:    ret
 entry:
   %tobool.not9 = icmp eq iXLen %n, 0

@JeffreyALaw
Copy link

So just to chime in since I suggested Felipe tackle this problem.

As Felipe noted some designs flush their pipelines on a VXRM write which can be a significant performance issue. I measured a 2-3% improvement with a patch tackling this problem for GCC running spec2017's x264 workloads on the BPI-F3 board. That all comes from speculative hoisting of VXRM assignments needed to use vaaddu to implement the ceiling average found in pixel_avg.

@topperc
Copy link
Collaborator

topperc commented Sep 25, 2024

So just to chime in since I suggested Felipe tackle this problem.

As Felipe noted some designs flush their pipelines on a VXRM write which can be a significant performance issue. I measured a 2-3% improvement with a patch tackling this problem for GCC running spec2017's x264 workloads on the BPI-F3 board. That all comes from speculative hoisting of VXRM assignments needed to use vaaddu to implement the ceiling average found in pixel_avg.

Do you have data for not using vaaddu in that loop and using a vwadd.vv, a vadd.vi, and a vnsrl instead? So that there's no VXRM write needed?

SiFive's p470 and p670 don't optimize VXRM writes well either. Even with the writes hoisted of the loop nest, I couldn't get the vaaddu to be profitable.

@JeffreyALaw
Copy link

So just to chime in since I suggested Felipe tackle this problem.
As Felipe noted some designs flush their pipelines on a VXRM write which can be a significant performance issue. I measured a 2-3% improvement with a patch tackling this problem for GCC running spec2017's x264 workloads on the BPI-F3 board. That all comes from speculative hoisting of VXRM assignments needed to use vaaddu to implement the ceiling average found in pixel_avg.

Do you have data for not using vaaddu in that loop and using a vwadd.vv, a vadd.vi, and a vnsrl instead? So that there's no VXRM write needed?

SiFive's p470 and p670 don't optimize VXRM writes well either. Even with the writes hoisted of the loop nest, I couldn't get the vaaddu to be profitable.

No. We largely set this aside as we adjusted the Ventana design to make VXRM assignments cheap. But it seemed a shame to leave things performing so badly on the BPI given we had bits to make it "no so bad".

@JeffreyALaw
Copy link

So pondering this overnight, obviously it's up to y'all what to do for LLVM around this specific patch.

However it does seem clear to me that we're going to have designs with wildly different performance characteristics for VXRM assignments and thus selecting vaaddu and friends really needs to be a micro-arch tuning decision. I'll have to get the ball rolling on that for GCC.

@topperc topperc self-requested a review September 27, 2024 17:44
@felipealmeida felipealmeida force-pushed the felipe_vxrm_hoisting_pr branch from 6bef118 to f825201 Compare September 30, 2024 01:22
@felipealmeida
Copy link
Contributor Author

Hello,

I added a TuneVXRMPipelineFlush subfeature and conditioned the changes to VXRM on this. I added this subtarget feature to P470 and P670 and changed the tests accordingly.

I didn't find the SpacemiT K1 processor definition for the BPI. And probably other CPUs could benefit from this as well. Let me know if I should add that to any other processors.

@topperc
Copy link
Collaborator

topperc commented Oct 3, 2024

Hello,

I added a TuneVXRMPipelineFlush subfeature and conditioned the changes to VXRM on this. I added this subtarget feature to P470 and P670 and changed the tests accordingly.

I didn't find the SpacemiT K1 processor definition for the BPI. And probably other CPUs could benefit from this as well. Let me know if I should add that to any other processors.

def SPACEMIT_X60 : RISCVProcessorModel<"spacemit-x60" is the BPI. I think K1 is the chip with 8 cores. X60 is the individual core.

@topperc
Copy link
Collaborator

topperc commented Oct 30, 2024

reverse ping

@felipealmeida felipealmeida force-pushed the felipe_vxrm_hoisting_pr branch from f825201 to 2c68f46 Compare October 30, 2024 17:18
@felipealmeida
Copy link
Contributor Author

Added X60

Copy link
Collaborator

@topperc topperc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zqb-all
Copy link
Contributor

zqb-all commented Oct 31, 2024

LGTM

Change the intersect for the anticipated algorithm to ignore unknown
when anticipating. This effectively allows VXRM writes speculatively
because it could do a VXRM write even when there's branches where VXRM
is unneeded.

The importance of this change is because VXRM writes causes pipeline
flushes in some micro-architectures and so it makes sense to allow
more aggressive hoisting even if it causes some degradation for the
slow path.

An example is this code:
```
typedef unsigned char uint8_t;
__attribute__ ((noipa))
void foo (uint8_t *dst,  int i_dst_stride,
           uint8_t *src1, int i_src1_stride,
           uint8_t *src2, int i_src2_stride,
           int i_width, int i_height )
{
   for( int y = 0; y < i_height; y++ )
     {
       for( int x = 0; x < i_width; x++ )
         dst[x] = ( src1[x] + src2[x] + 1 ) >> 1;
       dst  += i_dst_stride;
       src1 += i_src1_stride;
       src2 += i_src2_stride;
     }
}
```
With this patch, the code above generates a hoisting VXRM writes out of the outer loop.
@felipealmeida felipealmeida force-pushed the felipe_vxrm_hoisting_pr branch from 2c68f46 to 8e1ea6e Compare November 27, 2024 13:44
@felipealmeida
Copy link
Contributor Author

Rebased and fixed test from changes to X60 tuning

@topperc
Copy link
Collaborator

topperc commented Nov 27, 2024

@felipealmeida do you need me to merge this?

@felipealmeida
Copy link
Contributor Author

@felipealmeida do you need me to merge this?

Yes, please. I'm not a committer.

@mshockwave mshockwave merged commit e3fdc3a into llvm:main Nov 27, 2024
7 checks passed
@github-actions
Copy link

@felipealmeida Congratulations on having your first Pull Request (PR) merged into the LLVM Project!

Your changes will be combined with recent changes from other authors, then tested by our build bots. If there is a problem with a build, you may receive a report in an email or a comment on this PR.

Please check whether problems have been caused by your change specifically, as the builds can include changes from many authors. It is not uncommon for your change to be included in a build that fails due to someone else's changes, or infrastructure issues.

How to do this, and the rest of the post-merge process, is covered in detail here.

If your change does cause a problem, it may be reverted, or you can revert it yourself. This is a normal part of LLVM development. You can fix your changes and open a new PR to merge them again.

If you don't get any reports, no action is required from you. Your changes are working as expected, well done!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants