[flang] 1.5x performance regression on gas_dyn2

This reproduces on zen4 after https://github.com/llvm/llvm-project/pull/121544

After `InlineHLFIRAssign` we now have this in `chozdt_` routine:
```
  %23 = hlfir.minloc %19#0 {fastmath = #arith.fastmath<fast>} : (!fir.box<!fir.array<?xf32>>) -> !hlfir.expr<1xi32>
  fir.do_loop %arg6 = %c1 to %c1 step %c1 unordered {
    %55 = hlfir.apply %23, %arg6 : (!hlfir.expr<1xi32>, index) -> i32
    %56 = hlfir.designate %7#0 (%arg6)  : (!fir.ref<!fir.array<1xi32>>, index) -> !fir.ref<i32>
    hlfir.assign %55 to %56 : i32, !fir.ref<i32>
  }
```

The do-loop is the result of inlining of:
```
  hlfir.assign %23 to %7#0 : !hlfir.expr<1xi32>, !fir.ref<!fir.array<1xi32>>
```

After LLVM inlining, and other optimizations we have the following minloc loop:
```
.lr.ph.i:                                         ; preds = %.lr.ph.i.preheader, %.lr.ph.i
  %42 = phi i32 [ %51, %.lr.ph.i ], [ 1, %.lr.ph.i.preheader ]
  %43 = phi float [ %52, %.lr.ph.i ], [ %41, %.lr.ph.i.preheader ]
  %44 = phi i64 [ %53, %.lr.ph.i ], [ 1, %.lr.ph.i.preheader ]
  %45 = shl nsw i64 %44, 2
  %46 = getelementptr i8, ptr %10, i64 %45
  %47 = load float, ptr %46, align 4, !tbaa !60
  %48 = fcmp fast uge float %47, %43
  %49 = trunc i64 %44 to i32
  %50 = add nuw i32 %49, 1
  %51 = select i1 %48, i32 %42, i32 %50
  %52 = select i1 %48, float %43, float %47
  %53 = add nuw nsw i64 %44, 1
  %exitcond.not.i = icmp eq i64 %53, %9
  br i1 %exitcond.not.i, label %_FortranAMinlocReal4x1_i32_fast_simplified.exit, label %.lr.ph.i, !llvm.loop !64
```

The select operations form the cyclic dependency, and they are later transformed to cmov/minss instructions. Before my change, the SimplifyCFG pass could not produce selects, presumably, because the result of the integer select was stored into the temporary `<1 x i32>` array.

The cyclic dependency introduces by the integer cmov seems to limit performance of the loop. The loop would be better off with compare and jump.

Performance restores with `-disable-select-optimize=false`, unfortunately, it is disabled by default on X86.

Possible solutions:
* Try to enable the select optimization for X86, but I guess it is disabled for a reason.
* We can try to trick SimplifyCFG to not optimize the compare-jump into selects by setting the following probability to the jump instruction:
```
  %35 = load float, ptr %34, align 4, !tbaa !1609
  %36 = fcmp fast olt float %35, %23
  %37 = fcmp fast une float %23, %23
  %38 = fcmp fast oeq float %35, %35
  %39 = and i1 %37, %38
  %40 = or i1 %36, %39
  %41 = trunc i32 %27 to i1
  %42 = xor i1 %41, true
  %43 = or i1 %40, %42
  br i1 %43, label %44, label %47, !prof !2000

44:                                               ; preds = %26
  store i32 1, ptr %5, align 4, !tbaa !1609
  %45 = trunc i64 %22 to i32
  %46 = add i32 %45, 1
  store i32 %46, ptr %14, align 4, !tbaa !1609
  br label %47

47:                                               ; preds = %44, %26
...

!2000 = !{!"branch_weights", !"expected",  i32 1, i32 99}
```
This is just a trick though, and the right solution should be allowing the select optimization to use its heuristics.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[flang] 1.5x performance regression on gas_dyn2 #121599

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[flang] 1.5x performance regression on gas_dyn2 #121599

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions