-
Notifications
You must be signed in to change notification settings - Fork 15.3k
Description
This reproduces on zen4 after #121544
After InlineHLFIRAssign we now have this in chozdt_ routine:
%23 = hlfir.minloc %19#0 {fastmath = #arith.fastmath<fast>} : (!fir.box<!fir.array<?xf32>>) -> !hlfir.expr<1xi32>
fir.do_loop %arg6 = %c1 to %c1 step %c1 unordered {
%55 = hlfir.apply %23, %arg6 : (!hlfir.expr<1xi32>, index) -> i32
%56 = hlfir.designate %7#0 (%arg6) : (!fir.ref<!fir.array<1xi32>>, index) -> !fir.ref<i32>
hlfir.assign %55 to %56 : i32, !fir.ref<i32>
}
The do-loop is the result of inlining of:
hlfir.assign %23 to %7#0 : !hlfir.expr<1xi32>, !fir.ref<!fir.array<1xi32>>
After LLVM inlining, and other optimizations we have the following minloc loop:
.lr.ph.i: ; preds = %.lr.ph.i.preheader, %.lr.ph.i
%42 = phi i32 [ %51, %.lr.ph.i ], [ 1, %.lr.ph.i.preheader ]
%43 = phi float [ %52, %.lr.ph.i ], [ %41, %.lr.ph.i.preheader ]
%44 = phi i64 [ %53, %.lr.ph.i ], [ 1, %.lr.ph.i.preheader ]
%45 = shl nsw i64 %44, 2
%46 = getelementptr i8, ptr %10, i64 %45
%47 = load float, ptr %46, align 4, !tbaa !60
%48 = fcmp fast uge float %47, %43
%49 = trunc i64 %44 to i32
%50 = add nuw i32 %49, 1
%51 = select i1 %48, i32 %42, i32 %50
%52 = select i1 %48, float %43, float %47
%53 = add nuw nsw i64 %44, 1
%exitcond.not.i = icmp eq i64 %53, %9
br i1 %exitcond.not.i, label %_FortranAMinlocReal4x1_i32_fast_simplified.exit, label %.lr.ph.i, !llvm.loop !64
The select operations form the cyclic dependency, and they are later transformed to cmov/minss instructions. Before my change, the SimplifyCFG pass could not produce selects, presumably, because the result of the integer select was stored into the temporary <1 x i32> array.
The cyclic dependency introduces by the integer cmov seems to limit performance of the loop. The loop would be better off with compare and jump.
Performance restores with -disable-select-optimize=false, unfortunately, it is disabled by default on X86.
Possible solutions:
- Try to enable the select optimization for X86, but I guess it is disabled for a reason.
- We can try to trick SimplifyCFG to not optimize the compare-jump into selects by setting the following probability to the jump instruction:
%35 = load float, ptr %34, align 4, !tbaa !1609
%36 = fcmp fast olt float %35, %23
%37 = fcmp fast une float %23, %23
%38 = fcmp fast oeq float %35, %35
%39 = and i1 %37, %38
%40 = or i1 %36, %39
%41 = trunc i32 %27 to i1
%42 = xor i1 %41, true
%43 = or i1 %40, %42
br i1 %43, label %44, label %47, !prof !2000
44: ; preds = %26
store i32 1, ptr %5, align 4, !tbaa !1609
%45 = trunc i64 %22 to i32
%46 = add i32 %45, 1
store i32 %46, ptr %14, align 4, !tbaa !1609
br label %47
47: ; preds = %44, %26
...
!2000 = !{!"branch_weights", !"expected", i32 1, i32 99}
This is just a trick though, and the right solution should be allowing the select optimization to use its heuristics.