Skip to content

Commit 392c7a9

Browse files
authored
Cranelift/x64 backend: do not use one-way branches. (bytecodealliance#10086)
* Cranelift/x64 backend: do not use one-way branches. In bytecodealliance#9980, we saw that code copmiled with the single-pass register allocator has incorrect behavior. We eventually narrowed this down to the fact that the single-pass allocator is inserting code meant to be at the end of a block, just before its terminator, *between* two branches that form the terminator sequence. The allocator is correct; the bug is with Cranelift's x64 backend. When we produce instructions into a VCode container, we maintain basic blocks, and we have the invariant (usual for basic block-based IR) that only the last -- terminator -- instruction is a branch that can leave the block. Even the conditional branches maintain this invariant: though VCode is meant to be "almost machine code", we emit *two-target conditionals* that are semantically like "jcond; jmp". We then are able to optimize this inline during binary emission in the `MachBuffer`: the buffer knows about unconditional and conditional branches and will "chomp" branches off the tail of the buffer whenever they target the fallthrough block. (We designed the system this way because it is simpler to think about BBs that are order-invariant, i.e., not bake the "fallthrough" concept into the IR.) Thus we have a simpler abstraction but produce optimal terminator sequences. Unfortunately, when adding a branch-on-floating-point-compare lowering, we had the need to branch to a target if either of *two* conditions were true, and rather than add a new kind of terminator instruction, we added a "one-armed branch": conditionally branch to label or fall through. We emitted this in sequence right before the actual terminator, so semantically it was almost equivalent. I write "almost" because the register allocator *is* allowed to insert spills/reloads/moves between any two instructions. Here the distinct pieces of the terminator sequence matter: the allocator might insert something just before the last instruction, assuming the basic-block "single in, single out" invariant means this will always run with the block. With one-armed branches this is no longer true. The backtracking allocator (our original RA2 algorithm, and still the default today) will never insert code at the end of a block when it has multiple terminators, because it associates such block-start/end insertions with *edges*; so in such conditions it inserts instructions into the tops of successor blocks instead. But the single-pass allocator needs to perform work at the end of every block, so it will trigger this bug. This PR removes `JmpIf` and converts the br-of-fcmp lowering to use `JmpCondOr` instead, which is a pseudoinstruction that does `jcc1; jcc2; jmp`. This maintains the BB invariant and fixes the bug. Note that Winch still uses `JmpIf`, so we cannot remove it entirely: this PR renames it to `WinchJmpIf` instead, and adds a mechanism to assert failure if it is ever added to `VCode` (rather than emitted directly, as Winch's macro-assembler does). We could instead write Winch's `jmp_if` assembler function in terms of `JmpCond` with a fallthrough label that is immediately bound, and let the MachBuffer always chomp the jmp; I opted not to regress Winch compiler performance by doing this. If one day we abstract out the assembler further, we can remove `WinchJmpIf`. This is one of two instances of a "one-armed branch"; the other is s390x's `OneWayCondBr`, used in `br_table` lowerings, which we will address separately. Once we do, that will address bytecodealliance#9980 entirely. * Add test for cascading branch-chomping behavior. * keep the paperclip happy
1 parent 1e58f1b commit 392c7a9

File tree

10 files changed

+201
-59
lines changed

10 files changed

+201
-59
lines changed

cranelift/codegen/src/isa/x64/inst.isle

Lines changed: 29 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -578,17 +578,16 @@
578578
;; Jump to a known target: jmp simm32.
579579
(JmpKnown (dst MachLabel))
580580

581-
;; One-way conditional branch: jcond cond target.
581+
;; Low-level one-way conditional branch: jcond cond target.
582582
;;
583-
;; This instruction is useful when we have conditional jumps depending on
584-
;; more than two conditions, see for instance the lowering of Brif
585-
;; with Fcmp inputs.
586-
;;
587-
;; A note of caution: in contexts where the branch target is another
588-
;; block, this has to be the same successor as the one specified in the
589-
;; terminator branch of the current block. Otherwise, this might confuse
590-
;; register allocation by creating new invisible edges.
591-
(JmpIf (cc CC)
583+
;; This instruction is useful only for lower-level code
584+
;; generators that use the Cranelift instruction backend as an
585+
;; assembler library. The instruction is thus named after its
586+
;; primary user, Winch. This instruction *should not* be used
587+
;; by Cranelift proper and placed into VCode: it does not
588+
;; adhere to the basic-block invariant, namely that branches
589+
;; always end a block (with no fallthrough).
590+
(WinchJmpIf (cc CC)
592591
(taken MachLabel))
593592

594593
;; Two-way conditional branch: jcond cond target target.
@@ -599,6 +598,17 @@
599598
(taken MachLabel)
600599
(not_taken MachLabel))
601600

601+
;; Two-way conditional branch with a combination of conditions:
602+
;;
603+
;; j(cc1 or cc2) target1 target2
604+
;;
605+
;; Emitted as a compound sequence of three branches -- `jcc1
606+
;; target1`, `jcc2 target1`, `jmp target2`.
607+
(JmpCondOr (cc1 CC)
608+
(cc2 CC)
609+
(taken MachLabel)
610+
(not_taken MachLabel))
611+
602612
;; Jump-table sequence, as one compound instruction (see note in lower.rs
603613
;; for rationale).
604614
;;
@@ -5030,15 +5040,16 @@
50305040
(rule (jmp_known target)
50315041
(SideEffectNoResult.Inst (MInst.JmpKnown target)))
50325042

5033-
(decl jmp_if (CC MachLabel) ConsumesFlags)
5034-
(rule (jmp_if cc taken)
5035-
(ConsumesFlags.ConsumesFlagsSideEffect (MInst.JmpIf cc taken)))
5036-
50375043
;; Conditional jump based on the condition code.
50385044
(decl jmp_cond (CC MachLabel MachLabel) ConsumesFlags)
50395045
(rule (jmp_cond cc taken not_taken)
50405046
(ConsumesFlags.ConsumesFlagsSideEffect (MInst.JmpCond cc taken not_taken)))
50415047

5048+
;; Conditional jump based on the OR of two condition codes.
5049+
(decl jmp_cond_or (CC CC MachLabel MachLabel) ConsumesFlags)
5050+
(rule (jmp_cond_or cc1 cc2 taken not_taken)
5051+
(ConsumesFlags.ConsumesFlagsSideEffect (MInst.JmpCondOr cc1 cc2 taken not_taken)))
5052+
50425053
;; Conditional jump based on the result of an icmp.
50435054
(decl jmp_cond_icmp (IcmpCondResult MachLabel MachLabel) SideEffectNoResult)
50445055
(rule (jmp_cond_icmp (IcmpCondResult.Condition producer cc) taken not_taken)
@@ -5050,14 +5061,12 @@
50505061
(with_flags_side_effect producer (jmp_cond cc taken not_taken)))
50515062
(rule (jmp_cond_fcmp (FcmpCondResult.AndCondition producer cc1 cc2) taken not_taken)
50525063
(with_flags_side_effect producer
5053-
(consumes_flags_concat
5054-
(jmp_if (cc_invert cc1) not_taken)
5055-
(jmp_cond (cc_invert cc2) not_taken taken))))
5064+
;; DeMorgan's rule: to get cc1 AND cc2, we do NOT (NOT cc1 OR NOT cc2).
5065+
;; The outer NOT comes from flipping `not_taken` and `taken`.
5066+
(jmp_cond_or (cc_invert cc1) (cc_invert cc2) not_taken taken)))
50565067
(rule (jmp_cond_fcmp (FcmpCondResult.OrCondition producer cc1 cc2) taken not_taken)
50575068
(with_flags_side_effect producer
5058-
(consumes_flags_concat
5059-
(jmp_if cc1 taken)
5060-
(jmp_cond cc2 taken not_taken))))
5069+
(jmp_cond_or cc1 cc2 taken not_taken)))
50615070

50625071
;; Emit the compound instruction that does:
50635072
;;

cranelift/codegen/src/isa/x64/inst/emit.rs

Lines changed: 73 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1622,7 +1622,8 @@ pub(crate) fn emit(
16221622
inst.emit(sink, info, state);
16231623

16241624
// jne .loop_start
1625-
// TODO: Encoding the JmpIf as a short jump saves us 4 bytes here.
1625+
// TODO: Encoding the conditional jump as a short jump
1626+
// could save us us 4 bytes here.
16261627
one_way_jmp(sink, CC::NZ, loop_start);
16271628

16281629
// The regular prologue code is going to emit a `sub` after this, so we need to
@@ -1891,7 +1892,7 @@ pub(crate) fn emit(
18911892
sink.put4(0x0);
18921893
}
18931894

1894-
Inst::JmpIf { cc, taken } => {
1895+
Inst::WinchJmpIf { cc, taken } => {
18951896
let cond_start = sink.cur_offset();
18961897
let cond_disp_off = cond_start + 2;
18971898

@@ -1936,6 +1937,76 @@ pub(crate) fn emit(
19361937
sink.put4(0x0);
19371938
}
19381939

1940+
Inst::JmpCondOr {
1941+
cc1,
1942+
cc2,
1943+
taken,
1944+
not_taken,
1945+
} => {
1946+
// Emit:
1947+
// jcc1 taken
1948+
// jcc2 taken
1949+
// jmp not_taken
1950+
//
1951+
// Note that we enroll both conditionals in the
1952+
// branch-chomping mechanism because MachBuffer
1953+
// simplification can continue upward as long as it keeps
1954+
// chomping branches. In the best case, if taken ==
1955+
// not_taken and that one block is the fallthrough block,
1956+
// all three branches can disappear.
1957+
1958+
// jcc1 taken
1959+
let cond_1_start = sink.cur_offset();
1960+
let cond_1_disp_off = cond_1_start + 2;
1961+
let cond_1_end = cond_1_start + 6;
1962+
1963+
sink.use_label_at_offset(cond_1_disp_off, *taken, LabelUse::JmpRel32);
1964+
let inverted: [u8; 6] = [
1965+
0x0F,
1966+
0x80 + (cc1.invert().get_enc()),
1967+
0x00,
1968+
0x00,
1969+
0x00,
1970+
0x00,
1971+
];
1972+
sink.add_cond_branch(cond_1_start, cond_1_end, *taken, &inverted[..]);
1973+
1974+
sink.put1(0x0F);
1975+
sink.put1(0x80 + cc1.get_enc());
1976+
sink.put4(0x0);
1977+
1978+
// jcc2 taken
1979+
let cond_2_start = sink.cur_offset();
1980+
let cond_2_disp_off = cond_2_start + 2;
1981+
let cond_2_end = cond_2_start + 6;
1982+
1983+
sink.use_label_at_offset(cond_2_disp_off, *taken, LabelUse::JmpRel32);
1984+
let inverted: [u8; 6] = [
1985+
0x0F,
1986+
0x80 + (cc2.invert().get_enc()),
1987+
0x00,
1988+
0x00,
1989+
0x00,
1990+
0x00,
1991+
];
1992+
sink.add_cond_branch(cond_2_start, cond_2_end, *taken, &inverted[..]);
1993+
1994+
sink.put1(0x0F);
1995+
sink.put1(0x80 + cc2.get_enc());
1996+
sink.put4(0x0);
1997+
1998+
// jmp not_taken
1999+
let uncond_start = sink.cur_offset();
2000+
let uncond_disp_off = uncond_start + 1;
2001+
let uncond_end = uncond_start + 5;
2002+
2003+
sink.use_label_at_offset(uncond_disp_off, *not_taken, LabelUse::JmpRel32);
2004+
sink.add_uncond_branch(uncond_start, uncond_end, *not_taken);
2005+
2006+
sink.put1(0xE9);
2007+
sink.put4(0x0);
2008+
}
2009+
19392010
Inst::JmpUnknown { target } => {
19402011
let target = target.clone();
19412012

cranelift/codegen/src/isa/x64/inst/mod.rs

Lines changed: 25 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -93,7 +93,8 @@ impl Inst {
9393
| Inst::Hlt
9494
| Inst::Imm { .. }
9595
| Inst::JmpCond { .. }
96-
| Inst::JmpIf { .. }
96+
| Inst::JmpCondOr { .. }
97+
| Inst::WinchJmpIf { .. }
9798
| Inst::JmpKnown { .. }
9899
| Inst::JmpTableSeq { .. }
99100
| Inst::JmpUnknown { .. }
@@ -1738,12 +1739,24 @@ impl PrettyPrint for Inst {
17381739
format!("{op} {dst}")
17391740
}
17401741

1741-
Inst::JmpIf { cc, taken } => {
1742+
Inst::WinchJmpIf { cc, taken } => {
17421743
let taken = taken.to_string();
17431744
let op = ljustify2("j".to_string(), cc.to_string());
17441745
format!("{op} {taken}")
17451746
}
17461747

1748+
Inst::JmpCondOr {
1749+
cc1,
1750+
cc2,
1751+
taken,
1752+
not_taken,
1753+
} => {
1754+
let taken = taken.to_string();
1755+
let not_taken = not_taken.to_string();
1756+
let op = ljustify(format!("j{cc1},{cc2}"));
1757+
format!("{op} {taken}; j {not_taken}")
1758+
}
1759+
17471760
Inst::JmpCond {
17481761
cc,
17491762
taken,
@@ -2657,8 +2670,9 @@ fn x64_get_operands(inst: &mut Inst, collector: &mut impl OperandVisitor) {
26572670
}
26582671

26592672
Inst::JmpKnown { .. }
2660-
| Inst::JmpIf { .. }
2673+
| Inst::WinchJmpIf { .. }
26612674
| Inst::JmpCond { .. }
2675+
| Inst::JmpCondOr { .. }
26622676
| Inst::Ret { .. }
26632677
| Inst::Nop { .. }
26642678
| Inst::TrapIf { .. }
@@ -2775,12 +2789,20 @@ impl MachInst for Inst {
27752789
}
27762790
&Self::JmpKnown { .. } => MachTerminator::Uncond,
27772791
&Self::JmpCond { .. } => MachTerminator::Cond,
2792+
&Self::JmpCondOr { .. } => MachTerminator::Cond,
27782793
&Self::JmpTableSeq { .. } => MachTerminator::Indirect,
27792794
// All other cases are boring.
27802795
_ => MachTerminator::None,
27812796
}
27822797
}
27832798

2799+
fn is_low_level_branch(&self) -> bool {
2800+
match self {
2801+
&Self::WinchJmpIf { .. } => true,
2802+
_ => false,
2803+
}
2804+
}
2805+
27842806
fn is_mem_access(&self) -> bool {
27852807
panic!("TODO FILL ME OUT")
27862808
}

cranelift/codegen/src/isa/x64/pcc.rs

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -840,8 +840,9 @@ pub(crate) fn check(
840840
| Inst::ReturnCallKnown { .. }
841841
| Inst::JmpKnown { .. }
842842
| Inst::Ret { .. }
843-
| Inst::JmpIf { .. }
843+
| Inst::WinchJmpIf { .. }
844844
| Inst::JmpCond { .. }
845+
| Inst::JmpCondOr { .. }
845846
| Inst::TrapIf { .. }
846847
| Inst::TrapIfAnd { .. }
847848
| Inst::TrapIfOr { .. }

cranelift/codegen/src/machinst/mod.rs

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -201,6 +201,14 @@ pub trait MachInst: Clone + Debug {
201201
/// architecture.
202202
fn function_alignment() -> FunctionAlignment;
203203

204+
/// Is this a low-level, one-way branch, not meant for use in a
205+
/// VCode body? These instructions are meant to be used only when
206+
/// directly emitted, i.e. when `MachInst` is used as an assembler
207+
/// library.
208+
fn is_low_level_branch(&self) -> bool {
209+
false
210+
}
211+
204212
/// A label-use kind: a type that describes the types of label references that
205213
/// can occur in an instruction.
206214
type LabelUse: MachInstLabelUse;

cranelift/codegen/src/machinst/vcode.rs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -357,6 +357,7 @@ impl<I: VCodeInst> VCodeBuilder<I> {
357357
/// Push an instruction for the current BB and current IR inst
358358
/// within the BB.
359359
pub fn push(&mut self, insn: I, loc: RelSourceLoc) {
360+
assert!(!insn.is_low_level_branch()); // These are not meant to be in VCode.
360361
self.vcode.insts.push(insn);
361362
self.vcode.srclocs.push(loc);
362363
}

cranelift/filetests/filetests/isa/x64/branches.clif

Lines changed: 52 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -167,8 +167,7 @@ block2:
167167
; movq %rsp, %rbp
168168
; block0:
169169
; ucomiss %xmm1, %xmm0
170-
; jp label1
171-
; jnz label1; j label2
170+
; jp,nz label1; j label2
172171
; block1:
173172
; movl $2, %eax
174173
; movq %rbp, %rsp
@@ -216,8 +215,7 @@ block2:
216215
; movq %rsp, %rbp
217216
; block0:
218217
; ucomiss %xmm1, %xmm0
219-
; jp label1
220-
; jnz label1; j label2
218+
; jp,nz label1; j label2
221219
; block1:
222220
; movl $1, %eax
223221
; movq %rbp, %rsp
@@ -265,8 +263,7 @@ block2:
265263
; movq %rsp, %rbp
266264
; block0:
267265
; ucomiss %xmm1, %xmm0
268-
; jp label2
269-
; jnz label2; j label1
266+
; jp,nz label2; j label1
270267
; block1:
271268
; movl $1, %eax
272269
; movq %rbp, %rsp
@@ -631,8 +628,7 @@ block202:
631628
; movq %rsp, %rbp
632629
; block0:
633630
; ucomiss const(1), %xmm0
634-
; jp label2
635-
; jnz label2; j label1
631+
; jp,nz label2; j label1
636632
; block1:
637633
; jmp label5
638634
; block2:
@@ -753,8 +749,7 @@ block2:
753749
; movq %rsp, %rbp
754750
; block0:
755751
; ucomiss %xmm1, %xmm0
756-
; jp label1
757-
; jnz label1; j label2
752+
; jp,nz label1; j label2
758753
; block1:
759754
; movl $2, %eax
760755
; movq %rbp, %rsp
@@ -855,8 +850,7 @@ block2:
855850
; movq %rsp, %rbp
856851
; block0:
857852
; ucomiss %xmm1, %xmm0
858-
; jp label1
859-
; jnz label1; j label2
853+
; jp,nz label1; j label2
860854
; block1:
861855
; movl $2, %eax
862856
; movq %rbp, %rsp
@@ -887,6 +881,52 @@ block2:
887881
; popq %rbp
888882
; retq
889883

884+
function %brif_i8_fcmp_same_target(f32, f32) -> i32 {
885+
block0(v0: f32, v1: f32):
886+
v2 = fcmp eq v0, v1
887+
v3 = uextend.i32 v2
888+
;; This test should demonstrate branch-chomping work on the combo
889+
;; two-condition branch lowered from `fcmp`; in fact this case is
890+
;; even more interesting because critical-edge splitting will create
891+
;; edge blocks (block1 and block2 in lowered VCode below), since
892+
;; otherwise we have multiple outs from first block and multiple ins
893+
;; to second block; and then branch-chomping elides five (!)
894+
;; cascading branches in a row.
895+
brif v3, block1, block1
896+
897+
block1:
898+
v4 = iconst.i32 1
899+
return v4
900+
}
901+
902+
; VCode:
903+
; pushq %rbp
904+
; movq %rsp, %rbp
905+
; block0:
906+
; ucomiss %xmm1, %xmm0
907+
; jp,nz label2; j label1
908+
; block1:
909+
; jmp label3
910+
; block2:
911+
; jmp label3
912+
; block3:
913+
; movl $1, %eax
914+
; movq %rbp, %rsp
915+
; popq %rbp
916+
; ret
917+
;
918+
; Disassembled:
919+
; block0: ; offset 0x0
920+
; pushq %rbp
921+
; movq %rsp, %rbp
922+
; block1: ; offset 0x4
923+
; ucomiss %xmm1, %xmm0
924+
; block2: ; offset 0x7
925+
; movl $1, %eax
926+
; movq %rbp, %rsp
927+
; popq %rbp
928+
; retq
929+
890930
function %br_table_i32(i32) -> i32 {
891931
block0(v0: i32):
892932
br_table v0, block4, [block1, block2, block2, block3]

0 commit comments

Comments
 (0)