@@ -49,9 +49,8 @@ successors.
49
49
50
50
Instructions are opaque to the allocator except for a few important
51
51
bits: (1) ` is_ret ` (is a return instruction); (2) ` is_branch ` (is a
52
- branch instruction); (3) ` is_move ` (is a move between registers), and
53
- (4) a vector of Operands, covered below. Every block must end in a
54
- return or branch.
52
+ branch instruction); and (3) a vector of Operands, covered below.
53
+ Every block must end in a return or branch.
55
54
56
55
Both instructions and blocks are named by indices in contiguous index
57
56
spaces. A block's instructions must be a contiguous range of
@@ -103,9 +102,6 @@ consists of the following fields:
103
102
available in a way that does not conflict with outputs, the use
104
103
should be placed at the "after" position.
105
104
106
- This operand-specification design allows for SSA and non-SSA code (see
107
- section below for details).
108
-
109
105
VRegs, or virtual registers, are specified by an index and a register
110
106
class (Float or Int). The classes are not given separately; they are
111
107
encoded on every mention of the vreg. (In a sense, the class is an
@@ -133,9 +129,7 @@ thus imperative that we support this pattern well in the register
133
129
allocator.
134
130
135
131
This instruction-set design is somewhat at odds with an SSA
136
- representation, where a value cannot be redefined. Even in non-SSA
137
- code, it is awkward to overwrite a vreg that may need to be used again
138
- later.
132
+ representation, where a value cannot be redefined.
139
133
140
134
Thus, the allocator supports a useful fiction of sorts: the
141
135
instruction can be described as if it has three register mentions --
@@ -151,31 +145,12 @@ We will see below how the allocator makes this work by doing some
151
145
preprocessing so that the core allocation algorithms do not need to
152
146
worry about this constraint.
153
147
154
- Note that some non-SSA clients, such as Cranelift using the
155
- regalloc.rs-to-regalloc2 compatibility shim, will instead generate
156
- their own copies (copying to the output vreg first) and then use "mod"
157
- operand kinds, which allow the output vreg to be both read and
158
- written. regalloc2 works hard to make this as efficient as the
159
- reused-input scheme by treating moves specially (see below).
160
-
161
148
## SSA
162
149
163
- regalloc2 was originally designed to take an SSA IR as input, where
164
- the usual definitions apply: every vreg is defined exactly once, and
165
- every vreg use is dominated by its one def. (Useing blockparams means
166
- that we do not need additional conditions for phi-nodes.)
167
-
168
- The allocator then evolved to support non-SSA inputs as well. As a
169
- result, the input is maximally flexible right now: it does not check
170
- for and enforce, nor try to take advantage of, the single-def
171
- rule. However, blockparams are still available.
172
-
173
- In the future, we hope to change this, however, once compilation of
174
- non-SSA inputs is no longer needed. Specifically, if we can migrate
175
- Cranelift to the native regalloc2 API rather than the regalloc.rs
176
- compatibility shim, we will be able to remove "mod" operand kinds,
177
- assume (and verify) single defs, and take advantage of this when
178
- reasoning about various algorithms in the allocator.
150
+ regalloc2 takes an SSA IR as input, where the usual definitions apply:
151
+ every vreg is defined exactly once, and every vreg use is dominated by
152
+ its one def. (Useing blockparams means that we do not need additional
153
+ conditions for phi-nodes.)
179
154
180
155
## Block Parameters
181
156
@@ -198,52 +173,6 @@ def. The tradeoff is that a vreg's def now has two possibilities --
198
173
ordinary instruction def or blockparam def -- but this is fairly
199
174
reasonable to handle.
200
175
201
- ## Non-SSA
202
-
203
- As mentioned, regalloc2 supports non-SSA inputs as well. No special
204
- flag is needed to place the allocator in this mode or disable SSA
205
- verification. However, we hope to eventually remove this functionality
206
- when it is no longer needed.
207
-
208
- ## Program Moves
209
-
210
- As an especially useful feature for non-SSA IR, regalloc2 supports
211
- special handling of "move" instructions: it will try to merge the
212
- input and output allocations to elide the move altogether.
213
-
214
- It turns out that moves are used frequently in the non-SSA input that
215
- we observe from Cranelift via the regalloc.rs compatibility shim. They
216
- are used in three different ways:
217
-
218
- - Moves to or from physical registers, used to implement ABI details
219
- or place values in particular registers required by certain
220
- instructions.
221
- - Moves between vregs on program edges, as lowered from phi/blockparam
222
- dataflow in the higher-level SSA IR (CLIF).
223
- - Moves just prior to two-address-form instructions that modify an
224
- input to form an output: the input is moved to the output vreg to
225
- avoid clobbering the input.
226
-
227
- Note that, strictly speaking, special handling of program moves is
228
- redundant because each of these kinds of uses has an equivalent in the
229
- "native" regalloc2 API:
230
-
231
- - Moves to/from physical registers can become operand constraints,
232
- either on a particular instruction that requires/produces the values
233
- in certain registers (e.g., a call or ret with args/results in regs,
234
- or a special instruction with fixed register args), or on a ghost
235
- instruction at the top of function that defs vregs for all in-reg
236
- args.
237
-
238
- - Moves between vregs as a lowering of blockparams/phi nodes can be
239
- replaced with use of regalloc2's native blockparam support.
240
-
241
- - Moves prior to two-address-form instructions can be replaced with
242
- the reused-input mechanism.
243
-
244
- Thus, eventually, special handling of program moves should be
245
- removed. However, it is very important for performance at the moment.
246
-
247
176
## Output
248
177
249
178
The allocator produces two main data structures as output: an array of
@@ -336,25 +265,6 @@ branch that ends a block. There is exactly one "out" tuple for every
336
265
"in" tuple. As mentioned above, we will later scan over both to
337
266
generate moves.
338
267
339
- ### Program-Move Vectors: Source-Side and Dest-Side
340
-
341
- Similar to blockparams, we handle moves specially. In fact, we ingest
342
- all moves in the input program into a set of vectors -- "move sources"
343
- and "move dests", analogous to the "ins" and "outs" blockparam vectors
344
- described above -- and then completely ignore the moves in the program
345
- thereafter. The semantics of the API are such that all program moves
346
- will be recreated with regalloc-inserted edits, and should not still
347
- be emitted after regalloc. This may seem inefficient, but in fact it
348
- allows for better code because it integrates program-moves with the
349
- move resolution that handles other forms of vreg movement. We
350
- previously took the simpler approach of handling program-moves as
351
- opaque instructions with a source and dest, and we found that there
352
- were many redundant move-chains (A->B, B->C) that are eliminated when
353
- everything is handled centrally.
354
-
355
- We also construct a ` prog_move_merges ` vector of live-range index pairs
356
- to attempt to merge when we reach that stage of allocation.
357
-
358
268
## Core Allocation State: Ranges, Uses, Bundles, VRegs, PRegs
359
269
360
270
We now come to the core data structures: live-ranges, bundles, virtual
@@ -571,31 +481,14 @@ For each instruction, we process its effects on the scan state:
571
481
instruction), add a single-program-point liverange to each clobbered
572
482
preg.
573
483
574
- - If not a move:
575
- - for each program point [ after, before] , for each operand at
576
- this point(\* ):
577
- - if a def or mod:
578
- - if not currently live, this is a dead def; create an empty
579
- LR.
580
- - if a def:
581
- - set the start of the LR for this vreg to this point.
582
- - set as dead.
583
- - if a use:
584
- - create LR if not live, with start at beginning of block.
585
-
586
- - Else, if a move:
587
- - simple case (no pinned vregs):
588
- - add to ` prog_move ` data structures, and update LRs as above.
589
- - effective point for the use is * after* the move, and for the mod
590
- is * before* the * next* instruction. Why not more conventional
591
- use-before, def-after? Because this allows the move to happen in
592
- parallel with other moves that the move-resolution inserts
593
- (between split fragments of a vreg); these moves always happen
594
- at the gaps between instructions. We place it after, not before,
595
- because before may land at a block-start and interfere with edge
596
- moves, while after is always a "normal" gap (a move cannot end a
597
- block).
598
- - otherwise: see below (pinned vregs).
484
+ - For each program point [ after, before] , for each operand at
485
+ this point(\* ):
486
+ - if a def:
487
+ - if not currently live, this is a dead def; create an empty LR.
488
+ - set the start of the LR for this vreg to this point.
489
+ - set as dead.
490
+ - if a use:
491
+ - create LR if not live, with start at beginning of block.
599
492
600
493
601
494
(\* ) an instruction operand's effective point is adjusted in a few
@@ -609,62 +502,6 @@ the block), and create the "ins" tuples. (The uses for the other side
609
502
of the edge are already handled as normal uses on a branch
610
503
instruction.)
611
504
612
- ### Optimization: Pinned VRegs and Moves
613
-
614
- In order to efficiently handle the translation from the regalloc.rs
615
- API, which uses named RealRegs that are distinct from VirtualRegs
616
- rather than operand constraints, we need to implement a few
617
- optimizations. The translation layer translates RealRegs as particular
618
- vregs at the regalloc2 layer, because we need to track their liveness
619
- properly. Handling these as "normal" vregs, with massive bundles of
620
- many liveranges throughout the function, turns out to be a very
621
- inefficient solution. So we mark them as "pinned" with a hook in the
622
- RA2 API. Semantically, this means they are always assigned to a
623
- particular preg whenever mentioned in an operand (but * NOT* between
624
- those points; it is possible for a pinned vreg to move all about
625
- registers and stackslots as long as it eventually makes it back to its
626
- home preg in time for its next use).
627
-
628
- This has a few implications during liverange construction. First, when
629
- we see an operand that mentions a pinned vreg, we translate this to an
630
- operand constraint that names a fixed preg. Later, when we build
631
- bundles, we will not create a bundle for the pinned vreg; instead we
632
- will transfer its liveranges directly as unmoveable reservations in
633
- pregs' allocation maps. Finally, we need to handle moves specially.
634
-
635
- With the caveat that "this is a massive hack and I am very very
636
- sorry", here is how it works. A move between two pinned vregs is easy:
637
- we add that to the inserted-moves vector right away because we know the
638
- Allocation on both sides. A move from a pinned vreg to a normal vreg
639
- is the first interesting case. In this case, we (i) create a ghost def
640
- with a fixed-register policy on the normal vreg, doing the other
641
- liverange-maintenance bits as above, and (ii) adjust the liveranges on
642
- the pinned vreg (so the preg) in a particular way. If the preg is live
643
- flowing downward, then this move implies a copy, because the normal
644
- vreg and the pinned vreg are both used in the future and cannot
645
- overlap. But we cannot keep the preg continuously live, because at
646
- exactly one program point, the normal vreg is pinned to it. So we cut
647
- the downward-flowing liverange just * after* the normal vreg's
648
- fixed-reg ghost def. Then, whether it is live downward or not, we
649
- create an upward-flowing liverange on the pinned vreg that ends just
650
- * before* the ghost def.
651
-
652
- The move-from-normal-to-pinned case is similar. First, we create a
653
- ghost use on the normal vreg that pins its value at this program point
654
- to the fixed preg. Then, if the preg is live flowing downward, we trim
655
- its downward liverange to start just after the fixed use.
656
-
657
- There are also some tricky metadata-maintenance records that we emit
658
- so that the checker can keep this all straight.
659
-
660
- The outcome of this hack, together with the operand-constraint
661
- translation on normal uses/defs/mods on pinned vregs, is that we
662
- essentially are translating regalloc.rs's means of referring to real
663
- registers to regalloc2's preferred abstractions by doing a bit of
664
- reverse-engineering. It is not perfect, but it works. Still, we hope
665
- to rip it all out once we get rid of the need for the compatibility
666
- shim.
667
-
668
505
### Handling Reused Inputs
669
506
670
507
Reused inputs are also handled a bit specially. We have already
@@ -710,9 +547,8 @@ splitting later), and then try out assignments, backtrack via
710
547
eviction, and split continuously to chip away at the problem until we
711
548
have a working set of allocation assignments.
712
549
713
- We attempt to merge three kinds of bundle pairs: reused-input to
714
- corresponding output; across program moves; and across blockparam
715
- assignments.
550
+ We attempt to merge two kinds of bundle pairs: reused-input to
551
+ corresponding output; and across blockparam assignments.
716
552
717
553
To merge two bundles, we traverse over both their sorted liverange
718
554
vectors at once, checking for overlaps. Note that we can do this without
@@ -1016,18 +852,6 @@ stack spill, and if this is the case, it is better to do the store
1016
852
just after the last use and reload just before the first use of the
1017
853
respective bundles.
1018
854
1019
- Unfortunately, this heuristic choice does interact somewhat poorly
1020
- with program moves: moves between two normal (non-pinned) vregs do not
1021
- create ghost uses or defs, and so these points of the ranges can be
1022
- spilled, turning a normal register move into a move from or to the
1023
- stack. However, empirically, we have found that adding such ghost
1024
- uses/defs actually regresses some cases as well, because it pulls
1025
- values back into registers when we could have had a stack-to-stack
1026
- move (that might even be a no-op if the same spillset); overall, it
1027
- seems better to trim. It also improves allocation performance by
1028
- reducing contention in the registers during the core loop (before
1029
- second-chance allocation).
1030
-
1031
855
## Second-Chance Allocation: Spilled Bundles
1032
856
1033
857
Once the main allocation loop terminates, when all bundles have either
@@ -1111,11 +935,8 @@ There are two sources of moves that we must generate. The first are
1111
935
moves between different ranges of the same vreg, as the split pieces
1112
936
of that vreg's original bundle may have been assigned to different
1113
937
locations. The second are moves that result from move semantics in the
1114
- input program: either assignments from blockparam args on branches to
1115
- the target block's params, or program move instructions. (Recall that
1116
- we reify program moves in a unified way with all other moves, so the
1117
- client should not generate any machine code for their original moves
1118
- in the pre-allocation program.)
938
+ input program: assignments from blockparam args on branches to the
939
+ target block's params.
1119
940
1120
941
Moves are tricky to handle efficiently because they join two
1121
942
potentially very different locations in the program (in the case of
@@ -1157,24 +978,12 @@ This completes the "edge-moves". We sort the half-move array and then
1157
978
have all of the alloc-to-alloc pairs on a given (from-block, to-block)
1158
979
edge.
1159
980
1160
- There are also two kinds of moves that happen within blocks. First,
1161
- when a live-range ends and another begins for the same vreg in the
1162
- same block (i.e., a split in the middle of a block), we know both
981
+ Next, when a live-range ends and another begins for the same vreg in
982
+ the same block (i.e., a split in the middle of a block), we know both
1163
983
sides of the move immediately (because it is the same vreg and we can
1164
984
look up the adjacent allocation easily), and we can generate that
1165
985
move.
1166
986
1167
- Second, program moves occur within blocks. Here we need to do a
1168
- similar thing as for block-edge half-moves, but keyed on program point
1169
- instead. This is why the ` prog_move_srcs ` and ` prog_move_dsts ` arrays
1170
- are initially sorted by their (vreg, inst) keys: we can directly fill
1171
- in their allocation slots during our main scan. Note that when sorted
1172
- this way, the source and dest for a given move instruction will be at
1173
- different indices. After the main scan, we * re-sort* the arrays by
1174
- just the instruction, so the two sides of a move line up at the same
1175
- index; we can then traverse both arrays, zipped together, and generate
1176
- moves.
1177
-
1178
987
Finally, we generate moves to fix up multi-fixed-reg-constraint
1179
988
situations, and make reused inputs work, as described earlier.
1180
989
@@ -1206,9 +1015,7 @@ priorities:
1206
1015
1207
1016
- In-edge moves, to place edge-moves before the first instruction in a
1208
1017
block.
1209
- - Block-param metadata, used for the checker only.
1210
1018
- Regular, used for vreg movement between allocations.
1211
- - Post-regular, used for checker metadata related to pinned-vreg moves.
1212
1019
- Multi-fixed-reg, used for moves that handle the
1213
1020
single-vreg-in-multiple-fixed-pregs constraint case.
1214
1021
- Reused-input, used for implementing outputs with reused-input policies.
@@ -1354,54 +1161,18 @@ approach of doing only an intra-block analysis. This turns out to be
1354
1161
sufficient to remove most redundant moves, especially in the common
1355
1162
case of a single use of an otherwise-spilled value.
1356
1163
1357
- Note that we could do better * if* we accepted only SSA code, because
1358
- we would know that a value could not be redefined once written. We
1359
- should consider this again once we clean up and remove the non-SSA
1360
- support.
1164
+ Note that there is an opportunity to do better: as we only accept SSA
1165
+ code we would know that a value could not be redefined once written.
1361
1166
1362
1167
# Future Plans
1363
1168
1364
- ## SSA-Only Cleanup
1365
-
1366
- When the major user (Cranelift via the regalloc.rs shim) migrates to
1367
- generate SSA code and native regalloc2 operands, there are many bits
1368
- of complexity we can remove, as noted throughout this
1369
- writeup. Briefly, we could (i) remove special handling of program
1370
- moves, (ii) remove the pinned-vreg hack, (iii) simplify redundant-move
1371
- elimination, (iv) remove special handling of "mod" operands, and (v)
1372
- probably simplify plenty of code given the invariant that a def always
1373
- starts a range.
1374
-
1375
- More importantly, we expect this change to result in potentially much
1376
- better allocation performance. The use of special pinned vregs and
1377
- moves to/from them instead of fixed-reg constraints, explicit moves
1378
- for every reused-input constraint, and already-sequentialized series
1379
- of move instructions on edges for phi nodes, are all expensive ways of
1380
- encoding regalloc2's native input primitives that have to be
1381
- reverse-engineered. Removing that translation layer would be
1382
- ideal. Also, allowing regalloc2 to handle phi-node (blockparam)
1383
- lowering in a way that is integrated with other moves will likely
1384
- generate better code than the way that program-move handling interacts
1385
- with Cranelift's manually lowered phi-moves at the moment.
1386
-
1387
1169
## Better Split Heuristics
1388
1170
1389
1171
We have spent quite some effort trying to improve splitting behavior,
1390
1172
and it is now generally decent, but more work could be done here,
1391
1173
especially with regard to the interaction between splits and the loop
1392
1174
nest.
1393
1175
1394
- ## Native Debuginfo Output
1395
-
1396
- Cranelift currently computes value locations (in registers and
1397
- stack-slots) for detailed debuginfo with an expensive post-pass, after
1398
- regalloc is complete. This is because the existing register allocator
1399
- does not support returning this information directly. However,
1400
- providing such information by generating it while we scan over
1401
- liveranges in each vreg would be relatively simple, and has the
1402
- potential to be much faster and more reliable for Cranelift. We should
1403
- investigate adding an interface for this to regalloc2 and using it.
1404
-
1405
1176
# Appendix: Comparison to IonMonkey Allocator
1406
1177
1407
1178
There are a number of differences between the [ IonMonkey
0 commit comments