Skip to content

Commit f1bf70a

Browse files
authored
Cleanup DESIGN.md (#124)
Update DESIGN.md to reflect the current state of RA2.
1 parent f0e9cde commit f1bf70a

File tree

1 file changed

+23
-252
lines changed

1 file changed

+23
-252
lines changed

doc/DESIGN.md

Lines changed: 23 additions & 252 deletions
Original file line numberDiff line numberDiff line change
@@ -49,9 +49,8 @@ successors.
4949

5050
Instructions are opaque to the allocator except for a few important
5151
bits: (1) `is_ret` (is a return instruction); (2) `is_branch` (is a
52-
branch instruction); (3) `is_move` (is a move between registers), and
53-
(4) a vector of Operands, covered below. Every block must end in a
54-
return or branch.
52+
branch instruction); and (3) a vector of Operands, covered below.
53+
Every block must end in a return or branch.
5554

5655
Both instructions and blocks are named by indices in contiguous index
5756
spaces. A block's instructions must be a contiguous range of
@@ -103,9 +102,6 @@ consists of the following fields:
103102
available in a way that does not conflict with outputs, the use
104103
should be placed at the "after" position.
105104

106-
This operand-specification design allows for SSA and non-SSA code (see
107-
section below for details).
108-
109105
VRegs, or virtual registers, are specified by an index and a register
110106
class (Float or Int). The classes are not given separately; they are
111107
encoded on every mention of the vreg. (In a sense, the class is an
@@ -133,9 +129,7 @@ thus imperative that we support this pattern well in the register
133129
allocator.
134130

135131
This instruction-set design is somewhat at odds with an SSA
136-
representation, where a value cannot be redefined. Even in non-SSA
137-
code, it is awkward to overwrite a vreg that may need to be used again
138-
later.
132+
representation, where a value cannot be redefined.
139133

140134
Thus, the allocator supports a useful fiction of sorts: the
141135
instruction can be described as if it has three register mentions --
@@ -151,31 +145,12 @@ We will see below how the allocator makes this work by doing some
151145
preprocessing so that the core allocation algorithms do not need to
152146
worry about this constraint.
153147

154-
Note that some non-SSA clients, such as Cranelift using the
155-
regalloc.rs-to-regalloc2 compatibility shim, will instead generate
156-
their own copies (copying to the output vreg first) and then use "mod"
157-
operand kinds, which allow the output vreg to be both read and
158-
written. regalloc2 works hard to make this as efficient as the
159-
reused-input scheme by treating moves specially (see below).
160-
161148
## SSA
162149

163-
regalloc2 was originally designed to take an SSA IR as input, where
164-
the usual definitions apply: every vreg is defined exactly once, and
165-
every vreg use is dominated by its one def. (Useing blockparams means
166-
that we do not need additional conditions for phi-nodes.)
167-
168-
The allocator then evolved to support non-SSA inputs as well. As a
169-
result, the input is maximally flexible right now: it does not check
170-
for and enforce, nor try to take advantage of, the single-def
171-
rule. However, blockparams are still available.
172-
173-
In the future, we hope to change this, however, once compilation of
174-
non-SSA inputs is no longer needed. Specifically, if we can migrate
175-
Cranelift to the native regalloc2 API rather than the regalloc.rs
176-
compatibility shim, we will be able to remove "mod" operand kinds,
177-
assume (and verify) single defs, and take advantage of this when
178-
reasoning about various algorithms in the allocator.
150+
regalloc2 takes an SSA IR as input, where the usual definitions apply:
151+
every vreg is defined exactly once, and every vreg use is dominated by
152+
its one def. (Useing blockparams means that we do not need additional
153+
conditions for phi-nodes.)
179154

180155
## Block Parameters
181156

@@ -198,52 +173,6 @@ def. The tradeoff is that a vreg's def now has two possibilities --
198173
ordinary instruction def or blockparam def -- but this is fairly
199174
reasonable to handle.
200175

201-
## Non-SSA
202-
203-
As mentioned, regalloc2 supports non-SSA inputs as well. No special
204-
flag is needed to place the allocator in this mode or disable SSA
205-
verification. However, we hope to eventually remove this functionality
206-
when it is no longer needed.
207-
208-
## Program Moves
209-
210-
As an especially useful feature for non-SSA IR, regalloc2 supports
211-
special handling of "move" instructions: it will try to merge the
212-
input and output allocations to elide the move altogether.
213-
214-
It turns out that moves are used frequently in the non-SSA input that
215-
we observe from Cranelift via the regalloc.rs compatibility shim. They
216-
are used in three different ways:
217-
218-
- Moves to or from physical registers, used to implement ABI details
219-
or place values in particular registers required by certain
220-
instructions.
221-
- Moves between vregs on program edges, as lowered from phi/blockparam
222-
dataflow in the higher-level SSA IR (CLIF).
223-
- Moves just prior to two-address-form instructions that modify an
224-
input to form an output: the input is moved to the output vreg to
225-
avoid clobbering the input.
226-
227-
Note that, strictly speaking, special handling of program moves is
228-
redundant because each of these kinds of uses has an equivalent in the
229-
"native" regalloc2 API:
230-
231-
- Moves to/from physical registers can become operand constraints,
232-
either on a particular instruction that requires/produces the values
233-
in certain registers (e.g., a call or ret with args/results in regs,
234-
or a special instruction with fixed register args), or on a ghost
235-
instruction at the top of function that defs vregs for all in-reg
236-
args.
237-
238-
- Moves between vregs as a lowering of blockparams/phi nodes can be
239-
replaced with use of regalloc2's native blockparam support.
240-
241-
- Moves prior to two-address-form instructions can be replaced with
242-
the reused-input mechanism.
243-
244-
Thus, eventually, special handling of program moves should be
245-
removed. However, it is very important for performance at the moment.
246-
247176
## Output
248177

249178
The allocator produces two main data structures as output: an array of
@@ -336,25 +265,6 @@ branch that ends a block. There is exactly one "out" tuple for every
336265
"in" tuple. As mentioned above, we will later scan over both to
337266
generate moves.
338267

339-
### Program-Move Vectors: Source-Side and Dest-Side
340-
341-
Similar to blockparams, we handle moves specially. In fact, we ingest
342-
all moves in the input program into a set of vectors -- "move sources"
343-
and "move dests", analogous to the "ins" and "outs" blockparam vectors
344-
described above -- and then completely ignore the moves in the program
345-
thereafter. The semantics of the API are such that all program moves
346-
will be recreated with regalloc-inserted edits, and should not still
347-
be emitted after regalloc. This may seem inefficient, but in fact it
348-
allows for better code because it integrates program-moves with the
349-
move resolution that handles other forms of vreg movement. We
350-
previously took the simpler approach of handling program-moves as
351-
opaque instructions with a source and dest, and we found that there
352-
were many redundant move-chains (A->B, B->C) that are eliminated when
353-
everything is handled centrally.
354-
355-
We also construct a `prog_move_merges` vector of live-range index pairs
356-
to attempt to merge when we reach that stage of allocation.
357-
358268
## Core Allocation State: Ranges, Uses, Bundles, VRegs, PRegs
359269

360270
We now come to the core data structures: live-ranges, bundles, virtual
@@ -571,31 +481,14 @@ For each instruction, we process its effects on the scan state:
571481
instruction), add a single-program-point liverange to each clobbered
572482
preg.
573483

574-
- If not a move:
575-
- for each program point [after, before], for each operand at
576-
this point(\*):
577-
- if a def or mod:
578-
- if not currently live, this is a dead def; create an empty
579-
LR.
580-
- if a def:
581-
- set the start of the LR for this vreg to this point.
582-
- set as dead.
583-
- if a use:
584-
- create LR if not live, with start at beginning of block.
585-
586-
- Else, if a move:
587-
- simple case (no pinned vregs):
588-
- add to `prog_move` data structures, and update LRs as above.
589-
- effective point for the use is *after* the move, and for the mod
590-
is *before* the *next* instruction. Why not more conventional
591-
use-before, def-after? Because this allows the move to happen in
592-
parallel with other moves that the move-resolution inserts
593-
(between split fragments of a vreg); these moves always happen
594-
at the gaps between instructions. We place it after, not before,
595-
because before may land at a block-start and interfere with edge
596-
moves, while after is always a "normal" gap (a move cannot end a
597-
block).
598-
- otherwise: see below (pinned vregs).
484+
- For each program point [after, before], for each operand at
485+
this point(\*):
486+
- if a def:
487+
- if not currently live, this is a dead def; create an empty LR.
488+
- set the start of the LR for this vreg to this point.
489+
- set as dead.
490+
- if a use:
491+
- create LR if not live, with start at beginning of block.
599492

600493

601494
(\*) an instruction operand's effective point is adjusted in a few
@@ -609,62 +502,6 @@ the block), and create the "ins" tuples. (The uses for the other side
609502
of the edge are already handled as normal uses on a branch
610503
instruction.)
611504

612-
### Optimization: Pinned VRegs and Moves
613-
614-
In order to efficiently handle the translation from the regalloc.rs
615-
API, which uses named RealRegs that are distinct from VirtualRegs
616-
rather than operand constraints, we need to implement a few
617-
optimizations. The translation layer translates RealRegs as particular
618-
vregs at the regalloc2 layer, because we need to track their liveness
619-
properly. Handling these as "normal" vregs, with massive bundles of
620-
many liveranges throughout the function, turns out to be a very
621-
inefficient solution. So we mark them as "pinned" with a hook in the
622-
RA2 API. Semantically, this means they are always assigned to a
623-
particular preg whenever mentioned in an operand (but *NOT* between
624-
those points; it is possible for a pinned vreg to move all about
625-
registers and stackslots as long as it eventually makes it back to its
626-
home preg in time for its next use).
627-
628-
This has a few implications during liverange construction. First, when
629-
we see an operand that mentions a pinned vreg, we translate this to an
630-
operand constraint that names a fixed preg. Later, when we build
631-
bundles, we will not create a bundle for the pinned vreg; instead we
632-
will transfer its liveranges directly as unmoveable reservations in
633-
pregs' allocation maps. Finally, we need to handle moves specially.
634-
635-
With the caveat that "this is a massive hack and I am very very
636-
sorry", here is how it works. A move between two pinned vregs is easy:
637-
we add that to the inserted-moves vector right away because we know the
638-
Allocation on both sides. A move from a pinned vreg to a normal vreg
639-
is the first interesting case. In this case, we (i) create a ghost def
640-
with a fixed-register policy on the normal vreg, doing the other
641-
liverange-maintenance bits as above, and (ii) adjust the liveranges on
642-
the pinned vreg (so the preg) in a particular way. If the preg is live
643-
flowing downward, then this move implies a copy, because the normal
644-
vreg and the pinned vreg are both used in the future and cannot
645-
overlap. But we cannot keep the preg continuously live, because at
646-
exactly one program point, the normal vreg is pinned to it. So we cut
647-
the downward-flowing liverange just *after* the normal vreg's
648-
fixed-reg ghost def. Then, whether it is live downward or not, we
649-
create an upward-flowing liverange on the pinned vreg that ends just
650-
*before* the ghost def.
651-
652-
The move-from-normal-to-pinned case is similar. First, we create a
653-
ghost use on the normal vreg that pins its value at this program point
654-
to the fixed preg. Then, if the preg is live flowing downward, we trim
655-
its downward liverange to start just after the fixed use.
656-
657-
There are also some tricky metadata-maintenance records that we emit
658-
so that the checker can keep this all straight.
659-
660-
The outcome of this hack, together with the operand-constraint
661-
translation on normal uses/defs/mods on pinned vregs, is that we
662-
essentially are translating regalloc.rs's means of referring to real
663-
registers to regalloc2's preferred abstractions by doing a bit of
664-
reverse-engineering. It is not perfect, but it works. Still, we hope
665-
to rip it all out once we get rid of the need for the compatibility
666-
shim.
667-
668505
### Handling Reused Inputs
669506

670507
Reused inputs are also handled a bit specially. We have already
@@ -710,9 +547,8 @@ splitting later), and then try out assignments, backtrack via
710547
eviction, and split continuously to chip away at the problem until we
711548
have a working set of allocation assignments.
712549

713-
We attempt to merge three kinds of bundle pairs: reused-input to
714-
corresponding output; across program moves; and across blockparam
715-
assignments.
550+
We attempt to merge two kinds of bundle pairs: reused-input to
551+
corresponding output; and across blockparam assignments.
716552

717553
To merge two bundles, we traverse over both their sorted liverange
718554
vectors at once, checking for overlaps. Note that we can do this without
@@ -1016,18 +852,6 @@ stack spill, and if this is the case, it is better to do the store
1016852
just after the last use and reload just before the first use of the
1017853
respective bundles.
1018854

1019-
Unfortunately, this heuristic choice does interact somewhat poorly
1020-
with program moves: moves between two normal (non-pinned) vregs do not
1021-
create ghost uses or defs, and so these points of the ranges can be
1022-
spilled, turning a normal register move into a move from or to the
1023-
stack. However, empirically, we have found that adding such ghost
1024-
uses/defs actually regresses some cases as well, because it pulls
1025-
values back into registers when we could have had a stack-to-stack
1026-
move (that might even be a no-op if the same spillset); overall, it
1027-
seems better to trim. It also improves allocation performance by
1028-
reducing contention in the registers during the core loop (before
1029-
second-chance allocation).
1030-
1031855
## Second-Chance Allocation: Spilled Bundles
1032856

1033857
Once the main allocation loop terminates, when all bundles have either
@@ -1111,11 +935,8 @@ There are two sources of moves that we must generate. The first are
1111935
moves between different ranges of the same vreg, as the split pieces
1112936
of that vreg's original bundle may have been assigned to different
1113937
locations. The second are moves that result from move semantics in the
1114-
input program: either assignments from blockparam args on branches to
1115-
the target block's params, or program move instructions. (Recall that
1116-
we reify program moves in a unified way with all other moves, so the
1117-
client should not generate any machine code for their original moves
1118-
in the pre-allocation program.)
938+
input program: assignments from blockparam args on branches to the
939+
target block's params.
1119940

1120941
Moves are tricky to handle efficiently because they join two
1121942
potentially very different locations in the program (in the case of
@@ -1157,24 +978,12 @@ This completes the "edge-moves". We sort the half-move array and then
1157978
have all of the alloc-to-alloc pairs on a given (from-block, to-block)
1158979
edge.
1159980

1160-
There are also two kinds of moves that happen within blocks. First,
1161-
when a live-range ends and another begins for the same vreg in the
1162-
same block (i.e., a split in the middle of a block), we know both
981+
Next, when a live-range ends and another begins for the same vreg in
982+
the same block (i.e., a split in the middle of a block), we know both
1163983
sides of the move immediately (because it is the same vreg and we can
1164984
look up the adjacent allocation easily), and we can generate that
1165985
move.
1166986

1167-
Second, program moves occur within blocks. Here we need to do a
1168-
similar thing as for block-edge half-moves, but keyed on program point
1169-
instead. This is why the `prog_move_srcs` and `prog_move_dsts` arrays
1170-
are initially sorted by their (vreg, inst) keys: we can directly fill
1171-
in their allocation slots during our main scan. Note that when sorted
1172-
this way, the source and dest for a given move instruction will be at
1173-
different indices. After the main scan, we *re-sort* the arrays by
1174-
just the instruction, so the two sides of a move line up at the same
1175-
index; we can then traverse both arrays, zipped together, and generate
1176-
moves.
1177-
1178987
Finally, we generate moves to fix up multi-fixed-reg-constraint
1179988
situations, and make reused inputs work, as described earlier.
1180989

@@ -1206,9 +1015,7 @@ priorities:
12061015

12071016
- In-edge moves, to place edge-moves before the first instruction in a
12081017
block.
1209-
- Block-param metadata, used for the checker only.
12101018
- Regular, used for vreg movement between allocations.
1211-
- Post-regular, used for checker metadata related to pinned-vreg moves.
12121019
- Multi-fixed-reg, used for moves that handle the
12131020
single-vreg-in-multiple-fixed-pregs constraint case.
12141021
- Reused-input, used for implementing outputs with reused-input policies.
@@ -1354,54 +1161,18 @@ approach of doing only an intra-block analysis. This turns out to be
13541161
sufficient to remove most redundant moves, especially in the common
13551162
case of a single use of an otherwise-spilled value.
13561163

1357-
Note that we could do better *if* we accepted only SSA code, because
1358-
we would know that a value could not be redefined once written. We
1359-
should consider this again once we clean up and remove the non-SSA
1360-
support.
1164+
Note that there is an opportunity to do better: as we only accept SSA
1165+
code we would know that a value could not be redefined once written.
13611166

13621167
# Future Plans
13631168

1364-
## SSA-Only Cleanup
1365-
1366-
When the major user (Cranelift via the regalloc.rs shim) migrates to
1367-
generate SSA code and native regalloc2 operands, there are many bits
1368-
of complexity we can remove, as noted throughout this
1369-
writeup. Briefly, we could (i) remove special handling of program
1370-
moves, (ii) remove the pinned-vreg hack, (iii) simplify redundant-move
1371-
elimination, (iv) remove special handling of "mod" operands, and (v)
1372-
probably simplify plenty of code given the invariant that a def always
1373-
starts a range.
1374-
1375-
More importantly, we expect this change to result in potentially much
1376-
better allocation performance. The use of special pinned vregs and
1377-
moves to/from them instead of fixed-reg constraints, explicit moves
1378-
for every reused-input constraint, and already-sequentialized series
1379-
of move instructions on edges for phi nodes, are all expensive ways of
1380-
encoding regalloc2's native input primitives that have to be
1381-
reverse-engineered. Removing that translation layer would be
1382-
ideal. Also, allowing regalloc2 to handle phi-node (blockparam)
1383-
lowering in a way that is integrated with other moves will likely
1384-
generate better code than the way that program-move handling interacts
1385-
with Cranelift's manually lowered phi-moves at the moment.
1386-
13871169
## Better Split Heuristics
13881170

13891171
We have spent quite some effort trying to improve splitting behavior,
13901172
and it is now generally decent, but more work could be done here,
13911173
especially with regard to the interaction between splits and the loop
13921174
nest.
13931175

1394-
## Native Debuginfo Output
1395-
1396-
Cranelift currently computes value locations (in registers and
1397-
stack-slots) for detailed debuginfo with an expensive post-pass, after
1398-
regalloc is complete. This is because the existing register allocator
1399-
does not support returning this information directly. However,
1400-
providing such information by generating it while we scan over
1401-
liveranges in each vreg would be relatively simple, and has the
1402-
potential to be much faster and more reliable for Cranelift. We should
1403-
investigate adding an interface for this to regalloc2 and using it.
1404-
14051176
# Appendix: Comparison to IonMonkey Allocator
14061177

14071178
There are a number of differences between the [IonMonkey

0 commit comments

Comments
 (0)