Skip to content

Commit da42568

Browse files
Complete the RegisterMachine.md document
1 parent 07b0efc commit da42568

File tree

1 file changed

+150
-10
lines changed

1 file changed

+150
-10
lines changed

Documentation/RegisterMachine.md

Lines changed: 150 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -91,26 +91,166 @@ The register-based interpreter design is expected to reduce the number of instru
9191

9292
## Design Overview
9393

94-
TBD
94+
This section describes the high-level design of the register-based interpreter.
95+
96+
### Instruction set
97+
98+
Most of each VM instruction correspond to a single WebAssembly instruction, but they encode their operand and result registers into the instruction itself. For example, `Instruction.i32Add(lhs: Reg, rhs: Reg, result: Reg)` corresponds to the `i32.add` WebAssembly instruction, and it takes two registers as input and produces one register as output.
99+
Exceptions are "provider" instructions, such as `local.get`, `{i32,i64,f32,f64}.const`, etc., which are no-ops at runtime. They are encoded as registers in instruction operands, so thre is no corresponding VM instruction for them.
100+
101+
A *register* in this context is a 64-bit slot in the stack frame that can uniformly hold any of the WebAssembly value types (i32, i64, f32, f64, ref). The register is identified by a 16-bit index.
102+
103+
### Translation
104+
105+
The translation pass converts WebAssembly instructions into a sequence of VM instructions. The translation is done in a single instruction traversal, and it abstractly interprets the WebAssembly instructions to track stack value sources (constants, locals, other instructions)
106+
107+
For example, the following WebAssembly code:
108+
109+
```wat
110+
local.get 0
111+
local.get 1
112+
i32.add
113+
i32.const 1
114+
i32.add
115+
local.set 0
116+
end
117+
```
118+
119+
is translated into the following VM instructions:
120+
121+
```
122+
;; [reg:0] Local 0
123+
;; [reg:1] Local 1
124+
;; [reg:2] Const 0 = i32:1
125+
;; [reg:6] Dynamic 0
126+
reg:6 = i32.add reg:0, reg:1
127+
reg:0 = i32.add reg:6, reg:2
128+
return
129+
```
130+
131+
Note that the last `local.set 0` instruction is fused directly into the `i32.add` instruction, and the `i32.const 1` instruction is embedded into the `i32.add` instruction, which references the constant slot in the stack frame.
132+
133+
Most of the translation process is straightforward and structured control-flow instructions are a bit more complex. Structured control-flow instructions (block, loop, if) are translated into a flatten branch-based instruction sequence as well as the second generation interpreter. For example, the following WebAssembly code:
134+
135+
```wat
136+
local.get 0
137+
if i32
138+
i32.const 1
139+
else
140+
i32.const 2
141+
end
142+
local.set 0
143+
```
144+
145+
is translated into the following VM instructions:
146+
147+
```
148+
;; [reg:0] Local 0
149+
;; [reg:1] Const 0 = i32:1
150+
;; [reg:2] Const 1 = i32:2
151+
;; [reg:5] Dynamic 0
152+
0x00: br_if_not reg:0, +4 ; 0x6
153+
0x02: reg:5 = copy reg:1
154+
0x04: br +2 ; 0x8
155+
0x06: reg:5 = copy reg:2
156+
0x08: reg:0 = copy reg:5
157+
```
158+
159+
See [`Translator.swift`](../Sources/WasmKit/Translator.swift) for the translation pass implementation.
160+
161+
You can see translated VM instructions by running the `wasmkit-cli explore` command.
95162

96163
### Stack frame layout
97164

98165
See doc comments on `StackLayout` type. The stack frame layout design is heavily inspired by stitch WebAssembly interpreter[^4].
99166

100-
### TODO
167+
Basically, the stack frame consists of four parts: frame header, locals, dynamic stack, and constant pool.
168+
169+
1. The frame header contains the saved stack pointer, return address, current instance, and value slots for parameters and return values.
170+
2. The locals part contains the local variables of the current function.
171+
3. The constant pool part contains the constant values
172+
- The size of the constant pool is determined by a heuristic based on the Wasm-level code size. The translation pass determines the size at the beginning of the translation process to statically know value slot indices without fixing up them at the end of the translation process.
173+
4. The dynamic stack part contains the dynamic stack values, which are the intermediate values produced by the WebAssembly instructions.
174+
- The size of the dynamic stack is the maximum height of the stack determined by the translation pass and is fixed at the end of the translation process.
175+
176+
Value slots in the frame header, locals, dynamic stack, and constant pool are all accessible by the register index.
177+
178+
### Instruction encoding
179+
180+
The VM instructions are encoded as a variable-length 64-bit slot sequence. The first 64-bit head slot is used to encode the instruction opcode kind. The rest of the slots are used to encode immediate operands.
181+
182+
The head slot value is different based on the threading model as mentioned in the next section. For direct-threaded, the head slot value is a pointer to the instruction handler function. For token-threaded, the head slot value is an opcode id.
183+
184+
### Threading model
185+
186+
We use the threaded code technique for instruction dispatch. Note that "threaded code" here is not related to the "thread" in the multi-threading context. It is a technique to implement a virtual machine interpreter efficiently[^5].
187+
188+
The interpreter supports two threading models: direct-threaded and token-threaded. The direct-threaded model is the default threading model on most platforms, and the token-threaded model is a fallback option for platforms that do not support guaranteed tail call.
189+
190+
There is nothing special; we just use a traditional interpreter technique to minimize the overhead instruction dispatch.
191+
192+
Typically, there are two ways to implement the direct-threaded model in C: using [Labels as Values](https://gcc.gnu.org/onlinedocs/gcc/Labels-as-Values.html) extension or using guaranteed tail call (`musttail` in LLVM).
193+
194+
Swift does not support either of them, so we ask C-interop for help.
195+
A little-known gem of the Swift compiler is C-interop, which uses Clang as a library and can mix Swift code and code in C headers as a single translation unit, and optimize them together.
196+
197+
We tried both Label as Values and guaranteed tail call approaches, and concluded that the guaranteed tail call approach is better fit for us.
198+
199+
In the Label as Values approach, there is a large single function with a lot of labels and includes all the instruction implementations. Theoretically, compiler can know the all necessary information to give us the "optimal" code. However, in practice, the compiler uses several heuristics and does not always generate the optimal code for this scale of function. For example, the register pressure is always pretty high in the interpreter function, and it often spills important variables like `sp` and `pc`, which significantly degrades the performance. We tried to tame the compiler by teaching hot/cold paths, but it's very tricky and time-consuming.
200+
201+
On the other hand, the guaranteed tail call approach is more straightforward and easier to tune. The instruction handler functions are all separated, and the compiler can optimize them individually. It will not mix hot/cold paths if we separate them at the translation stage. Generated machine code is more predictable and easier to read. Therefore, we chose the guaranteed tail call approach for the direct-threaded model implementation.
202+
203+
The instruction handler functions are defined in C headers. Those C functions call Swift functions implementing the actual instruction semantics, and then they tail-call the next instruction handler function.
204+
We use [`swiftasync`](https://clang.llvm.org/docs/AttributeReference.html#swiftasynccall) calling convention for the C instruction handler functions to guarantee tail call and keep `self` context in a dedicated register.
205+
206+
In this way, we can implement instruction semantics in Swift and can dispatch instructions efficiently.
207+
208+
Here is an example of the instruction handler function in C header and the corresponding Swift implementation:
209+
210+
```c
211+
// In C header
212+
typedef SWIFT_CC(swiftasync) void (* _Nonnull wasmkit_tc_exec)(
213+
uint64_t *_Nonnull sp, Pc, Md, Ms, SWIFT_CONTEXT void *_Nullable state);
214+
215+
SWIFT_CC(swiftasync) static inline void wasmkit_tc_i32Add(Sp sp, Pc pc, Md md, Ms ms, SWIFT_CONTEXT void *state) {
216+
SWIFT_CC(swift) uint64_t wasmkit_execute_i32Add(Sp *sp, Pc *pc, Md *md, Ms *ms, SWIFT_CONTEXT void *state, SWIFT_ERROR_RESULT void **error);
217+
void * _Nullable error = NULL; uint64_t next;
218+
INLINE_CALL next = wasmkit_execute_i32Add(&sp, &pc, &md, &ms, state, &error);
219+
return ((wasmkit_tc_exec)next)(sp, pc, md, ms, state);
220+
}
221+
222+
// In Swift
223+
import CWasmKit.InlineCode // Import C header
224+
extension Execution {
225+
@_silgen_name("wasmkit_execute_i32Add") @inline(__always)
226+
mutating func execute_i32Add(sp: UnsafeMutablePointer<Sp>, pc: UnsafeMutablePointer<Pc>, md: UnsafeMutablePointer<Md>, ms: UnsafeMutablePointer<Ms>) -> CodeSlot {
227+
let immediate = Instruction.BinaryOperand.load(from: &pc.pointee)
228+
sp.pointee[i32: immediate.result] = sp.pointee[i32: immediate.lhs].add(sp.pointee[i32: immediate.rhs])
229+
let next = pc.pointee.pointee
230+
pc.pointee = pc.pointee.advanced(by: 1)
231+
return next
232+
}
233+
}
234+
```
235+
236+
Those boilerplate code is generated by the [`Utilities/Sources/VMGen.swift`](../Utilities/Sources/VMGen.swift) script.
237+
238+
## Performance evaluation
239+
240+
We have not done a comprehensive performance evaluation yet, but we have run the CoreMark benchmark to compare the performance of the register-based interpreter with the second generation interpreter. The benchmark was run on a 2020 Mac mini (M1, 16GB RAM) with `swift-DEVELOPMENT-SNAPSHOT-2024-09-17-a` toolchain and compiled with `swift build -c release`.
241+
242+
The below figure shows the score is 7.4x higher than the second generation interpreter.
243+
244+
![CoreMark score (higher is better)](https://github.com/user-attachments/assets/2c400efe-fe17-452d-b86e-747c2aba5ae8)
245+
246+
Additionally, we have compared our new interpreter with other top-tier WebAssembly interpreters; [wasm3](https://github.com/wasm3/wasm3), [stitch](https://github.com/makepad/stitch), and [wasmi](https://github.com/wasmi-labs/wasmi). The result shows that our interpreter is well competitive with them.
101247
102-
- Variadic width instructions
103-
- Constant pool space on stack
104-
- Trade space and time on prologue for time by fewer instructions
105-
- |locals|dynamic stack| -> |locals|dynamic stack|constant|
106-
- Stack caching
107-
- Instruction fusion
108-
- Const embedding
109-
- Conditional embedding (eg. br_if_lt)
248+
![CoreMark score in interpreter class (higher is better)](https://github.com/user-attachments/assets/f43c129c-0745-4e52-8e92-17dadc0c7fdd)
110249
111250
## References
112251
113252
[^1]: https://github.com/swiftwasm/WasmKit/pull/70
114253
[^2]: Jun Xu, Liang He, Xin Wang, Wenyong Huang, Ning Wang. “A Fast WebAssembly Interpreter Design in WASM-Micro-Runtime.” Intel, 7 Oct. 2021, https://www.intel.com/content/www/us/en/developer/articles/technical/webassembly-interpreter-design-wasm-micro-runtime.html
115254
[^3]: [Baseline Compilation in Wasmtime](https://github.com/bytecodealliance/rfcs/blob/de8616ba2fe01f3e94467a0f6ef3e4195c274334/accepted/wasmtime-baseline-compilation.md)
116255
[^4]: stitch WebAssembly interpreter by @ejpbruel2 https://github.com/makepad/stitch
256+
[^5]: https://en.wikipedia.org/wiki/Threaded_code

0 commit comments

Comments
 (0)