You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -91,26 +91,166 @@ The register-based interpreter design is expected to reduce the number of instru
91
91
92
92
## Design Overview
93
93
94
-
TBD
94
+
This section describes the high-level design of the register-based interpreter.
95
+
96
+
### Instruction set
97
+
98
+
Most of each VM instruction correspond to a single WebAssembly instruction, but they encode their operand and result registers into the instruction itself. For example, `Instruction.i32Add(lhs: Reg, rhs: Reg, result: Reg)` corresponds to the `i32.add` WebAssembly instruction, and it takes two registers as input and produces one register as output.
99
+
Exceptions are "provider" instructions, such as `local.get`, `{i32,i64,f32,f64}.const`, etc., which are no-ops at runtime. They are encoded as registers in instruction operands, so thre is no corresponding VM instruction for them.
100
+
101
+
A *register* in this context is a 64-bit slot in the stack frame that can uniformly hold any of the WebAssembly value types (i32, i64, f32, f64, ref). The register is identified by a 16-bit index.
102
+
103
+
### Translation
104
+
105
+
The translation pass converts WebAssembly instructions into a sequence of VM instructions. The translation is done in a single instruction traversal, and it abstractly interprets the WebAssembly instructions to track stack value sources (constants, locals, other instructions)
106
+
107
+
For example, the following WebAssembly code:
108
+
109
+
```wat
110
+
local.get 0
111
+
local.get 1
112
+
i32.add
113
+
i32.const 1
114
+
i32.add
115
+
local.set 0
116
+
end
117
+
```
118
+
119
+
is translated into the following VM instructions:
120
+
121
+
```
122
+
;; [reg:0] Local 0
123
+
;; [reg:1] Local 1
124
+
;; [reg:2] Const 0 = i32:1
125
+
;; [reg:6] Dynamic 0
126
+
reg:6 = i32.add reg:0, reg:1
127
+
reg:0 = i32.add reg:6, reg:2
128
+
return
129
+
```
130
+
131
+
Note that the last `local.set 0` instruction is fused directly into the `i32.add` instruction, and the `i32.const 1` instruction is embedded into the `i32.add` instruction, which references the constant slot in the stack frame.
132
+
133
+
Most of the translation process is straightforward and structured control-flow instructions are a bit more complex. Structured control-flow instructions (block, loop, if) are translated into a flatten branch-based instruction sequence as well as the second generation interpreter. For example, the following WebAssembly code:
134
+
135
+
```wat
136
+
local.get 0
137
+
if i32
138
+
i32.const 1
139
+
else
140
+
i32.const 2
141
+
end
142
+
local.set 0
143
+
```
144
+
145
+
is translated into the following VM instructions:
146
+
147
+
```
148
+
;; [reg:0] Local 0
149
+
;; [reg:1] Const 0 = i32:1
150
+
;; [reg:2] Const 1 = i32:2
151
+
;; [reg:5] Dynamic 0
152
+
0x00: br_if_not reg:0, +4 ; 0x6
153
+
0x02: reg:5 = copy reg:1
154
+
0x04: br +2 ; 0x8
155
+
0x06: reg:5 = copy reg:2
156
+
0x08: reg:0 = copy reg:5
157
+
```
158
+
159
+
See [`Translator.swift`](../Sources/WasmKit/Translator.swift) for the translation pass implementation.
160
+
161
+
You can see translated VM instructions by running the `wasmkit-cli explore` command.
95
162
96
163
### Stack frame layout
97
164
98
165
See doc comments on `StackLayout` type. The stack frame layout design is heavily inspired by stitch WebAssembly interpreter[^4].
99
166
100
-
### TODO
167
+
Basically, the stack frame consists of four parts: frame header, locals, dynamic stack, and constant pool.
168
+
169
+
1. The frame header contains the saved stack pointer, return address, current instance, and value slots for parameters and return values.
170
+
2. The locals part contains the local variables of the current function.
171
+
3. The constant pool part contains the constant values
172
+
- The size of the constant pool is determined by a heuristic based on the Wasm-level code size. The translation pass determines the size at the beginning of the translation process to statically know value slot indices without fixing up them at the end of the translation process.
173
+
4. The dynamic stack part contains the dynamic stack values, which are the intermediate values produced by the WebAssembly instructions.
174
+
- The size of the dynamic stack is the maximum height of the stack determined by the translation pass and is fixed at the end of the translation process.
175
+
176
+
Value slots in the frame header, locals, dynamic stack, and constant pool are all accessible by the register index.
177
+
178
+
### Instruction encoding
179
+
180
+
The VM instructions are encoded as a variable-length 64-bit slot sequence. The first 64-bit head slot is used to encode the instruction opcode kind. The rest of the slots are used to encode immediate operands.
181
+
182
+
The head slot value is different based on the threading model as mentioned in the next section. For direct-threaded, the head slot value is a pointer to the instruction handler function. For token-threaded, the head slot value is an opcode id.
183
+
184
+
### Threading model
185
+
186
+
We use the threaded code technique for instruction dispatch. Note that "threaded code" here is not related to the "thread" in the multi-threading context. It is a technique to implement a virtual machine interpreter efficiently[^5].
187
+
188
+
The interpreter supports two threading models: direct-threaded and token-threaded. The direct-threaded model is the default threading model on most platforms, and the token-threaded model is a fallback option for platforms that do not support guaranteed tail call.
189
+
190
+
There is nothing special; we just use a traditional interpreter technique to minimize the overhead instruction dispatch.
191
+
192
+
Typically, there are two ways to implement the direct-threaded model in C: using [Labels as Values](https://gcc.gnu.org/onlinedocs/gcc/Labels-as-Values.html) extension or using guaranteed tail call (`musttail` in LLVM).
193
+
194
+
Swift does not support either of them, so we ask C-interop for help.
195
+
A little-known gem of the Swift compiler is C-interop, which uses Clang as a library and can mix Swift code and code in C headers as a single translation unit, and optimize them together.
196
+
197
+
We tried both Label as Values and guaranteed tail call approaches, and concluded that the guaranteed tail call approach is better fit for us.
198
+
199
+
In the Label as Values approach, there is a large single function with a lot of labels and includes all the instruction implementations. Theoretically, compiler can know the all necessary information to give us the "optimal" code. However, in practice, the compiler uses several heuristics and does not always generate the optimal code for this scale of function. For example, the register pressure is always pretty high in the interpreter function, and it often spills important variables like `sp` and `pc`, which significantly degrades the performance. We tried to tame the compiler by teaching hot/cold paths, but it's very tricky and time-consuming.
200
+
201
+
On the other hand, the guaranteed tail call approach is more straightforward and easier to tune. The instruction handler functions are all separated, and the compiler can optimize them individually. It will not mix hot/cold paths if we separate them at the translation stage. Generated machine code is more predictable and easier to read. Therefore, we chose the guaranteed tail call approach for the direct-threaded model implementation.
202
+
203
+
The instruction handler functions are defined in C headers. Those C functions call Swift functions implementing the actual instruction semantics, and then they tail-call the next instruction handler function.
204
+
We use [`swiftasync`](https://clang.llvm.org/docs/AttributeReference.html#swiftasynccall) calling convention for the C instruction handler functions to guarantee tail call and keep `self` context in a dedicated register.
205
+
206
+
In this way, we can implement instruction semantics in Swift and can dispatch instructions efficiently.
207
+
208
+
Here is an example of the instruction handler function in C header and the corresponding Swift implementation:
Those boilerplate code is generated by the [`Utilities/Sources/VMGen.swift`](../Utilities/Sources/VMGen.swift) script.
237
+
238
+
## Performance evaluation
239
+
240
+
We have not done a comprehensive performance evaluation yet, but we have run the CoreMark benchmark to compare the performance of the register-based interpreter with the second generation interpreter. The benchmark was run on a 2020 Mac mini (M1, 16GB RAM) with `swift-DEVELOPMENT-SNAPSHOT-2024-09-17-a` toolchain and compiled with `swift build -c release`.
241
+
242
+
The below figure shows the score is 7.4x higher than the second generation interpreter.
243
+
244
+

245
+
246
+
Additionally, we have compared our new interpreter with other top-tier WebAssembly interpreters; [wasm3](https://github.com/wasm3/wasm3), [stitch](https://github.com/makepad/stitch), and [wasmi](https://github.com/wasmi-labs/wasmi). The result shows that our interpreter is well competitive with them.
101
247
102
-
- Variadic width instructions
103
-
- Constant pool space on stack
104
-
- Trade space and time on prologue for time by fewer instructions
[^2]: Jun Xu, Liang He, Xin Wang, Wenyong Huang, Ning Wang. “A Fast WebAssembly Interpreter Design in WASM-Micro-Runtime.” Intel, 7 Oct. 2021, https://www.intel.com/content/www/us/en/developer/articles/technical/webassembly-interpreter-design-wasm-micro-runtime.html
115
254
[^3]: [Baseline Compilation in Wasmtime](https://github.com/bytecodealliance/rfcs/blob/de8616ba2fe01f3e94467a0f6ef3e4195c274334/accepted/wasmtime-baseline-compilation.md)
116
255
[^4]: stitch WebAssembly interpreter by @ejpbruel2 https://github.com/makepad/stitch
0 commit comments