|
| 1 | +This crate is similar to the [`origin-start` example], except that doesn't |
| 2 | +print any output, and enables optimizations for small code size. To produce a |
| 3 | +small binary, compile with `--release`. |
| 4 | + |
| 5 | +To produce an even smaller binary, use `objcopy` to remove the `.eh_frame` |
| 6 | +and `.comment` sections: |
| 7 | + |
| 8 | +``` |
| 9 | +objcopy -R .eh_frame -R .comment target/release/origin-start-tiny even-smaller |
| 10 | +``` |
| 11 | + |
| 12 | +For details on the specific optimizations performed, see the options under |
| 13 | +`[profile.release]` and the use of `default-features = false`, in Cargo.toml, |
| 14 | +and the additional link flags passed in build.rs. |
| 15 | + |
| 16 | +## The optimizations |
| 17 | + |
| 18 | +First, `origin` makes much of its functionality optional, so we add |
| 19 | +`default-features = false` to disable things like `.init_array`/`.fini_array` |
| 20 | +support, thread support, and other things. We only enable the features needed |
| 21 | +for our minimal test program: |
| 22 | + |
| 23 | +```toml |
| 24 | +origin = { path = "../..", default-features = false, features = ["origin-program", "origin-start"] } |
| 25 | +``` |
| 26 | + |
| 27 | +Then, we enable several optimizations in the `#[profile.release]` section of |
| 28 | +Cargo.toml: |
| 29 | + |
| 30 | +```toml |
| 31 | +# Give the optimizer more lattitude to optimize and delete unneeded code. |
| 32 | +lto = true |
| 33 | +# "abort" is smaller than "unwind". |
| 34 | +panic = "abort" |
| 35 | +# Tell the optimizer to optimize for size. |
| 36 | +opt-level = "z" |
| 37 | +# Delete the symbol table from the executable. |
| 38 | +strip = true |
| 39 | +``` |
| 40 | + |
| 41 | +In detail: |
| 42 | + |
| 43 | +> `lto = true` |
| 44 | +
|
| 45 | +LTO is Link-Time Optimization, which gives the optimizer the ability to see the |
| 46 | +whole program at once, and be more aggressive about deleting unneeded code. |
| 47 | + |
| 48 | +For example, in our test program, it enables inlining of the `main` function |
| 49 | +into the code in `origin` that calls it. Since it's just returning the constant |
| 50 | +`42`, inlining reduces code size. |
| 51 | + |
| 52 | +> `panic = "abort"` |
| 53 | +
|
| 54 | +Rust's default panic mechanism is to perform a stack unwind, however that |
| 55 | +mechanism takes some code. |
| 56 | + |
| 57 | +This doesn't actally help our example here, since it's a minimal program that |
| 58 | +doesn't contain any `panic` calls, but it's a useful optimization feature in |
| 59 | +general. |
| 60 | + |
| 61 | +> `opt-level = "z"` |
| 62 | +
|
| 63 | +The "z" optimization level instructs the compiler to prioritize code size above |
| 64 | +all other considerations. |
| 65 | + |
| 66 | +For example, on x86-64, in our test program, it uses this code sequence to load |
| 67 | +the value `0x2a`, which is our return value of `42`, into the `%rdi` register |
| 68 | +to pass to the exit system call: |
| 69 | + |
| 70 | +```asm |
| 71 | + 4000bc: 6a 2a push $0x2a |
| 72 | + 4000be: 5f pop %rdi |
| 73 | +``` |
| 74 | + |
| 75 | +Compare that with the sequence it emits without "z": |
| 76 | + |
| 77 | +```asm |
| 78 | + 4000c1: bf 2a 00 00 00 mov $0x2a,%edi |
| 79 | +``` |
| 80 | + |
| 81 | +The "z" form is two instructions rather than one. It also does a store and a |
| 82 | +load, as well as a stack pointer subtract and add. Modern x86-64 processors do |
| 83 | +store-to-load forwarding to avoid actually writing to memory and have a Stack |
| 84 | +engine for `push`/`pop` sequences and are very good at optimizing those kinds |
| 85 | +of instruction sequences; see Agner's |
| 86 | +[The microarchitecture of Intel, AMD, and VIA CPUs] for more information. |
| 87 | +However, even with these fancy features, it's probably still not completely |
| 88 | +free. |
| 89 | + |
| 90 | +But it is 3 bytes instead of 5, so `opt_level = "z"` goes with it. |
| 91 | + |
| 92 | +Amusingly, it doesn't do this same trick for the immediately following |
| 93 | +instruction, which looks similar: |
| 94 | +``` |
| 95 | + 4000bd: b8 e7 00 00 00 mov $0xe7,%eax |
| 96 | +``` |
| 97 | + |
| 98 | +Here, the value being loaded is 0xe7, which has the eigth bit set. The x86 |
| 99 | +`push` instructions immediate field is signed, so `push $0xe7` would need a |
| 100 | +4-byte immediate field to zero-extend it. Consequently, using the `push`/`pop` |
| 101 | +trick in this case would be longer. |
| 102 | + |
| 103 | +Next, we enable several link arguments in build.rs: |
| 104 | + |
| 105 | +```rust |
| 106 | + // Tell the linker to exclude the .eh_frame_hdr section. |
| 107 | + println!("cargo:rustc-link-arg=-Wl,--no-eh-frame-hdr"); |
| 108 | + // Tell the linker not to page-align sections. |
| 109 | + println!("cargo:rustc-link-arg=-Wl,-n"); |
| 110 | + // Tell the linker to make the text and data readable and writeable. This |
| 111 | + // allows them to occupy the same page. |
| 112 | + println!("cargo:rustc-link-arg=-Wl,-N"); |
| 113 | + // Tell the linker to exclude the `.note.gnu.build-id` section. |
| 114 | + println!("cargo:rustc-link-arg=-Wl,--build-id=none"); |
| 115 | + // Disable PIE, which adds some code size. |
| 116 | + println!("cargo:rustc-link-arg=-Wl,--no-pie"); |
| 117 | + // Disable the `GNU-stack` segment, if we're using lld. |
| 118 | + println!("cargo:rustc-link-arg=-Wl,-z,nognustack"); |
| 119 | +``` |
| 120 | + |
| 121 | +In detail: |
| 122 | + |
| 123 | +> `--no-eh-frame-hdr` |
| 124 | +
|
| 125 | +This disables the creation of a `.eh_frame_hdr` section, which we don't need |
| 126 | +since we won't be doing any unwinding. |
| 127 | + |
| 128 | +> `-n` |
| 129 | +
|
| 130 | +This turns of page alignment of sections, so that we don't waste any space on |
| 131 | +padding bytes. |
| 132 | + |
| 133 | +> `-N` |
| 134 | +
|
| 135 | +This sets code sections to be writable, so that they can be loaded into memory |
| 136 | +together with data. Ordinarily, having read-only code is a very good thing, |
| 137 | +but making them writable can save a few bytes. |
| 138 | + |
| 139 | +The `-n` and `-N` flags dont actally help our example here, but they can save |
| 140 | +some code size in larger programs. |
| 141 | + |
| 142 | +> `--build-id=none` |
| 143 | +
|
| 144 | +This disables the creation of a `.note.gnu.build-id` section, which is used by |
| 145 | +some build tools. We're not using any extra tools in our simple example here, |
| 146 | +so we can disable this. |
| 147 | + |
| 148 | +> `--no-pie` |
| 149 | +
|
| 150 | +Position-Independent Executables (PIE) are executables that can be loaded into |
| 151 | +a random address in memory, to make some kinds of security exploits harder, |
| 152 | +though it takes some extra code and relocation metadata to fix up addresses |
| 153 | +once the actual runtime address has been determined. Our simple example here |
| 154 | +isn't concerned with security, so we can disable this feature and save the |
| 155 | +space. |
| 156 | + |
| 157 | +> -z nognustack |
| 158 | +
|
| 159 | +This option is only recognized by ld.lld, so if you happen to be using that, |
| 160 | +this disables the use of the `GNU-stack` feature which allows the OS to mark |
| 161 | +the stack as non-executable. A non-executable stack is a very good thing, but |
| 162 | +omitting this request does save a few bytes. |
| 163 | + |
| 164 | +Finally, we add a RUSTFLAGS flag with .cargo/config.toml: |
| 165 | + |
| 166 | +```toml |
| 167 | +rustflags = ["-Z", "trap-unreachable=no"] |
| 168 | +``` |
| 169 | + |
| 170 | +This disables the use of trap instructions, such as `ud2` on x86-64, at places |
| 171 | +the compiler thinks should be unreachable, such as after the `jmp` in `_start` |
| 172 | +or after the `syscall` that calls `exit_group`, because rustix uses the |
| 173 | +`noreturn` [inline asm option]. Normally this is a very good thing, but it |
| 174 | +does take a few extra bytes. |
| 175 | + |
| 176 | +[inline asm option]: https://doc.rust-lang.org/reference/inline-assembly.html#options |
| 177 | + |
| 178 | +## Generated code |
| 179 | + |
| 180 | +With all these optimizations, the generated code looks like this: |
| 181 | + |
| 182 | +```asm |
| 183 | +00000000004000b0 <.text>: |
| 184 | + 4000b0: 48 89 e7 mov %rsp,%rdi |
| 185 | + 4000b3: 55 push %rbp |
| 186 | + 4000b4: e9 00 00 00 00 jmp 0x4000b9 |
| 187 | + 4000b9: 50 push %rax |
| 188 | + 4000ba: 6a 2a push $0x2a |
| 189 | + 4000bc: 5f pop %rdi |
| 190 | + 4000bd: b8 e7 00 00 00 mov $0xe7,%eax |
| 191 | + 4000c2: 0f 05 syscall |
| 192 | +``` |
| 193 | + |
| 194 | +Those first 3 instructions are origin's `_start` function. The next 5 |
| 195 | +instructions are `origin::program::entry` and everything, including the user |
| 196 | +`main` function and the `exit_group` syscall inlined into it. |
| 197 | + |
| 198 | +In theory this code code be made even smaller. |
| 199 | + |
| 200 | +That first `mov $rsp,%rdi` is moving the incoming stack pointer we got from the |
| 201 | +OS into the first argument register to pass to `origin::program::entry` so that |
| 202 | +it can use it to pick up the command-line arguments, environment variables, and |
| 203 | +AUX records, however we don't use any of those, so we don't need that argument. |
| 204 | +In theory origin could put that behind a cargo flag, but I didn't feel like |
| 205 | +adding separate versions of the `_start` sequence just for that optimization. |
| 206 | + |
| 207 | +Also, in theory, `origin::program::entry` could use the |
| 208 | +`llvm.frameaddress intrinsic` to read the incoming stack pointer value, instead |
| 209 | +of needing an explicit argument. But having it be an explicit argument makes it |
| 210 | +behave more like normal Rust code, which shouldn't be peeking at its caller's |
| 211 | +stack memory. |
| 212 | + |
| 213 | +And lastly, we could enable the `push %rbp`, `jmp`, and `push %rax`, which are |
| 214 | +just zeroing out the return address so that nothing ever unwinds back into the |
| 215 | +`_start` code, jumping to the immediately following code, and aligning the |
| 216 | +stack pointer, all to make a "call" from the `[naked]` function `_start` written |
| 217 | +in asm to the Rust `origin::program::entry` function. This is the transition |
| 218 | +from assembly code to the first Rust code in the program. There are sneaky ways |
| 219 | +to arrange for this code to be able to fall-through from `_start` into the |
| 220 | +`origin::program::entry`, but as above, I'm aiming to have this code behave |
| 221 | +like normal Rust code, which shouldn't be using control flow paths that the |
| 222 | +compiler doesn't know about. |
| 223 | + |
| 224 | +## Sources |
| 225 | + |
| 226 | +Many of these optimizations came from the following websites: |
| 227 | + |
| 228 | + - [Minimizing Rust Binary Size](https://github.com/johnthagen/min-sized-rust), |
| 229 | + a great general-purpose resource. |
| 230 | + - [A very small Rust binary indeed](https://darkcoding.net/software/a-very-small-rust-binary-indeed/), |
| 231 | + a great resource for more extreme code-size optimizations. |
| 232 | + |
| 233 | +[origin-start example]: https://github.com/sunfishcode/origin/blob/main/example-crates/origin-start/README.md |
| 234 | +[The microarchitecture of Intel, AMD, and VIA CPUs]: https://www.agner.org/optimize/microarchitecture.pdf |
0 commit comments