Skip to content

Commit 8c4adc6

Browse files
committed
Add an origin-start-tiny example demonstrating small code size.
1 parent b539c7b commit 8c4adc6

File tree

7 files changed

+352
-7
lines changed

7 files changed

+352
-7
lines changed

README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,9 @@ Origin can also be used on its own, in several different configurations:
5757

5858
- The [origin-start-lto example] is like origin-start, but builds with LTO.
5959

60+
- The [origin-start-tiny example] is like origin-start, but builds with
61+
optimization flags and disables features to build a very small binary.
62+
6063
## Fully static linking
6164

6265
The resulting executables in the origin-start, origin-start-no-alloc, and
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
[build]
2+
# Disable traps on unreachable code.
3+
rustflags = ["-Z", "trap-unreachable=no"]
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
[package]
2+
name = "origin-start-tiny"
3+
version = "0.0.0"
4+
edition = "2021"
5+
publish = false
6+
7+
[dependencies]
8+
# Origin can be depended on just like any other crate. For no_std, disable
9+
# the default features, and the desired features.
10+
origin = { path = "../..", default-features = false, features = ["origin-program", "origin-start"] }
11+
12+
# Crates to help writing no_std code.
13+
compiler_builtins = { version = "0.1.101", features = ["mem"] }
14+
15+
# This is just an example crate, and not part of the origin workspace.
16+
[workspace]
17+
18+
# Let's optimize for small size!
19+
[profile.release]
20+
# Give the optimizer more lattitude to optimize and delete unneeded code.
21+
lto = true
22+
# "abort" is smaller than "unwind".
23+
panic = "abort"
24+
# Tell the optimizer to optimize for size.
25+
opt-level = "z"
26+
# Delete the symbol table from the executable.
27+
strip = true
Lines changed: 234 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,234 @@
1+
This crate is similar to the [`origin-start` example], except that doesn't
2+
print any output, and enables optimizations for small code size. To produce a
3+
small binary, compile with `--release`.
4+
5+
To produce an even smaller binary, use `objcopy` to remove the `.eh_frame`
6+
and `.comment` sections:
7+
8+
```
9+
objcopy -R .eh_frame -R .comment target/release/origin-start-tiny even-smaller
10+
```
11+
12+
For details on the specific optimizations performed, see the options under
13+
`[profile.release]` and the use of `default-features = false`, in Cargo.toml,
14+
and the additional link flags passed in build.rs.
15+
16+
## The optimizations
17+
18+
First, `origin` makes much of its functionality optional, so we add
19+
`default-features = false` to disable things like `.init_array`/`.fini_array`
20+
support, thread support, and other things. We only enable the features needed
21+
for our minimal test program:
22+
23+
```toml
24+
origin = { path = "../..", default-features = false, features = ["origin-program", "origin-start"] }
25+
```
26+
27+
Then, we enable several optimizations in the `#[profile.release]` section of
28+
Cargo.toml:
29+
30+
```toml
31+
# Give the optimizer more lattitude to optimize and delete unneeded code.
32+
lto = true
33+
# "abort" is smaller than "unwind".
34+
panic = "abort"
35+
# Tell the optimizer to optimize for size.
36+
opt-level = "z"
37+
# Delete the symbol table from the executable.
38+
strip = true
39+
```
40+
41+
In detail:
42+
43+
> `lto = true`
44+
45+
LTO is Link-Time Optimization, which gives the optimizer the ability to see the
46+
whole program at once, and be more aggressive about deleting unneeded code.
47+
48+
For example, in our test program, it enables inlining of the `main` function
49+
into the code in `origin` that calls it. Since it's just returning the constant
50+
`42`, inlining reduces code size.
51+
52+
> `panic = "abort"`
53+
54+
Rust's default panic mechanism is to perform a stack unwind, however that
55+
mechanism takes some code.
56+
57+
This doesn't actally help our example here, since it's a minimal program that
58+
doesn't contain any `panic` calls, but it's a useful optimization feature in
59+
general.
60+
61+
> `opt-level = "z"`
62+
63+
The "z" optimization level instructs the compiler to prioritize code size above
64+
all other considerations.
65+
66+
For example, on x86-64, in our test program, it uses this code sequence to load
67+
the value `0x2a`, which is our return value of `42`, into the `%rdi` register
68+
to pass to the exit system call:
69+
70+
```asm
71+
4000bc: 6a 2a push $0x2a
72+
4000be: 5f pop %rdi
73+
```
74+
75+
Compare that with the sequence it emits without "z":
76+
77+
```asm
78+
4000c1: bf 2a 00 00 00 mov $0x2a,%edi
79+
```
80+
81+
The "z" form is two instructions rather than one. It also does a store and a
82+
load, as well as a stack pointer subtract and add. Modern x86-64 processors do
83+
store-to-load forwarding to avoid actually writing to memory and have a Stack
84+
engine for `push`/`pop` sequences and are very good at optimizing those kinds
85+
of instruction sequences; see Agner's
86+
[The microarchitecture of Intel, AMD, and VIA CPUs] for more information.
87+
However, even with these fancy features, it's probably still not completely
88+
free.
89+
90+
But it is 3 bytes instead of 5, so `opt_level = "z"` goes with it.
91+
92+
Amusingly, it doesn't do this same trick for the immediately following
93+
instruction, which looks similar:
94+
```
95+
4000bd: b8 e7 00 00 00 mov $0xe7,%eax
96+
```
97+
98+
Here, the value being loaded is 0xe7, which has the eigth bit set. The x86
99+
`push` instructions immediate field is signed, so `push $0xe7` would need a
100+
4-byte immediate field to zero-extend it. Consequently, using the `push`/`pop`
101+
trick in this case would be longer.
102+
103+
Next, we enable several link arguments in build.rs:
104+
105+
```rust
106+
// Tell the linker to exclude the .eh_frame_hdr section.
107+
println!("cargo:rustc-link-arg=-Wl,--no-eh-frame-hdr");
108+
// Tell the linker not to page-align sections.
109+
println!("cargo:rustc-link-arg=-Wl,-n");
110+
// Tell the linker to make the text and data readable and writeable. This
111+
// allows them to occupy the same page.
112+
println!("cargo:rustc-link-arg=-Wl,-N");
113+
// Tell the linker to exclude the `.note.gnu.build-id` section.
114+
println!("cargo:rustc-link-arg=-Wl,--build-id=none");
115+
// Disable PIE, which adds some code size.
116+
println!("cargo:rustc-link-arg=-Wl,--no-pie");
117+
// Disable the `GNU-stack` segment, if we're using lld.
118+
println!("cargo:rustc-link-arg=-Wl,-z,nognustack");
119+
```
120+
121+
In detail:
122+
123+
> `--no-eh-frame-hdr`
124+
125+
This disables the creation of a `.eh_frame_hdr` section, which we don't need
126+
since we won't be doing any unwinding.
127+
128+
> `-n`
129+
130+
This turns of page alignment of sections, so that we don't waste any space on
131+
padding bytes.
132+
133+
> `-N`
134+
135+
This sets code sections to be writable, so that they can be loaded into memory
136+
together with data. Ordinarily, having read-only code is a very good thing,
137+
but making them writable can save a few bytes.
138+
139+
The `-n` and `-N` flags dont actally help our example here, but they can save
140+
some code size in larger programs.
141+
142+
> `--build-id=none`
143+
144+
This disables the creation of a `.note.gnu.build-id` section, which is used by
145+
some build tools. We're not using any extra tools in our simple example here,
146+
so we can disable this.
147+
148+
> `--no-pie`
149+
150+
Position-Independent Executables (PIE) are executables that can be loaded into
151+
a random address in memory, to make some kinds of security exploits harder,
152+
though it takes some extra code and relocation metadata to fix up addresses
153+
once the actual runtime address has been determined. Our simple example here
154+
isn't concerned with security, so we can disable this feature and save the
155+
space.
156+
157+
> -z nognustack
158+
159+
This option is only recognized by ld.lld, so if you happen to be using that,
160+
this disables the use of the `GNU-stack` feature which allows the OS to mark
161+
the stack as non-executable. A non-executable stack is a very good thing, but
162+
omitting this request does save a few bytes.
163+
164+
Finally, we add a RUSTFLAGS flag with .cargo/config.toml:
165+
166+
```toml
167+
rustflags = ["-Z", "trap-unreachable=no"]
168+
```
169+
170+
This disables the use of trap instructions, such as `ud2` on x86-64, at places
171+
the compiler thinks should be unreachable, such as after the `jmp` in `_start`
172+
or after the `syscall` that calls `exit_group`, because rustix uses the
173+
`noreturn` [inline asm option]. Normally this is a very good thing, but it
174+
does take a few extra bytes.
175+
176+
[inline asm option]: https://doc.rust-lang.org/reference/inline-assembly.html#options
177+
178+
## Generated code
179+
180+
With all these optimizations, the generated code looks like this:
181+
182+
```asm
183+
00000000004000b0 <.text>:
184+
4000b0: 48 89 e7 mov %rsp,%rdi
185+
4000b3: 55 push %rbp
186+
4000b4: e9 00 00 00 00 jmp 0x4000b9
187+
4000b9: 50 push %rax
188+
4000ba: 6a 2a push $0x2a
189+
4000bc: 5f pop %rdi
190+
4000bd: b8 e7 00 00 00 mov $0xe7,%eax
191+
4000c2: 0f 05 syscall
192+
```
193+
194+
Those first 3 instructions are origin's `_start` function. The next 5
195+
instructions are `origin::program::entry` and everything, including the user
196+
`main` function and the `exit_group` syscall inlined into it.
197+
198+
In theory this code code be made even smaller.
199+
200+
That first `mov $rsp,%rdi` is moving the incoming stack pointer we got from the
201+
OS into the first argument register to pass to `origin::program::entry` so that
202+
it can use it to pick up the command-line arguments, environment variables, and
203+
AUX records, however we don't use any of those, so we don't need that argument.
204+
In theory origin could put that behind a cargo flag, but I didn't feel like
205+
adding separate versions of the `_start` sequence just for that optimization.
206+
207+
Also, in theory, `origin::program::entry` could use the
208+
`llvm.frameaddress intrinsic` to read the incoming stack pointer value, instead
209+
of needing an explicit argument. But having it be an explicit argument makes it
210+
behave more like normal Rust code, which shouldn't be peeking at its caller's
211+
stack memory.
212+
213+
And lastly, we could enable the `push %rbp`, `jmp`, and `push %rax`, which are
214+
just zeroing out the return address so that nothing ever unwinds back into the
215+
`_start` code, jumping to the immediately following code, and aligning the
216+
stack pointer, all to make a "call" from the `[naked]` function `_start` written
217+
in asm to the Rust `origin::program::entry` function. This is the transition
218+
from assembly code to the first Rust code in the program. There are sneaky ways
219+
to arrange for this code to be able to fall-through from `_start` into the
220+
`origin::program::entry`, but as above, I'm aiming to have this code behave
221+
like normal Rust code, which shouldn't be using control flow paths that the
222+
compiler doesn't know about.
223+
224+
## Sources
225+
226+
Many of these optimizations came from the following websites:
227+
228+
- [Minimizing Rust Binary Size](https://github.com/johnthagen/min-sized-rust),
229+
a great general-purpose resource.
230+
- [A very small Rust binary indeed](https://darkcoding.net/software/a-very-small-rust-binary-indeed/),
231+
a great resource for more extreme code-size optimizations.
232+
233+
[origin-start example]: https://github.com/sunfishcode/origin/blob/main/example-crates/origin-start/README.md
234+
[The microarchitecture of Intel, AMD, and VIA CPUs]: https://www.agner.org/optimize/microarchitecture.pdf
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
fn main() {
2+
// Pass -nostartfiles to the linker. In the future this could be obviated
3+
// by a `no_entry` feature: <https://github.com/rust-lang/rfcs/pull/2735>
4+
println!("cargo:rustc-link-arg=-nostartfiles");
5+
6+
// The following options optimize for code size!
7+
8+
// Tell the linker to exclude the .eh_frame_hdr section.
9+
println!("cargo:rustc-link-arg=-Wl,--no-eh-frame-hdr");
10+
// Tell the linker not to page-align sections.
11+
println!("cargo:rustc-link-arg=-Wl,-n");
12+
// Tell the linker to make the text and data readable and writeable. This
13+
// allows them to occupy the same page.
14+
println!("cargo:rustc-link-arg=-Wl,-N");
15+
// Tell the linker to exclude the `.note.gnu.build-id` section.
16+
println!("cargo:rustc-link-arg=-Wl,--build-id=none");
17+
// Disable PIE, which adds some code size.
18+
println!("cargo:rustc-link-arg=-Wl,--no-pie");
19+
// Disable the `GNU-stack` segment, if we're using lld.
20+
println!("cargo:rustc-link-arg=-Wl,-z,nognustack");
21+
}
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
//! Going for minimal size!
2+
3+
#![no_std]
4+
#![no_main]
5+
#![allow(internal_features)]
6+
#![feature(lang_items)]
7+
#![feature(core_intrinsics)]
8+
9+
extern crate origin;
10+
extern crate compiler_builtins;
11+
12+
#[panic_handler]
13+
fn panic(_panic: &core::panic::PanicInfo<'_>) -> ! {
14+
core::intrinsics::abort()
15+
}
16+
17+
#[lang = "eh_personality"]
18+
extern "C" fn eh_personality() {}
19+
20+
#[no_mangle]
21+
extern "C" fn main(_argc: i32, _argv: *const *const u8) -> i32 {
22+
42
23+
}

0 commit comments

Comments
 (0)