-
Notifications
You must be signed in to change notification settings - Fork 14
Description
Description
We currently generate a Rust project at build time and invoke Cargo to produce a model-specific encoderfile binary. This approach has become increasingly fragile and expensive:
- Docker builds trigger nested Rust builds (Cargo-in-Cargo)
- CI frequently OOMs, especially on ARM runners
- Build behavior depends on crates.io publish timing and cache state
- Version-coupled crates (
encoderfile/encoderfile-core) are resolved at build time, leading to accidental mismatches - Build times and failure modes are hard to reason about and debug
In practice, we are re-compiling Rust code solely to embed model assets (weights, tokenizer, configs), not because the executable logic itself is changing.
Proposed change
Move from “generate Rust project + build” to “post-link binary packaging”.
Instead of rebuilding Rust code to embed assets, we will:
- Build pre-compiled binaries
- Generate model assets separately
- Append those assets to the already-compiled binary (llamafile-style)
- Load and validate the embedded payload at runtime
This removes Cargo from the packaging path entirely.
What this looks like
Before
Docker build
→ encoderfile build
→ generate Rust project
→ invoke Cargo
→ resolve dependencies
→ compile Rust
→ embed assets
After
CI build
→ cargo build (once, per model type)
Packaging
→ concat binary + payload
Runtime
→ read embedded payload
→ initialize model
Why this is better
- Eliminates nested Rust builds and CI OOMs
- Removes crates.io timing and cache dependency
- Makes Docker builds deterministic and fast
- Preserves strong compile-time typing and monomorphization
- Aligns with one-binary-per-model-type architecture
- Simplifies debugging and failure modes
Importantly, this does not require:
- Cosmopolitan / universal binaries
- C++ or linker tricks
include_bytes!, custom sections, orbuild.rshacks
The OS loader already ignores trailing bytes in executables; we simply take advantage of that.
Scope / follow-ups
- Define payload format (footer marker, length prefix, optional checksum)
- Implement runtime payload loader
- Update CI and Docker pipelines
- Deprecate generated-project path and related macros
Non-goals
- Supporting multiple model types in a single binary
- Re-introducing runtime dispatch or dynamic model selection
- Universal “run anywhere” binaries
This change trades compile-time asset embedding for runtime initialization, which is acceptable and significantly reduces operational complexity.
On Implementation
Note: Bringing up headless mode and backends for future planning. These are NOT in scope for this issue.
We’re standardizing on the following model going forward:
-
Targets are
(platform × backend)runtime binaries, installed explicitly:encoderfile target add arm64-unknown-linux-gnu --backend cuda
These are downloaded from GitHub Releases and cached locally. No cross-compiling for users, no auto-selection.
-
Embedded encoderfiles are deployment artifacts only.
A.encoderfilecontains a runtime binary + embedded protobuf payload and is:- fully self-contained
- immutable
- not allowed to run headless
- does not load external weights/config at runtime
-
Headless mode is only supported by pre-built runtime binaries.
Headless execution (external weights/config/tokenizer) is explicitly disallowed for embedded encoderfiles and enforced at compile time via mutually exclusive features (embeddedvsheadless). -
Exactly one backend per runtime binary (CPU, CUDA, Metal, etc.).
Backend choice is a build-time decision. There is no runtime backend switching and no multi-backend binaries.
This separation keeps deployment artifacts deterministic, avoids cross-compile pain for users, prevents accidental CUDA/Metal dependencies, and cleanly supports future environments (e.g. WASM) via headless runtimes.