A lab for exploring local inference, in Rust and assembly language, to understand the real-world constraints of LLMs.
This workspace documents a progressive exploration of Qwen3 models, in their various sizes and configurations. The goal is not to cover a particular type of architecture, but to examine concretely how architectural choices influence local inference constraints.
This iteration naturally led me to work on Mixture-of-Experts (MoE) models, although this was not the starting point of the project. A more in-depth exploration of these architectures will likely be the subject of future iterations.
This is the 11th iteration of an evolving workspace.
Bugs: yes. This is an experiments workspace. If something looks rough, it probably is — on purpose.
This project started from a simple question: what are the real constraints of running modern LLMs locally?
In practice, inference is often presented as a compute problem. In reality, when working on local setups, the limiting factor is frequently memory bandwidth rather than raw FLOPs. Exploring this boundary — prefill vs decode regimes, dense vs sparse architectures, quantization strategies, CPU vs GPU trade-offs — is one of the core motivations behind this repository.
The goal is not to compete with production-grade serving systems, but to understand, measure, and reason about what actually matters when running large models outside of hyperscale infrastructure.
Note: Some comments in the codebase are written in French. This reflects the exploratory nature of the project and the iterative process behind it. Future iterations will progressively standardize and clean up the documentation.
This repository is not a production inference server, nor a drop-in replacement for existing optimized frameworks.
It is a research-oriented workspace built to study the different regimes of an inference engine: prefill vs decode, dense vs sparse layers, quantized vs full precision paths, CPU vs GPU execution.
A recurring hypothesis throughout this work is that local inference is often memory-bandwidth bound rather than compute-bound. The goal is therefore not only to optimize, but to understand precisely where and why bottlenecks appear.
Each iteration of this codebase reflects that process: implement, measure, refine, and question the underlying assumptions.
Parts of the comment cleanup were assisted by AI tools. Code, architecture and experimental direction are intentionally designed and reviewed manually.
- It is not a polished, production-ready inference server.
- It is not optimized for deployment at scale.
- It is not a benchmark-driven project chasing state-of-the-art throughput numbers.
- It is not a framework abstraction layer meant to hide complexity.
Instead, this repository embraces complexity. It exposes trade-offs rather than abstracting them away. It prioritizes clarity of mechanisms over API ergonomics.
If you are looking for a stable, turnkey LLM serving solution, there are excellent projects in the ecosystem. This workspace serves a different purpose: understanding the mechanics of inference at a low level.
The long-term objective of this project is CPU-based local inference.
The core question is how far one can reasonably push inference on commodity hardware, without relying on massive parallel GPU infrastructures. In that sense, the CPU is not a fallback — it is the reference environment.
That said, it would not be intellectually honest to study inference regimes without experimenting with GPU backends. For this reason, several backends have been implemented (Metal, Vulkan, SIMD variants), not as an end goal, but as comparative tools.
These GPU implementations allow exploration of different memory hierarchies, subgroup behaviors, quantization paths, and bandwidth characteristics. They serve as instruments for measurement and contrast rather than as the primary target of the project.
The project currently includes:
- Dense Transformer architectures (Qwen3 family)
- Mixture-of-Experts (MoE) layers
- Vision-Language models (including MRoPE and image token injection)
- BF16 paths
- INT8 (per-channel / VNNI)
- Q4 (group-wise 4-bit quantization)
- Quantization integrated directly into the model loader
- Scalar reference implementation
- Multi-threaded CPU backend
- AVX-512 optimized backends (BF16, INT8, Q4 variants)
- Metal (macOS)
- Vulkan (Linux, multiple iterations and subgroup configurations)
- Prefill and decode regimes
- KV cache (BF16 storage)
- MoE routing implementations (multiple experimental variants)
- Runtime auto-tuning for CPU execution
- Instrumentation and tracing for routing analysis
This list reflects the current state of the workspace and its experimental nature. Some backends represent successive iterations of ideas rather than finalized implementations.
The codebase is intentionally written to be self-contained and readable without excessive commentary. The structure, module boundaries, and naming aim to make the execution flow understandable directly from the source.
I do not intend to heavily comment or over-explain every implementation detail in this iteration. A more narrative and pedagogical explanation of design decisions, trade-offs, and experimental findings will be provided separately (for instance in long-form articles).
This repository reflects a working exploration rather than a finalized, tutorial-style codebase.
- A more product-oriented iteration, with a cleaner separation between experimental and stable components.
- Broader model support beyond the current Qwen3 focus (e.g. GPT-OSS, LFM2, Mistral and related architectures).
- Further refinement of CPU-focused inference paths, with clearer performance characterization.
- Continued exploration of sparse and dense regimes under local hardware constraints.
This project is released under the MIT License. See the LICENSE file for details.