qwen3_Experiments_rs

A lab for exploring local inference, in Rust and assembly language, to understand the real-world constraints of LLMs.

This workspace documents a progressive exploration of Qwen3 models, in their various sizes and configurations. The goal is not to cover a particular type of architecture, but to examine concretely how architectural choices influence local inference constraints.

This iteration naturally led me to work on Mixture-of-Experts (MoE) models, although this was not the starting point of the project. A more in-depth exploration of these architectures will likely be the subject of future iterations.

This is the 11th iteration of an evolving workspace.

Bugs: yes. This is an experiments workspace. If something looks rough, it probably is — on purpose.

Why local inference?

This project started from a simple question: what are the real constraints of running modern LLMs locally?

In practice, inference is often presented as a compute problem. In reality, when working on local setups, the limiting factor is frequently memory bandwidth rather than raw FLOPs. Exploring this boundary — prefill vs decode regimes, dense vs sparse architectures, quantization strategies, CPU vs GPU trade-offs — is one of the core motivations behind this repository.

The goal is not to compete with production-grade serving systems, but to understand, measure, and reason about what actually matters when running large models outside of hyperscale infrastructure.

Note: Some comments in the codebase are written in French. This reflects the exploratory nature of the project and the iterative process behind it. Future iterations will progressively standardize and clean up the documentation.

What this repository is

This repository is not a production inference server, nor a drop-in replacement for existing optimized frameworks.

It is a research-oriented workspace built to study the different regimes of an inference engine: prefill vs decode, dense vs sparse layers, quantized vs full precision paths, CPU vs GPU execution.

A recurring hypothesis throughout this work is that local inference is often memory-bandwidth bound rather than compute-bound. The goal is therefore not only to optimize, but to understand precisely where and why bottlenecks appear.

Each iteration of this codebase reflects that process: implement, measure, refine, and question the underlying assumptions.

Parts of the comment cleanup were assisted by AI tools. Code, architecture and experimental direction are intentionally designed and reviewed manually.

What this repository is not

It is not a polished, production-ready inference server.
It is not optimized for deployment at scale.
It is not a benchmark-driven project chasing state-of-the-art throughput numbers.
It is not a framework abstraction layer meant to hide complexity.

Instead, this repository embraces complexity. It exposes trade-offs rather than abstracting them away. It prioritizes clarity of mechanisms over API ergonomics.

If you are looking for a stable, turnkey LLM serving solution, there are excellent projects in the ecosystem. This workspace serves a different purpose: understanding the mechanics of inference at a low level.

Scope and direction

The long-term objective of this project is CPU-based local inference.

The core question is how far one can reasonably push inference on commodity hardware, without relying on massive parallel GPU infrastructures. In that sense, the CPU is not a fallback — it is the reference environment.

That said, it would not be intellectually honest to study inference regimes without experimenting with GPU backends. For this reason, several backends have been implemented (Metal, Vulkan, SIMD variants), not as an end goal, but as comparative tools.

These GPU implementations allow exploration of different memory hierarchies, subgroup behaviors, quantization paths, and bandwidth characteristics. They serve as instruments for measurement and contrast rather than as the primary target of the project.

Current capabilities

The project currently includes:

Model support

Dense Transformer architectures (Qwen3 family)
Mixture-of-Experts (MoE) layers
Vision-Language models (including MRoPE and image token injection)

Quantization

BF16 paths
INT8 (per-channel / VNNI)
Q4 (group-wise 4-bit quantization)
Quantization integrated directly into the model loader

CPU backends

Scalar reference implementation
Multi-threaded CPU backend
AVX-512 optimized backends (BF16, INT8, Q4 variants)

GPU backends (comparative exploration)

Metal (macOS)
Vulkan (Linux, multiple iterations and subgroup configurations)

Engine features

Prefill and decode regimes
KV cache (BF16 storage)
MoE routing implementations (multiple experimental variants)
Runtime auto-tuning for CPU execution
Instrumentation and tracing for routing analysis

This list reflects the current state of the workspace and its experimental nature. Some backends represent successive iterations of ideas rather than finalized implementations.

About documentation

The codebase is intentionally written to be self-contained and readable without excessive commentary. The structure, module boundaries, and naming aim to make the execution flow understandable directly from the source.

I do not intend to heavily comment or over-explain every implementation detail in this iteration. A more narrative and pedagogical explanation of design decisions, trade-offs, and experimental findings will be provided separately (for instance in long-form articles).

This repository reflects a working exploration rather than a finalized, tutorial-style codebase.

Future directions

A more product-oriented iteration, with a cleaner separation between experimental and stable components.
Broader model support beyond the current Qwen3 focus (e.g. GPT-OSS, LFM2, Mistral and related architectures).
Further refinement of CPU-focused inference paths, with clearer performance characterization.
Continued exploration of sparse and dense regimes under local hardware constraints.

License

This project is released under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
crates		crates
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

qwen3_Experiments_rs

Why local inference?

What this repository is

What this repository is not

Scope and direction

Current capabilities

Model support

Quantization

CPU backends

GPU backends (comparative exploration)

Engine features

About documentation

Future directions

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

qwen3_Experiments_rs

Why local inference?

What this repository is

What this repository is not

Scope and direction

Current capabilities

Model support

Quantization

CPU backends

GPU backends (comparative exploration)

Engine features

About documentation

Future directions

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages