SimpleLLM

~950 line, minimal, extensible LLM inference engine built from scratch.

NOTE: Currently, this repository ONLY supports OpenAI/gpt-oss-120b on a single NVIDIA H100. Why? A complex starting point (MoE + large model + reasoning + good hardware) demonstrates that building something like this from scratch is viable!!

Component	Lines
`llm.py` (engine)	563
`model/model.py`	324
`model/tokenizer.py`	92

The codebase is designed to be read, extended, and modified.

Performance

SimpleLLM's engine is async by default. Every request goes through a background inference loop that continuously batches work to keep the GPU saturated & prioritizing throughput.

Single H100 80GB, max_tokens=1000:

Benchmark	SimpleLLM	vLLM
batch_size=1	135 tok/s	138 tok/s
batch_size=64	4,041 tok/s	3,846 tok/s

Why SimpleLLM?

Researchers trying new ideas. If you're experimenting with novel inference techniques, you want a simple but performant starting point that already runs on good hardware with a real model. This gives you continuous batching, CUDA graphs, quantized MoE, and all the modern stuff in code you can actually read and hack on.

Research labs that would otherwise fork vLLM. If your plan is to fork a production engine and strip it down to implement your own kernels or adapt it to your infrastructure, consider starting here instead. It's already stripped down.

Students learning how inference engines work. This is a working implementation of current state-of-the-art techniques in ~760 lines. You can trace through the entire request lifecycle, from tokenization to continuous batching to CUDA graph replay, without getting lost in abstraction layers.

Maximizing Throughput

The engine implements several techniques to squeeze every bit of performance from the hardware:

Async by default: all generation happens in a background thread with a request queue. You submit prompts, get futures back, and the GPU never idles waiting for you.
Continuous batching: new requests join the active batch mid-generation instead of waiting for the current batch to finish.
CUDA graphs: decode steps are captured and replayed as graphs, which eliminates kernel launch overhead.
Slot-based KV cache: pre-allocated cache slots enable zero-copy sequence management. When a sequence finishes, its slot is immediately available for new work.
Fused QKV projections: three matmuls collapsed into one after weight loading.
Fused RMSNorm + residual: a Triton kernel that combines normalization and residual addition in a single memory pass.
Fused RoPE: position encoding applied in-place via Triton during decode.
Flash Attention 2: memory-efficient attention for both prefill (variable-length) and decode (with KV cache).
Paged KV cache: pre-allocated memory pages for KV storage, so sequences can grow without reallocation and completed sequences free their pages instantly.
GQA (grouped query attention): 8 KV heads shared across 64 query heads, reducing memory bandwidth during decode.

Installation

Requires Python 3.12+ and an NVIDIA GPU with CUDA 12.8+.

./setup.sh
source ./venv/bin/activate

Usage

from llm import LLM

engine = LLM("./gpt-oss-120b")
outputs = engine.generate(["What is the meaning of life?"], max_tokens=100).result()
print(outputs[0].text)

Kernels

The kernels/triton_kernels/ directory contains community-contributed Triton kernels:

routing: expert selection and token dispatch
matmul_ogs: quantized grouped matmuls for MoE
numerics: MXFP4 quantization/dequantization
swiglu: fused SwiGLU activation
topk: efficient top-k selection
compaction: token compaction for sparse routing

Some of these kernels were adapted/copied from open-source libraries like vLLM & Triton.

What's Next

Paged attention (potentially slower on 1xH100, can be faster when max_tokens/user vary quite a bit)
OpenAI/gpt-oss-120b Tensor parallelism on 8x H100s
Support for other MoE models
Support for other architectures

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
cookbook		cookbook
kernels		kernels
model		model
.gitignore		.gitignore
README.md		README.md
VERSION		VERSION
llm.py		llm.py
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SimpleLLM

Performance

Why SimpleLLM?

Maximizing Throughput

Installation

Usage

Kernels

What's Next

License

About

Uh oh!

Releases

Packages

Languages

naklecha/simple-llm

Folders and files

Latest commit

History

Repository files navigation

SimpleLLM

Performance

Why SimpleLLM?

Maximizing Throughput

Installation

Usage

Kernels

What's Next

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages