A model specialized CPU inference runtime for Sarvam-1 written in C
The aim is to explore whether we can outperform llama.cpp by aggressively specializing for Sarvam-1's exact configuration. Sarvam-1 is the smallest Open-Weights model published by Sarvam.ai as of March, 2026.
Note
Indic texts don't render properly in my kitty terminal. I could not get it to work properly with fonts like Noto.
Git clone, navigate and then:
# fetch model weights and export them to the binary blob format
make export
# compile main.c
make
# run
make run "your prompt" <number of tokens>
# optional benchmark harness (ttft + ms/token)
make benchmark-build
make benchmark "your prompt" <number of tokens>The architecture is identical to what I found in Karpathy's llama2.c so forward pass required only two major architectural changes i.e. RoPE extracted into its own function applied separately to Q and K, precomputed freq_cis tables replaced with on-the-fly calculation.
The tokeniser was the messiest part. llama2.c assumed ASCII where as Sarvam-1 uses SentencePiece with a 68096-token vocabulary built for Indic scripts (UTF-8 multibyte characters(▁ (U+2581) as the space marker, and [INST]/[/INST] chat template tokens). The encoder walks UTF-8 codepoints, substitutes spaces with ▁, falls back to <0xHH> hex tokens for unknown bytes, then runs BPE merges, then the decoder maps ▁ back to spaces on output.
To be the fastest CPU only runtime for Sarvam-1 and prove it with numbers.
- Integrate OpenBLAS
- Fuse kernels
- Implement Q8 quantization
Email: sbcharjee.acad@gmail.com
Discord: Join the server

