GitHub - jacksonsc007/Qwen3.Ink.Cpp: Qwen3 reproduction based on pure C++

Introduction

Qwen3.Ink.Cpp is a study-oriented repository that reproduces the Qwen3-8B model using pure C++.

This project integrates quantization methods and optimized QGEMM(Quantized GEMM) kernels in GGML, along with the SIMD-aware weight packing strategy proposed in AWQ. By combining these techniques, it endeavors to achieve the best of both worlds.

In contrast to the implementation in llama.cpp for Qwen3-8B, this project delivers better performance during the prompt phase, while maintaining comparable (slightly lower) performance during the autoregressive generation phase.

[Warning] Please note that the benchmark results are highly dependent on my personal hardwares and may not be reproducible on other hardwares.

comparison.mp4

Benchmark

Experiment Settings

Component	Specification
Operating System	Ubuntu 22.04
CPU	Intel Core i5-13600KF
DRAM	3600 MT/s, Dual-channel

Performance

model	quantization scheme	perpelexity	gsm8k
qwen3-8b	fp16	10.97	87.57
qwen3-8b	w41	12.08	86.20
qwen3-8b	w40	11.73	84.99
qwen3-8b	w4z	11.49	85.52
qwen3-8b	A80W41	12.06	87.03
qwen3-8b	A80W4z	11.53	85.29
qwen3-8b-awq	-	11.52	86.35

Efficiency

model	mem	backend	threads	test	t/s
qwen3 8B Q4_0	8.6 GiB	CPU	16	pp322	60.36 ± 0.23
qwen3 8B Q4_0	8.6 GiB	CPU	16	tg128	10.40 ± 0.00
qwen3 8B A81Q41-repack-FP16_FP32_mix ink	14.0 GiB	CPU	16	pp322	134.35
qwen3 8B A81Q41-repack-FP16_FP32_mix ink	14.0 GiB	CPU	16	tg128	9.26

We leveraged llama-bench to benchmark llama.cpp models, and our own code along with custom input to benchmark our custom model. It is not a strict comparison.
tg128 stands for generation of 128 tokens in autoregreesive generation phase. pp322 denotes processing an input prompt with 322 tokens.

Installation

Before setting up the environment, clone the repository and its submodules:

git clone https://github.com/jacksonsc007/Qwen3.Ink.Cpp.git  
cd Qwen3.Ink.Cpp
git submodule update --init --recursive

Python Environment

We recommend using uv to set up the Python environment quickly:

uv sync

Deployment

Performing Quantization on FP32/FP16 Models

Follow the instructions in the notebook:

save_A81W41_quantized_weight-qwen3_8b.ipynb

Building Qwen3 with C++

Run the build script:

bash build_qwen.sh

Chatting with the Model ε≡٩(๑>₃<)۶

Once built, you can start interacting with the model:

build/chat

Specification

Here is an overview of the essential files:

File Name	Description
`evaluate-qwen3_8b_W4.ipynb`	Evaluates the impact of weight-only quantization on Qwen3's performance. Perplexity on `wikitext` and the benchmark result on GSM8K are reported.
`evaluate-qwen3_8b_A8W4.ipynb`	Evaluates the impact of activation and weight quantization on Qwen3's performance. Perplexity on `wikitext` and the benchmark result on GSM8K are reported.
`save_A81W41_quantized_weight-qwen3_8b.ipynb`	Applies A81W41 quantization to the FP32 model and saves the quantized weights and metadata to disk.
`save_A80W40_quantized_weight-qwen3_8b.ipynb`	Applies A80W40 quantization to the FP32 model and saves the quantized weights and metadata to disk.
`quantize_methods.py`	Contains the core quantization methods used throughout the repository.

Acknowledgements

This repository was heavily inspired by and built upon the following resources:

Courses & Tutorials:

Much credit goes to Professor Han's for his open-source sprit and wonderfull lectures.

Foundational Repositories:

Additional Resources

Much appreciation to Ｒｅ：ゼロから始める異世界生活 for providing the benchmark text during the development.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
lm-evaluation-harness @ 8bc4aff		lm-evaluation-harness @ 8bc4aff
notebooks		notebooks
qwen3		qwen3
tokenizers		tokenizers
.clang-format		.clang-format
.flake8		.flake8
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
README.md		README.md
build_qwen.sh		build_qwen.sh
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Introduction

Benchmark

Experiment Settings

Performance

Efficiency

Installation

Python Environment

Deployment

Performing Quantization on FP32/FP16 Models

Building Qwen3 with C++

Chatting with the Model ε≡٩(๑>₃<)۶

Specification

Acknowledgements

Courses & Tutorials:

Foundational Repositories:

Additional Resources

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

jacksonsc007/Qwen3.Ink.Cpp

Folders and files

Latest commit

History

Repository files navigation

Introduction

Benchmark

Experiment Settings

Performance

Efficiency

Installation

Python Environment

Deployment

Performing Quantization on FP32/FP16 Models

Building Qwen3 with C++

Chatting with the Model ε≡٩(๑>₃<)۶

Specification

Acknowledgements

Courses & Tutorials:

Foundational Repositories:

Additional Resources

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages