Skip to content

jacksonsc007/Qwen3.Ink.Cpp

Repository files navigation

Introduction

Qwen3.Ink.Cpp is a study-oriented repository that reproduces the Qwen3-8B model using pure C++.

This project integrates quantization methods and optimized QGEMM(Quantized GEMM) kernels in GGML, along with the SIMD-aware weight packing strategy proposed in AWQ. By combining these techniques, it endeavors to achieve the best of both worlds.

In contrast to the implementation in llama.cpp for Qwen3-8B, this project delivers better performance during the prompt phase, while maintaining comparable (slightly lower) performance during the autoregressive generation phase.

[Warning] Please note that the benchmark results are highly dependent on my personal hardwares and may not be reproducible on other hardwares.

comparison.mp4

Benchmark

Experiment Settings

Component Specification
Operating System Ubuntu 22.04
CPU Intel Core i5-13600KF
DRAM 3600 MT/s, Dual-channel

Performance

model quantization scheme perpelexity gsm8k
qwen3-8b fp16 10.97 87.57
qwen3-8b w41 12.08 86.20
qwen3-8b w40 11.73 84.99
qwen3-8b w4z 11.49 85.52
qwen3-8b A80W41 12.06 87.03
qwen3-8b A80W4z 11.53 85.29
qwen3-8b-awq - 11.52 86.35

Efficiency

model mem backend threads test t/s
qwen3 8B Q4_0 8.6 GiB CPU 16 pp322 60.36 ± 0.23
qwen3 8B Q4_0 8.6 GiB CPU 16 tg128 10.40 ± 0.00
qwen3 8B A81Q41-repack-FP16_FP32_mix ink 14.0 GiB CPU 16 pp322 134.35
qwen3 8B A81Q41-repack-FP16_FP32_mix ink 14.0 GiB CPU 16 tg128 9.26
  • We leveraged llama-bench to benchmark llama.cpp models, and our own code along with custom input to benchmark our custom model. It is not a strict comparison.
  • tg128 stands for generation of 128 tokens in autoregreesive generation phase. pp322 denotes processing an input prompt with 322 tokens.

Installation

Before setting up the environment, clone the repository and its submodules:

git clone https://github.com/jacksonsc007/Qwen3.Ink.Cpp.git  
cd Qwen3.Ink.Cpp
git submodule update --init --recursive

Python Environment

We recommend using uv to set up the Python environment quickly:

uv sync

Deployment

Performing Quantization on FP32/FP16 Models

Follow the instructions in the notebook:

Building Qwen3 with C++

Run the build script:

bash build_qwen.sh

Chatting with the Model ε≡٩(๑>₃<)۶

Once built, you can start interacting with the model:

build/chat

Specification

Here is an overview of the essential files:

File Name Description
evaluate-qwen3_8b_W4.ipynb Evaluates the impact of weight-only quantization on Qwen3's performance. Perplexity on wikitext and the benchmark result on GSM8K are reported.
evaluate-qwen3_8b_A8W4.ipynb Evaluates the impact of activation and weight quantization on Qwen3's performance. Perplexity on wikitext and the benchmark result on GSM8K are reported.
save_A81W41_quantized_weight-qwen3_8b.ipynb Applies A81W41 quantization to the FP32 model and saves the quantized weights and metadata to disk.
save_A80W40_quantized_weight-qwen3_8b.ipynb Applies A80W40 quantization to the FP32 model and saves the quantized weights and metadata to disk.
quantize_methods.py Contains the core quantization methods used throughout the repository.

Acknowledgements

This repository was heavily inspired by and built upon the following resources:

Courses & Tutorials:

  1. TinyML and Efficient Deep Learning Computing
  2. TinyChatEngine
  3. TinychatTutorial

Much credit goes to Professor Han's for his open-source sprit and wonderfull lectures.

Foundational Repositories:

  1. sgemm.c
  2. GGML
  3. llama.cpp
  4. AWQ
  5. quantized-gemm

Additional Resources

Much appreciation to Re:ゼロから始める異世界生活 for providing the benchmark text during the development.

About

Qwen3 reproduction based on pure C++

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •