Tiny C++ LLM inference implementation from scratch.
- GPT-2
- Llama3.2
- Qwen2.5
- Qwen3
- Mistral
- Fast BPE tokenizer, inspired by tiktoken.
- CPU/CUDA inference.
- FP32/FP16/BF16 inference.
- KV Cache
tinygpt::tokenizer
is faster than both HuggingFace Tokenizers and OpenAI tiktoken,the encoding speed was measured using the ~/benches/tokenizer.py script on a machine with an Intel(R) Xeon(R) Platinum 8255C CPU @ 2.50GHz.
- Paged Attention
- Continuous Batching
- CUDA Graph
- Kernel Fusion
git clone --recurse-submodules https://github.com/keith2018/TinyGPT.git
git clone https://huggingface.co/openai-community/gpt2
git clone https://huggingface.co/meta-llama/Llama-3.2-1B
git clone https://huggingface.co/meta-llama/Llama-3.2-3B
git clone https://huggingface.co/Qwen/Qwen2.5-0.5B
git clone https://huggingface.co/Qwen/Qwen2.5-3B
git clone https://huggingface.co/Qwen/Qwen3-1.7B
git clone https://huggingface.co/mistralai/Mistral-7B-v0.3
if success, set the path in file ./demo/demo_gpt.cpp
const std::string MODEL_DIR = "path to model files (huggingface repo)";
mkdir build
cmake -B ./build -DCMAKE_BUILD_TYPE=Release
cmake --build ./build --config Release
This will generate the executable file and copy assets to directory demo/bin
, then you can run the demo:
cd demo/bin
./TinyGPT_demo
demo output:
[INFO] Load model ...
[INFO] Load model done.
[INFO] Generated Outputs:
[INFO] ------------------------------------------------------------
[INFO] Prompt: 'Hello, my name is'
[INFO] Output: ' Max! I am Phelan and I'm the world's greatest magician! I am the world's greatest magician! You are the world's greatest magician! You'
[INFO] ------------------------------------------------------------
[INFO] Prompt: 'The president of the United States is'
[INFO] Output: ' on a temporary trip to Asia, and the Pentagon has made several announcements about what's next for the war on terror.\n\nThe next day, General Martin Dempsey'
[INFO] ------------------------------------------------------------
[INFO] Prompt: 'The capital of France is'
[INFO] Output: ' located in the eastern part of the country, so it is very easy to find houses in this part of the country. The most important houses are in Paris, and'
[INFO] ------------------------------------------------------------
[INFO] Prompt: 'The future of AI is'
[INFO] Output: ' forever. Our time is now.\n\n\nSequel to the game, The Mighty Ducks is available on Android and iOS, and a new Android app is also coming'
[INFO] ------------------------------------------------------------
[INFO] Time cost: 1907 ms, speed: 83.90 token/s
# pip install .
import tinygpt
enc = tinygpt.Tokenizer()
enc.init_with_config("tokenizer.json", "tokenizer_config.json")
ids = enc.encode("This is a test")
- Tensor
TinyTorch
https://github.com/keith2018/TinyTorch
- JsonParser
RapidJSON
https://github.com/Tencent/rapidjson
- Regex
- Unicode
- HashMap
ankerl::unordered_dense
https://github.com/martinus/unordered_dense
- ConcurrentQueue
moodycamel::ConcurrentQueue
https://github.com/cameron314/concurrentqueue
This code is licensed under the MIT License (see LICENSE).