qwen600.cu

While studying and practicing CUDA & GPGPU, thought why not make an inference engine from scratch ? So, chose QWEN3-0.6B model, small model than can run smoothly on my RTX 3050 8GB VRAM. My intention was (and still) to build educational program to learn about LLMs & transformers while maintaining practice in CUDA programming.

Now includes AMD ROCm/HIP version for AMD GPUs! 🚀

I'm introducing static mini inference engine for QWEN3-0.6B instruct model in bf16, where its CUDA benchmarking claims that it's faster than llama.cpp by approximately 8.5% & hf with flash-attn by 292% in tokens/sec, see benchmarks below. AMD HIP version performance testing is pending.

What does qwen600 include:

single batch inference engine
static-constanted for compile-time optimization
CUDA version: all CUDA C/C++, minimal libraries (cuBLAS, CUB, std IO)
HIP version: AMD ROCm/HIP support, minimal libraries (hipBLAS, hipCUB, std IO)
no python dependencies (except for tokenizer setup)
efficient memory pipeline: mmap, single GPU block, async copy
zero-cost pointer-based weight management on GPU

qwen600 is inspired by:

Design Philosophy

The design of qwen600.cu is heavily inspired by the suckless philosophy.
The goal is to create a tool that is simple, minimalist, and highly performant by avoiding feature bloat and unnecessary abstractions.
Configuration is done directly in the source code config.h as much as possible, and dependencies are kept to an absolute minimum.

WANNA TRY ?!

Quick Start

Version	GPU	Build Command	Run Command
CUDA	NVIDIA	`mkdir build && cd build && cmake .. && make -j$(nproc)`	`./qwen600 <model_dir> -r 1`
HIP	AMD	`./build_hip.sh`	`./build_hip/qwen600_hip <model_dir> -r 1`

Initial Setup

First, you need to clone QWEN3-0.6B. This is fantastic hugging face doc blog to start with cloning hf repos.

then as a safe approach, you locate the weights file (model.safetensors) and sha256sum:

sha256sum <model_dir>/<safetensors-file-name>

and output must be according to hf:

f47f71177f32bcd101b7573ec9171e6a57f4f4d31148d38e382306f42996874b

After that:

git clone https://github.com/yassa9/qwen600
cd qwen600

Assume that downloaded hugging face dir is <model_dir>.

We convert the Hugging Face tokenizer into the format used by qwen600.

# IMPORTANT: <model_dir> must be a LOCAL directory (not a repo ID)
python -c "from huggingface_hub import snapshot_download; snapshot_download('Qwen/Qwen3-0.6B', local_dir='Qwen3-0.6B')"
python export.py Qwen3-0.6B

This writes tokenizer.bin and template_*.txt into <model_dir>. If you pass a repo ID like Qwen/Qwen3-0.6B, you will get an error because the script treats the argument as a filesystem path.

Building qwen600

Now we are ready to build ! You just want:

CUDA + nvcc
cuBLAS + CUB

mkdir build && cd build
cmake .. && make -j$(nproc)

Just that simple, no other bulky libraries and dependencies to build.

Moment of Truth: Running the Model

You can see arguments manual by:

# you are now inside qwen600/build
./qwen600

the output be that manual:

usage:   ./qwen600 <model_dir> [options]
example: ./qwen600 <model_dir> -r 1
model directory must contain:
  - model.safetensors
  - tokenizer.bin
  - template_*.txt files

arguments:
----------
  -r <int>    reasoning mode, 0 (default) = no thinking, 1 = thinking
  -s <int>    random seed, default
  -k <int>    k value in top-k sampling, default 20
  -t <float>  temperature in [0,inf], default 0.6
  -p <float>  p value in top-p (nucleus) sampling in [0,1], default 0.95
  -i <string> input prompt
  -y <string> system prompt in chat mode, default is none

For example:

./qwen600 <model_dir> -r 1 -t 0.65 -p 0.9 -k 20

or simply going with defaults:

./qwen600 <model_dir> -r 1

Based on official hugging face model card, they advise that:

- For thinking mode (enable_thinking=True), use Temperature=0.6, TopP=0.95, TopK=20. 
- DO NOT use greedy decoding, as it can lead to performance degradation and endless repetitions.
- For non-thinking mode (enable_thinking=False), we suggest using Temperature=0.7, TopP=0.8, TopK=20, and MinP=0.

Some Experiments

Without THINKING

./qwen600 <model_dir> -r 0

>> what is capital of Greece ?

The capital of Greece is Athens

[231.71 tk/s, 19 tokens in 0.08s]

>> tell me interesting fact about whales ?

One of the most fascinating facts about whales is that they are the only mammals that can swim upside down. This ability allows them to navigate through the ocean in a unique way, using their tails as a natural propeller

[632.31 tk/s, 227 tokens in 0.36s]

With the REASONING MODE, aka THINKING

./qwen600 <model_dir> -r 1

>> what are llms used for ?

Okay, the user is asking what LLMs (Large Language Models) are used for. Let me start by recalling what I know about them. LLMs are a type of AI model that can understand and generate human language. They're used in various fields because they can process a lot of text and perform tasks like answering questions, generating content, and even creating stories.

First, I should mention their primary applications. Like, in customer service, where they can help with queries. Then, maybe some examples like writing articles, generating code, or even creating content. Also, I should think about how they're used in different industries. For instance, healthcare, finance, education, etc.

Wait, are there any specific industries where LLMs are particularly used? Maybe in customer service for chatbots, or in creative fields like art or music. Also, in research, maybe for data analysis or information retrieval. I need to make sure the answer is comprehensive but not too technical. Let me structure it step by step: definition, applications, industries, examples, and maybe a brief mention of limitations or challenges.

I should also check if there's any recent advancements in LLMs that I should include. For example, the development of more powerful models that can handle more complex tasks. But maybe that's beyond the current scope. The user might be looking for a clear and concise answer. Let me make sure to cover key areas without getting too detailed.

now the responding:

Large Language Models (LLMs) are advanced AI systems designed to understand and generate human language. They are used in a variety of applications across different industries to improve efficiency, accuracy, and accessibility of information. Here are some key uses:

Customer Service and Chatbots: LLMs help in answering questions, resolving customer issues, and providing real-time assistance. For example, chatbots can handle inquiries, resolve complaints, and even provide product recommendations.

Writing and Content Creation: They can generate articles, creative content, and even code. For instance, they might assist in writing research papers, creating marketing content, or generating creative scripts.

Data Analysis and Research: LLMs analyze large datasets, extract insights, and help with information retrieval. They are useful in fields like finance, healthcare, and education for tasks like summarizing data, identifying patterns, and generating reports.

Creative Industries: In areas like art, music, and design, LLMs can generate original content, music, and visual art, assisting artists and producers.

Education: They support personalized learning, provide explanations, and assist in language learning by generating interactive content.

Healthcare: LLMs can help diagnose medical conditions, explain treatments, and support patient education.

Limitations: While LLMs are powerful, they may face challenges in handling complex tasks, understanding context, or adapting to new information. However, ongoing advancements in training and model architecture continue to expand their capabilities

[111.44 tk/s, 604 tokens in 5.42s]

Benchmarking

CUDA Version Benchmarks

These benchmarks are done on same machine:

RTX 3050 8BG + CUDA 13.0
AMD ryzen 5 3500
16GB RAM
Void Linux

Every Number here is mean of consecutive 5 runs of same parameters where temp = 0 to be greedy, done manually (no scripts).

Every test is with the same question what are llms used for ? in THINKING mode.

inference engine	~ tokens/sec
hf + flash-attn	29.57
llama.cpp	107.19
qwen600 (CUDA)	116.15

AMD HIP Version Benchmarks

⚠️ AMD benchmarking not yet completed - Performance metrics pending testing on AMD hardware.

NOTE: As I mentioned earlier, it is EDUCATIONAL project for me, I'm not aiming for winning a race, but I think that difference caused by static compile-time optimizations, and some other tweaks and tricks.

AMD ROCm/HIP Version

This repository now includes an AMD ROCm/HIP version for AMD GPUs! The HIP version maintains the same API and performance characteristics as the CUDA version while enabling execution on AMD hardware.

🚀 HIP Version Features:

Full CUDA to HIP conversion - All kernels converted to AMD ROCm/HIP
Multi-GPU architecture support - gfx906, gfx908, gfx90a, gfx1030, gfx1100
Optimized softmax implementation with shared memory
BFloat16 support with conversion utilities
Same API compatibility as CUDA versiona

📁 HIP Files:

main.hip - AMD ROCm/HIP main application
qwen_model.hip.h - HIP model implementation
static_loader.hip.h - HIP weight loading utilities
CMakeLists.hip.txt - HIP build configuration
build_hip.sh / build_hip.bat - Build scripts
README_HIP.md - Detailed HIP documentation

🛠️ Building HIP Version:

Prerequisites:

AMD ROCm/HIP (version 5.0+)
CMake 3.20+
AMD GPU with ROCm support

Quick Build:

# Linux/macOS
chmod +x build_hip.sh
./build_hip.sh

# Windows
build_hip.bat

Targeting your AMD GPU architecture (ROCm)

Detect your GPU architecture string (e.g. gfx942, gfx1030):

rocminfo | grep -m1 -E 'Name:.*gfx' || /opt/rocm/bin/rocminfo | grep -m1 -E 'Name:.*gfx'

You can rebuild specifying the exact arch without editing files:

cd build_hip
cmake -S . -B . -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_COMPILER=hipcc -DCMAKE_HIP_ARCHITECTURES="gfx942"
make -j$(nproc)

Or edit CMakeLists.hip.txt and set:

set(CMAKE_HIP_ARCHITECTURES "gfx942")
target_compile_options(qwen600_hip PRIVATE $<$<COMPILE_LANGUAGE:HIP>:--offload-arch=gfx942>)

hipCUB note

hipCUB is header-only on many ROCm installs. If you hit a link error like -lhipcub not found, this repo’s CMakeLists.hip.txt already uses header includes and does not link against hipcub.

Manual Build:

mkdir build_hip && cd build_hip
cmake -f ../CMakeLists.hip.txt \
    -DCMAKE_BUILD_TYPE=Release \
    -DCMAKE_CXX_COMPILER=hipcc \
    -DCMAKE_HIP_ARCHITECTURES="gfx90a" \
    ..
make -j$(nproc)

🎯 Usage:

# Same API as CUDA version
./build_hip/qwen600_hip <model_dir> -r 1 -t 0.7 -p 0.9

# Example
./build_hip/qwen600_hip ./models/qwen600 -r 1

Notes:

<model_dir> must contain model.safetensors, tokenizer.bin, and template_*.txt.
To generate tokenizer.bin/templates, run python export.py <local_model_dir>.
If you see messages like HIP error in softmax kernel: invalid device function, the code falls back to a portable simple softmax kernel automatically; this is expected on some architectures and does not affect correctness.

📊 Performance:

Same performance characteristics as CUDA version (theoretically)
Optimized for AMD GPUs with architecture-specific builds
ROCm profiler support for performance analysis
⚠️ Note: AMD benchmarking not yet completed - performance metrics pending

🔗 Links:

TODOs

There are still many catches there:

Fusing RMSnorm Kernel
Fusing skip connections with cuBLAS
Fix Softmax Kernel & Dispatcher
[] Exploring option of RoPE pre-computed values
AMD ROCm/HIP version
[] Benchmarking AMD version

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
assets		assets
.gitignore		.gitignore
CMakeLists.hip.txt		CMakeLists.hip.txt
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
build_hip.sh		build_hip.sh
config.h		config.h
export.py		export.py
main.cu		main.cu
main.hip		main.hip
qwen_model.cuh		qwen_model.cuh
qwen_model.hip.h		qwen_model.hip.h
sampler.h		sampler.h
static_loader.h		static_loader.h
static_loader.hip.h		static_loader.hip.h
tokenizer.h		tokenizer.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

qwen600.cu

Design Philosophy

WANNA TRY ?!

Quick Start

Initial Setup

Building qwen600

Moment of Truth: Running the Model

Some Experiments

Without THINKING

With the REASONING MODE, aka THINKING

Benchmarking

CUDA Version Benchmarks

AMD HIP Version Benchmarks

AMD ROCm/HIP Version

🚀 HIP Version Features:

📁 HIP Files:

🛠️ Building HIP Version:

Prerequisites:

Quick Build:

Targeting your AMD GPU architecture (ROCm)

hipCUB note

Manual Build:

🎯 Usage:

📊 Performance:

🔗 Links:

TODOs

License

About

Uh oh!

Releases

Packages

Languages

License

Ramshankar07/qwen600-ROCm-inference

Folders and files

Latest commit

History

Repository files navigation

qwen600.cu

Design Philosophy

WANNA TRY ?!

Quick Start

Initial Setup

Building qwen600

Moment of Truth: Running the Model

Some Experiments

Without THINKING

With the REASONING MODE, aka THINKING

Benchmarking

CUDA Version Benchmarks

AMD HIP Version Benchmarks

AMD ROCm/HIP Version

🚀 HIP Version Features:

📁 HIP Files:

🛠️ Building HIP Version:

Prerequisites:

Quick Build:

Targeting your AMD GPU architecture (ROCm)

hipCUB note

Manual Build:

🎯 Usage:

📊 Performance:

🔗 Links:

TODOs

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages