High-performance kernel operations for KTransformers, featuring CPU-optimized MoE inference with AMX, AVX, KML and blis (amd library) support.
- Note
- Features
- Installation
- Verification
- Integration with SGLang
- Direct Python API Usage
- Build Configuration
- Error Troubleshooting
- Weight Quantization
- Before Commit!
Current Support Status:
- ✅ Intel CPUs with AMX: Fully supported
⚠️ Universal CPU with llamafile: In preview, not yet fully complete⚠️ AMD CPUs with BLIS: Upcoming, not yet fully integrated
- AMX Optimization: Intel AMX (Advanced Matrix Extensions) support for INT4/INT8 quantized MoE inference
- Multi-Backend: Unified
KTMoEWrapperAPI supporting multiple backends (AMXINT4, AMXINT8, LLAMAFILE*) - Flexible Backends: AVX512, AVX2 via pluggable backend architecture
- Efficient MoE: Optimized Mixture-of-Experts operations with NUMA-aware memory management
- Async Execution: Non-blocking
submit_forward/sync_forwardAPI for improved pipelining - Easy Integration: Clean Python API with automatic backend selection
Note: *LLAMAFILE backend support is currently in preview and not yet fully complete.
First, initialize git submodules:
git submodule update --init --recursiveStep 0: Create and activate a conda environment (recommended):
conda create -n kt-kernel python=3.11 -y
conda activate kt-kernelYou can now install in two clear steps using the same script.
Option A: Two-step (explicit)
# 1) Install system prerequisites (cmake, hwloc, pkg-config)
./install.sh deps
# 2) Build and install kt-kernel (auto-detects CPU)
# By default, the script cleans the local ./build directory before compiling.
./install.sh buildOption B: One-step (deps + build)
./install.shThe install script will:
- Auto-detect CPU capabilities (AMX support)
- Install
cmakevia conda (if available) - Install system dependencies (
libhwloc-dev,pkg-config) based on your OS
What gets configured automatically:
- AMX CPU detected →
NATIVE + AMX=ON - No AMX detected →
NATIVE + AMX=OFF
AVX512 or AVX2 instead of NATIVE to avoid compilation issues (see below).
If you need specific build options (e.g., for LLAMAFILE backend, compatibility, or binary distribution):
# Example for LLAMAFILE backend on AMX CPU with AVX512
export CPUINFER_CPU_INSTRUCT=AVX512 # Options: NATIVE, AVX512, AVX2, FANCY
export CPUINFER_ENABLE_AMX=OFF # Options: ON, OFF
# Run with manual mode (build only)
./install.sh build --manualFor advanced build options and binary distribution, see the Build Configuration section. If you encounter issues, refer to Error Troubleshooting.
python -c "from kt_kernel import KTMoEWrapper; print('✓ kt-kernel installed successfully')"KT-Kernel can be used standalone via Direct Python API or integrated with SGLang for production deployment. This section describes SGLang integration to enable CPU-GPU heterogeneous inference, where "hot" experts run on GPU and "cold" experts run on CPU for optimal resource utilization.
git clone https://github.com/sgl-project/sglang.git
cd sglang
pip install -e "python[all]"You need both GPU weights and CPU weights for heterogeneous inference:
GPU Weights: Use the original / quantized model weights.
CPU Weights: Quantize to AMX-optimized format using the conversion script:
python scripts/convert_cpu_weights.py \
--input-path /path/to/model \
--input-type bf16 \ # Depends on your GPU weights type: fp8, fp16, or bf16
--output /path/to/cpu-weights \
--quant-method int8 # or int4Supported input formats: FP8, FP16, BF16 → INT4/INT8.
For more details, see:
Note: LLAMAFILE backend supports GGUF format directly, but this feature is still in preview.
Start the SGLang server with your normal SGLang parameters, and add the following KT-Kernel specific parameters to enable CPU-GPU heterogeneous inference:
KT-Kernel Parameters to Add:
--kt-method: Backend method (AMXINT4, AMXINT8, or LLAMAFILE)--kt-weight-path: Path to the converted CPU weights--kt-cpuinfer: Number of CPU inference threads (set to physical cores)--kt-threadpool-count: Number of thread pools (set to NUMA node count)--kt-num-gpu-experts: Number of experts to keep on GPU--kt-max-deferred-experts-per-token: Deferred experts for pipelined execution
Example:
python -m sglang.launch_server \
[your normal SGLang parameters...] \
--kt-method AMXINT8 \
--kt-weight-path /path/to/cpu-weights \
--kt-cpuinfer 64 \
--kt-threadpool-count 2 \
--kt-num-gpu-experts 32 \
--kt-max-deferred-experts-per-token 2See KT-Kernel Parameters section below for detailed parameter tuning guidelines.
This example demonstrates the full workflow from downloading weights to launching the server.
Hardware Configuration:
- GPU: NVIDIA RTX 4090 24GB
- CPU: 2x Intel Xeon Gold 6454S (64 physical cores total, 128 threads, 2 NUMA nodes)
- Model: Qwen3-30B-A3B
- GPU Weights: BF16 original weights
- CPU Weights: AMXINT8 quantized
How to verify your system configuration:
# Check CPU configuration
lscpu | grep -E "^CPU\(s\)|Thread\(s\) per core|Socket\(s\)|NUMA node\(s\)"
# Expected output example:
CPU(s): 128
Thread(s) per core: 2
Socket(s): 2
NUMA node(s): 2
# → Physical cores = CPU(s) / Thread(s) per core = 128 / 2 = 64Parameter Rationale:
--kt-cpuinfer 64: Set to physical cores (64), not hyperthreads (128)--kt-threadpool-count 2: 2 NUMA nodes detected (dual-socket system)--kt-num-gpu-experts 32: With 24GB GPU memory, we can fit ~32 experts on GPU for this model (varies by model architecture and actual memory usage)--kt-max-deferred-experts-per-token 2: Enable pipelined execution - allows CPU to process next batch while GPU completes current batch
# Install huggingface-cli if not already installed
pip install huggingface-hub
# Download model from Hugging Face
hf download Qwen/Qwen3-30B-A3B --local-dir /mnt/data/models/Qwen3-30B-A3Bpython scripts/convert_cpu_weights.py \
--input-path /mnt/data/models/Qwen3-30B-A3B \
--input-type bf16 \
--output /mnt/data/models/Qwen3-30B-A3B-INT8 \
--quant-method int8python -m sglang.launch_server \
--host 0.0.0.0 \
--port 8000 \
--model /mnt/data/models/Qwen3-30B-A3B \
--trust-remote-code \
--mem-fraction-static 0.92 \
--chunked-prefill-size 4096 \
--served-model-name Qwen3-30B-A3B \
--enable-mixed-chunk \
--kt-method AMXINT8 \
--kt-weight-path /mnt/data/models/Qwen3-30B-A3B-INT8 \
--kt-cpuinfer 64 \
--kt-threadpool-count 2 \
--kt-num-gpu-experts 32 \
--kt-max-deferred-experts-per-token 2| Parameter | Description | Example Value |
|---|---|---|
--kt-method |
CPU inference backend method | AMXINT4, AMXINT8, or LLAMAFILE (preview) |
--kt-weight-path |
Path to quantized CPU weights | /path/to/cpu-weights |
--kt-cpuinfer |
Number of CPU inference threads | 64 (adjust based on CPU cores) |
--kt-threadpool-count |
Number of thread pools for parallel execution | 2 (typically 1-4) |
--kt-num-gpu-experts |
Number of experts to keep on GPU | 32 (remaining experts go to CPU) |
--kt-max-deferred-experts-per-token |
Number of experts per token to defer for pipelined execution | 2 (0 to disable, 1-2 recommended) |
Parameter Guidelines:
-
kt-method: Choose based on your CPU and weight format:AMXINT4: Best performance on AMX CPUs with INT4 quantized weights (May cause huge accuracy drop for some models, e.g., Qwen3-30B-A3B)AMXINT8: Higher accuracy with INT8 quantized weights on AMX CPUsLLAMAFILE: Preview support for GGUF format (not fully complete)
-
kt-cpuinfer: Set to the number of physical CPU cores (not hyperthreads).- Check physical cores:
lscpu | grep -E "^CPU\(s\)|Thread\(s\) per core" - Physical cores = CPU(s) / Thread(s) per core
- Example: If CPU(s)=128 and Thread(s) per core=2, then physical cores = 64
- Important: Do NOT set to hyperthread count - this will degrade performance
- Check physical cores:
-
kt-threadpool-count: Set to the number of NUMA nodes.- Check NUMA count:
lscpu | grep "NUMA node(s)" - Or use:
numactl --hardware | grep "available" - Note: NUMA node count is NOT necessarily the number of physical CPUs
- It represents memory domains, which may be divided within a single CPU or across multiple CPUs
- Use the NUMA node count from
lscpu, regardless of physical CPU count
- Typical values: 1-2 for single-socket, 2-4 for dual-socket systems
- This enables better memory bandwidth utilization across NUMA domains
- Check NUMA count:
-
kt-num-gpu-experts: Determine based on GPU memory and profiling:- More GPU experts = lower latency but higher GPU memory usage (May cause OOM)
-
kt-max-deferred-experts-per-token: Enables pipelined execution:0: Synchronous execution (simpler, higher latency)1-2: Deferred execution (better latency, requires tuning) - recommended3-4: Higher deferred count (possible but rarely beneficial)
For standalone usage without SGLang, you can use KT-Kernel directly via Python API:
from kt_kernel import KTMoEWrapper
# Initialize the MoE wrapper
wrapper = KTMoEWrapper(
layer_idx=0,
num_experts=8,
num_experts_per_tok=2,
hidden_size=4096,
moe_intermediate_size=14336,
num_gpu_experts=2,
cpuinfer_threads=32,
threadpool_count=2,
weight_path="/path/to/weights",
chunked_prefill_size=512,
method="AMXINT4" # Options: "AMXINT4", "AMXINT8", "LLAMAFILE" (preview)
)
# Load weights (from disk - pre-quantized)
wrapper.load_weights(physical_to_logical_map)
# Or load weights from tensors (online quantization)
wrapper.load_weights_from_tensors(gate_proj, up_proj, down_proj, physical_to_logical_map)
# Run inference
output = wrapper.forward(hidden_states, topk_ids, topk_weights, cuda_stream)
# Or use async API for better performance
wrapper.submit_forward(hidden_states, topk_ids, topk_weights, cuda_stream)
# ... do other work ...
output = wrapper.sync_forward(hidden_states, cuda_stream)# Initialize with additional options
wrapper = KTMoEWrapper(
layer_idx=0,
num_experts=8,
num_experts_per_tok=2,
hidden_size=4096,
moe_intermediate_size=14336,
num_gpu_experts=2,
cpuinfer_threads=32,
threadpool_count=2,
weight_path="/path/to/weights",
chunked_prefill_size=512,
method="AMXINT4",
cpu_save=False, # Keep weights in CPU memory after loading
max_deferred_experts_per_token=0 # Number of experts to defer (for pipelined execution)
)
# Pre-allocate buffers for specific batch sizes (improves performance)
KTMoEWrapper.set_capture_batch_sizes([1, 2, 4, 8, 16])
# Query captured batch sizes
batch_sizes = KTMoEWrapper.get_capture_batch_sizes()
# Clear buffer cache to free memory
KTMoEWrapper.clear_buffer_cache()If you prefer manual installation without the install.sh script, follow these steps:
Prerequisites:
cmake(recommended:conda install -y cmake)libhwloc-devandpkg-config
Core Options:
| Variable | Options | Description |
|---|---|---|
CPUINFER_CPU_INSTRUCT |
NATIVE, AVX512, AVX2, FANCY |
CPU instruction set to use |
CPUINFER_ENABLE_AMX |
ON, OFF |
Enable Intel AMX support |
CPUINFER_BUILD_TYPE |
Release, Debug, RelWithDebInfo |
Build type (default: Release) |
CPUINFER_PARALLEL |
Number | Parallel build jobs (default: auto-detect) |
CPUINFER_VERBOSE |
0, 1 |
Verbose build output (default: 0) |
Instruction Set Details:
NATIVE: Auto-detect and use all available CPU instructions (-march=native) - Recommended for best performanceAVX512: Explicit AVX512 support for Skylake-SP and Cascade LakeAVX2: AVX2 support for maximum compatibilityFANCY: AVX512 with full extensions (AVX512F/BW/DQ/VL/VNNI) for Ice Lake+ and Zen 4+. Use this when building pre-compiled binaries to distribute to users with modern CPUs. For local builds, preferNATIVEfor better performance.
Example Configurations:
# Maximum performance on AMX CPU
export CPUINFER_CPU_INSTRUCT=NATIVE
export CPUINFER_ENABLE_AMX=ON
# AVX512 CPU without AMX
export CPUINFER_CPU_INSTRUCT=AVX512
export CPUINFER_ENABLE_AMX=OFF
# Compatibility build
export CPUINFER_CPU_INSTRUCT=AVX2
export CPUINFER_ENABLE_AMX=OFF
# Debug build for development
export CPUINFER_BUILD_TYPE=Debug
export CPUINFER_VERBOSE=1# Editable installation (for development)
pip install -e .
# Standard installation
pip install . -- Looking for a CUDA compiler - NOTFOUND
CMake Error at CMakeLists.txt:389 (message):
KTRANSFORMERS_USE_CUDA=ON but CUDA compiler not found
Make sure you have the CUDA toolkit installed and nvcc is in your system PATH.
Try export CMAKE_ARGS="-D CMAKE_CUDA_COMPILER=$(which nvcc)" and reinstall again.
Run sudo apt install libhwloc-dev if on a Debian-based system or build from source: https://www.open-mpi.org/projects/hwloc/.
wget https://download.open-mpi.org/release/hwloc/v2.12/hwloc-2.12.2.tar.gz
tar -xzf hwloc-2.12.2.tar.gz
cd hwloc-2.12.2
./configure
make
sudo make install
KT-Kernel provides weight quantization tools for CPU-GPU hybrid inference (e.g., integrating with SGLang). Both tools work together to enable heterogeneous expert placement across CPUs and GPUs.
Quantize weights to INT4/INT8 format optimized for AMX inference:
python scripts/convert_cpu_weights.py \
--input-path /path/to/model \
--input-type bf16 \
--output /path/to/output \
--quant-method int4Supported formats: FP8, FP16, BF16 → INT4/INT8
Apply GPTQ quantization to model weights:
# Install additional dependencies first
pip install accelerate transformers llmcompressor datasets
# Quantize GPU weights
python scripts/convert_gpu_weights.py \
--model_id /path/to/model \
--output_dir /path/to/output \
--quant_type W4A16Supported types: W4A16 (GPTQ4), W8A16 (GPTQ8)
For detailed documentation, advanced options, and low-memory mode, see scripts/README.md.
your msg should match: Conventional Commits (https://www.conventionalcommits.org/)
and format your code before commit:
cmake -B build
cd build
make formatand you may need a new clang-format at least 18, use this command in conda env:
conda install -c conda-forge clang-format=18
rm -rf buildand you may need black for python format:
conda install black