Skip to content

Commit 3c13435

Browse files
authored
Fix CPU Instruction Set and Installation (kvcache-ai#1729)
* [fix](kt-kernel): fix AVX512 cpu instruction set detection * [feat](kt-kernel): AVX512 fallback kernel for RAW-INT4 * [fix](kt-kernel): fix setup version issue * [fix](kt-kernel): update install for custom build * [docs](kt-kernel): new installation guide for various cpu instruction set * [fix](kt-kernel): fix _mm512_dpbusd_epi32_compat fallback implmentation * [style](kt-kernel): clang format
1 parent a8667dd commit 3c13435

File tree

11 files changed

+538
-471
lines changed

11 files changed

+538
-471
lines changed

kt-kernel/CMakeLists.txt

Lines changed: 15 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ option(KTRANSFORMERS_CPU_MOE_AMD "ktransformers: CPU use moe kernel for amd" OFF
2828
# LTO control
2929
option(CPUINFER_ENABLE_LTO "Enable link time optimization (IPO)" OFF)
3030

31-
project(kt_kernel_ext VERSION 0.4.2)
31+
project(kt_kernel_ext VERSION 0.4.4)
3232
# Choose compilers BEFORE project() so CMake honors them
3333
if(USE_CONDA_TOOLCHAIN)
3434
if(NOT DEFINED ENV{CONDA_PREFIX} OR NOT EXISTS "$ENV{CONDA_PREFIX}")
@@ -378,7 +378,20 @@ if(HOST_IS_X86)
378378
target_link_libraries(${test_name} llama OpenMP::OpenMP_CXX numa)
379379
endforeach()
380380
endif()
381-
list(APPEND ARCH_FLAGS -mfma -mf16c -mavx512bf16 -mavx512vnni)
381+
# Note: AVX512 subset flags (-mavx512vnni, -mavx512bf16) are already added
382+
# in the generic x86 detection block above (lines 276-289) when corresponding
383+
# LLAMA_AVX512_* options are enabled. No need to add them again here.
384+
# -mfma is already added by LLAMA_NATIVE (line 254), LLAMA_AVX*, or LLAMA_FMA blocks.
385+
# Only add -mf16c if LLAMA_F16C is not already enabled.
386+
if(NOT LLAMA_F16C)
387+
list(APPEND ARCH_FLAGS -mf16c)
388+
endif()
389+
if(LLAMA_AVX512_VNNI)
390+
message(STATUS "AVX512_VNNI enabled")
391+
endif()
392+
if(LLAMA_AVX512_BF16)
393+
message(STATUS "AVX512_BF16 enabled")
394+
endif()
382395
endif()
383396
endif()
384397

kt-kernel/README.md

Lines changed: 75 additions & 58 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@ High-performance kernel operations for KTransformers, featuring CPU-optimized Mo
3737
-**Intel CPUs with AMX**: Fully supported (using weights converted to INT4/INT8 format)
3838
-**Universal CPU (llamafile backend)**: Supported (using GGUF-format weights)
3939
-**AMD CPUs with BLIS**: Supported (for int8 prefill & decode)
40+
-**Kimi-K2 Native INT4 (RAWINT4)**: Supported on AVX512 CPUs (CPU-GPU shared INT4 weights) - [Guide](../doc/en/Kimi-K2-Thinking-Native.md)
4041

4142
## Features
4243

@@ -49,6 +50,8 @@ High-performance kernel operations for KTransformers, featuring CPU-optimized Mo
4950

5051
### Option 1: Install from PyPI (Recommended for Most Users)
5152

53+
Coming soon...
54+
5255
Choose the version matching your CUDA installation:
5356

5457
```bash
@@ -104,76 +107,55 @@ python -c "import kt_kernel"
104107

105108
---
106109

107-
### Option 2: Install from Source (For AMD, ARM, or Custom Builds)
110+
### Option 2: Install from Source (For Local Use or Custom Builds)
108111

109-
If you need AMD (BLIS), ARM (KML), or custom CUDA versions, build from source:
112+
Build from source for local installation or when you need AMD (BLIS), ARM (KML), or custom CUDA versions.
110113

111114
#### Prerequisites
112115

113-
First, initialize git submodules:
116+
First, initialize git submodules and create a conda environment:
114117
```bash
115118
git submodule update --init --recursive
116-
```
117-
118-
#### Quick Installation
119-
120-
Step 0: Create and activate a conda environment (recommended):
121-
122-
```bash
123119
conda create -n kt-kernel python=3.11 -y
124120
conda activate kt-kernel
125121
```
126122

127-
You can now install in two clear steps using the same script.
123+
#### Quick Installation (Recommended)
128124

129-
**Option A: Two-step** (specify dependencies installation and build separately)
125+
Simply run the install script - it will auto-detect your CPU and optimize for best performance:
130126

131127
```bash
132-
# 1) Install system prerequisites (cmake, hwloc, pkg-config)
133-
./install.sh deps
134-
135-
# 2) Build and install kt-kernel (auto-detects CPU instruction set)
136-
# By default, the script cleans the local ./build directory before compiling
137-
./install.sh build
128+
./install.sh
138129
```
139130

140-
**Option B: One-step**
131+
**What happens automatically:**
132+
- Auto-detects CPU capabilities (AMX, AVX512_VNNI, AVX512_BF16)
133+
- Installs system dependencies (`cmake`, `libhwloc-dev`, `pkg-config`)
134+
- Builds optimized binary for **your CPU only** (using `-march=native`)
135+
- **Software fallbacks**: Automatically enabled for CPUs without VNNI/BF16
141136

137+
**Optional: Two-step installation**
142138
```bash
143-
./install.sh
139+
./install.sh deps # Install dependencies only
140+
./install.sh build # Build and install kt-kernel
144141
```
145142

146-
The install script will:
147-
- Auto-detect CPU capabilities (AMX support)
148-
- Install `cmake` via conda (if available)
149-
- Install system dependencies (`libhwloc-dev`, `pkg-config`) based on your OS
143+
**CPU Requirements by Backend:**
150144

151-
**What gets configured automatically:**
152-
- AMX CPU detected → `NATIVE + AMX=ON`
153-
- No AMX detected → `NATIVE + AMX=OFF`
145+
| Backend | Minimum CPU Requirement | Example CPUs | Notes |
146+
|---------|-------------------------|--------------|-------|
147+
| **LLAMAFILE** | AVX2 | Intel Haswell (2013+), AMD Zen+ | Universal compatibility |
148+
| **RAWINT4** | AVX512F + AVX512BW | Intel Skylake-X (2017+), Ice Lake, Cascade Lake | Software fallbacks for VNNI/BF16 |
149+
| **AMXINT4/INT8** | AMX | Intel Sapphire Rapids (2023+) | Best performance, requires AMX hardware |
154150

155-
⚠️ **Important for LLAMAFILE backend users:**
156-
If you have an AMX-capable CPU but plan to use the LLAMAFILE backend, do NOT use the default auto-detection build.
157-
Use "manual mode" with `CPUINFER_CPU_INSTRUCT` set to `AVX512` or `AVX2` instead of `NATIVE` to avoid compilation issues (see below).
151+
**Software Fallback Support (AVX512 backends):**
152+
- ✅ VNNI fallback: Uses AVX512BW instructions
153+
- ✅ BF16 fallback: Uses AVX512F instructions
154+
- ✅ Older AVX512 CPUs (Skylake-X, Cascade Lake) can run RAWINT4 with fallbacks
158155

159-
⚠️ **Important for BLIS AMD backend users:**
160-
for the installation guide, see this [issue](https://github.com/kvcache-ai/ktransformers/issues/1601)
156+
⚠️ **Portability Note:** The default build is optimized for your specific CPU and may not work on different/older CPUs. For portable builds or binary distribution, see [Manual Configuration](#manual-configuration-advanced) below.
161157

162-
163-
### Manual Configuration (Advanced)
164-
165-
If you need specific build options (e.g., for LLAMAFILE backend, compatibility, or binary distribution):
166-
167-
```bash
168-
# Example for LLAMAFILE backend on AMX CPU with AVX512
169-
export CPUINFER_CPU_INSTRUCT=AVX512 # Options: NATIVE, AVX512, AVX2, FANCY
170-
export CPUINFER_ENABLE_AMX=OFF # Options: ON, OFF
171-
172-
# Build only (skip auto-detection of instruction set)
173-
./install.sh build --manual
174-
```
175-
176-
For advanced build options and binary distribution, see the [Build Configuration](#build-configuration) section. If you encounter issues, refer to [Error Troubleshooting](#error-troubleshooting).
158+
⚠️ **AMD BLIS backend users:** See [installation guide](https://github.com/kvcache-ai/ktransformers/issues/1601) for AMD-specific setup.
177159

178160
## Verification
179161

@@ -482,11 +464,44 @@ batch_sizes = KTMoEWrapper.get_capture_batch_sizes()
482464
KTMoEWrapper.clear_buffer_cache()
483465
```
484466

467+
### Manual Configuration (Advanced)
468+
469+
For portable builds, binary distribution, or cross-machine deployment, you need to manually specify target instruction sets:
470+
471+
```bash
472+
# General distribution (works on any AVX512 CPU from 2017+)
473+
export CPUINFER_CPU_INSTRUCT=AVX512
474+
export CPUINFER_ENABLE_AMX=OFF
475+
./install.sh build --manual
476+
477+
# Maximum compatibility (works on any CPU from 2013+)
478+
export CPUINFER_CPU_INSTRUCT=AVX2
479+
export CPUINFER_ENABLE_AMX=OFF
480+
./install.sh build --manual
481+
482+
# Modern CPUs only (Ice Lake+, Zen 4+)
483+
export CPUINFER_CPU_INSTRUCT=FANCY
484+
export CPUINFER_ENABLE_AMX=OFF
485+
./install.sh build --manual
486+
```
487+
488+
**Optional: Override VNNI/BF16 detection**
489+
```bash
490+
# Force enable/disable VNNI and BF16 (for testing fallbacks)
491+
export CPUINFER_ENABLE_AVX512_VNNI=OFF
492+
export CPUINFER_ENABLE_AVX512_BF16=OFF
493+
./install.sh
494+
```
495+
496+
See `./install.sh --help` for all available options.
497+
498+
---
499+
485500
## Build Configuration
486501

487-
### Manual Installation
502+
### Manual Installation (Without install.sh)
488503

489-
If you prefer manual installation without the `install.sh` script, follow these steps:
504+
If you prefer manual installation without the `install.sh` script:
490505

491506
#### 1. Install System Dependencies
492507

@@ -508,27 +523,29 @@ If you prefer manual installation without the `install.sh` script, follow these
508523

509524
**Instruction Set Details:**
510525

511-
- **`NATIVE`**: Auto-detect and use all available CPU instructions (`-march=native`) - **Recommended for best performance**
512-
- **`AVX512`**: Explicit AVX512 support for Skylake-SP and Cascade Lake
513-
- **`AVX2`**: AVX2 support for maximum compatibility
514-
- **`FANCY`**: AVX512 with full extensions (AVX512F/BW/DQ/VL/VNNI) for Ice Lake+ and Zen 4+. Use this when building pre-compiled binaries to distribute to users with modern CPUs. For local builds, prefer `NATIVE` for better performance.
526+
| Option | Target CPUs | Use Case |
527+
|--------|-------------|----------|
528+
| **`NATIVE`** | Your specific CPU only | Local builds (best performance, **default**) |
529+
| **`AVX512`** | Skylake-X, Ice Lake, Cascade Lake, Zen 4+ | General distribution |
530+
| **`AVX2`** | Haswell (2013) and newer | Maximum compatibility |
531+
| **`FANCY`** | Ice Lake+, Zen 4+ | Modern CPUs with full AVX512 extensions |
515532

516533
**Example Configurations:**
517534

518535
```bash
519-
# Maximum performance on AMX CPU
536+
# Local use - maximum performance (default behavior)
520537
export CPUINFER_CPU_INSTRUCT=NATIVE
521-
export CPUINFER_ENABLE_AMX=ON
538+
export CPUINFER_ENABLE_AMX=ON # or OFF
522539

523-
# AVX512 CPU without AMX
540+
# Distribution build - works on any AVX512 CPU
524541
export CPUINFER_CPU_INSTRUCT=AVX512
525542
export CPUINFER_ENABLE_AMX=OFF
526543

527-
# Compatibility build
544+
# Maximum compatibility - works on CPUs since 2013
528545
export CPUINFER_CPU_INSTRUCT=AVX2
529546
export CPUINFER_ENABLE_AMX=OFF
530547

531-
# Debug build for development
548+
# Debug build
532549
export CPUINFER_BUILD_TYPE=Debug
533550
export CPUINFER_VERBOSE=1
534551
```

0 commit comments

Comments
 (0)