Skip to content
Merged
Show file tree
Hide file tree
Changes from 42 commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
c24dbc3
[feat]: fp8 amx kernel init
ouqingliang Dec 9, 2025
c36cda2
Fix forward_decode crash by initializing gate_up_ba_ for each activat…
ouqingliang Dec 10, 2025
8a3bed8
[fix](kt-kernel): fix moe.hpp huge mem request
ouqingliang Dec 10, 2025
7c2b56b
[feat]: update awq moe
ouqingliang Dec 10, 2025
cab5bcc
delete fp8_moe
ouqingliang Dec 11, 2025
1264956
[fix](kt-kernel): fix load weight bug
ouqingliang Dec 11, 2025
bc36a4b
[feat](kt-kernel): fp8 native finished, not test.
ouqingliang Dec 15, 2025
ebbd16a
[feat](kt-kernel): support fp8 native, speed not test.
ouqingliang Dec 16, 2025
1d21372
[feat](kt-kernel): add FP8 MoE with new AVX kernel, benchmarks and op…
ouqingliang Dec 19, 2025
908a186
[chore](kt-kernel): merge main into moe_hpp
ouqingliang Dec 19, 2025
78fb03c
[fix](kt-kernel): clamp exp input in act_fn to prevent float32 overflow
ouqingliang Dec 19, 2025
daf4e47
kt cli 1.0
ErvinXie Dec 19, 2025
94a7f0a
on going install
ErvinXie Dec 19, 2025
edb1826
install
ErvinXie Dec 19, 2025
ff6f657
[fix](kt-kernel): fix k2 write_buffer to per expert
ouqingliang Dec 19, 2025
ebfdefb
[feat](kt-kernel): support fp8 layerwise prefill write buffer.
ouqingliang Dec 20, 2025
92cad78
move kt cli to kt-kernel
ErvinXie Dec 22, 2025
403da4c
move cli
ErvinXie Dec 22, 2025
dab16cd
change completion
ErvinXie Dec 22, 2025
7908469
[fix]: rename fp8, delete amx_raw_utils.hpp
ouqingliang Dec 22, 2025
74bb60f
completions
ErvinXie Dec 23, 2025
8bf604e
update model management
ErvinXie Dec 23, 2025
2d973e9
completion ok
ErvinXie Dec 23, 2025
a8c0ee6
fix run
ErvinXie Dec 23, 2025
1174138
Merge branch 'moe_hpp' into kt-cli
ErvinXie Dec 23, 2025
00dc49b
m2 run ok
ErvinXie Dec 23, 2025
e91fdc3
kt chat ok
ErvinXie Dec 23, 2025
221ae90
fc ok
ErvinXie Dec 23, 2025
a751024
print run command
ErvinXie Dec 23, 2025
57e904d
auto compute
ErvinXie Dec 23, 2025
79d0299
calc
ErvinXie Dec 23, 2025
ca5d8cc
rm download
ErvinXie Dec 23, 2025
176d245
auto tp
ErvinXie Dec 23, 2025
c146d35
model reg
ErvinXie Dec 23, 2025
100cdea
fix cuda visible
ErvinXie Dec 23, 2025
bd9c2d7
refact install
ErvinXie Dec 23, 2025
38085be
rm install
ErvinXie Dec 23, 2025
ce2f2cb
block mb and sft
ErvinXie Dec 23, 2025
b3aa7c7
[fix]: 123
ErvinXie Dec 23, 2025
b43d52c
[docs]: add FP8 and MiniMax-M2.1 support documentation
ouqingliang Dec 24, 2025
c9d67d3
[feat]: detect sglang
ErvinXie Dec 24, 2025
798502e
Merge branch 'kt-cli' into kt-minimax
ErvinXie Dec 24, 2025
363be79
[fix]: Update MiniMax-M2.1 tutorial.
ouqingliang Dec 24, 2025
bdfc0a6
[docs]: 123
ErvinXie Dec 24, 2025
801d693
[docs]: rm python
ErvinXie Dec 24, 2025
33d068b
[docs]: add compare llama.cpp
ouqingliang Dec 24, 2025
d077169
Merge branch 'kt-minimax' of https://github.com/kvcache-ai/ktransform…
ouqingliang Dec 24, 2025
22a0354
[docs]: add readme
ouqingliang Dec 24, 2025
27db578
[docs]: fix tutorial.
ouqingliang Dec 24, 2025
35b7cfa
Merge branch 'main' into kt-minimax
ouqingliang Dec 24, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
132 changes: 132 additions & 0 deletions doc/en/MiniMax-M2.1-Tutorial.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
# Running MiniMax-M2.1 with Native Precision using SGLang and KT-Kernel

This tutorial demonstrates how to run MiniMax-M2.1 model inference using SGLang integrated with KT-Kernel. MiniMax-M2.1 provides native FP8 weights, enabling efficient GPU inference with reduced memory footprint while maintaining high accuracy.

## Table of Contents

- [Overview](#overview)
- [Hardware Requirements](#hardware-requirements)
- [Prerequisites](#prerequisites)
- [Step 1: Download Model Weights](#step-1-download-model-weights)
- [Step 2: Launch SGLang Server](#step-2-launch-sglang-server)
- [Step 3: Send Inference Requests](#step-3-send-inference-requests)
- [Performance](#performance)
- [Troubleshooting](#troubleshooting)

## Overview

MiniMax-M2.1 is a large MoE (Mixture of Experts) model that provides native FP8 weights. This tutorial uses KT-Kernel's FP8 support to enable CPU-GPU heterogeneous inference:

- **FP8 GPU Inference**: Native FP8 precision for GPU-side computation, providing both memory efficiency and computational accuracy
- **CPU-GPU Heterogeneous Architecture**:
- Hot experts and attention modules run on GPU with FP8 precision
- Cold experts offloaded to CPU for memory efficiency

## Hardware Requirements

**Minimum Configuration:**
- **GPU**: NVIDIA RTX 4090 24 GB (or equivalent with at least 24GB VRAM available)
- **CPU**: x86 CPU with AVX512 support (e.g., Intel Sapphire Rapids, AMD EPYC)
- **RAM**: At least <!-- TODO: RAM requirement -->GB system memory
- **Storage**: 220 GB for model weights (same weight dir for GPU and CPU)

**Tested Configuration:**

- **GPU**: 1/2 x NVIDIA GeForce RTX 5090 (32 GB)
- **CPU**: 2 x AMD EPYC 9355 32-Core Processor (128 threads)
- **RAM**: 1TB DDR5 5600MT/s ECC
- **OS**: Linux (Ubuntu 20.04+ recommended)

## Prerequisites

Before starting, ensure you have:

1. **SGLang installed** - Follow [SGLang integration steps](./kt-kernel_intro.md#integration-with-sglang)
2. **KT-Kernel installed** - Follow the [installation guide](./kt-kernel_intro.md#installation)

Note: Currently, please clone our custom SGLang repository:

```bash
git clone https://github.com/kvcache-ai/sglang.git
cd sglang
pip install -e "python[all]"
```

3. **CUDA toolkit** - CUDA 12.0+ recommended for FP8 support
4. **Hugging Face CLI** - For downloading models:
```bash
pip install -U huggingface-hub
```

## Step 1: Download Model Weights

<!-- TODO: using kt-cli -->
## Step 2: Launch SGLang Server


<!-- TODO: using kt-cli -->

See [KT-Kernel Parameters](https://github.com/kvcache-ai/ktransformers/tree/main/kt-kernel#kt-kernel-parameters) for detailed parameter tuning guidelines.

### Key Parameters

| Parameter | Description |
|-----------|-------------|
| `--kt-method FP8` | Enable FP8 inference mode for MiniMax-M2.1 native FP8 weights. |
| `--kt-cpuinfer` | Number of CPU inference threads. Set to physical CPU cores (not hyperthreads). |
| `--kt-threadpool-count` | Number of thread pools. Set to NUMA node count. |
| `--kt-num-gpu-experts` | Number of experts kept on GPU for decoding. |
| `--chunked-prefill-size` | Maximum tokens per prefill batch. |
| `--max-total-tokens` | Maximum total tokens in KV cache. |
| `--kt-gpu-prefill-token-threshold` | Token threshold for layerwise prefill strategy. |

## Step 3: Send Inference Requests


## Performance

### Throughput (tokens/s)

The following benchmarks were measured with single concurrency (Prefill tps / Decode tps):

| GPU | CPU | PCIe | 2048 tokens | 8192 tokens | 32768 tokens |
|------------|-------------|-------------|-------------|-------------|--------------|
| 1 x RTX 4090 (48 GB) | 2 x Intel Xeon Platinum 8488C| PCIe 4.0 | 129 / 21.8 | 669 / 20.9 | 1385 / 18.5 |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The VRAM for an NVIDIA RTX 4090 is 24 GB, not 48 GB. This seems to be a typo in the performance table. Please correct the values for both the 1x and 2x RTX 4090 configurations to reflect the correct VRAM.

| 2 x RTX 4090 (48 GB) | 2 x Intel Xeon Platinum 8488C| PCIe 4.0 | 139 / 23.6 | 1013 / 23.3 | 2269 / 21.6 |
| 1 x RTX 5090 (32 GB) | 2 x AMD EPYC 9355 | PCIe 5.0 | 408 / 32.1 | 1196 / 31.4 | 2540 / 27.6 |
| 2 x RTX 5090 (32 GB) | 2 x AMD EPYC 9355 | PCIe 5.0 | 414 / 34.3 | 1847 / 33.1 | 4007 / 31.8 |

### Comparison with llama.cpp

We benchmarked KT-Kernel + SGLang against llama.cpp to demonstrate the performance advantages of our CPU-GPU heterogeneous inference approach.


<!-- TODO: Add prefill performance comparison chart -->
<!-- ![Prefill Performance Comparison](./images/minimax-m2.1-prefill-comparison.png) -->

| Input Length | llama.cpp (tokens/s) | KT-Kernel (tokens/s) | Speedup |
|--------------|----------------------|----------------------|---------|
| <!-- TODO --> | <!-- TODO --> | <!-- TODO --> | <!-- TODO --> |
| <!-- TODO --> | <!-- TODO --> | <!-- TODO --> | <!-- TODO --> |
| <!-- TODO --> | <!-- TODO --> | <!-- TODO --> | <!-- TODO --> |

### Key Observations

<!-- TODO: Add key observations and analysis, e.g.:
- KT-Kernel achieves Xx speedup in prefill compared to llama.cpp
- Decode performance shows Xx improvement due to GPU expert caching
- Memory efficiency comparison
- Scalability with different batch sizes
-->

## Troubleshooting
<!-- TODO: -->

## Advance Use Casee: Running Claude Code with MiniMax-M2.1 Local Backend
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's a typo in the heading. "Casee" should be "Case".

Suggested change
## Advance Use Casee: Running Claude Code with MiniMax-M2.1 Local Backend
## Advance Use Case: Running Claude Code with MiniMax-M2.1 Local Backend

<!-- TODO: -->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This tutorial contains several TODO placeholders that need to be filled in to make it complete and useful for users. Please address the following:

  • Line 30: Specify the minimum RAM requirement.
  • Lines 63 & 67: Provide the kt-cli commands for downloading weights and launching the server.
  • Lines 104-111: Add the performance comparison chart and fill in the data in the comparison table.
  • Lines 115-120: Add the key observations and analysis of the performance benchmarks.
  • Line 123: Add content to the Troubleshooting section.
  • Line 126: Add content for the advanced use case.


## Additional Resources

- [KT-Kernel Documentation](../../kt-kernel/README.md)
- [SGLang GitHub](https://github.com/sgl-project/sglang)
- [KT-Kernel Parameters Reference](../../kt-kernel/README.md#kt-kernel-parameters)
12 changes: 7 additions & 5 deletions kt-kernel/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ High-performance kernel operations for KTransformers, featuring CPU-optimized Mo
- ✅ **Universal CPU (llamafile backend)**: Supported (using GGUF-format weights)
- ✅ **AMD CPUs with BLIS**: Supported (for int8 prefill & decode)
- ✅ **Kimi-K2 Native INT4 (RAWINT4)**: Supported on AVX512 CPUs (CPU-GPU shared INT4 weights) - [Guide](../doc/en/Kimi-K2-Thinking-Native.md)
- ✅ **FP8 weights (e.g., MiniMax-M2.1)**: Supported on AVX512 CPUs (CPU-GPU shared FP8 weights) - [Guide](../doc/en/MiniMax-M2.1-Tutorial.md)

## Features

Expand Down Expand Up @@ -361,20 +362,21 @@ python -m sglang.launch_server \

| Parameter | Description | Example Value |
|-----------|-------------|---------------|
| `--kt-method` | CPU inference backend method | `AMXINT4`, `AMXINT8`, `RAWINT4`, or `LLAMAFILE` |
| `--kt-method` | CPU inference backend method | `AMXINT4`, `AMXINT8`, `RAWINT4`, `FP8` or `LLAMAFILE` |
| `--kt-weight-path` | Path to quantized CPU weights | `/path/to/cpu-weights` |
| `--kt-cpuinfer` | Number of CPU inference threads | `64` (adjust based on CPU cores) |
| `--kt-threadpool-count` | Number of thread pools for parallel execution | `2` (typically 1-4) |
| `--kt-num-gpu-experts` | Number of experts to keep on GPU | `32` (remaining experts go to CPU) |
| `--kt-max-deferred-experts-per-token` | Number of experts per token to defer for pipelined execution | `2` (0 to disable, 1-4 recommended) |
| `--kt-gpu-prefill-token-threshold` | Token count threshold for prefill strategy (RAWINT4 only) | ~`400` |
| `--kt-gpu-prefill-token-threshold` | Token count threshold for prefill strategy (FP8 and RAWINT4 only) | ~`1024` |

**Parameter Guidelines:**

- **`kt-method`**: Choose based on your CPU and weight format:
- `AMXINT4`: Best performance on AMX CPUs with INT4 quantized weights (May cause huge accuracy drop for some models, e.g., Qwen3-30B-A3B)
- `AMXINT8`: Higher accuracy with INT8 quantized weights on AMX CPUs
- `RAWINT4`: Native INT4 weights shared by CPU and GPU (AMX backend only, currently supports Kimi-K2-Thinking model). See [Kimi-K2-Thinking Native Tutorial](../doc/en/Kimi-K2-Thinking-Native.md) for details.
- `FP8`: FP8 weights shared by CPU and GPU
- `LLAMAFILE`: GGUF-based backend

- **`kt-cpuinfer`**: Set to the number of **physical CPU cores** (not hyperthreads).
Expand All @@ -400,10 +402,10 @@ python -m sglang.launch_server \
- `1-4`: Deferred execution (recommended range; good latency/quality balance, requires tuning)
- `5-7`: Highest latency reduction but may introduce noticeable accuracy loss; use with care

- **`kt-gpu-prefill-token-threshold`** (RAWINT4 only): Controls prefill strategy for native INT4 inference:
- **`kt-gpu-prefill-token-threshold`** (FP8 and RAWINT4 only): Controls prefill strategy for native FP8 and INT4 inference:
- **≤ threshold**: Uses hybrid CPU+GPU prefill. No extra VRAM needed, but performance degrades slowly as token count increases.
- **> threshold**: Uses layerwise GPU prefill. Performance scales better with longer sequences, but requires ~9GB+ extra VRAM.
- Only applicable when `--kt-method RAWINT4` is used. Currently supports Kimi-K2-Thinking model only.
- **> threshold**: Uses layerwise GPU prefill. Performance scales better with longer sequences, but requires one MoE layer extra VRAM (e.g., ~9GB+ for Kimi-K2-Thinking and ~3.6GB for MiniMax-M2.1).
- Only applicable when `--kt-method RAWINT4` or `--kt-method FP8` is used.

## Direct Python API Usage

Expand Down
Loading
Loading