-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Kt minimax #1742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Kt minimax #1742
Changes from 42 commits
Commits
Show all changes
50 commits
Select commit
Hold shift + click to select a range
c24dbc3
[feat]: fp8 amx kernel init
ouqingliang c36cda2
Fix forward_decode crash by initializing gate_up_ba_ for each activat…
ouqingliang 8a3bed8
[fix](kt-kernel): fix moe.hpp huge mem request
ouqingliang 7c2b56b
[feat]: update awq moe
ouqingliang cab5bcc
delete fp8_moe
ouqingliang 1264956
[fix](kt-kernel): fix load weight bug
ouqingliang bc36a4b
[feat](kt-kernel): fp8 native finished, not test.
ouqingliang ebbd16a
[feat](kt-kernel): support fp8 native, speed not test.
ouqingliang 1d21372
[feat](kt-kernel): add FP8 MoE with new AVX kernel, benchmarks and op…
ouqingliang 908a186
[chore](kt-kernel): merge main into moe_hpp
ouqingliang 78fb03c
[fix](kt-kernel): clamp exp input in act_fn to prevent float32 overflow
ouqingliang daf4e47
kt cli 1.0
ErvinXie 94a7f0a
on going install
ErvinXie edb1826
install
ErvinXie ff6f657
[fix](kt-kernel): fix k2 write_buffer to per expert
ouqingliang ebfdefb
[feat](kt-kernel): support fp8 layerwise prefill write buffer.
ouqingliang 92cad78
move kt cli to kt-kernel
ErvinXie 403da4c
move cli
ErvinXie dab16cd
change completion
ErvinXie 7908469
[fix]: rename fp8, delete amx_raw_utils.hpp
ouqingliang 74bb60f
completions
ErvinXie 8bf604e
update model management
ErvinXie 2d973e9
completion ok
ErvinXie a8c0ee6
fix run
ErvinXie 1174138
Merge branch 'moe_hpp' into kt-cli
ErvinXie 00dc49b
m2 run ok
ErvinXie e91fdc3
kt chat ok
ErvinXie 221ae90
fc ok
ErvinXie a751024
print run command
ErvinXie 57e904d
auto compute
ErvinXie 79d0299
calc
ErvinXie ca5d8cc
rm download
ErvinXie 176d245
auto tp
ErvinXie c146d35
model reg
ErvinXie 100cdea
fix cuda visible
ErvinXie bd9c2d7
refact install
ErvinXie 38085be
rm install
ErvinXie ce2f2cb
block mb and sft
ErvinXie b3aa7c7
[fix]: 123
ErvinXie b43d52c
[docs]: add FP8 and MiniMax-M2.1 support documentation
ouqingliang c9d67d3
[feat]: detect sglang
ErvinXie 798502e
Merge branch 'kt-cli' into kt-minimax
ErvinXie 363be79
[fix]: Update MiniMax-M2.1 tutorial.
ouqingliang bdfc0a6
[docs]: 123
ErvinXie 801d693
[docs]: rm python
ErvinXie 33d068b
[docs]: add compare llama.cpp
ouqingliang d077169
Merge branch 'kt-minimax' of https://github.com/kvcache-ai/ktransform…
ouqingliang 22a0354
[docs]: add readme
ouqingliang 27db578
[docs]: fix tutorial.
ouqingliang 35b7cfa
Merge branch 'main' into kt-minimax
ouqingliang File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,132 @@ | ||
| # Running MiniMax-M2.1 with Native Precision using SGLang and KT-Kernel | ||
|
|
||
| This tutorial demonstrates how to run MiniMax-M2.1 model inference using SGLang integrated with KT-Kernel. MiniMax-M2.1 provides native FP8 weights, enabling efficient GPU inference with reduced memory footprint while maintaining high accuracy. | ||
|
|
||
| ## Table of Contents | ||
|
|
||
| - [Overview](#overview) | ||
| - [Hardware Requirements](#hardware-requirements) | ||
| - [Prerequisites](#prerequisites) | ||
| - [Step 1: Download Model Weights](#step-1-download-model-weights) | ||
| - [Step 2: Launch SGLang Server](#step-2-launch-sglang-server) | ||
| - [Step 3: Send Inference Requests](#step-3-send-inference-requests) | ||
| - [Performance](#performance) | ||
| - [Troubleshooting](#troubleshooting) | ||
|
|
||
| ## Overview | ||
|
|
||
| MiniMax-M2.1 is a large MoE (Mixture of Experts) model that provides native FP8 weights. This tutorial uses KT-Kernel's FP8 support to enable CPU-GPU heterogeneous inference: | ||
|
|
||
| - **FP8 GPU Inference**: Native FP8 precision for GPU-side computation, providing both memory efficiency and computational accuracy | ||
| - **CPU-GPU Heterogeneous Architecture**: | ||
| - Hot experts and attention modules run on GPU with FP8 precision | ||
| - Cold experts offloaded to CPU for memory efficiency | ||
|
|
||
| ## Hardware Requirements | ||
|
|
||
| **Minimum Configuration:** | ||
| - **GPU**: NVIDIA RTX 4090 24 GB (or equivalent with at least 24GB VRAM available) | ||
| - **CPU**: x86 CPU with AVX512 support (e.g., Intel Sapphire Rapids, AMD EPYC) | ||
| - **RAM**: At least <!-- TODO: RAM requirement -->GB system memory | ||
| - **Storage**: 220 GB for model weights (same weight dir for GPU and CPU) | ||
|
|
||
| **Tested Configuration:** | ||
|
|
||
| - **GPU**: 1/2 x NVIDIA GeForce RTX 5090 (32 GB) | ||
| - **CPU**: 2 x AMD EPYC 9355 32-Core Processor (128 threads) | ||
| - **RAM**: 1TB DDR5 5600MT/s ECC | ||
| - **OS**: Linux (Ubuntu 20.04+ recommended) | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| Before starting, ensure you have: | ||
|
|
||
| 1. **SGLang installed** - Follow [SGLang integration steps](./kt-kernel_intro.md#integration-with-sglang) | ||
| 2. **KT-Kernel installed** - Follow the [installation guide](./kt-kernel_intro.md#installation) | ||
|
|
||
| Note: Currently, please clone our custom SGLang repository: | ||
|
|
||
| ```bash | ||
| git clone https://github.com/kvcache-ai/sglang.git | ||
| cd sglang | ||
| pip install -e "python[all]" | ||
| ``` | ||
|
|
||
| 3. **CUDA toolkit** - CUDA 12.0+ recommended for FP8 support | ||
| 4. **Hugging Face CLI** - For downloading models: | ||
| ```bash | ||
| pip install -U huggingface-hub | ||
| ``` | ||
|
|
||
| ## Step 1: Download Model Weights | ||
|
|
||
| <!-- TODO: using kt-cli --> | ||
| ## Step 2: Launch SGLang Server | ||
|
|
||
|
|
||
| <!-- TODO: using kt-cli --> | ||
|
|
||
| See [KT-Kernel Parameters](https://github.com/kvcache-ai/ktransformers/tree/main/kt-kernel#kt-kernel-parameters) for detailed parameter tuning guidelines. | ||
|
|
||
| ### Key Parameters | ||
|
|
||
| | Parameter | Description | | ||
| |-----------|-------------| | ||
| | `--kt-method FP8` | Enable FP8 inference mode for MiniMax-M2.1 native FP8 weights. | | ||
| | `--kt-cpuinfer` | Number of CPU inference threads. Set to physical CPU cores (not hyperthreads). | | ||
| | `--kt-threadpool-count` | Number of thread pools. Set to NUMA node count. | | ||
| | `--kt-num-gpu-experts` | Number of experts kept on GPU for decoding. | | ||
| | `--chunked-prefill-size` | Maximum tokens per prefill batch. | | ||
| | `--max-total-tokens` | Maximum total tokens in KV cache. | | ||
| | `--kt-gpu-prefill-token-threshold` | Token threshold for layerwise prefill strategy. | | ||
|
|
||
| ## Step 3: Send Inference Requests | ||
|
|
||
|
|
||
| ## Performance | ||
|
|
||
| ### Throughput (tokens/s) | ||
|
|
||
| The following benchmarks were measured with single concurrency (Prefill tps / Decode tps): | ||
|
|
||
| | GPU | CPU | PCIe | 2048 tokens | 8192 tokens | 32768 tokens | | ||
| |------------|-------------|-------------|-------------|-------------|--------------| | ||
| | 1 x RTX 4090 (48 GB) | 2 x Intel Xeon Platinum 8488C| PCIe 4.0 | 129 / 21.8 | 669 / 20.9 | 1385 / 18.5 | | ||
| | 2 x RTX 4090 (48 GB) | 2 x Intel Xeon Platinum 8488C| PCIe 4.0 | 139 / 23.6 | 1013 / 23.3 | 2269 / 21.6 | | ||
| | 1 x RTX 5090 (32 GB) | 2 x AMD EPYC 9355 | PCIe 5.0 | 408 / 32.1 | 1196 / 31.4 | 2540 / 27.6 | | ||
| | 2 x RTX 5090 (32 GB) | 2 x AMD EPYC 9355 | PCIe 5.0 | 414 / 34.3 | 1847 / 33.1 | 4007 / 31.8 | | ||
|
|
||
| ### Comparison with llama.cpp | ||
|
|
||
| We benchmarked KT-Kernel + SGLang against llama.cpp to demonstrate the performance advantages of our CPU-GPU heterogeneous inference approach. | ||
|
|
||
|
|
||
| <!-- TODO: Add prefill performance comparison chart --> | ||
| <!--  --> | ||
|
|
||
| | Input Length | llama.cpp (tokens/s) | KT-Kernel (tokens/s) | Speedup | | ||
| |--------------|----------------------|----------------------|---------| | ||
| | <!-- TODO --> | <!-- TODO --> | <!-- TODO --> | <!-- TODO --> | | ||
| | <!-- TODO --> | <!-- TODO --> | <!-- TODO --> | <!-- TODO --> | | ||
| | <!-- TODO --> | <!-- TODO --> | <!-- TODO --> | <!-- TODO --> | | ||
|
|
||
| ### Key Observations | ||
|
|
||
| <!-- TODO: Add key observations and analysis, e.g.: | ||
| - KT-Kernel achieves Xx speedup in prefill compared to llama.cpp | ||
| - Decode performance shows Xx improvement due to GPU expert caching | ||
| - Memory efficiency comparison | ||
| - Scalability with different batch sizes | ||
| --> | ||
|
|
||
| ## Troubleshooting | ||
| <!-- TODO: --> | ||
|
|
||
| ## Advance Use Casee: Running Claude Code with MiniMax-M2.1 Local Backend | ||
|
||
| <!-- TODO: --> | ||
|
||
|
|
||
| ## Additional Resources | ||
|
|
||
| - [KT-Kernel Documentation](../../kt-kernel/README.md) | ||
| - [SGLang GitHub](https://github.com/sgl-project/sglang) | ||
| - [KT-Kernel Parameters Reference](../../kt-kernel/README.md#kt-kernel-parameters) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The VRAM for an NVIDIA RTX 4090 is 24 GB, not 48 GB. This seems to be a typo in the performance table. Please correct the values for both the 1x and 2x RTX 4090 configurations to reflect the correct VRAM.