Skip to content

Kt minimax#1742

Merged
ouqingliang merged 50 commits intomainfrom
kt-minimax
Dec 24, 2025
Merged

Kt minimax#1742
ouqingliang merged 50 commits intomainfrom
kt-minimax

Conversation

@ErvinXie
Copy link
Collaborator

@ErvinXie ErvinXie commented Dec 24, 2025

What does this PR do?

Enable FP8 Support, Minimax M2, DeepSeek Serise.
And kt cli

ouqingliang and others added 30 commits December 9, 2025 13:38
…ed expert

The original code only initialized gate_up_ba_[0], but do_gate_up_gemm uses
gate_up_ba_[expert_idx] where expert_idx comes from m_expert_id_map_ and can
be any expert ID, causing access to uninitialized buffers.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
…timizations

- Add new AVX-512 FP8 MoE kernel implementation
- Add bench_fp8_moe.py and bench_fp8_write_buffer.py for performance testing
- Add test_fp8_write_buffer.py for correctness validation
- Add exp overflow protection with clamping in act_fn (amx.hpp)
- Add work stealing optimization for merge_results (moe_base.hpp)
- Add auto-detection for MoE naming formats (deepseek/mixtral) in loader.py

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Resolve conflicts:
- ext_bindings.cpp: Keep FP8 MoE write_weight_scale_to_buffer binding
- k2-moe.hpp: Keep CRTP refactored version using moe_base.hpp

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add clamping for neg_gate_val before exp_avx512() to avoid overflow
when exp input exceeds 88 (the limit for float32).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @ErvinXie, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the KT-Kernel's capabilities by introducing native FP8 support, specifically targeting models like Minimax M2/M2.1 and DeepSeek V3.2 for more efficient inference. It also delivers a brand new, user-friendly command-line interface (kt) to simplify model management, server execution, and various development tasks. Underlying architectural improvements include a refactored MoE operator base class and optimized weight handling for better performance and maintainability.

Highlights

  • FP8 Support: Introduced comprehensive support for FP8 (8-bit floating point) weights, enabling more memory-efficient and potentially faster inference for models like DeepSeek V3.2 and MiniMax M2/M2.1. This includes new FP8-specific buffers, kernels, and weight loading mechanisms.
  • Minimax M2.1 Tutorial: Added a new tutorial document (doc/en/MiniMax-M2.1-Tutorial.md) detailing how to run the MiniMax-M2.1 model with native FP8 precision using SGLang and KT-Kernel, including hardware requirements, prerequisites, and performance benchmarks.
  • DeepSeek Series Integration: Enhanced support for DeepSeek models, particularly DeepSeek V3.2, by leveraging the new FP8 capabilities for optimized inference. This is reflected in the model registry and default parameters within the new CLI.
  • New Command-Line Interface (CLI): Implemented a new kt command-line interface to streamline common workflows such as running inference servers, chatting with models, quantizing weights, benchmarking, diagnosing environment issues, and managing models/configurations.
  • MoE Operator Refactoring: Refactored the AMX MoE operators (awq-moe.hpp, k2-moe.hpp, moe.hpp) to inherit from a new moe_base.hpp class, centralizing common logic and improving code maintainability and extensibility.
  • Optimized Weight Unpacking: Introduced optimized write_weights_to_buffer functions for RAWINT4 and FP8 models, utilizing AVX512 instructions and coarse-grained task splitting for efficient unpacking of weights to GPU buffers, improving memory bandwidth.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant new functionality, including FP8 support for Minimax M2 and DeepSeek models, a comprehensive command-line interface (kt-cli), and major refactoring of the C++ MoE operator code. The changes are extensive and well-structured. The addition of FP8 support is a great performance enhancement. The new CLI will significantly improve usability. The refactoring of the C++ code into a common base class is a major improvement for maintainability. My review focuses on some documentation placeholders, a potential issue in a benchmark script, and a code duplication in the C++ bindings.

Comment on lines +30 to +126
- **RAM**: At least <!-- TODO: RAM requirement -->GB system memory
- **Storage**: 220 GB for model weights (same weight dir for GPU and CPU)

**Tested Configuration:**

- **GPU**: 1/2 x NVIDIA GeForce RTX 5090 (32 GB)
- **CPU**: 2 x AMD EPYC 9355 32-Core Processor (128 threads)
- **RAM**: 1TB DDR5 5600MT/s ECC
- **OS**: Linux (Ubuntu 20.04+ recommended)

## Prerequisites

Before starting, ensure you have:

1. **SGLang installed** - Follow [SGLang integration steps](./kt-kernel_intro.md#integration-with-sglang)
2. **KT-Kernel installed** - Follow the [installation guide](./kt-kernel_intro.md#installation)

Note: Currently, please clone our custom SGLang repository:

```bash
git clone https://github.com/kvcache-ai/sglang.git
cd sglang
pip install -e "python[all]"
```

3. **CUDA toolkit** - CUDA 12.0+ recommended for FP8 support
4. **Hugging Face CLI** - For downloading models:
```bash
pip install -U huggingface-hub
```

## Step 1: Download Model Weights

<!-- TODO: using kt-cli -->
## Step 2: Launch SGLang Server


<!-- TODO: using kt-cli -->

See [KT-Kernel Parameters](https://github.com/kvcache-ai/ktransformers/tree/main/kt-kernel#kt-kernel-parameters) for detailed parameter tuning guidelines.

### Key Parameters

| Parameter | Description |
|-----------|-------------|
| `--kt-method FP8` | Enable FP8 inference mode for MiniMax-M2.1 native FP8 weights. |
| `--kt-cpuinfer` | Number of CPU inference threads. Set to physical CPU cores (not hyperthreads). |
| `--kt-threadpool-count` | Number of thread pools. Set to NUMA node count. |
| `--kt-num-gpu-experts` | Number of experts kept on GPU for decoding. |
| `--chunked-prefill-size` | Maximum tokens per prefill batch. |
| `--max-total-tokens` | Maximum total tokens in KV cache. |
| `--kt-gpu-prefill-token-threshold` | Token threshold for layerwise prefill strategy. |

## Step 3: Send Inference Requests


## Performance

### Throughput (tokens/s)

The following benchmarks were measured with single concurrency (Prefill tps / Decode tps):

| GPU | CPU | PCIe | 2048 tokens | 8192 tokens | 32768 tokens |
|------------|-------------|-------------|-------------|-------------|--------------|
| 1 x RTX 4090 (48 GB) | 2 x Intel Xeon Platinum 8488C| PCIe 4.0 | 129 / 21.8 | 669 / 20.9 | 1385 / 18.5 |
| 2 x RTX 4090 (48 GB) | 2 x Intel Xeon Platinum 8488C| PCIe 4.0 | 139 / 23.6 | 1013 / 23.3 | 2269 / 21.6 |
| 1 x RTX 5090 (32 GB) | 2 x AMD EPYC 9355 | PCIe 5.0 | 408 / 32.1 | 1196 / 31.4 | 2540 / 27.6 |
| 2 x RTX 5090 (32 GB) | 2 x AMD EPYC 9355 | PCIe 5.0 | 414 / 34.3 | 1847 / 33.1 | 4007 / 31.8 |

### Comparison with llama.cpp

We benchmarked KT-Kernel + SGLang against llama.cpp to demonstrate the performance advantages of our CPU-GPU heterogeneous inference approach.


<!-- TODO: Add prefill performance comparison chart -->
<!-- ![Prefill Performance Comparison](./images/minimax-m2.1-prefill-comparison.png) -->

| Input Length | llama.cpp (tokens/s) | KT-Kernel (tokens/s) | Speedup |
|--------------|----------------------|----------------------|---------|
| <!-- TODO --> | <!-- TODO --> | <!-- TODO --> | <!-- TODO --> |
| <!-- TODO --> | <!-- TODO --> | <!-- TODO --> | <!-- TODO --> |
| <!-- TODO --> | <!-- TODO --> | <!-- TODO --> | <!-- TODO --> |

### Key Observations

<!-- TODO: Add key observations and analysis, e.g.:
- KT-Kernel achieves Xx speedup in prefill compared to llama.cpp
- Decode performance shows Xx improvement due to GPU expert caching
- Memory efficiency comparison
- Scalability with different batch sizes
-->

## Troubleshooting
<!-- TODO: -->

## Advance Use Casee: Running Claude Code with MiniMax-M2.1 Local Backend
<!-- TODO: -->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This tutorial contains several TODO placeholders that need to be filled in to make it complete and useful for users. Please address the following:

  • Line 30: Specify the minimum RAM requirement.
  • Lines 63 & 67: Provide the kt-cli commands for downloading weights and launching the server.
  • Lines 104-111: Add the performance comparison chart and fill in the data in the comparison table.
  • Lines 115-120: Add the key observations and analysis of the performance benchmarks.
  • Line 123: Add content to the Troubleshooting section.
  • Line 126: Add content for the advanced use case.


| GPU | CPU | PCIe | 2048 tokens | 8192 tokens | 32768 tokens |
|------------|-------------|-------------|-------------|-------------|--------------|
| 1 x RTX 4090 (48 GB) | 2 x Intel Xeon Platinum 8488C| PCIe 4.0 | 129 / 21.8 | 669 / 20.9 | 1385 / 18.5 |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The VRAM for an NVIDIA RTX 4090 is 24 GB, not 48 GB. This seems to be a typo in the performance table. Please correct the values for both the 1x and 2x RTX 4090 configurations to reflect the correct VRAM.

## Troubleshooting
<!-- TODO: -->

## Advance Use Casee: Running Claude Code with MiniMax-M2.1 Local Backend
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's a typo in the heading. "Casee" should be "Case".

Suggested change
## Advance Use Casee: Running Claude Code with MiniMax-M2.1 Local Backend
## Advance Use Case: Running Claude Code with MiniMax-M2.1 Local Backend

Comment on lines +231 to +241
bandwidth = (
hidden_size
* intermediate_size
* 3
* num_experts_per_tok
* (1 / num_experts_per_tok * expert_num * (1 - (1 - num_experts_per_tok / expert_num) ** qlen))
* bytes_per_elem
* test_iter
/ total_time
/ 1e9
) # 单位:GB/s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The bandwidth calculation appears to be using a complex and potentially incorrect formula. The term (1 / num_experts_per_tok * ...) is unusual. The standard formula for the expected number of unique experts accessed is expert_num * (1 - (1 - p)^qlen) where p = num_experts_per_tok / expert_num.

For a benchmark, it might be clearer and more standard to calculate bandwidth based on the total data processed per iteration, which would be qlen * num_experts_per_tok * (hidden_size * intermediate_size * 3) * bytes_per_elem, assuming no expert caching between iterations. Could you please review this formula?

Comment on lines 291 to 338
moe_cls.def("write_weight_scale_to_buffer_task", &WriteWeightScaleToBufferBindings::cpuinfer_interface,
py::arg("gpu_tp_count"), py::arg("gpu_experts_num"), py::arg("w13_weight_ptrs"),
py::arg("w13_scale_ptrs"), py::arg("w2_weight_ptrs"), py::arg("w2_scale_ptrs"));
py::arg("gpu_tp_count"), py::arg("expert_id"), py::arg("w13_weight_ptrs"), py::arg("w13_scale_ptrs"),
py::arg("w2_weight_ptrs"), py::arg("w2_scale_ptrs"));
}

// FP8 MoE: processes one expert at a time (expert_id instead of gpu_experts_num)
if constexpr (std::is_same_v<MoeTP, AMX_FP8_MOE_TP<amx::GemmKernel224FP8>>) {
struct WriteWeightScaleToBufferBindings {
struct Args {
CPUInfer* cpuinfer;
MoeClass* moe;
int gpu_tp_count;
int expert_id;
std::vector<uintptr_t> w13_weight_ptrs;
std::vector<uintptr_t> w13_scale_ptrs;
std::vector<uintptr_t> w2_weight_ptrs;
std::vector<uintptr_t> w2_scale_ptrs;
};

static void inner(void* args) {
Args* args_ = (Args*)args;
args_->cpuinfer->enqueue(&MoeClass::write_weight_scale_to_buffer, args_->moe, args_->gpu_tp_count,
args_->expert_id, args_->w13_weight_ptrs, args_->w13_scale_ptrs, args_->w2_weight_ptrs,
args_->w2_scale_ptrs);
}

static std::pair<intptr_t, intptr_t> cpuinfer_interface(std::shared_ptr<MoeClass> moe, int gpu_tp_count,
int expert_id, py::list w13_weight_ptrs,
py::list w13_scale_ptrs, py::list w2_weight_ptrs,
py::list w2_scale_ptrs) {
// Convert Python lists to std::vector<uintptr_t>
std::vector<uintptr_t> w13_weight_vec, w13_scale_vec, w2_weight_vec, w2_scale_vec;

for (auto item : w13_weight_ptrs) w13_weight_vec.push_back(py::cast<uintptr_t>(item));
for (auto item : w13_scale_ptrs) w13_scale_vec.push_back(py::cast<uintptr_t>(item));
for (auto item : w2_weight_ptrs) w2_weight_vec.push_back(py::cast<uintptr_t>(item));
for (auto item : w2_scale_ptrs) w2_scale_vec.push_back(py::cast<uintptr_t>(item));

Args* args = new Args{nullptr, moe.get(), gpu_tp_count, expert_id,
w13_weight_vec, w13_scale_vec, w2_weight_vec, w2_scale_vec};
return std::make_pair((intptr_t)&inner, (intptr_t)args);
}
};

moe_cls.def("write_weight_scale_to_buffer_task", &WriteWeightScaleToBufferBindings::cpuinfer_interface,
py::arg("gpu_tp_count"), py::arg("expert_id"), py::arg("w13_weight_ptrs"), py::arg("w13_scale_ptrs"),
py::arg("w2_weight_ptrs"), py::arg("w2_scale_ptrs"));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The binding logic for write_weight_scale_to_buffer_task is duplicated for AMX_K2_MOE_TP and AMX_FP8_MOE_TP. To improve maintainability and reduce code duplication, consider refactoring this into a single template or using SFINAE / if constexpr with requires (C++20) to create a single binding for all classes that support this task.

@ouqingliang ouqingliang merged commit d8046e1 into main Dec 24, 2025
10 of 12 checks passed
@ouqingliang ouqingliang deleted the kt-minimax branch December 29, 2025 07:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants