Kt minimax by ErvinXie · Pull Request #1742 · kvcache-ai/ktransformers

ErvinXie · 2025-12-24T03:09:57Z

What does this PR do?

Enable FP8 Support, Minimax M2, DeepSeek Serise.
And kt cli

…ed expert The original code only initialized gate_up_ba_[0], but do_gate_up_gemm uses gate_up_ba_[expert_idx] where expert_idx comes from m_expert_id_map_ and can be any expert ID, causing access to uninitialized buffers. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

…timizations - Add new AVX-512 FP8 MoE kernel implementation - Add bench_fp8_moe.py and bench_fp8_write_buffer.py for performance testing - Add test_fp8_write_buffer.py for correctness validation - Add exp overflow protection with clamping in act_fn (amx.hpp) - Add work stealing optimization for merge_results (moe_base.hpp) - Add auto-detection for MoE naming formats (deepseek/mixtral) in loader.py 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Resolve conflicts: - ext_bindings.cpp: Keep FP8 MoE write_weight_scale_to_buffer binding - k2-moe.hpp: Keep CRTP refactored version using moe_base.hpp 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add clamping for neg_gate_val before exp_avx512() to avoid overflow when exp input exceeds 88 (the limit for float32). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

gemini-code-assist · 2025-12-24T03:11:36Z

Summary of Changes

Hello @ErvinXie, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the KT-Kernel's capabilities by introducing native FP8 support, specifically targeting models like Minimax M2/M2.1 and DeepSeek V3.2 for more efficient inference. It also delivers a brand new, user-friendly command-line interface (kt) to simplify model management, server execution, and various development tasks. Underlying architectural improvements include a refactored MoE operator base class and optimized weight handling for better performance and maintainability.

Highlights

FP8 Support: Introduced comprehensive support for FP8 (8-bit floating point) weights, enabling more memory-efficient and potentially faster inference for models like DeepSeek V3.2 and MiniMax M2/M2.1. This includes new FP8-specific buffers, kernels, and weight loading mechanisms.
Minimax M2.1 Tutorial: Added a new tutorial document (doc/en/MiniMax-M2.1-Tutorial.md) detailing how to run the MiniMax-M2.1 model with native FP8 precision using SGLang and KT-Kernel, including hardware requirements, prerequisites, and performance benchmarks.
DeepSeek Series Integration: Enhanced support for DeepSeek models, particularly DeepSeek V3.2, by leveraging the new FP8 capabilities for optimized inference. This is reflected in the model registry and default parameters within the new CLI.
New Command-Line Interface (CLI): Implemented a new kt command-line interface to streamline common workflows such as running inference servers, chatting with models, quantizing weights, benchmarking, diagnosing environment issues, and managing models/configurations.
MoE Operator Refactoring: Refactored the AMX MoE operators (awq-moe.hpp, k2-moe.hpp, moe.hpp) to inherit from a new moe_base.hpp class, centralizing common logic and improving code maintainability and extensibility.
Optimized Weight Unpacking: Introduced optimized write_weights_to_buffer functions for RAWINT4 and FP8 models, utilizing AVX512 instructions and coarse-grained task splitting for efficient unpacking of weights to GPU buffers, improving memory bandwidth.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces significant new functionality, including FP8 support for Minimax M2 and DeepSeek models, a comprehensive command-line interface (kt-cli), and major refactoring of the C++ MoE operator code. The changes are extensive and well-structured. The addition of FP8 support is a great performance enhancement. The new CLI will significantly improve usability. The refactoring of the C++ code into a common base class is a major improvement for maintainability. My review focuses on some documentation placeholders, a potential issue in a benchmark script, and a code duplication in the C++ bindings.

gemini-code-assist · 2025-12-24T03:14:59Z

doc/en/MiniMax-M2.1-Tutorial.md

+- **RAM**: At least <!-- TODO: RAM requirement -->GB system memory
+- **Storage**: 220 GB for model weights (same weight dir for GPU and CPU)
+
+**Tested Configuration:**
+
+- **GPU**: 1/2 x NVIDIA GeForce RTX 5090 (32 GB)
+- **CPU**: 2 x AMD EPYC 9355 32-Core Processor (128 threads)
+- **RAM**: 1TB DDR5 5600MT/s ECC
+- **OS**: Linux (Ubuntu 20.04+ recommended)
+
+## Prerequisites
+
+Before starting, ensure you have:
+
+1. **SGLang installed** - Follow [SGLang integration steps](./kt-kernel_intro.md#integration-with-sglang)
+2. **KT-Kernel installed** - Follow the [installation guide](./kt-kernel_intro.md#installation)
+
+Note: Currently, please clone our custom SGLang repository:
+
+```bash
+git clone https://github.com/kvcache-ai/sglang.git
+cd sglang
+pip install -e "python[all]"
+```
+
+3. **CUDA toolkit** - CUDA 12.0+ recommended for FP8 support
+4. **Hugging Face CLI** - For downloading models:
+   ```bash
+   pip install -U huggingface-hub
+   ```
+
+## Step 1: Download Model Weights
+
+<!-- TODO: using kt-cli -->
+## Step 2: Launch SGLang Server
+
+
+<!-- TODO: using kt-cli -->
+
+See [KT-Kernel Parameters](https://github.com/kvcache-ai/ktransformers/tree/main/kt-kernel#kt-kernel-parameters) for detailed parameter tuning guidelines.
+
+### Key Parameters
+
+| Parameter | Description |
+|-----------|-------------|
+| `--kt-method FP8` | Enable FP8 inference mode for MiniMax-M2.1 native FP8 weights. |
+| `--kt-cpuinfer` | Number of CPU inference threads. Set to physical CPU cores (not hyperthreads). |
+| `--kt-threadpool-count` | Number of thread pools. Set to NUMA node count. |
+| `--kt-num-gpu-experts` | Number of experts kept on GPU for decoding. |
+| `--chunked-prefill-size` | Maximum tokens per prefill batch. |
+| `--max-total-tokens` | Maximum total tokens in KV cache. |
+| `--kt-gpu-prefill-token-threshold` | Token threshold for layerwise prefill strategy. |
+
+## Step 3: Send Inference Requests
+
+
+## Performance
+
+### Throughput (tokens/s)
+
+The following benchmarks were measured with single concurrency (Prefill tps / Decode tps):
+
+| GPU  | CPU  | PCIe |  2048 tokens | 8192 tokens | 32768 tokens |
+|------------|-------------|-------------|-------------|-------------|--------------|
+| 1 x RTX 4090 (48 GB) | 2 x Intel Xeon Platinum 8488C| PCIe 4.0 | 129 / 21.8 | 669 / 20.9 | 1385 / 18.5 |
+| 2 x RTX 4090 (48 GB) | 2 x Intel Xeon Platinum 8488C| PCIe 4.0 | 139 / 23.6 | 1013 / 23.3 | 2269 / 21.6 |
+| 1 x RTX 5090 (32 GB) | 2 x AMD EPYC 9355 | PCIe 5.0 | 408 / 32.1 | 1196 / 31.4 | 2540 / 27.6 |
+| 2 x RTX 5090 (32 GB) | 2 x AMD EPYC 9355 | PCIe 5.0 | 414 / 34.3 | 1847 / 33.1 | 4007 / 31.8 |
+
+### Comparison with llama.cpp
+
+We benchmarked KT-Kernel + SGLang against llama.cpp to demonstrate the performance advantages of our CPU-GPU heterogeneous inference approach.
+
+
+<!-- TODO: Add prefill performance comparison chart -->
+<!-- ![Prefill Performance Comparison](./images/minimax-m2.1-prefill-comparison.png) -->
+
+| Input Length | llama.cpp (tokens/s) | KT-Kernel (tokens/s) | Speedup |
+|--------------|----------------------|----------------------|---------|
+| <!-- TODO --> | <!-- TODO --> | <!-- TODO --> | <!-- TODO --> |
+| <!-- TODO --> | <!-- TODO --> | <!-- TODO --> | <!-- TODO --> |
+| <!-- TODO --> | <!-- TODO --> | <!-- TODO --> | <!-- TODO --> |
+
+### Key Observations
+
+<!-- TODO: Add key observations and analysis, e.g.:
+- KT-Kernel achieves Xx speedup in prefill compared to llama.cpp
+- Decode performance shows Xx improvement due to GPU expert caching
+- Memory efficiency comparison
+- Scalability with different batch sizes
+-->
+
+## Troubleshooting
+<!-- TODO: -->
+
+## Advance Use Casee: Running Claude Code with MiniMax-M2.1 Local Backend
+<!-- TODO: -->


This tutorial contains several TODO placeholders that need to be filled in to make it complete and useful for users. Please address the following:

Line 30: Specify the minimum RAM requirement.

Lines 63 & 67: Provide the kt-cli commands for downloading weights and launching the server.

Lines 104-111: Add the performance comparison chart and fill in the data in the comparison table.

Lines 115-120: Add the key observations and analysis of the performance benchmarks.

Line 123: Add content to the Troubleshooting section.

Line 126: Add content for the advanced use case.

gemini-code-assist · 2025-12-24T03:14:59Z

doc/en/MiniMax-M2.1-Tutorial.md

+
+| GPU  | CPU  | PCIe |  2048 tokens | 8192 tokens | 32768 tokens |
+|------------|-------------|-------------|-------------|-------------|--------------|
+| 1 x RTX 4090 (48 GB) | 2 x Intel Xeon Platinum 8488C| PCIe 4.0 | 129 / 21.8 | 669 / 20.9 | 1385 / 18.5 |


The VRAM for an NVIDIA RTX 4090 is 24 GB, not 48 GB. This seems to be a typo in the performance table. Please correct the values for both the 1x and 2x RTX 4090 configurations to reflect the correct VRAM.

gemini-code-assist · 2025-12-24T03:14:59Z

doc/en/MiniMax-M2.1-Tutorial.md

+## Troubleshooting
+<!-- TODO: -->
+
+## Advance Use Casee: Running Claude Code with MiniMax-M2.1 Local Backend


There's a typo in the heading. "Casee" should be "Case".

Suggested change

## Advance Use Casee: Running Claude Code with MiniMax-M2.1 Local Backend

## Advance Use Case: Running Claude Code with MiniMax-M2.1 Local Backend

gemini-code-assist · 2025-12-24T03:14:59Z

kt-kernel/bench/bench_fp8_moe.py

+        bandwidth = (
+            hidden_size
+            * intermediate_size
+            * 3
+            * num_experts_per_tok
+            * (1 / num_experts_per_tok * expert_num * (1 - (1 - num_experts_per_tok / expert_num) ** qlen))
+            * bytes_per_elem
+            * test_iter
+            / total_time
+            / 1e9
+        )  # 单位：GB/s


The bandwidth calculation appears to be using a complex and potentially incorrect formula. The term (1 / num_experts_per_tok * ...) is unusual. The standard formula for the expected number of unique experts accessed is expert_num * (1 - (1 - p)^qlen) where p = num_experts_per_tok / expert_num.

For a benchmark, it might be clearer and more standard to calculate bandwidth based on the total data processed per iteration, which would be qlen * num_experts_per_tok * (hidden_size * intermediate_size * 3) * bytes_per_elem, assuming no expert caching between iterations. Could you please review this formula?

gemini-code-assist · 2025-12-24T03:14:59Z

kt-kernel/ext_bindings.cpp

    moe_cls.def("write_weight_scale_to_buffer_task", &WriteWeightScaleToBufferBindings::cpuinfer_interface,
-                py::arg("gpu_tp_count"), py::arg("gpu_experts_num"), py::arg("w13_weight_ptrs"),
-                py::arg("w13_scale_ptrs"), py::arg("w2_weight_ptrs"), py::arg("w2_scale_ptrs"));
+                py::arg("gpu_tp_count"), py::arg("expert_id"), py::arg("w13_weight_ptrs"), py::arg("w13_scale_ptrs"),
+                py::arg("w2_weight_ptrs"), py::arg("w2_scale_ptrs"));
+  }
+
+  // FP8 MoE: processes one expert at a time (expert_id instead of gpu_experts_num)
+  if constexpr (std::is_same_v<MoeTP, AMX_FP8_MOE_TP<amx::GemmKernel224FP8>>) {
+    struct WriteWeightScaleToBufferBindings {
+      struct Args {
+        CPUInfer* cpuinfer;
+        MoeClass* moe;
+        int gpu_tp_count;
+        int expert_id;
+        std::vector<uintptr_t> w13_weight_ptrs;
+        std::vector<uintptr_t> w13_scale_ptrs;
+        std::vector<uintptr_t> w2_weight_ptrs;
+        std::vector<uintptr_t> w2_scale_ptrs;
+      };
+
+      static void inner(void* args) {
+        Args* args_ = (Args*)args;
+        args_->cpuinfer->enqueue(&MoeClass::write_weight_scale_to_buffer, args_->moe, args_->gpu_tp_count,
+                                 args_->expert_id, args_->w13_weight_ptrs, args_->w13_scale_ptrs, args_->w2_weight_ptrs,
+                                 args_->w2_scale_ptrs);
+      }
+
+      static std::pair<intptr_t, intptr_t> cpuinfer_interface(std::shared_ptr<MoeClass> moe, int gpu_tp_count,
+                                                              int expert_id, py::list w13_weight_ptrs,
+                                                              py::list w13_scale_ptrs, py::list w2_weight_ptrs,
+                                                              py::list w2_scale_ptrs) {
+        // Convert Python lists to std::vector<uintptr_t>
+        std::vector<uintptr_t> w13_weight_vec, w13_scale_vec, w2_weight_vec, w2_scale_vec;
+
+        for (auto item : w13_weight_ptrs) w13_weight_vec.push_back(py::cast<uintptr_t>(item));
+        for (auto item : w13_scale_ptrs) w13_scale_vec.push_back(py::cast<uintptr_t>(item));
+        for (auto item : w2_weight_ptrs) w2_weight_vec.push_back(py::cast<uintptr_t>(item));
+        for (auto item : w2_scale_ptrs) w2_scale_vec.push_back(py::cast<uintptr_t>(item));
+
+        Args* args = new Args{nullptr,        moe.get(),     gpu_tp_count,  expert_id,
+                              w13_weight_vec, w13_scale_vec, w2_weight_vec, w2_scale_vec};
+        return std::make_pair((intptr_t)&inner, (intptr_t)args);
+      }
+    };
+
+    moe_cls.def("write_weight_scale_to_buffer_task", &WriteWeightScaleToBufferBindings::cpuinfer_interface,
+                py::arg("gpu_tp_count"), py::arg("expert_id"), py::arg("w13_weight_ptrs"), py::arg("w13_scale_ptrs"),
+                py::arg("w2_weight_ptrs"), py::arg("w2_scale_ptrs"));
  }


The binding logic for write_weight_scale_to_buffer_task is duplicated for AMX_K2_MOE_TP and AMX_FP8_MOE_TP. To improve maintainability and reduce code duplication, consider refactoring this into a single template or using SFINAE / if constexpr with requires (C++20) to create a single binding for all classes that support this task.

into kt-minimax

ouqingliang and others added 30 commits December 9, 2025 13:38

[feat]: fp8 amx kernel init

c24dbc3

[fix](kt-kernel): fix moe.hpp huge mem request

8a3bed8

[feat]: update awq moe

7c2b56b

delete fp8_moe

cab5bcc

[fix](kt-kernel): fix load weight bug

1264956

[feat](kt-kernel): fp8 native finished, not test.

bc36a4b

[feat](kt-kernel): support fp8 native, speed not test.

ebbd16a

kt cli 1.0

daf4e47

on going install

94a7f0a

install

edb1826

[fix](kt-kernel): fix k2 write_buffer to per expert

ff6f657

[feat](kt-kernel): support fp8 layerwise prefill write buffer.

ebfdefb

move kt cli to kt-kernel

92cad78

move cli

403da4c

change completion

dab16cd

[fix]: rename fp8, delete amx_raw_utils.hpp

7908469

completions

74bb60f

update model management

8bf604e

completion ok

2d973e9

fix run

a8c0ee6

Merge branch 'moe_hpp' into kt-cli

1174138

m2 run ok

00dc49b

kt chat ok

e91fdc3

fc ok

221ae90

print run command

a751024

auto compute

57e904d

ErvinXie and others added 12 commits December 23, 2025 10:21

calc

79d0299

rm download

ca5d8cc

auto tp

176d245

model reg

c146d35

fix cuda visible

100cdea

refact install

bd9c2d7

rm install

38085be

block mb and sft

ce2f2cb

[fix]: 123

b3aa7c7

[docs]: add FP8 and MiniMax-M2.1 support documentation

b43d52c

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

[feat]: detect sglang

c9d67d3

Merge branch 'kt-cli' into kt-minimax

798502e

ErvinXie added the run-ci label Dec 24, 2025

gemini-code-assist bot reviewed Dec 24, 2025

View reviewed changes

ouqingliang and others added 8 commits December 24, 2025 03:20

[fix]: Update MiniMax-M2.1 tutorial.

363be79

[docs]: 123

bdfc0a6

[docs]: rm python

801d693

[docs]: add compare llama.cpp

33d068b

Merge branch 'kt-minimax' of https://github.com/kvcache-ai/ktransformers

d077169

into kt-minimax

[docs]: add readme

22a0354

[docs]: fix tutorial.

27db578

Merge branch 'main' into kt-minimax

35b7cfa

ouqingliang merged commit d8046e1 into main Dec 24, 2025
10 of 12 checks passed

ouqingliang deleted the kt-minimax branch December 29, 2025 07:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kt minimax#1742

Kt minimax#1742
ouqingliang merged 50 commits intomainfrom
kt-minimax

ErvinXie commented Dec 24, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Dec 24, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 24, 2025

Uh oh!

gemini-code-assist bot Dec 24, 2025

Uh oh!

gemini-code-assist bot Dec 24, 2025

Uh oh!

gemini-code-assist bot Dec 24, 2025

Uh oh!

gemini-code-assist bot Dec 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	## Advance Use Casee: Running Claude Code with MiniMax-M2.1 Local Backend
	## Advance Use Case: Running Claude Code with MiniMax-M2.1 Local Backend

Conversation

ErvinXie commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

gemini-code-assist bot commented Dec 24, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ErvinXie commented Dec 24, 2025 •

edited

Loading