Conversation
…ed expert The original code only initialized gate_up_ba_[0], but do_gate_up_gemm uses gate_up_ba_[expert_idx] where expert_idx comes from m_expert_id_map_ and can be any expert ID, causing access to uninitialized buffers. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
…timizations - Add new AVX-512 FP8 MoE kernel implementation - Add bench_fp8_moe.py and bench_fp8_write_buffer.py for performance testing - Add test_fp8_write_buffer.py for correctness validation - Add exp overflow protection with clamping in act_fn (amx.hpp) - Add work stealing optimization for merge_results (moe_base.hpp) - Add auto-detection for MoE naming formats (deepseek/mixtral) in loader.py 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Resolve conflicts: - ext_bindings.cpp: Keep FP8 MoE write_weight_scale_to_buffer binding - k2-moe.hpp: Keep CRTP refactored version using moe_base.hpp 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add clamping for neg_gate_val before exp_avx512() to avoid overflow when exp input exceeds 88 (the limit for float32). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Summary of ChangesHello @ErvinXie, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the KT-Kernel's capabilities by introducing native FP8 support, specifically targeting models like Minimax M2/M2.1 and DeepSeek V3.2 for more efficient inference. It also delivers a brand new, user-friendly command-line interface ( Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces significant new functionality, including FP8 support for Minimax M2 and DeepSeek models, a comprehensive command-line interface (kt-cli), and major refactoring of the C++ MoE operator code. The changes are extensive and well-structured. The addition of FP8 support is a great performance enhancement. The new CLI will significantly improve usability. The refactoring of the C++ code into a common base class is a major improvement for maintainability. My review focuses on some documentation placeholders, a potential issue in a benchmark script, and a code duplication in the C++ bindings.
doc/en/MiniMax-M2.1-Tutorial.md
Outdated
| - **RAM**: At least <!-- TODO: RAM requirement -->GB system memory | ||
| - **Storage**: 220 GB for model weights (same weight dir for GPU and CPU) | ||
|
|
||
| **Tested Configuration:** | ||
|
|
||
| - **GPU**: 1/2 x NVIDIA GeForce RTX 5090 (32 GB) | ||
| - **CPU**: 2 x AMD EPYC 9355 32-Core Processor (128 threads) | ||
| - **RAM**: 1TB DDR5 5600MT/s ECC | ||
| - **OS**: Linux (Ubuntu 20.04+ recommended) | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| Before starting, ensure you have: | ||
|
|
||
| 1. **SGLang installed** - Follow [SGLang integration steps](./kt-kernel_intro.md#integration-with-sglang) | ||
| 2. **KT-Kernel installed** - Follow the [installation guide](./kt-kernel_intro.md#installation) | ||
|
|
||
| Note: Currently, please clone our custom SGLang repository: | ||
|
|
||
| ```bash | ||
| git clone https://github.com/kvcache-ai/sglang.git | ||
| cd sglang | ||
| pip install -e "python[all]" | ||
| ``` | ||
|
|
||
| 3. **CUDA toolkit** - CUDA 12.0+ recommended for FP8 support | ||
| 4. **Hugging Face CLI** - For downloading models: | ||
| ```bash | ||
| pip install -U huggingface-hub | ||
| ``` | ||
|
|
||
| ## Step 1: Download Model Weights | ||
|
|
||
| <!-- TODO: using kt-cli --> | ||
| ## Step 2: Launch SGLang Server | ||
|
|
||
|
|
||
| <!-- TODO: using kt-cli --> | ||
|
|
||
| See [KT-Kernel Parameters](https://github.com/kvcache-ai/ktransformers/tree/main/kt-kernel#kt-kernel-parameters) for detailed parameter tuning guidelines. | ||
|
|
||
| ### Key Parameters | ||
|
|
||
| | Parameter | Description | | ||
| |-----------|-------------| | ||
| | `--kt-method FP8` | Enable FP8 inference mode for MiniMax-M2.1 native FP8 weights. | | ||
| | `--kt-cpuinfer` | Number of CPU inference threads. Set to physical CPU cores (not hyperthreads). | | ||
| | `--kt-threadpool-count` | Number of thread pools. Set to NUMA node count. | | ||
| | `--kt-num-gpu-experts` | Number of experts kept on GPU for decoding. | | ||
| | `--chunked-prefill-size` | Maximum tokens per prefill batch. | | ||
| | `--max-total-tokens` | Maximum total tokens in KV cache. | | ||
| | `--kt-gpu-prefill-token-threshold` | Token threshold for layerwise prefill strategy. | | ||
|
|
||
| ## Step 3: Send Inference Requests | ||
|
|
||
|
|
||
| ## Performance | ||
|
|
||
| ### Throughput (tokens/s) | ||
|
|
||
| The following benchmarks were measured with single concurrency (Prefill tps / Decode tps): | ||
|
|
||
| | GPU | CPU | PCIe | 2048 tokens | 8192 tokens | 32768 tokens | | ||
| |------------|-------------|-------------|-------------|-------------|--------------| | ||
| | 1 x RTX 4090 (48 GB) | 2 x Intel Xeon Platinum 8488C| PCIe 4.0 | 129 / 21.8 | 669 / 20.9 | 1385 / 18.5 | | ||
| | 2 x RTX 4090 (48 GB) | 2 x Intel Xeon Platinum 8488C| PCIe 4.0 | 139 / 23.6 | 1013 / 23.3 | 2269 / 21.6 | | ||
| | 1 x RTX 5090 (32 GB) | 2 x AMD EPYC 9355 | PCIe 5.0 | 408 / 32.1 | 1196 / 31.4 | 2540 / 27.6 | | ||
| | 2 x RTX 5090 (32 GB) | 2 x AMD EPYC 9355 | PCIe 5.0 | 414 / 34.3 | 1847 / 33.1 | 4007 / 31.8 | | ||
|
|
||
| ### Comparison with llama.cpp | ||
|
|
||
| We benchmarked KT-Kernel + SGLang against llama.cpp to demonstrate the performance advantages of our CPU-GPU heterogeneous inference approach. | ||
|
|
||
|
|
||
| <!-- TODO: Add prefill performance comparison chart --> | ||
| <!--  --> | ||
|
|
||
| | Input Length | llama.cpp (tokens/s) | KT-Kernel (tokens/s) | Speedup | | ||
| |--------------|----------------------|----------------------|---------| | ||
| | <!-- TODO --> | <!-- TODO --> | <!-- TODO --> | <!-- TODO --> | | ||
| | <!-- TODO --> | <!-- TODO --> | <!-- TODO --> | <!-- TODO --> | | ||
| | <!-- TODO --> | <!-- TODO --> | <!-- TODO --> | <!-- TODO --> | | ||
|
|
||
| ### Key Observations | ||
|
|
||
| <!-- TODO: Add key observations and analysis, e.g.: | ||
| - KT-Kernel achieves Xx speedup in prefill compared to llama.cpp | ||
| - Decode performance shows Xx improvement due to GPU expert caching | ||
| - Memory efficiency comparison | ||
| - Scalability with different batch sizes | ||
| --> | ||
|
|
||
| ## Troubleshooting | ||
| <!-- TODO: --> | ||
|
|
||
| ## Advance Use Casee: Running Claude Code with MiniMax-M2.1 Local Backend | ||
| <!-- TODO: --> |
There was a problem hiding this comment.
This tutorial contains several TODO placeholders that need to be filled in to make it complete and useful for users. Please address the following:
- Line 30: Specify the minimum RAM requirement.
- Lines 63 & 67: Provide the
kt-clicommands for downloading weights and launching the server. - Lines 104-111: Add the performance comparison chart and fill in the data in the comparison table.
- Lines 115-120: Add the key observations and analysis of the performance benchmarks.
- Line 123: Add content to the Troubleshooting section.
- Line 126: Add content for the advanced use case.
|
|
||
| | GPU | CPU | PCIe | 2048 tokens | 8192 tokens | 32768 tokens | | ||
| |------------|-------------|-------------|-------------|-------------|--------------| | ||
| | 1 x RTX 4090 (48 GB) | 2 x Intel Xeon Platinum 8488C| PCIe 4.0 | 129 / 21.8 | 669 / 20.9 | 1385 / 18.5 | |
doc/en/MiniMax-M2.1-Tutorial.md
Outdated
| ## Troubleshooting | ||
| <!-- TODO: --> | ||
|
|
||
| ## Advance Use Casee: Running Claude Code with MiniMax-M2.1 Local Backend |
| bandwidth = ( | ||
| hidden_size | ||
| * intermediate_size | ||
| * 3 | ||
| * num_experts_per_tok | ||
| * (1 / num_experts_per_tok * expert_num * (1 - (1 - num_experts_per_tok / expert_num) ** qlen)) | ||
| * bytes_per_elem | ||
| * test_iter | ||
| / total_time | ||
| / 1e9 | ||
| ) # 单位:GB/s |
There was a problem hiding this comment.
The bandwidth calculation appears to be using a complex and potentially incorrect formula. The term (1 / num_experts_per_tok * ...) is unusual. The standard formula for the expected number of unique experts accessed is expert_num * (1 - (1 - p)^qlen) where p = num_experts_per_tok / expert_num.
For a benchmark, it might be clearer and more standard to calculate bandwidth based on the total data processed per iteration, which would be qlen * num_experts_per_tok * (hidden_size * intermediate_size * 3) * bytes_per_elem, assuming no expert caching between iterations. Could you please review this formula?
| moe_cls.def("write_weight_scale_to_buffer_task", &WriteWeightScaleToBufferBindings::cpuinfer_interface, | ||
| py::arg("gpu_tp_count"), py::arg("gpu_experts_num"), py::arg("w13_weight_ptrs"), | ||
| py::arg("w13_scale_ptrs"), py::arg("w2_weight_ptrs"), py::arg("w2_scale_ptrs")); | ||
| py::arg("gpu_tp_count"), py::arg("expert_id"), py::arg("w13_weight_ptrs"), py::arg("w13_scale_ptrs"), | ||
| py::arg("w2_weight_ptrs"), py::arg("w2_scale_ptrs")); | ||
| } | ||
|
|
||
| // FP8 MoE: processes one expert at a time (expert_id instead of gpu_experts_num) | ||
| if constexpr (std::is_same_v<MoeTP, AMX_FP8_MOE_TP<amx::GemmKernel224FP8>>) { | ||
| struct WriteWeightScaleToBufferBindings { | ||
| struct Args { | ||
| CPUInfer* cpuinfer; | ||
| MoeClass* moe; | ||
| int gpu_tp_count; | ||
| int expert_id; | ||
| std::vector<uintptr_t> w13_weight_ptrs; | ||
| std::vector<uintptr_t> w13_scale_ptrs; | ||
| std::vector<uintptr_t> w2_weight_ptrs; | ||
| std::vector<uintptr_t> w2_scale_ptrs; | ||
| }; | ||
|
|
||
| static void inner(void* args) { | ||
| Args* args_ = (Args*)args; | ||
| args_->cpuinfer->enqueue(&MoeClass::write_weight_scale_to_buffer, args_->moe, args_->gpu_tp_count, | ||
| args_->expert_id, args_->w13_weight_ptrs, args_->w13_scale_ptrs, args_->w2_weight_ptrs, | ||
| args_->w2_scale_ptrs); | ||
| } | ||
|
|
||
| static std::pair<intptr_t, intptr_t> cpuinfer_interface(std::shared_ptr<MoeClass> moe, int gpu_tp_count, | ||
| int expert_id, py::list w13_weight_ptrs, | ||
| py::list w13_scale_ptrs, py::list w2_weight_ptrs, | ||
| py::list w2_scale_ptrs) { | ||
| // Convert Python lists to std::vector<uintptr_t> | ||
| std::vector<uintptr_t> w13_weight_vec, w13_scale_vec, w2_weight_vec, w2_scale_vec; | ||
|
|
||
| for (auto item : w13_weight_ptrs) w13_weight_vec.push_back(py::cast<uintptr_t>(item)); | ||
| for (auto item : w13_scale_ptrs) w13_scale_vec.push_back(py::cast<uintptr_t>(item)); | ||
| for (auto item : w2_weight_ptrs) w2_weight_vec.push_back(py::cast<uintptr_t>(item)); | ||
| for (auto item : w2_scale_ptrs) w2_scale_vec.push_back(py::cast<uintptr_t>(item)); | ||
|
|
||
| Args* args = new Args{nullptr, moe.get(), gpu_tp_count, expert_id, | ||
| w13_weight_vec, w13_scale_vec, w2_weight_vec, w2_scale_vec}; | ||
| return std::make_pair((intptr_t)&inner, (intptr_t)args); | ||
| } | ||
| }; | ||
|
|
||
| moe_cls.def("write_weight_scale_to_buffer_task", &WriteWeightScaleToBufferBindings::cpuinfer_interface, | ||
| py::arg("gpu_tp_count"), py::arg("expert_id"), py::arg("w13_weight_ptrs"), py::arg("w13_scale_ptrs"), | ||
| py::arg("w2_weight_ptrs"), py::arg("w2_scale_ptrs")); | ||
| } |
There was a problem hiding this comment.
The binding logic for write_weight_scale_to_buffer_task is duplicated for AMX_K2_MOE_TP and AMX_FP8_MOE_TP. To improve maintainability and reduce code duplication, consider refactoring this into a single template or using SFINAE / if constexpr with requires (C++20) to create a single binding for all classes that support this task.
What does this PR do?
Enable FP8 Support, Minimax M2, DeepSeek Serise.
And kt cli