Support Native Kimi K2 Thinking by ErvinXie · Pull Request #1663 · kvcache-ai/ktransformers

ErvinXie · 2025-12-04T03:39:04Z

gemini-code-assist · 2025-12-04T03:39:23Z

Summary of Changes

Hello @ErvinXie, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces native support for Kimi K2 thinking by significantly optimizing the prefill operation within the Mixture of Experts (MoE) architecture. The core forward_prefill function has been refactored to utilize a highly parallelized approach, distributing computations across multiple experts and leveraging advanced vector instructions for improved efficiency. This enhancement aims to boost the performance of Kimi K2 models during their initial processing phase.

Highlights

Optimized Kimi K2 MoE Prefill: The forward_prefill function in k2-moe.hpp has been completely rewritten to leverage parallel processing and AVX512 intrinsics for significant performance improvements in the Mixture of Experts (MoE) prefill operation.
Fine-grained Expert Processing: The new prefill logic efficiently tracks and processes activated experts, distributing computation across a thread pool for various stages, including input copying, gate/up/down matrix multiplications, and activation functions.
Enhanced Performance Profiling: Detailed performance profiling has been integrated into the forward_prefill function, allowing for granular measurement of each stage's execution time to identify bottlenecks and further optimize.
Parallel GEMM Indexing Fixes: Corrections were made in amx_kernels.hpp to ensure correct sub-matrix indexing during parallelized General Matrix Multiply (GEMM) operations, which is crucial for proper data access and computation in k-group processing.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a highly optimized forward_prefill implementation for the Kimi K2 MoE layer, replacing a simple sequential loop with a batched and parallelized version that leverages a thread pool and AVX512 intrinsics. This change should significantly improve performance during the prefill phase. The PR also includes important bug fixes in amx_kernels.hpp related to indexing in a parallelized kernel, enhancing correctness. My review identifies a potential memory alignment issue in the new forward_prefill implementation that could lead to runtime crashes, and I have provided a suggestion to improve its robustness.

gemini-code-assist · 2025-12-04T03:41:10Z

kt-kernel/operators/amx/k2-moe.hpp

+            auto f32out = (__m512*)((float*)output + i * config_.hidden_size + e);
+            f32out[0] = x0;
+            f32out[1] = x1;


The direct cast to (__m512*) and subsequent write operations assume that the output buffer is 64-byte aligned. If a non-aligned buffer is provided by the caller, this will result in a segmentation fault. The warm_up function in this file, for instance, uses a std::vector for the output buffer, which does not guarantee alignment, highlighting a scenario where this could fail. To enhance robustness, it is recommended to use unaligned store intrinsics.

auto f32out = (float*)output + i * config_.hidden_size + e; _mm512_storeu_ps(f32out, x0); _mm512_storeu_ps(f32out + 16, x1);

- Avoid expensive torch.stack().contiguous() in Python (was ~6.6s) - Use per-expert pointer arrays (gate_projs) instead of contiguous memory - C++ worker pool performs parallel memcpy for TP slicing - Add LOAD_TIME_PROFILE for load_weights timing analysis 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

zhangjiekui · 2025-12-10T03:01:04Z

v0.43可以运行起来，并且生成质量应该没问题了（主观测试了长文本生成和实测了简单的工具调用），都ok

但经常会出现以下错误：

[2025-12-10 10:36:20] INFO: 10.1.150.105:41802 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-12-10 10:36:20 TP0] Prefill batch, #new-seq: 1, #new-token: 2171, #cached-token: 21, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-12-10 10:36:21 TP1] Scheduler hit an exception: Traceback (most recent call last):
File "/data/cppsrc/sglang-ktransformers/sglang/python/sglang/srt/managers/scheduler.py", line 2698, in run_scheduler_process
scheduler.event_loop_overlap()
File "/root/.virtualenvs/ktsglang/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/data/cppsrc/sglang-ktransformers/sglang/python/sglang/srt/managers/scheduler.py", line 993, in event_loop_overlap
batch_result = self.run_batch(batch)
^^^^^^^^^^^^^^^^^^^^^
File "/data/cppsrc/sglang-ktransformers/sglang/python/sglang/srt/managers/scheduler.py", line 1995, in run_batch
batch_result = self.model_worker.forward_batch_generation(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/cppsrc/sglang-ktransformers/sglang/python/sglang/srt/managers/tp_worker.py", line 371, in forward_batch_generation
logits_output, can_run_cuda_graph = self.model_runner.forward(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/cppsrc/sglang-ktransformers/sglang/python/sglang/srt/model_executor/model_runner.py", line 2155, in forward
output = self._forward_raw(
^^^^^^^^^^^^^^^^^^
File "/data/cppsrc/sglang-ktransformers/sglang/python/sglang/srt/model_executor/model_runner.py", line 2212, in _forward_raw
ret = self.forward_extend(
^^^^^^^^^^^^^^^^^^^^
File "/data/cppsrc/sglang-ktransformers/sglang/python/sglang/srt/model_executor/model_runner.py", line 2100, in forward_extend
return self.model.forward(
^^^^^^^^^^^^^^^^^^^
File "/root/.virtualenvs/ktsglang/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/data/cppsrc/sglang-ktransformers/sglang/python/sglang/srt/models/deepseek_v2.py", line 3104, in forward
hidden_states = self.model(
^^^^^^^^^^^
File "/root/.virtualenvs/ktsglang/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.virtualenvs/ktsglang/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/cppsrc/sglang-ktransformers/sglang/python/sglang/srt/models/deepseek_v2.py", line 2965, in forward
hidden_states, residual = layer(
^^^^^^
File "/root/.virtualenvs/ktsglang/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.virtualenvs/ktsglang/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/cppsrc/sglang-ktransformers/sglang/python/sglang/srt/models/deepseek_v2.py", line 2716, in forward
hidden_states = self.mlp(
^^^^^^^^^
File "/root/.virtualenvs/ktsglang/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.virtualenvs/ktsglang/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/cppsrc/sglang-ktransformers/sglang/python/sglang/srt/models/deepseek_v2.py", line 756, in forward
return self.forward_normal(
^^^^^^^^^^^^^^^^^^^^
File "/data/cppsrc/sglang-ktransformers/sglang/python/sglang/srt/models/deepseek_v2.py", line 837, in forward_normal
final_hidden_states = self.experts(
^^^^^^^^^^^^^
File "/root/.virtualenvs/ktsglang/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.virtualenvs/ktsglang/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/cppsrc/sglang-ktransformers/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 845, in forward
combine_input = self.run_moe_core(
^^^^^^^^^^^^^^^^^^
File "/data/cppsrc/sglang-ktransformers/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 867, in run_moe_core
return self.quant_method.apply(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/cppsrc/sglang-ktransformers/sglang/python/sglang/srt/layers/moe/kt_ep_wrapper.py", line 585, in apply
ctx = self._build_full_context(layer)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/cppsrc/sglang-ktransformers/sglang/python/sglang/srt/layers/moe/kt_ep_wrapper.py", line 639, in _build_full_context
_SHARED_FULL_CONTEXT = SharedFullContext(
^^^^^^^^^^^^^^^^^^
File "/data/cppsrc/sglang-ktransformers/sglang/python/sglang/srt/layers/moe/kt_ep_wrapper.py", line 98, in init
self._create_cpu_buffers()
File "/data/cppsrc/sglang-ktransformers/sglang/python/sglang/srt/layers/moe/kt_ep_wrapper.py", line 167, in _create_cpu_buffers
shm = shared_memory.SharedMemory(name=shm_name, create=True, size=nbytes)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/multiprocessing/shared_memory.py", line 104, in init
self._fd = _posixshmem.shm_open(
^^^^^^^^^^^^^^^^^^^^^
FileExistsError: [Errno 17] File exists: '/kt_buf_w13_weight_packed_r1'

* [feat]: fix k2 prefill * Update Kimi-K2-Thinking.md * Create Kimi-K2-Thinking-Native.md * Update Kimi-K2-Thinking.md * Update Kimi-K2-Thinking.md * Update Kimi-K2-Thinking-Native.md * [perf] optimize K2 MoE weight loading with per-expert pointers - Avoid expensive torch.stack().contiguous() in Python (was ~6.6s) - Use per-expert pointer arrays (gate_projs) instead of contiguous memory - C++ worker pool performs parallel memcpy for TP slicing - Add LOAD_TIME_PROFILE for load_weights timing analysis 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: ouqingliang <1692110604@qq.com> Co-authored-by: Claude <noreply@anthropic.com>

[feat]: fix k2 prefill

129a362

gemini-code-assist bot reviewed Dec 4, 2025

View reviewed changes

ErvinXie and others added 6 commits December 4, 2025 11:41

Update Kimi-K2-Thinking.md

84ef5a9

Create Kimi-K2-Thinking-Native.md

f3009d8

Update Kimi-K2-Thinking.md

2abad5f

Update Kimi-K2-Thinking.md

db6eab6

Update Kimi-K2-Thinking-Native.md

f983247

SkqLiao merged commit 71f683a into main Dec 5, 2025
7 of 9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Native Kimi K2 Thinking#1663

Support Native Kimi K2 Thinking#1663
SkqLiao merged 7 commits intomainfrom
avx_kimi_k2

ErvinXie commented Dec 4, 2025

Uh oh!

gemini-code-assist bot commented Dec 4, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 4, 2025

Uh oh!

Uh oh!

zhangjiekui commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ErvinXie commented Dec 4, 2025

Uh oh!

gemini-code-assist bot commented Dec 4, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zhangjiekui commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants