Skip to content

Support Native Kimi K2 Thinking#1663

Merged
SkqLiao merged 7 commits intomainfrom
avx_kimi_k2
Dec 5, 2025
Merged

Support Native Kimi K2 Thinking#1663
SkqLiao merged 7 commits intomainfrom
avx_kimi_k2

Conversation

@ErvinXie
Copy link
Collaborator

@ErvinXie ErvinXie commented Dec 4, 2025

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @ErvinXie, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces native support for Kimi K2 thinking by significantly optimizing the prefill operation within the Mixture of Experts (MoE) architecture. The core forward_prefill function has been refactored to utilize a highly parallelized approach, distributing computations across multiple experts and leveraging advanced vector instructions for improved efficiency. This enhancement aims to boost the performance of Kimi K2 models during their initial processing phase.

Highlights

  • Optimized Kimi K2 MoE Prefill: The forward_prefill function in k2-moe.hpp has been completely rewritten to leverage parallel processing and AVX512 intrinsics for significant performance improvements in the Mixture of Experts (MoE) prefill operation.
  • Fine-grained Expert Processing: The new prefill logic efficiently tracks and processes activated experts, distributing computation across a thread pool for various stages, including input copying, gate/up/down matrix multiplications, and activation functions.
  • Enhanced Performance Profiling: Detailed performance profiling has been integrated into the forward_prefill function, allowing for granular measurement of each stage's execution time to identify bottlenecks and further optimize.
  • Parallel GEMM Indexing Fixes: Corrections were made in amx_kernels.hpp to ensure correct sub-matrix indexing during parallelized General Matrix Multiply (GEMM) operations, which is crucial for proper data access and computation in k-group processing.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a highly optimized forward_prefill implementation for the Kimi K2 MoE layer, replacing a simple sequential loop with a batched and parallelized version that leverages a thread pool and AVX512 intrinsics. This change should significantly improve performance during the prefill phase. The PR also includes important bug fixes in amx_kernels.hpp related to indexing in a parallelized kernel, enhancing correctness. My review identifies a potential memory alignment issue in the new forward_prefill implementation that could lead to runtime crashes, and I have provided a suggestion to improve its robustness.

Comment on lines +719 to +721
auto f32out = (__m512*)((float*)output + i * config_.hidden_size + e);
f32out[0] = x0;
f32out[1] = x1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The direct cast to (__m512*) and subsequent write operations assume that the output buffer is 64-byte aligned. If a non-aligned buffer is provided by the caller, this will result in a segmentation fault. The warm_up function in this file, for instance, uses a std::vector for the output buffer, which does not guarantee alignment, highlighting a scenario where this could fail. To enhance robustness, it is recommended to use unaligned store intrinsics.

            auto f32out = (float*)output + i * config_.hidden_size + e;
            _mm512_storeu_ps(f32out, x0);
            _mm512_storeu_ps(f32out + 16, x1);

ErvinXie and others added 6 commits December 4, 2025 11:41
- Avoid expensive torch.stack().contiguous() in Python (was ~6.6s)
- Use per-expert pointer arrays (gate_projs) instead of contiguous memory
- C++ worker pool performs parallel memcpy for TP slicing
- Add LOAD_TIME_PROFILE for load_weights timing analysis

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@SkqLiao SkqLiao merged commit 71f683a into main Dec 5, 2025
7 of 9 checks passed
@zhangjiekui
Copy link

v0.43可以运行起来,并且生成质量应该没问题了(主观测试了长文本生成和实测了简单的工具调用),都ok

但经常会出现以下错误:

[2025-12-10 10:36:20] INFO: 10.1.150.105:41802 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-12-10 10:36:20 TP0] Prefill batch, #new-seq: 1, #new-token: 2171, #cached-token: 21, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-12-10 10:36:21 TP1] Scheduler hit an exception: Traceback (most recent call last):
File "/data/cppsrc/sglang-ktransformers/sglang/python/sglang/srt/managers/scheduler.py", line 2698, in run_scheduler_process
scheduler.event_loop_overlap()
File "/root/.virtualenvs/ktsglang/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/data/cppsrc/sglang-ktransformers/sglang/python/sglang/srt/managers/scheduler.py", line 993, in event_loop_overlap
batch_result = self.run_batch(batch)
^^^^^^^^^^^^^^^^^^^^^
File "/data/cppsrc/sglang-ktransformers/sglang/python/sglang/srt/managers/scheduler.py", line 1995, in run_batch
batch_result = self.model_worker.forward_batch_generation(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/cppsrc/sglang-ktransformers/sglang/python/sglang/srt/managers/tp_worker.py", line 371, in forward_batch_generation
logits_output, can_run_cuda_graph = self.model_runner.forward(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/cppsrc/sglang-ktransformers/sglang/python/sglang/srt/model_executor/model_runner.py", line 2155, in forward
output = self._forward_raw(
^^^^^^^^^^^^^^^^^^
File "/data/cppsrc/sglang-ktransformers/sglang/python/sglang/srt/model_executor/model_runner.py", line 2212, in _forward_raw
ret = self.forward_extend(
^^^^^^^^^^^^^^^^^^^^
File "/data/cppsrc/sglang-ktransformers/sglang/python/sglang/srt/model_executor/model_runner.py", line 2100, in forward_extend
return self.model.forward(
^^^^^^^^^^^^^^^^^^^
File "/root/.virtualenvs/ktsglang/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/data/cppsrc/sglang-ktransformers/sglang/python/sglang/srt/models/deepseek_v2.py", line 3104, in forward
hidden_states = self.model(
^^^^^^^^^^^
File "/root/.virtualenvs/ktsglang/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.virtualenvs/ktsglang/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/cppsrc/sglang-ktransformers/sglang/python/sglang/srt/models/deepseek_v2.py", line 2965, in forward
hidden_states, residual = layer(
^^^^^^
File "/root/.virtualenvs/ktsglang/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.virtualenvs/ktsglang/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/cppsrc/sglang-ktransformers/sglang/python/sglang/srt/models/deepseek_v2.py", line 2716, in forward
hidden_states = self.mlp(
^^^^^^^^^
File "/root/.virtualenvs/ktsglang/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.virtualenvs/ktsglang/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/cppsrc/sglang-ktransformers/sglang/python/sglang/srt/models/deepseek_v2.py", line 756, in forward
return self.forward_normal(
^^^^^^^^^^^^^^^^^^^^
File "/data/cppsrc/sglang-ktransformers/sglang/python/sglang/srt/models/deepseek_v2.py", line 837, in forward_normal
final_hidden_states = self.experts(
^^^^^^^^^^^^^
File "/root/.virtualenvs/ktsglang/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.virtualenvs/ktsglang/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/cppsrc/sglang-ktransformers/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 845, in forward
combine_input = self.run_moe_core(
^^^^^^^^^^^^^^^^^^
File "/data/cppsrc/sglang-ktransformers/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 867, in run_moe_core
return self.quant_method.apply(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/cppsrc/sglang-ktransformers/sglang/python/sglang/srt/layers/moe/kt_ep_wrapper.py", line 585, in apply
ctx = self._build_full_context(layer)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/cppsrc/sglang-ktransformers/sglang/python/sglang/srt/layers/moe/kt_ep_wrapper.py", line 639, in _build_full_context
_SHARED_FULL_CONTEXT = SharedFullContext(
^^^^^^^^^^^^^^^^^^
File "/data/cppsrc/sglang-ktransformers/sglang/python/sglang/srt/layers/moe/kt_ep_wrapper.py", line 98, in init
self._create_cpu_buffers()
File "/data/cppsrc/sglang-ktransformers/sglang/python/sglang/srt/layers/moe/kt_ep_wrapper.py", line 167, in _create_cpu_buffers
shm = shared_memory.SharedMemory(name=shm_name, create=True, size=nbytes)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/multiprocessing/shared_memory.py", line 104, in init
self._fd = _posixshmem.shm_open(
^^^^^^^^^^^^^^^^^^^^^
FileExistsError: [Errno 17] File exists: '/kt_buf_w13_weight_packed_r1'

KMSorSMS pushed a commit that referenced this pull request Dec 11, 2025
* [feat]: fix k2 prefill

* Update Kimi-K2-Thinking.md

* Create Kimi-K2-Thinking-Native.md

* Update Kimi-K2-Thinking.md

* Update Kimi-K2-Thinking.md

* Update Kimi-K2-Thinking-Native.md

* [perf] optimize K2 MoE weight loading with per-expert pointers

- Avoid expensive torch.stack().contiguous() in Python (was ~6.6s)
- Use per-expert pointer arrays (gate_projs) instead of contiguous memory
- C++ worker pool performs parallel memcpy for TP slicing
- Add LOAD_TIME_PROFILE for load_weights timing analysis

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

---------

Co-authored-by: ouqingliang <1692110604@qq.com>
Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants