Conversation
Summary of ChangesHello @ErvinXie, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces native support for Kimi K2 thinking by significantly optimizing the prefill operation within the Mixture of Experts (MoE) architecture. The core Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a highly optimized forward_prefill implementation for the Kimi K2 MoE layer, replacing a simple sequential loop with a batched and parallelized version that leverages a thread pool and AVX512 intrinsics. This change should significantly improve performance during the prefill phase. The PR also includes important bug fixes in amx_kernels.hpp related to indexing in a parallelized kernel, enhancing correctness. My review identifies a potential memory alignment issue in the new forward_prefill implementation that could lead to runtime crashes, and I have provided a suggestion to improve its robustness.
| auto f32out = (__m512*)((float*)output + i * config_.hidden_size + e); | ||
| f32out[0] = x0; | ||
| f32out[1] = x1; |
There was a problem hiding this comment.
The direct cast to (__m512*) and subsequent write operations assume that the output buffer is 64-byte aligned. If a non-aligned buffer is provided by the caller, this will result in a segmentation fault. The warm_up function in this file, for instance, uses a std::vector for the output buffer, which does not guarantee alignment, highlighting a scenario where this could fail. To enhance robustness, it is recommended to use unaligned store intrinsics.
auto f32out = (float*)output + i * config_.hidden_size + e;
_mm512_storeu_ps(f32out, x0);
_mm512_storeu_ps(f32out + 16, x1);- Avoid expensive torch.stack().contiguous() in Python (was ~6.6s) - Use per-expert pointer arrays (gate_projs) instead of contiguous memory - C++ worker pool performs parallel memcpy for TP slicing - Add LOAD_TIME_PROFILE for load_weights timing analysis 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
|
v0.43可以运行起来,并且生成质量应该没问题了(主观测试了长文本生成和实测了简单的工具调用),都ok 但经常会出现以下错误: [2025-12-10 10:36:20] INFO: 10.1.150.105:41802 - "POST /v1/chat/completions HTTP/1.1" 200 OK |
* [feat]: fix k2 prefill * Update Kimi-K2-Thinking.md * Create Kimi-K2-Thinking-Native.md * Update Kimi-K2-Thinking.md * Update Kimi-K2-Thinking.md * Update Kimi-K2-Thinking-Native.md * [perf] optimize K2 MoE weight loading with per-expert pointers - Avoid expensive torch.stack().contiguous() in Python (was ~6.6s) - Use per-expert pointer arrays (gate_projs) instead of contiguous memory - C++ worker pool performs parallel memcpy for TP slicing - Add LOAD_TIME_PROFILE for load_weights timing analysis 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: ouqingliang <1692110604@qq.com> Co-authored-by: Claude <noreply@anthropic.com>
#1598