feat: add Kimi-K2.5 (moonshotai/Kimi-K2.5) model support in HYBRID mode#403
feat: add Kimi-K2.5 (moonshotai/Kimi-K2.5) model support in HYBRID mode#403jasonqinzhou wants to merge 7 commits intomainfrom
Conversation
|
collector/trtllm/collect_mla.py
Outdated
| dtype_list = [tensorrt_llm.bindings.DataType.BF16] # not support f8 for trt < v1.1 | ||
| test_cases = [] | ||
| n_list = [128] | ||
| n_list = [64, 128] |
There was a problem hiding this comment.
the latency measurement is based on local num heads=n/tp. then you will have a lot of duplicate measurements. Remove 64 is a previous fix here. You don't need to modify this. i will suggest fixing that in sglang, to add tp 128 to collect_mla.py for sglang
collector/trtllm/collect_mla.py
Outdated
| dtype_list = [tensorrt_llm.bindings.DataType.BF16] # not support f8 for trt < v1.1 | ||
| test_cases = [] | ||
| n_list = [128] | ||
| n_list = [64, 128] |
| @@ -0,0 +1,26 @@ | |||
| { | |||
| "architectures": ["KimiK25ForConditionalGeneration"], | |||
There was a problem hiding this comment.
how we handle the vision encoder?
There was a problem hiding this comment.
same here, no quantization, needs a full copy past
collector/common_test_cases.py
Outdated
| # num_heads, q_lora_rank, kv_lora_rank, qk_nope_head_dim, qk_rope_head_dim, v_head_dim | ||
| model_config_list = [ | ||
| [128, 1536, 512, 128, 64, 128, "deepseek-ai/DeepSeek-V3"], | ||
| [64, 1536, 512, 128, 64, 128, "moonshotai/Kimi-K2.5"], |
There was a problem hiding this comment.
i think we don't need this. same for the previous explanation. 128/tp_list[1,2,4,...,128] naturally covers 64/tp_list[1,2,4,...,64]
- Add model config for Kimi-K2.5 (MLA-based MoE, 61 layers, 384 routed
experts, 64 attention heads, 262k context)
- Register KimiK25ForConditionalGeneration architecture under the DEEPSEEK
model family and add moonshotai/Kimi-K2.5 to DefaultHFModels
- Fix _parse_hf_config_json to fall back to top-level config when model
params are nested under "text_config" (required for VLM-style HF configs
like Kimi-K2.5)
- Extend MLA collector test cases and TRT-LLM collect_mla n_list to cover
num_heads=64 (Kimi-K2.5) in addition to the existing 128 (DeepSeek-V3)
Fix DeepSeekModel / TrtllmWideEPDeepSeekModel hardcoded 128-head ops:
DeepSeekModel and TrtllmWideEPDeepSeekModel hardcoded DeepSeek-V3's
128 attention heads in several MLA GEMM / attention ops, making them
produce incorrect weight-size and latency estimates for any DEEPSEEK
model with a different head count (e.g. Kimi-K2.5 with 64 heads).
Replace every affected hardcode with self._num_heads:
- context/generation q_b_proj_gemm n = num_heads * 192 // tp
- context kv_b_proj_gemm n = num_heads * 256 // tp
- context/generation_attention n_heads = num_heads // tp
- context_proj_gemm k = num_heads * 128 // tp
Fix nextn (MTP) auto-assigned to all DEEPSEEK models (task.py):
nextn was unconditionally set to 1 for every DEEPSEEK model, adding a
spurious (nextn+1) activation-memory multiplier and incorrect MTP
latency scaling for models without Multi-Token Prediction support.
Now reads num_nextn_predict_layers from the raw model config (default 0),
so DeepSeek-V3/V3.1 still get nextn=1 while Kimi-K2.5 gets nextn=0.
Fix IndexError in get_worker_candidates() when all configs OOM (inference_session.py):
Same exceptions[-1]-on-empty-list crash fixed in agg_pareto() by #378
now also fixed in DisaggInferenceSession.get_worker_candidates().
Fix disagg per-worker GPU search space not scaling with --total-gpus (task.py):
_finalize_disagg used total_gpus only to cap max_gpu_per_replica (replica
scaling), but never updated num_gpu_per_worker / tp_list / dp_list /
moe_ep_list in the prefill and decode worker configs. Those lists were
hardcoded to [1,2,4,8], so large MoE models like Kimi-K2.5 (needing
EP=32+ to avoid OOM) were never explored regardless of --total-gpus.
_finalize_disagg now extends each non-singleton parallel list with
powers-of-2 up to total_gpus so that configurations like EP=32/64/128
are included in the sweep when sufficient GPUs are available.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ffc79b3 to
3270e64
Compare
WalkthroughThe PR adds support for three new Kimi model variants with corresponding JSON configurations, expands MLA test case generation parameters, introduces a new Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
src/aiconfigurator/sdk/task.py (2)
306-307: Logic correctly derives nextn from model config.This change properly reads
num_nextn_predict_layersdirectly from the model's configuration instead of inferring it from the model family. This is more accurate for models like Kimi-K2.5 that may share architecture characteristics but have different MTP settings.Consider adding defensive error handling similar to what's done in
validate()(lines 773-776) to provide graceful fallback if model config loading fails:🛡️ Optional defensive handling
`@staticmethod` def _base_common_layer(ctx: TaskContext) -> dict: - raw_config = get_model_config_from_model_path(ctx.model_path).get("raw_config", {}) + try: + raw_config = get_model_config_from_model_path(ctx.model_path).get("raw_config", {}) + except Exception: + logger.warning("Could not load model config for %s; defaulting nextn to 0", ctx.model_path) + raw_config = {} nextn = raw_config.get("num_nextn_predict_layers", 0)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/aiconfigurator/sdk/task.py` around lines 306 - 307, The code now reads num_nextn_predict_layers via get_model_config_from_model_path(ctx.model_path) into raw_config/nextn but lacks defensive error handling; wrap the model config load in a try/except (or equivalent error check) around get_model_config_from_model_path to catch failures, log or surface the error consistently (similar to validate()'s handling), and fallback to a safe default (e.g., 0) for nextn so downstream logic won’t crash if loading the model config fails; reference get_model_config_from_model_path, ctx.model_path, raw_config, nextn and mirror the validate() pattern for logging/fallback.
773-777: Consider caching model config to avoid duplicate loading.
get_model_config_from_model_pathis called here invalidate()and also earlier in_base_common_layer()(line 306). For remote HuggingFace models, this results in redundant network requests. Consider caching the result on theTaskConfiginstance or passing it through the context to avoid the duplicate load.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/aiconfigurator/sdk/task.py` around lines 773 - 777, The call to get_model_config_from_model_path is duplicated in validate() and _base_common_layer(), causing redundant remote loads for HuggingFace models; cache the result on the TaskConfig instance (e.g., add an attribute like self._cached_model_config) or pass the fetched value through the shared context so subsequent calls reuse it: update _base_common_layer() (or the earlier loader) to set self._cached_model_config = model_info and change validate() to check self._cached_model_config before calling get_model_config_from_model_path again (falling back to fetching and caching if absent).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/aiconfigurator/sdk/task.py`:
- Around line 306-307: The code now reads num_nextn_predict_layers via
get_model_config_from_model_path(ctx.model_path) into raw_config/nextn but lacks
defensive error handling; wrap the model config load in a try/except (or
equivalent error check) around get_model_config_from_model_path to catch
failures, log or surface the error consistently (similar to validate()'s
handling), and fallback to a safe default (e.g., 0) for nextn so downstream
logic won’t crash if loading the model config fails; reference
get_model_config_from_model_path, ctx.model_path, raw_config, nextn and mirror
the validate() pattern for logging/fallback.
- Around line 773-777: The call to get_model_config_from_model_path is
duplicated in validate() and _base_common_layer(), causing redundant remote
loads for HuggingFace models; cache the result on the TaskConfig instance (e.g.,
add an attribute like self._cached_model_config) or pass the fetched value
through the shared context so subsequent calls reuse it: update
_base_common_layer() (or the earlier loader) to set self._cached_model_config =
model_info and change validate() to check self._cached_model_config before
calling get_model_config_from_model_path again (falling back to fetching and
caching if absent).
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
collector/common_test_cases.pycollector/trtllm/collect_mla.pysrc/aiconfigurator/cli/main.pysrc/aiconfigurator/model_configs/moonshotai--Kimi-K2-Instruct_config.jsonsrc/aiconfigurator/model_configs/moonshotai--Kimi-K2-Thinking_config.jsonsrc/aiconfigurator/model_configs/moonshotai--Kimi-K2.5_config.jsonsrc/aiconfigurator/sdk/common.pysrc/aiconfigurator/sdk/inference_session.pysrc/aiconfigurator/sdk/models.pysrc/aiconfigurator/sdk/task.pysrc/aiconfigurator/sdk/utils.py
Model the ViT vision encoder (27-layer, 1152-dim), patch merger, and projector as GEMM/ElementWise ops prepended to context_ops. Each vision op carries _vision_num_tokens so the backend uses the correct token count (4096 pre-merge, 1024 post-merge) instead of isl. Also reverts collector changes per review comments. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| @@ -0,0 +1,26 @@ | |||
| { | |||
There was a problem hiding this comment.
we need to do a full copy paste. this is missing quant field. https://huggingface.co/moonshotai/Kimi-K2-Instruct/blob/main/config.json
| @@ -0,0 +1,26 @@ | |||
| { | |||
| "architectures": ["DeepseekV3ForCausalLM"], | |||
There was a problem hiding this comment.
looks liek this model employs a 4bit quant,
https://huggingface.co/moonshotai/Kimi-K2-Thinking/blob/main/config.json
Are all these fields generated by claude? I think we need a full copy past to avoid misalignment.
| help="Optional end-to-end request latency target (ms). Enables request-latency optimization mode.", | ||
| ) | ||
| parser.add_argument("--prefix", type=int, default=0, help="Prefix cache length. Default to 0.") | ||
| parser.add_argument( |
There was a problem hiding this comment.
can we remove wideep and design in a seperate PR?
Fix DeepSeekModel / TrtllmWideEPDeepSeekModel hardcoded 128-head ops:
DeepSeekModel and TrtllmWideEPDeepSeekModel hardcoded DeepSeek-V3's
128 attention heads in several MLA GEMM / attention ops, making them
produce incorrect weight-size and latency estimates for any DEEPSEEK
model with a different head count (e.g. Kimi-K2.5 with 64 heads).
Replace every affected hardcode with self._num_heads:
- context/generation q_b_proj_gemm n = num_heads * 192 // tp
- context kv_b_proj_gemm n = num_heads * 256 // tp
- context/generation_attention n_heads = num_heads // tp
- context_proj_gemm k = num_heads * 128 // tp
Fix nextn (MTP) auto-assigned to all DEEPSEEK models (task.py):
nextn was unconditionally set to 1 for every DEEPSEEK model, adding a
spurious (nextn+1) activation-memory multiplier and incorrect MTP
latency scaling for models without Multi-Token Prediction support.
Now reads num_nextn_predict_layers from the raw model config (default 0),
so DeepSeek-V3/V3.1 still get nextn=1 while Kimi-K2.5 gets nextn=0.
Fix IndexError in get_worker_candidates() when all configs OOM (inference_session.py):
Same exceptions[-1]-on-empty-list crash fixed in agg_pareto() by #378
now also fixed in DisaggInferenceSession.get_worker_candidates().
Fix disagg per-worker GPU search space not scaling with --total-gpus (task.py):
_finalize_disagg used total_gpus only to cap max_gpu_per_replica (replica
scaling), but never updated num_gpu_per_worker / tp_list / dp_list /
moe_ep_list in the prefill and decode worker configs. Those lists were
hardcoded to [1,2,4,8], so large MoE models like Kimi-K2.5 (needing
EP=32+ to avoid OOM) were never explored regardless of --total-gpus.
_finalize_disagg now extends each non-singleton parallel list with
powers-of-2 up to total_gpus so that configurations like EP=32/64/128
are included in the sweep when sufficient GPUs are available.
Overview:
Details:
Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Example:
Summary by CodeRabbit
New Features
--enable-wideepCLI flag for enhanced model configuration support.Improvements