feat: add Kimi-K2.5 (moonshotai/Kimi-K2.5) model support in HYBRID mode by jasonqinzhou · Pull Request #403 · ai-dynamo/aiconfigurator

jasonqinzhou · 2026-02-19T21:38:01Z

Add model config for Kimi-K2.5 (MLA-based MoE, 61 layers, 384 routed experts, 64 attention heads, 262k context)
Register KimiK25ForConditionalGeneration architecture under the DEEPSEEK model family and add moonshotai/Kimi-K2.5 to DefaultHFModels
Fix _parse_hf_config_json to fall back to top-level config when model params are nested under "text_config" (required for VLM-style HF configs like Kimi-K2.5)
Extend MLA collector test cases and TRT-LLM collect_mla n_list to cover num_heads=64 (Kimi-K2.5) in addition to the existing 128 (DeepSeek-V3)

Fix DeepSeekModel / TrtllmWideEPDeepSeekModel hardcoded 128-head ops:
DeepSeekModel and TrtllmWideEPDeepSeekModel hardcoded DeepSeek-V3's
128 attention heads in several MLA GEMM / attention ops, making them
produce incorrect weight-size and latency estimates for any DEEPSEEK
model with a different head count (e.g. Kimi-K2.5 with 64 heads).
Replace every affected hardcode with self._num_heads:
- context/generation q_b_proj_gemm n = num_heads * 192 // tp
- context kv_b_proj_gemm n = num_heads * 256 // tp
- context/generation_attention n_heads = num_heads // tp
- context_proj_gemm k = num_heads * 128 // tp

Fix nextn (MTP) auto-assigned to all DEEPSEEK models (task.py):
nextn was unconditionally set to 1 for every DEEPSEEK model, adding a
spurious (nextn+1) activation-memory multiplier and incorrect MTP
latency scaling for models without Multi-Token Prediction support.
Now reads num_nextn_predict_layers from the raw model config (default 0),
so DeepSeek-V3/V3.1 still get nextn=1 while Kimi-K2.5 gets nextn=0.

Fix IndexError in get_worker_candidates() when all configs OOM (inference_session.py):
Same exceptions[-1]-on-empty-list crash fixed in agg_pareto() by #378
now also fixed in DisaggInferenceSession.get_worker_candidates().

Fix disagg per-worker GPU search space not scaling with --total-gpus (task.py):
_finalize_disagg used total_gpus only to cap max_gpu_per_replica (replica
scaling), but never updated num_gpu_per_worker / tp_list / dp_list /
moe_ep_list in the prefill and decode worker configs. Those lists were
hardcoded to [1,2,4,8], so large MoE models like Kimi-K2.5 (needing
EP=32+ to avoid OOM) were never explored regardless of --total-gpus.
_finalize_disagg now extends each non-singleton parallel list with
powers-of-2 up to total_gpus so that configurations like EP=32/64/128
are included in the sweep when sufficient GPUs are available.

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: #xxx

Example:

********************************************************************************
*                     Dynamo aiconfigurator Final Results                      *
********************************************************************************
  ----------------------------------------------------------------------------
  Input Configuration & SLA Target:
    Model: moonshotai/Kimi-K2.5 (is_moe: True)
    Total GPUs: 128
    Best Experiment Chosen: agg at 525.51 tokens/s/gpu (disagg 0.65x better)
  ----------------------------------------------------------------------------
  Overall Best Configuration:
    - Best Throughput: 67,265.92 tokens/s
    - Per-GPU Throughput: 525.51 tokens/s/gpu
    - Per-User Throughput: 34.46 tokens/s/user
    - TTFT: 1112.75ms
    - TPOT: 29.02ms
    - Request Latency: 30103.02ms
  ----------------------------------------------------------------------------
  Pareto Frontier:
                                                                                
      ┌────────────────────────────────────────────────────────────────────────┐
2950.0┤ •• agg                                                                 │
      │ ff disagg                                                              │
      │ xx agg best                                                            │
2458.3┤           •••                                                          │
      │              ••••                                                      │
      │                  ••••                                                  │
1966.7┤                      •••••                                             │
      │                           ••                                           │
1475.0┤                             •••••• fffff                               │
      │                                   ••••  ff                             │
      │                                       ••••fff                          │
 983.3┤                                             •fff                       │
      │                                                •fff••                  │
      │                                                    fff••••             │
 491.7┤                                                       fff ••x••        │
      │                                                          fff   •       │
      │                                                                        │
   0.0┤                                                                        │
      └┬─────────────────┬─────────────────┬────────────────┬─────────────────┬┘
       0                10                20               30                40 
tokens/s/gpu_cluster                 tokens/s/user                              

  ----------------------------------------------------------------------------
  Deployment Details:
    (p) stands for prefill, (d) stands for decode, bs stands for batch size, a replica stands for the smallest scalable unit xPyD of the disagg system
    Some math: total gpus used = replicas * gpus/replica
               gpus/replica = (p)gpus/worker * (p)workers + (d)gpus/worker * (d)workers; for Agg, gpus/replica = gpus/worker
               gpus/worker = tp * pp * dp = etp * ep * pp for MoE models; tp * pp for dense models (underlined numbers are the actual values in math)

agg Top Configurations: (Sorted by tokens/s/gpu)
+------+---------+--------------+---------------+---------+-----------------+----------------+-------------------+----------+--------------+--------------+--------------------+----+
| Rank | backend | tokens/s/gpu | tokens/s/user |   TTFT  | request_latency |  concurrency   | total_gpus (used) | replicas | gpus/replica | gpus/worker  |      parallel      | bs |
+------+---------+--------------+---------------+---------+-----------------+----------------+-------------------+----------+--------------+--------------+--------------------+----+
|  1   |  trtllm |    525.51    |     34.46     | 1112.75 |     30103.02    | 2048 (=256x8)  |   128 (128=8x16)  |    8     |      16      | 16 (=1x1x16) | tp1pp1dp16etp1ep16 | 16 |
|  2   |  trtllm |    525.51    |     34.46     | 1112.75 |     30103.02    | 2048 (=512x4)  |   128 (128=4x32)  |    4     |      32      | 32 (=1x1x32) | tp1pp1dp32etp1ep32 | 16 |
|  3   |  trtllm |    525.51    |     34.46     | 1112.75 |     30103.02    | 2048 (=1024x2) |   128 (128=2x64)  |    2     |      64      | 64 (=1x1x64) | tp1pp1dp64etp1ep64 | 16 |
+------+---------+--------------+---------------+---------+-----------------+----------------+-------------------+----------+--------------+--------------+--------------------+----+

disagg Top Configurations: (Sorted by tokens/s/gpu)
+------+---------+--------------+---------------+--------+-----------------+----------------+-------------------+----------+-----------------+------------+----------------+--------------------+-------+------------+----------------+--------------------+-------+
| Rank | backend | tokens/s/gpu | tokens/s/user |  TTFT  | request_latency |  concurrency   | total_gpus (used) | replicas |   gpus/replica  | (p)workers | (p)gpus/worker |    (p)parallel     | (p)bs | (d)workers | (d)gpus/worker |    (d)parallel     | (d)bs |
+------+---------+--------------+---------------+--------+-----------------+----------------+-------------------+----------+-----------------+------------+----------------+--------------------+-------+------------+----------------+--------------------+-------+
|  1   |  trtllm |    339.83    |     33.55     | 514.14 |     30295.33    | 1408 (=1408x1) |   128 (80=1x80)   |    1     | 80 (=1x16+4x16) |     1      |  16 (=1x1x16)  | tp1pp1dp16etp1ep16 |   1   |     4      |  16 (=1x1x16)  | tp1pp1dp16etp1ep16 |   22  |
|  2   |  trtllm |    339.81    |     33.55     | 514.14 |     30295.33    | 1408 (=1408x1) |   128 (80=1x80)   |    1     | 80 (=1x16+2x32) |     1      |  16 (=1x1x16)  | tp1pp1dp16etp1ep16 |   1   |     2      |  32 (=1x1x32)  | tp1pp1dp32etp1ep32 |   22  |
|  3   |  trtllm |    339.81    |     33.55     | 514.14 |     30295.33    | 1408 (=1408x1) |   128 (80=1x80)   |    1     | 80 (=1x16+1x64) |     1      |  16 (=1x1x16)  | tp1pp1dp16etp1ep16 |   1   |     1      |  64 (=1x1x64)  | tp1pp1dp64etp1ep64 |   22  |
+------+---------+--------------+---------------+--------+-----------------+----------------+-------------------+----------+-----------------+------------+----------------+--------------------+-------+------------+----------------+--------------------+-------+
********************************************************************************
2026-02-20 10:00:02,921 - aiconfigurator.cli.main - INFO - All experiments completed in 6.36 seconds
(aic_venv) jasonzho@NV-25010035:~/repo/repo5/aiconfigurator$ aiconfigurator cli default --model-path moonshotai/Kimi-K2.5 --total-gpus 128 --system gb200 --database-mode HYBRID --enable-wideep

Summary by CodeRabbit

New Features
- Added --enable-wideep CLI flag for enhanced model configuration support.
- Added support for three new Kimi model variants: K2-Instruct, K2-Thinking, and K2.5.
Improvements
- Enhanced multimodal model support (text + image configurations).
- Expanded MLA test case coverage with additional parameter configurations.

copy-pr-bot · 2026-02-19T21:38:05Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2026-02-19T21:46:39Z

src/aiconfigurator/sdk/task.py

tianhaox · 2026-02-27T16:20:30Z

collector/trtllm/collect_mla.py

    dtype_list = [tensorrt_llm.bindings.DataType.BF16]  # not support f8 for trt < v1.1
    test_cases = []
-    n_list = [128]
+    n_list = [64, 128]


the latency measurement is based on local num heads=n/tp. then you will have a lot of duplicate measurements. Remove 64 is a previous fix here. You don't need to modify this. i will suggest fixing that in sglang, to add tp 128 to collect_mla.py for sglang

tianhaox · 2026-02-27T16:20:38Z

collector/trtllm/collect_mla.py

    dtype_list = [tensorrt_llm.bindings.DataType.BF16]  # not support f8 for trt < v1.1
    test_cases = []
-    n_list = [128]
+    n_list = [64, 128]


same as above.

tianhaox · 2026-02-27T16:27:57Z

src/aiconfigurator/model_configs/moonshotai--Kimi-K2.5_config.json

@@ -0,0 +1,26 @@
+{
+  "architectures": ["KimiK25ForConditionalGeneration"],


how we handle the vision encoder?

same here, no quantization, needs a full copy past

tianhaox · 2026-02-27T16:29:37Z

collector/common_test_cases.py

    # num_heads, q_lora_rank, kv_lora_rank, qk_nope_head_dim, qk_rope_head_dim, v_head_dim
    model_config_list = [
        [128, 1536, 512, 128, 64, 128, "deepseek-ai/DeepSeek-V3"],
+        [64, 1536, 512, 128, 64, 128, "moonshotai/Kimi-K2.5"],


i think we don't need this. same for the previous explanation. 128/tp_list[1,2,4,...,128] naturally covers 64/tp_list[1,2,4,...,64]

- Add model config for Kimi-K2.5 (MLA-based MoE, 61 layers, 384 routed experts, 64 attention heads, 262k context) - Register KimiK25ForConditionalGeneration architecture under the DEEPSEEK model family and add moonshotai/Kimi-K2.5 to DefaultHFModels - Fix _parse_hf_config_json to fall back to top-level config when model params are nested under "text_config" (required for VLM-style HF configs like Kimi-K2.5) - Extend MLA collector test cases and TRT-LLM collect_mla n_list to cover num_heads=64 (Kimi-K2.5) in addition to the existing 128 (DeepSeek-V3) Fix DeepSeekModel / TrtllmWideEPDeepSeekModel hardcoded 128-head ops: DeepSeekModel and TrtllmWideEPDeepSeekModel hardcoded DeepSeek-V3's 128 attention heads in several MLA GEMM / attention ops, making them produce incorrect weight-size and latency estimates for any DEEPSEEK model with a different head count (e.g. Kimi-K2.5 with 64 heads). Replace every affected hardcode with self._num_heads: - context/generation q_b_proj_gemm n = num_heads * 192 // tp - context kv_b_proj_gemm n = num_heads * 256 // tp - context/generation_attention n_heads = num_heads // tp - context_proj_gemm k = num_heads * 128 // tp Fix nextn (MTP) auto-assigned to all DEEPSEEK models (task.py): nextn was unconditionally set to 1 for every DEEPSEEK model, adding a spurious (nextn+1) activation-memory multiplier and incorrect MTP latency scaling for models without Multi-Token Prediction support. Now reads num_nextn_predict_layers from the raw model config (default 0), so DeepSeek-V3/V3.1 still get nextn=1 while Kimi-K2.5 gets nextn=0. Fix IndexError in get_worker_candidates() when all configs OOM (inference_session.py): Same exceptions[-1]-on-empty-list crash fixed in agg_pareto() by #378 now also fixed in DisaggInferenceSession.get_worker_candidates(). Fix disagg per-worker GPU search space not scaling with --total-gpus (task.py): _finalize_disagg used total_gpus only to cap max_gpu_per_replica (replica scaling), but never updated num_gpu_per_worker / tp_list / dp_list / moe_ep_list in the prefill and decode worker configs. Those lists were hardcoded to [1,2,4,8], so large MoE models like Kimi-K2.5 (needing EP=32+ to avoid OOM) were never explored regardless of --total-gpus. _finalize_disagg now extends each non-singleton parallel list with powers-of-2 up to total_gpus so that configurations like EP=32/64/128 are included in the sweep when sufficient GPUs are available. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai · 2026-02-27T21:43:52Z

Walkthrough

The PR adds support for three new Kimi model variants with corresponding JSON configurations, expands MLA test case generation parameters, introduces a new --enable-wideep CLI flag, refactors dimension calculations to use dynamic per-head values, and enhances configuration parsing for multimodal HuggingFace models.

Changes

Cohort / File(s)	Summary
Kimi Model Support `src/aiconfigurator/model_configs/moonshotai--Kimi-K2-Instruct_config.json`, `src/aiconfigurator/model_configs/moonshotai--Kimi-K2-Thinking_config.json`, `src/aiconfigurator/model_configs/moonshotai--Kimi-K2.5_config.json`, `src/aiconfigurator/sdk/common.py`	Added three new Kimi model configuration files (K2-Instruct, K2-Thinking, K2.5) and registered them in default HF models. Updated architecture-to-model-family mapping to classify KimiK25ForConditionalGeneration as DEEPSEEK.
MLA Test Case Expansion `collector/common_test_cases.py`, `collector/trtllm/collect_mla.py`	Expanded MLA test case parameter space by adding n = 64 configurations alongside existing n = 128 entries, increasing test coverage for various model configurations.
CLI Enhancement `src/aiconfigurator/cli/main.py`	Added new `--enable-wideep` CLI flag with boolean toggle (default false). Extended `build_default_task_configs` function signature to accept and propagate `enable_wideep` parameter. Included the flag in reserved keys for YAML/config merge logic.
Configuration Parsing & Task Management `src/aiconfigurator/sdk/utils.py`, `src/aiconfigurator/sdk/task.py`	Enhanced `_parse_hf_config_json` to support multimodal HuggingFace models with fallback to top-level config when `text_config` is absent. Updated `TaskConfigFactory._base_common_layer` to derive `nextn` from model configuration rather than model family hardcoding.
Core Refactoring `src/aiconfigurator/sdk/models.py`, `src/aiconfigurator/sdk/inference_session.py`	Replaced hard-coded GEMM and ContextMLA dimensions with dynamic per-head calculations using `self._num_heads` and `self._head_size`. Refactored exception handling in `get_worker_candidates` to store exception in variable before reuse instead of repeated list indexing.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Three new Kimi models hop into view,
With dynamic dimensions that sum per-head true,
Test cases expand with a sixty-four cheer,
Configs now multimodal, far and near,
Wideep flags flutter on the command-line tier! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes adding Kimi-K2.5 model support in HYBRID mode, matching the primary objective of this PR.
Description check	✅ Passed	The description provides comprehensive detail on all major changes: model config, architecture registration, config parsing fixes, test case extensions, hardcoded value replacements, nextn fixes, error handling, and GPU scaling improvements. However, the template structure is incomplete with unfilled placeholder sections.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (2)

src/aiconfigurator/sdk/task.py (2)
306-307: Logic correctly derives nextn from model config.

This change properly reads num_nextn_predict_layers directly from the model's configuration instead of inferring it from the model family. This is more accurate for models like Kimi-K2.5 that may share architecture characteristics but have different MTP settings.

Consider adding defensive error handling similar to what's done in validate() (lines 773-776) to provide graceful fallback if model config loading fails:
🛡️ Optional defensive handling
 `@staticmethod`
 def _base_common_layer(ctx: TaskContext) -> dict:
-    raw_config = get_model_config_from_model_path(ctx.model_path).get("raw_config", {})
+    try:
+        raw_config = get_model_config_from_model_path(ctx.model_path).get("raw_config", {})
+    except Exception:
+        logger.warning("Could not load model config for %s; defaulting nextn to 0", ctx.model_path)
+        raw_config = {}
     nextn = raw_config.get("num_nextn_predict_layers", 0)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/aiconfigurator/sdk/task.py` around lines 306 - 307, The code now reads
num_nextn_predict_layers via get_model_config_from_model_path(ctx.model_path)
into raw_config/nextn but lacks defensive error handling; wrap the model config
load in a try/except (or equivalent error check) around
get_model_config_from_model_path to catch failures, log or surface the error
consistently (similar to validate()'s handling), and fallback to a safe default
(e.g., 0) for nextn so downstream logic won’t crash if loading the model config
fails; reference get_model_config_from_model_path, ctx.model_path, raw_config,
nextn and mirror the validate() pattern for logging/fallback.
773-777: Consider caching model config to avoid duplicate loading.

get_model_config_from_model_path is called here in validate() and also earlier in _base_common_layer() (line 306). For remote HuggingFace models, this results in redundant network requests. Consider caching the result on the TaskConfig instance or passing it through the context to avoid the duplicate load.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/aiconfigurator/sdk/task.py` around lines 773 - 777, The call to
get_model_config_from_model_path is duplicated in validate() and
_base_common_layer(), causing redundant remote loads for HuggingFace models;
cache the result on the TaskConfig instance (e.g., add an attribute like
self._cached_model_config) or pass the fetched value through the shared context
so subsequent calls reuse it: update _base_common_layer() (or the earlier
loader) to set self._cached_model_config = model_info and change validate() to
check self._cached_model_config before calling get_model_config_from_model_path
again (falling back to fetching and caching if absent).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/aiconfigurator/sdk/task.py`:
- Around line 306-307: The code now reads num_nextn_predict_layers via
get_model_config_from_model_path(ctx.model_path) into raw_config/nextn but lacks
defensive error handling; wrap the model config load in a try/except (or
equivalent error check) around get_model_config_from_model_path to catch
failures, log or surface the error consistently (similar to validate()'s
handling), and fallback to a safe default (e.g., 0) for nextn so downstream
logic won’t crash if loading the model config fails; reference
get_model_config_from_model_path, ctx.model_path, raw_config, nextn and mirror
the validate() pattern for logging/fallback.
- Around line 773-777: The call to get_model_config_from_model_path is
duplicated in validate() and _base_common_layer(), causing redundant remote
loads for HuggingFace models; cache the result on the TaskConfig instance (e.g.,
add an attribute like self._cached_model_config) or pass the fetched value
through the shared context so subsequent calls reuse it: update
_base_common_layer() (or the earlier loader) to set self._cached_model_config =
model_info and change validate() to check self._cached_model_config before
calling get_model_config_from_model_path again (falling back to fetching and
caching if absent).

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 90ad032 and 3270e64.

📒 Files selected for processing (11)

collector/common_test_cases.py
collector/trtllm/collect_mla.py
src/aiconfigurator/cli/main.py
src/aiconfigurator/model_configs/moonshotai--Kimi-K2-Instruct_config.json
src/aiconfigurator/model_configs/moonshotai--Kimi-K2-Thinking_config.json
src/aiconfigurator/model_configs/moonshotai--Kimi-K2.5_config.json
src/aiconfigurator/sdk/common.py
src/aiconfigurator/sdk/inference_session.py
src/aiconfigurator/sdk/models.py
src/aiconfigurator/sdk/task.py
src/aiconfigurator/sdk/utils.py

Model the ViT vision encoder (27-layer, 1152-dim), patch merger, and projector as GEMM/ElementWise ops prepended to context_ops. Each vision op carries _vision_num_tokens so the backend uses the correct token count (4096 pre-merge, 1024 post-merge) instead of isl. Also reverts collector changes per review comments. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

tianhaox · 2026-02-28T12:07:28Z

src/aiconfigurator/model_configs/moonshotai--Kimi-K2-Instruct_config.json

@@ -0,0 +1,26 @@
+{


we need to do a full copy paste. this is missing quant field. https://huggingface.co/moonshotai/Kimi-K2-Instruct/blob/main/config.json

tianhaox · 2026-02-28T12:09:41Z

src/aiconfigurator/model_configs/moonshotai--Kimi-K2-Thinking_config.json

@@ -0,0 +1,26 @@
+{
+  "architectures": ["DeepseekV3ForCausalLM"],


looks liek this model employs a 4bit quant,

https://huggingface.co/moonshotai/Kimi-K2-Thinking/blob/main/config.json
Are all these fields generated by claude? I think we need a full copy past to avoid misalignment.

tianhaox · 2026-02-28T12:23:30Z

src/aiconfigurator/cli/main.py

        help="Optional end-to-end request latency target (ms). Enables request-latency optimization mode.",
    )
    parser.add_argument("--prefix", type=int, default=0, help="Prefix cache length. Default to 0.")
+    parser.add_argument(


can we remove wideep and design in a seperate PR?

github-actions bot added the feat label Feb 19, 2026

jasonqinzhou commented Feb 20, 2026

View reviewed changes

src/aiconfigurator/sdk/task.py Outdated Show resolved Hide resolved

jasonqinzhou marked this pull request as ready for review February 20, 2026 02:41

jasonqinzhou requested review from AichenF, Arsene12358, Harrilee, YijiaZhao, ilyasher, simone-chen, tianhaox and xutizhou as code owners February 20, 2026 02:41

jasonqinzhou changed the title ~~feat: add Kimi-K2.5 (moonshotai/Kimi-K2.5) model support~~ feat: add Kimi-K2.5 (moonshotai/Kimi-K2.5) model support in HYBRID mode Feb 20, 2026

jasonqinzhou requested a review from Ethan-ES as a code owner February 20, 2026 18:01

jasonqinzhou mentioned this pull request Feb 20, 2026

feat: agent skill for adding a new hybrid architecture model #413

Closed

Harrilee approved these changes Feb 20, 2026

View reviewed changes

ilyasher approved these changes Feb 24, 2026

View reviewed changes

nqzhou24 mentioned this pull request Feb 25, 2026

feat: support multimodal models with nested text_config (Kimi-K2.5 quant detection) #454

Merged

tianhaox reviewed Feb 27, 2026

View reviewed changes

jasonqinzhou and others added 6 commits February 27, 2026 13:38

fix

5fcd29a

ruff

86e8b1d

add k2

d921975

wideep

5cca3e7

lint

3270e64

jasonqinzhou force-pushed the jasonzho/kimi-k2.5 branch from ffc79b3 to 3270e64 Compare February 27, 2026 21:40

coderabbitai bot reviewed Feb 27, 2026

View reviewed changes

tianhaox reviewed Feb 28, 2026

View reviewed changes

jasonqinzhou marked this pull request as draft March 12, 2026 00:56

		@@ -0,0 +1,26 @@
		{
		"architectures": ["KimiK25ForConditionalGeneration"],

		@@ -0,0 +1,26 @@
		{
		"architectures": ["DeepseekV3ForCausalLM"],

Conversation

jasonqinzhou commented Feb 19, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Example:

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Feb 19, 2026

Uh oh!

github-actions bot commented Feb 19, 2026

Uh oh!

Uh oh!

tianhaox Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tianhaox Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tianhaox Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

tianhaox Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

tianhaox Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot commented Feb 27, 2026

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

tianhaox Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

tianhaox Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

tianhaox Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jasonqinzhou commented Feb 19, 2026 •

edited by coderabbitai bot

Loading

tianhaox Feb 27, 2026 •

edited

Loading

tianhaox Feb 27, 2026 •

edited

Loading