[feat](kt-kernel): support qwen3-vl weights convert by mrhaoxx · Pull Request #1648 · kvcache-ai/ktransformers

mrhaoxx · 2025-11-27T04:03:59Z

What does this PR do?

Before submitting

Did you read the contributor guideline?
Did you write any new necessary tests?

gemini-code-assist · 2025-11-27T04:04:15Z

Summary of Changes

Hello @mrhaoxx, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly extends the weight conversion script to accommodate Qwen3-VL models. It addresses the unique structural differences of these models, particularly how their Mixture-of-Experts (MoE) layers are configured and stored. The changes ensure that the conversion process can correctly interpret and transform the weights of VL models, enabling their use with the kt-kernel framework.

Highlights

Qwen3-VL Model Support: Introduced support for converting weights of Qwen3-VL models, which have a distinct configuration structure and fused MoE expert layouts.
Dynamic Configuration Loading: Enhanced the configuration loading mechanism to detect and correctly parse text_config for VL models, distinguishing them from base models.
Fused MoE Expert Handling: Implemented specialized logic within _find_expert_layers and _convert_layer_experts to correctly identify and process the fused gate, up, and down projection weights characteristic of VL models.
Expanded Quantization Methods: Added moe_int4 and moe_int8 as supported quantization methods, allowing for more specific quantization strategies for Mixture-of-Experts models.
Improved Error Handling and Logging: Added a warning for missing NUMA folders during layer loading and more informative print statements regarding model type and fused tensor shapes during conversion.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds support for converting qwen3-vl model weights, which use a fused expert layout. The changes correctly identify the model type and introduce a new conversion path for these fused weights. The implementation is mostly correct, but I've identified a critical issue that will cause the script to fail at runtime due to unsupported quantization methods in a downstream component. I've also pointed out a high-severity issue regarding fragile logic that relies on alphabetical sorting of tensor names, which could lead to incorrect weight conversion. Finally, I've suggested a refactoring to improve the maintainability of a large and complex function. Addressing these points will make the implementation more robust and easier to maintain.

gemini-code-assist · 2025-11-27T04:05:47Z

kt-kernel/scripts/convert_cpu_weights.py

+            down_fused = fused_tensors[0]
+            gate_up_fused = fused_tensors[1]


The logic to assign down_fused and gate_up_fused relies on the alphabetical order of the projection names, as fused_tensors is populated from a sorted list of projection keys. This is fragile and could lead to incorrect weight loading if the projection names don't sort as expected.

A more robust approach would be to identify the tensors based on their shapes, which appear to be distinct. The down-like tensor is [E, I, H] and the gate_up-like tensor is [E, H, 2I]. You can use self.moe_intermediate_size (I) and self.hidden_size (H) to differentiate them.

Suggested change

down_fused = fused_tensors[0]

gate_up_fused = fused_tensors[1]

if fused_tensors[0].shape[1] == self.moe_intermediate_size:

down_fused = fused_tensors[0]

gate_up_fused = fused_tensors[1]

else:

down_fused = fused_tensors[1]

gate_up_fused = fused_tensors[0]

gemini-code-assist · 2025-11-27T04:05:47Z

kt-kernel/scripts/convert_cpu_weights.py

+        if self.kt_cvt_type == "vl":
+            if self.input_type not in ["bf16", "fp16"]:
+                raise ValueError(f"VL path currently supports bf16/fp16 only, got input_type={self.input_type}")
+
+            proj_set = set()
+            prefix = f"model.language_model.layers.{layer_idx}.mlp.experts."
+            for key in self.tensor_file_map.keys():
+                if key.startswith(prefix):
+                    parts = key.split(".")
+                    if len(parts) >= 7:
+                        proj_set.add(parts[6])
+
+            if not proj_set:
+                raise ValueError(
+                    f"[VL] No fused MoE experts found for layer {layer_idx} under 'model.language_model.layers'"
+                )
+
+            projs = sorted(proj_set)
+            print(f"  [VL] layer {layer_idx} fused proj keys: {projs}")
+
+            if len(projs) < 2:
+                raise ValueError(
+                    f"[VL] Expect at least 2 fused tensors (down & gate_up) in layer {layer_idx}, got {len(projs)}"
+                )
+
+            fused_tensors = []
+            for p in projs:
+                key = f"model.language_model.layers.{layer_idx}.mlp.experts.{p}"
+                if key not in self.tensor_file_map:
+                    raise KeyError(f"[VL] Missing fused tensor {key} for layer {layer_idx}")
+                w = self._load_tensor(key)
+                if self.input_type == "fp16":
+                    w = w.to(torch.bfloat16)
+                print(f"    [VL] tensor {p} shape: {tuple(w.shape)}")
+                fused_tensors.append(w)
+
+            #   fused_tensors[0] : down-like, [E, I, H]
+            #   fused_tensors[1] : gate_up-like, [E, H, 2I]
+            down_fused = fused_tensors[0]
+            gate_up_fused = fused_tensors[1]
+
+            #    gate_up_fused: [E, H, 2I] -> [E, 2I, H] -> gate / up
+            if gate_up_fused.dim() != 3:
+                raise ValueError(f"[VL] Expect gate_up fused tensor to be 3D, got shape {tuple(gate_up_fused.shape)}")
+            E, H, twoI = gate_up_fused.shape
+            if twoI % 2 != 0:
+                raise ValueError(f"[VL] gate_up last dim (2I) not even: {twoI}")
+            I = twoI // 2
+
+            gate_up_T = gate_up_fused.transpose(1, 2).contiguous()   # [E, 2I, H]
+            gate_proj = gate_up_T[:, :I, :]                          # [E, I, H]
+            up_proj = gate_up_T[:, I:, :]                            # [E, I, H]
+
+            if down_fused.dim() != 3:
+                raise ValueError(f"[VL] Expect down fused tensor to be 3D, got shape {tuple(down_fused.shape)}")
+            if down_fused.shape[0] != E:
+                raise ValueError(
+                    f"[VL] down_fused expert dim mismatch: {down_fused.shape[0]} vs gate_up {E}"
+                )
+            down_proj = down_fused.transpose(1, 2).contiguous()      # [E, H, I]
+            del fused_tensors
+            del gate_up_fused
+            del down_fused
+        else:
+            gate_weights = []
+            up_weights = []
+            down_weights = []

-            gate_weights.append(gate_weight)
-            up_weights.append(up_weight)
-            down_weights.append(down_weight)
+            for expert_id in expert_ids:
+                gate_key = f"model.layers.{layer_idx}.mlp.experts.{expert_id}.gate_proj.weight"
+                up_key = f"model.layers.{layer_idx}.mlp.experts.{expert_id}.up_proj.weight"
+                down_key = f"model.layers.{layer_idx}.mlp.experts.{expert_id}.down_proj.weight"
+
+                if gate_key not in self.tensor_file_map:
+                    raise KeyError(f"Missing gate weight for layer {layer_idx}, expert {expert_id}")
+                if up_key not in self.tensor_file_map:
+                    raise KeyError(f"Missing up weight for layer {layer_idx}, expert {expert_id}")
+                if down_key not in self.tensor_file_map:
+                    raise KeyError(f"Missing down weight for layer {layer_idx}, expert {expert_id}")
+
+                # Load weights based on input type
+                if self.input_type == "fp8":
+                    # Load FP8 weights and their scale_inv tensors
+                    gate_scale_key = f"model.layers.{layer_idx}.mlp.experts.{expert_id}.gate_proj.weight_scale_inv"
+                    up_scale_key = f"model.layers.{layer_idx}.mlp.experts.{expert_id}.up_proj.weight_scale_inv"
+                    down_scale_key = f"model.layers.{layer_idx}.mlp.experts.{expert_id}.down_proj.weight_scale_inv"
+
+                    if gate_scale_key not in self.tensor_file_map:
+                        raise KeyError(f"Missing gate weight_scale_inv for layer {layer_idx}, expert {expert_id}")
+                    if up_scale_key not in self.tensor_file_map:
+                        raise KeyError(f"Missing up weight_scale_inv for layer {layer_idx}, expert {expert_id}")
+                    if down_scale_key not in self.tensor_file_map:
+                        raise KeyError(f"Missing down weight_scale_inv for layer {layer_idx}, expert {expert_id}")
+
+                    # Load FP8 weights and scales
+                    gate_fp8 = self._load_tensor(gate_key).to("cuda")
+                    up_fp8 = self._load_tensor(up_key).to("cuda")
+                    down_fp8 = self._load_tensor(down_key).to("cuda")
+
+                    gate_scale_inv = self._load_tensor(gate_scale_key).to("cuda")
+                    up_scale_inv = self._load_tensor(up_scale_key).to("cuda")
+                    down_scale_inv = self._load_tensor(down_scale_key).to("cuda")
+
+                    # Dequantize FP8 to BF16 using block-wise scaling
+                    gate_weight = weight_dequant(gate_fp8, gate_scale_inv).to("cpu").to(torch.bfloat16).contiguous()
+                    up_weight = weight_dequant(up_fp8, up_scale_inv).to("cpu").to(torch.bfloat16).contiguous()
+                    down_weight = weight_dequant(down_fp8, down_scale_inv).to("cpu").to(torch.bfloat16).contiguous()
+
+                elif self.input_type == "fp16":
+                    # Load FP16 and convert to BF16
+                    gate_weight = self._load_tensor(gate_key).to(torch.bfloat16)
+                    up_weight = self._load_tensor(up_key).to(torch.bfloat16)
+                    down_weight = self._load_tensor(down_key).to(torch.bfloat16)
+
+                elif self.input_type == "bf16":
+                    # Load BF16 directly
+                    gate_weight = self._load_tensor(gate_key)
+                    up_weight = self._load_tensor(up_key)
+                    down_weight = self._load_tensor(down_key)
+
+                else:
+                    raise ValueError(f"Unsupported input_type for INT4 conversion: {self.input_type}")
+
+                gate_weights.append(gate_weight)
+                up_weights.append(up_weight)
+                down_weights.append(down_weight)
+
+            # Stack weights into single tensors: [num_experts, ...]
+            gate_proj = torch.stack(gate_weights, dim=0).contiguous()
+            up_proj = torch.stack(up_weights, dim=0).contiguous()
+            down_proj = torch.stack(down_weights, dim=0).contiguous()
+            del gate_weights, up_weights, down_weights

-        # Stack weights into single tensors: [num_experts, ...]
-        gate_proj = torch.stack(gate_weights, dim=0).contiguous()
-        up_proj = torch.stack(up_weights, dim=0).contiguous()
-        down_proj = torch.stack(down_weights, dim=0).contiguous()



The _convert_layer_experts function has become very large with the addition of the vl conversion path. The if/else block for self.kt_cvt_type is substantial, making the function difficult to read and maintain.

Consider refactoring the weight loading logic for "vl" and "base" types into separate helper methods. This would make _convert_layer_experts cleaner and more focused on the quantization process itself.

For example:

def _load_vl_expert_weights(self, layer_idx: int) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]: # ... logic from the `if self.kt_cvt_type == "vl":` block ... return gate_proj, up_proj, down_proj def _load_base_expert_weights(self, layer_idx: int, expert_ids: List[int]) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]: # ... logic from the `else:` block ... return gate_proj, up_proj, down_proj def _convert_layer_experts(self, layer_idx: int, expert_ids: List[int]) -> Dict[str, torch.Tensor]: """Convert all experts in a layer using online quantization via AMXMoEWrapper""" start_time = time.time() print(f"Converting layer {layer_idx} with {len(expert_ids) if self.kt_cvt_type == 'base' else 'fused'} experts via online quantization...") if self.kt_cvt_type == "vl": gate_proj, up_proj, down_proj = self._load_vl_expert_weights(layer_idx) else: gate_proj, up_proj, down_proj = self._load_base_expert_weights(layer_idx, expert_ids) # ... rest of the quantization logic ...

This would significantly improve the readability and maintainability of the code.

[feat](kt-kernel): support qwen3-vl weights convert

bf414f5

gemini-code-assist bot reviewed Nov 27, 2025

View reviewed changes

KMSorSMS added the run-ci label Nov 27, 2025

KMSorSMS mentioned this pull request Nov 27, 2025

Development Roadmap (2025 Q4) #1582

Open

KMSorSMS merged commit 637c49c into kvcache-ai:main Nov 27, 2025
9 checks passed

KMSorSMS pushed a commit that referenced this pull request Dec 11, 2025

[feat](kt-kernel): support qwen3-vl weights convert (#1648)

b10d01d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat](kt-kernel): support qwen3-vl weights convert#1648

[feat](kt-kernel): support qwen3-vl weights convert#1648
KMSorSMS merged 1 commit intokvcache-ai:mainfrom
mrhaoxx:main

mrhaoxx commented Nov 27, 2025 •

edited by KMSorSMS

Loading

Uh oh!

gemini-code-assist bot commented Nov 27, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 27, 2025

Uh oh!

gemini-code-assist bot Nov 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		down_fused = fused_tensors[0]
		gate_up_fused = fused_tensors[1]

Conversation

mrhaoxx commented Nov 27, 2025 • edited by KMSorSMS Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Uh oh!

gemini-code-assist bot commented Nov 27, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mrhaoxx commented Nov 27, 2025 •

edited by KMSorSMS

Loading