fix(z-image): Fix padding token shape mismatch for GGUF models (#8690)

blessedcoolant · web-flow · commit 90e34002f084 · 2025-12-23T06:04:40.000+05:30
## Summary Fix shape mismatch when loading GGUF-quantized Z-Image transformer models. GGUF Z-Image models store `x_pad_token` and `cap_pad_token` with shape `[3840]`, but diffusers `ZImageTransformer2DModel` expects `[1, 3840]` (with batch dimension). This caused a `RuntimeError` on Linux systems when loading models like `z_image_turbo-Q4_K.gguf`. The fix: - Dequantizes GGMLTensors first (since they don't support `unsqueeze`) - Reshapes the tensors to add the missing batch dimension ## Related Issues / Discussions Reported by Linux user using: - https://huggingface.co/leejet/Z-Image-Turbo-GGUF/resolve/main/z_image_turbo-Q4_K.gguf - https://huggingface.co/worstplayer/Z-Image_Qwen_3_4b_text_encoder_GGUF/resolve/main/Qwen_3_4b-Q6_K.gguf ## QA Instructions 1. Install a GGUF-quantized Z-Image model (e.g., `z_image_turbo-Q4_K.gguf`) 2. Install a Qwen3 GGUF encoder 3. Run a Z-Image generation 4. Verify no `RuntimeError: size mismatch for x_pad_token` error occurs ## Merge Plan None, straightforward fix. ## Checklist - [x] _The PR has a short but descriptive title, suitable for a changelog_ - [ ] _Tests added / updated (if applicable)_ - [ ] _❗Changes to a redux slice have a corresponding migration_ - [ ] _Documentation added / updated (if applicable)_ - [ ] _Updated `What's New` copy (if doing a release after this PR)_
diff --git a/invokeai/backend/model_manager/load/model_loaders/z_image.py b/invokeai/backend/model_manager/load/model_loaders/z_image.py
@@ -42,6 +42,7 @@ def _convert_z_image_gguf_to_diffusers(sd: dict[str, Any]) -> dict[str, Any]:
     - x_embedder.* -> all_x_embedder.2-1.*
     - final_layer.* -> all_final_layer.2-1.*
     - norm_final.* -> skipped (diffusers uses non-learnable LayerNorm)
+    - x_pad_token, cap_pad_token: [dim] -> [1, dim] (diffusers expects batch dimension)
     """
     new_sd: dict[str, Any] = {}
 
@@ -50,6 +51,17 @@ def _convert_z_image_gguf_to_diffusers(sd: dict[str, Any]) -> dict[str, Any]:
             new_sd[key] = value
             continue
 
+        # Handle padding tokens: GGUF has shape [dim], diffusers expects [1, dim]
+        if key in ("x_pad_token", "cap_pad_token"):
+            if hasattr(value, "shape") and len(value.shape) == 1:
+                # GGMLTensor doesn't support unsqueeze, so dequantize first if needed
+                if hasattr(value, "get_dequantized_tensor"):
+                    value = value.get_dequantized_tensor()
+                # Use reshape instead of unsqueeze for better compatibility
+                value = torch.as_tensor(value).reshape(1, -1)
+            new_sd[key] = value
+            continue
+
         # Handle x_embedder -> all_x_embedder.2-1
         if key.startswith("x_embedder."):
             suffix = key[len("x_embedder.") :]