Skip to content

Qwen3.5 MoE VLM crashes on text-only inference (SmallVector out of range) #148

@dirvine

Description

@dirvine

Description

Loading mlx-community/Qwen3.5-35B-A3B-4bit via VLMModelFactory.shared.loadContainer() works fine — the container loads and weights map correctly. But the first text-only inference call crashes with:

MLX/ErrorHandler.swift:343: Fatal error: SmallVector out of range.
at mlx-c/mlx/c/array.cpp:335

The same model loads and runs fine through LLMModelFactory.

Root Cause

Qwen3VLLanguage.getRopeIndex() (Qwen3VL.swift L1341) assumes inputIds is always 2D [batch, seq]:

let (batchSize, seqLength) = (inputIds.dim(0), inputIds.dim(1))

When the VLM container is used for text-only generation (no images), the token tensor can be 1D [seq]. inputIds.dim(1) on a 1D array triggers the SmallVector fatal error.

The call path is:

  1. Qwen35.prepare(input:cache:windowSize:) — no image/video, so pixelValues is nil
  2. languageModel(inputIds, ...) — forwards raw token IDs
  3. Qwen35Language.LanguageModel.callAsFunctionpositionIds == nil, ropeDeltas == nil, so enters the getRopeIndex branch
  4. Qwen3VLLanguage.getRopeIndex(inputIds:...) — crashes on inputIds.dim(1) becasue the tensor is 1D

This shows up concretely when WiredMemoryUtils.tune() runs its prefill measurement — it creates a plain LMInput(tokens:) with 1D tokens, calls model.prepare(), and the VLM path crashes.

Reproduction

import MLXLMCommon
import MLXVLM

let config = ModelConfiguration(id: "mlx-community/Qwen3.5-35B-A3B-4bit")
let container = try await VLMModelFactory.shared.loadContainer(configuration: config)

// This crashes:
try await container.perform { context in
    try await WiredMemoryUtils.tune(
        context: context,
        tokenCount: 512,
        parameters: GenerateParameters(maxTokens: 1)
    )
}

Suggested Fix

The simplest fix would be to ensure inputIds is 2D before calling getRopeIndex. Something like:

// In Qwen35Language.LanguageModel.callAsFunction, before calling getRopeIndex:
var inputsForRope = inputs
if inputsForRope.ndim == 1 {
    inputsForRope = inputsForRope.expandedDimensions(axis: 0)
}

Or alternatively, getRopeIndex itself could handle 1D input gracefully.

Environment

  • mlx-swift-lm: main branch (HEAD = bc3c20e)
  • mlx-swift: main branch (HEAD = b6e128c)
  • Model: mlx-community/Qwen3.5-35B-A3B-4bit (model_type: qwen3_5_moe)
  • macOS 26.0, Apple M4 Max 96GB

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions