Skip to content

Conversation

ddh0
Copy link
Contributor

@ddh0 ddh0 commented Oct 15, 2025

Add support for zai-org/GLM-4.5V vision model to llama.cpp. I currently only plan to support images + text, no video inputs in this PR.

The architecture is Glm4vMoeForConditionalGeneration ("model_type": "glm4v_moe"). Internally, this consists of an LLM (text model) and a ViT (vision adapter / multimodal projector):

LLM (text model glm4v_moe_text)

  • Based on GLM-4.5-Air
  • Tensor names start with model.language_model.
  • Uses a "multimodal 3D RoPE" - in apply_multimodal_rotary_pos_emb, it applies rotary embeddings across temporal, height, and width dimensions for visual tokens

ViT (vision adapter glm4v_moe)

  • Adapted from apple/aimv2-huge-patch14-336:
    • Architecture Aimv2VisionModel
    • ~681M params
    • 24 layers
    • hidden_size (n_embd): 1536
    • intermediate_size (n_ff): 4096
    • image_size: 336
    • patch_size: 14
    • num_channels: 3
    • depth: 24
  • Tensor names start with model.visual.
  • Its 2D positional embeddings are dynamically adapted via bicubic interpolation within the Glm4vMoeVisionEmbeddings module to handle varied image resolutions
  • It also applies its own rotary position embeddings within the self-attention blocks (via apply_rotary_pos_emb_vision)

Other notes:

  • Native context length is 65,536 (as opposed to 131,072 for GLM-4.5-Air)
  • RoPE theta (θ): 10,000.0 (as opposed to 100,000.0 for GLM-4.5-Air)
  • The model supports video input, but I currently do not plan to support video input in this PR (images only)
  • Tokenizer has video-related special tokens - need to handle these during conversion

References:

See also:

@github-actions github-actions bot added the python python script changes label Oct 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant