support GLM-4.5V vision model #16600

ddh0 · 2025-10-15T19:09:23Z

Add support for zai-org/GLM-4.5V vision model to llama.cpp. I currently only plan to support images + text, no video inputs in this PR.

The architecture is Glm4vMoeForConditionalGeneration ("model_type": "glm4v_moe"). Internally, this consists of an LLM (text model) and a ViT (vision adapter / multimodal projector):

LLM (text model `glm4v_moe_text`)

Based on GLM-4.5-Air
Tensor names start with model.language_model.
Uses a "multimodal 3D RoPE" - in apply_multimodal_rotary_pos_emb, it applies rotary embeddings across temporal, height, and width dimensions for visual tokens

ViT (vision adapter `glm4v_moe`)

Adapted from apple/aimv2-huge-patch14-336:
- Architecture Aimv2VisionModel
- ~681M params
- 24 layers
- hidden_size (n_embd): 1536
- intermediate_size (n_ff): 4096
- image_size: 336
- patch_size: 14
- num_channels: 3
- depth: 24
Tensor names start with model.visual.
Its 2D positional embeddings are dynamically adapted via bicubic interpolation within the Glm4vMoeVisionEmbeddings module to handle varied image resolutions
It also applies its own rotary position embeddings within the self-attention blocks (via apply_rotary_pos_emb_vision)

Other notes:

Native context length is 65,536 (as opposed to 131,072 for GLM-4.5-Air)
RoPE theta (θ): 10,000.0 (as opposed to 100,000.0 for GLM-4.5-Air)
The model supports video input, but I currently do not plan to support video input in this PR (images only)
Tokenizer has video-related special tokens - need to handle these during conversion

References:

The 🤗 reference implementations:
- modeling_glm4v_moe.py
- modular_glm4v_moe.py
The 🤗 model card
The 🤗 config.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

support GLM-4.5V vision model #16600

support GLM-4.5V vision model #16600

ddh0 commented Oct 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

support GLM-4.5V vision model #16600

Are you sure you want to change the base?

support GLM-4.5V vision model #16600

Conversation

ddh0 commented Oct 15, 2025

LLM (text model glm4v_moe_text)

ViT (vision adapter glm4v_moe)

Other notes:

References:

See also:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

LLM (text model `glm4v_moe_text`)

ViT (vision adapter `glm4v_moe`)