Skip to content

Conversation

ddh0
Copy link
Owner

@ddh0 ddh0 commented Oct 14, 2025

Implementation of GLM-4.5V in llama.cpp

The architecture is Glm4vMoeForConditionalGeneration ("model_type": "glm4v_moe"). Internally, this consists of an LLM (text model) and a ViT (vision adapter / multimodal projector):

LLM (text model glm4v_moe_text)

  • Based on GLM-4.5-Air
  • Tensor names start with model.language_model.
  • Uses a "multimodal 3D RoPE" - in apply_multimodal_rotary_pos_emb, it applies rotary embeddings across temporal, height, and width dimensions for visual tokens

ViT (vision adapter glm4v_moe)

  • Adapted from apple/aimv2-huge-patch14-336:
    • Architecture Aimv2VisionModel
    • ~681M params
    • 24 layers
    • hidden_size (n_embd): 1536
    • intermediate_size (n_ff): 4096
    • image_size: 336
    • patch_size: 14
    • num_channels: 3
    • depth: 24
  • Tensor names start with model.visual.
  • Its 2D positional embeddings are dynamically adapted via bicubic interpolation within the Glm4vMoeVisionEmbeddings module to handle varied image resolutions
  • It also applies its own rotary position embeddings within the self-attention blocks (via apply_rotary_pos_emb_vision)

Other notes:

  • Native context length is 65,536 (as opposed to 131,072 for GLM-4.5-Air)
  • RoPE theta (θ): 10,000.0 (as opposed to 100,000.0 for GLM-4.5-Air)
  • The model supports video input, but I currently do not plan to support video input in this PR (images only)
  • Tokenizer has video-related special tokens - need to handle these during conversion

References:

See also:

@ddh0 ddh0 marked this pull request as draft October 14, 2025 07:38
* cuda : remove legacy copy-op pointer indirection code (ggml-org#16485)

* remove legacy copy-op pointer indirection code

* further removal of copy-op indirection code

* renamed check_node_graph_compatibility_and_refresh_copy_ops function

* CUDA: add fp kernel for larger batch size MoE (ggml-org#16512)

* CUDA: kernel for larger batch sizes for MoE

* WIP

* WIP

* WIP

* WIP

* WIP

* WIP

* fixup

* tests

* Move mmq_ids_helper to mmid

* cleanup

* Remove redundant checks

* CUDA: use fastdiv + ggml_cuda_mad for mmvf (ggml-org#16557)

* CUDA: use fastdiv + ggml_cuda_mad for mmvf

* use bf16 directly + fix formatting

* Add exception for HIP code

* CUDA: enable FA for FP32 KV cache (ggml-org#16546)

* vulkan: Improve build time for MSVC (ggml-org#16545)

Enable CMP0147 so custom build steps (invoking vulkan-shader-gen) are run in parallel.

Enable /MP so source files are compiled in parallel.

* vulkan: Support FA with K/V in F32 (ggml-org#16543)

* CUDA + openCL: fix bug in accessing rms_norm->src while doing fusion (ggml-org#16577)

* vulkan: Add ACC_TYPE_VEC2 implementation (ggml-org#16203)

Signed-off-by: Stefan Savic <[email protected]>
Co-authored-by: Stefan Savic <[email protected]>

* metal : avoid using Metal's gpuAddress property (ggml-org#16576)

* metal : avoid using Metal's gpuAddress property

* metal : fix rope kernels buffer check

---------

Signed-off-by: Stefan Savic <[email protected]>
Co-authored-by: Anav Prasad <[email protected]>
Co-authored-by: Aman Gupta <[email protected]>
Co-authored-by: Johannes Gäßler <[email protected]>
Co-authored-by: Jeff Bolz <[email protected]>
Co-authored-by: SavicStefan <[email protected]>
Co-authored-by: Stefan Savic <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>
@ddh0 ddh0 changed the title support GLM-4.5V (108B multimodal) support GLM-4.5V (108B VLM) Oct 14, 2025
@ddh0
Copy link
Owner Author

ddh0 commented Oct 15, 2025

Moved to ggml-org#16600

@ddh0 ddh0 closed this Oct 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant