mtmd: Add JinaCLIP v2 vision projector + GGUF support for jina-bert-v3 (merged-LoRA or adapter) #16574

pockers21 · 2025-10-14T09:04:34Z

mtmd: Add JinaCLIP v2 vision projector + GGUF support for jina-bert-v3 (merged-LoRA or adapter)

Overview

Converter: write jina-bert-v3 text tower params into GGUF (supports both merged-LoRA checkpoints and adapter-based inputs), and export vision metadata (projector_type=jinaclip, vision.rope_theta, image_size, patch_size, projection_dim, etc.).
Runtime: introduce PROJECTOR_TYPE_JINACLIP in the MTMD path (JinaCLIP v2 vision tower: 2D RoPE with shared frequency cache, attention/FFN internal LayerNorm, single-token output), and normalize with common_embd_normalize(..., 2).
CLI (core): add a minimal validation tool llama-jinaclip-cli (built by default) for text/image embedding numerical/performance checks; depends only on common+mtmd+Threads, cross-platform buildable, no third-party deps.
Compatibility: only activates when related GGUF metadata exists; doesn’t affect other projectors (e.g., LLaVA/Qwen2VL); no ggml op changes; no external dependencies.

Scope of changes

convert_hf_to_gguf.py
- Text: support both merged-LoRA single checkpoints and adapter-based export.
- Vision (JinaCLIP v2): export clip.projector_type=jinaclip, clip.vision.rope_theta (configurable), image_size/patch_size/projection_dim, and map tensors for fused/non-fused QKV.
tools/mtmd/clip.cpp, tools/mtmd/clip-impl.h
- Add PROJECTOR_TYPE_JINACLIP: JinaCLIP v2 vision tower (2D RoPE with shared freq cache), attention internal LN, FFN sub-layer LN (enabled when both weight/bias present), single-token output (CLS-equivalent), unified L2 normalize.
- clip_n_output_tokens() returns 1 for JinaCLIP; clip_n_mmproj_embd() returns projection_dim.
tools/mtmd/jinaclip-cli.cpp, tools/mtmd/CMakeLists.txt
- Add llama-jinaclip-cli target (default); one command covers text/image minimal validation, thread scaling, encode_ms reporting, and saves embeddings for Python parity.

Validation summary

CI: CPU-only ci/run.sh passes locally; no ggml op changes in this PR.
Correctness: embedding models have no perplexity; we verify via C++ vs Python parity.
- TEXT (CPU, minimal sample): cosine=0.999996, RMSE=0.000125
- IMAGE (CPU, minimal sample): cosine=0.990261, RMSE=0.006168
Performance: checked with CLI encode_ms and thread scaling; no regression observed. More data can be added if requested.
Compatibility: activated only when GGUF metadata (projector_type=jinaclip, etc.) is present; other projectors unaffected.
Reference: ModelScope uniontech-yourong/split_jina (used for Python-side parity).

Performance (absolute metrics, CPU-only minimal samples)

Environment
- OS: Ubuntu 22.04.5 LTS
- CPU: Intel Xeon Platinum 8352V (dual-socket, 2×32C/64T, SMT on), 128 threads total
- Build: Release, GGML_CUDA=OFF (CPU-only), GCC 11.4, CMake 3.22
- Model: JinaCLIP v2 vision tower (image_size=512, patch=14, depth=24, hidden=1024; official: https://huggingface.co/jinaai/jina-clip-v2); text tower (Jina Embeddings v3, output truncated to 512 dims)
- Threads: primarily 8 threads for both text/image (with 1-thread comparison)
Metric definitions
- Text: use CLI-reported JINACLIP_ENCODE_MS (pure inference, excludes load)
- Image: use CLI line “image … done in … ms” (pure inference, excludes load)
Results (single sample, minimal)
- Text (“hello world”, ≈5 tokens)
  - 1 thread: encode_ms ≈ 180.48 ms
  - 8 threads: encode_ms ≈ 34.08 ms
- Image (512×512, single)
  - 8 threads: image done in ≈ 6154 ms (stabilizes ~6.1–6.4 s after warm-up)
Notes
- Above numbers are CPU-only pure inference; end-to-end (including model load) is higher and not included.

GPU group (absolute metrics, minimal samples)

Environment
- GPU: NVIDIA vGPU-32GB (cc=8.9, 32 GB), Driver 550.107, CUDA 12.4
- Build: Release, GGML_CUDA=ON (CUDA backend), CUDA arch=89
- Threads: -t 8 (host-side preprocessing threads)
Results (pure inference, excludes load)
- Text (“hello world”, ≈5 tokens): encode_ms ≈ 84.88 ms
- Image (512×512, single): image done in ≈ 827 ms

Reproduction (optional)

Minimal commands & data (CPU)

Produce GGUF (with ST pooling metadata)
- Text: jina-bert-v3.pooling_type = MEAN/CLS/LAST
- Vision: clip.projector_type = jinaclip, clip.vision.rope_theta = 10000 (default)
Text parity
- C++: CUDA_VISIBLE_DEVICES= ./build/bin/llama-jinaclip-cli -m /path/jina-text-converted.gguf -p "hello world" --n-gpu-layers 0
- Python: python3 <ref>/debug.py --mode text --input "hello world" --out-dir <dir> --fa off
- Metric: read both 512-d outputs and compute cosine / RMSE
Image parity
- C++: CUDA_VISIBLE_DEVICES= ./build/bin/llama-jinaclip-cli --mmproj /path/mmproj-jina-vision-converted.gguf --image /path/img.jpg --n-gpu-layers 0
- Python: python3 <ref>/debug.py --mode image --input /path/img.jpg --out-dir <dir> --fa off
- Metric: read both 512-d outputs and compute cosine / RMSE

Files in this PR

convert_hf_to_gguf.py
tools/mtmd/clip.cpp
tools/mtmd/clip-impl.h
tools/mtmd/jinaclip-cli.cpp
tools/mtmd/CMakeLists.txt

…ged-LoRA or adapter)

ngxson

add a minimal validation tool llama-jinaclip-cli (built by default) for text/image embedding numerical/performance checks;

I don't see why wee need to add this new CLI. The mtmd-cli can do this with -p and --image params

ngxson · 2025-10-14T10:03:37Z

tools/mtmd/CMakeLists.txt

+
+# JinaCLIP CLI (align style with other targets above)
+set(TARGET llama-jinaclip-cli)
+add_executable         (${TARGET} jinaclip-cli.cpp)
+target_link_libraries  (${TARGET} PRIVATE common mtmd Threads::Threads)


we should try to merge this with mtmd-cli to avoid the "fragmentation" trap of the old llava-cli binary

ngxson · 2025-10-14T10:04:56Z

convert_hf_to_gguf.py

+        self.gguf_writer.add_uint32("clip.vision.image_size", img_sz)
+        self.gguf_writer.add_uint32("clip.vision.patch_size", patch_sz)
+        self.gguf_writer.add_uint32("clip.vision.embedding_length", n_embd)
+        self.gguf_writer.add_uint32("clip.vision.block_count", n_layer)
+        self.gguf_writer.add_uint32("clip.vision.projection_dim", proj_dim)
+        self.gguf_writer.add_uint32("clip.vision.feed_forward_length", n_ff)
+        self.gguf_writer.add_uint32("clip.vision.attention.head_count", n_head)


We had specific functions and constants to add these metadata keys. Use them instead

ngxson · 2025-10-14T10:05:39Z

convert_hf_to_gguf.py

+
+        # Top-level direct mappings
+        if src_no_vm == 'cls_token':
+            return [('v.cls_token', data_torch)]


Use proper mapping instead

ngxson · 2025-10-14T10:09:13Z

tools/mtmd/clip.cpp

+        if (!ctx->jinaclip_rope_initialized) {
+            const int          half_dim = rope_dim / 2;
+            std::vector<float> base_freqs(half_dim);
+            for (int i = 0; i < half_dim; i++) {
+                float arange_val    = i * 2.0f;                     // [0, 2, 4, ..., 30]
+                float normalized    = arange_val / rope_dim;        // [0, 2/32, 4/32, ..., 30/32]
+                float theta_powered = powf(freq_base, normalized);  // theta^normalized
+                base_freqs[i]       = 1.0f / theta_powered;         // 1.0 / theta^normalized
+            }


Not sure what you're trying to do here, is this just 2D RoPE? (which we already supported)

ngxson · 2025-10-14T10:11:00Z

tools/mtmd/clip.cpp

+        }
+
+        clip_image_u8 resized_keep_ratio;
+        image_manipulation::bicubic_pil_resize(*img, resized_keep_ratio, out_w, out_h);


Generally pre-processing doesn't need to be byte-exact. I would prefer keeping the old bicubic_resize to keep it simple.

CISC · 2025-10-14T10:57:49Z

convert_hf_to_gguf.py

-            self.model_arch = gguf.MODEL_ARCH.JINA_BERT_V3
+
+        # Jina v3 (RoPE) without LoRA should export as jina-bert-v3 to avoid expecting absolute position embeddings
+        try:
+            text_cfg = hparams.get("text_config", {}) if isinstance(hparams.get("text_config", {}), dict) else {}
+            pe_type = (text_cfg.get("position_embedding_type") or hparams.get("position_embedding_type") or "").lower()
+            rope_base = text_cfg.get("rotary_emb_base", hparams.get("rotary_emb_base"))
+            name_path = (hparams.get("_name_or_path") or "").lower()
+            is_v3 = (pe_type == "rotary" or rope_base is not None) and ("jina" in name_path and "v3" in name_path)
+            if is_v3 and not self._lora_names:
+                self.model_arch = gguf.MODEL_ARCH.JINA_BERT_V3


Please explain this, first off it breaks jina-embeddings-v3 conversion, secondly jina-clip-v2 looks like it loads jina-embeddings-v3 and uses the retrieval.query LoRA/prompt, but load_trained_adapters set to false suggests it's not applied?
https://huggingface.co/jinaai/jina-clip-v2/blob/main/config.json#L15-L38

Add JinaCLIP v2 vision projector + GGUF support for jina-bert-v3 (mer…

fd37a5c

…ged-LoRA or adapter)

pockers21 requested review from CISC and ngxson as code owners October 14, 2025 09:04

github-actions bot added examples python python script changes labels Oct 14, 2025

ngxson requested changes Oct 14, 2025

View reviewed changes

ngxson reviewed Oct 14, 2025

View reviewed changes

CISC reviewed Oct 14, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

mtmd: Add JinaCLIP v2 vision projector + GGUF support for jina-bert-v3 (merged-LoRA or adapter) #16574

mtmd: Add JinaCLIP v2 vision projector + GGUF support for jina-bert-v3 (merged-LoRA or adapter) #16574

Uh oh!

pockers21 commented Oct 14, 2025

Uh oh!

ngxson left a comment

Uh oh!

ngxson Oct 14, 2025

Uh oh!

ngxson Oct 14, 2025

Uh oh!

ngxson Oct 14, 2025

Uh oh!

ngxson Oct 14, 2025

Uh oh!

ngxson Oct 14, 2025

Uh oh!

CISC Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mtmd: Add JinaCLIP v2 vision projector + GGUF support for jina-bert-v3 (merged-LoRA or adapter) #16574

Are you sure you want to change the base?

mtmd: Add JinaCLIP v2 vision projector + GGUF support for jina-bert-v3 (merged-LoRA or adapter) #16574

Uh oh!

Conversation

pockers21 commented Oct 14, 2025

mtmd: Add JinaCLIP v2 vision projector + GGUF support for jina-bert-v3 (merged-LoRA or adapter)

Uh oh!

ngxson left a comment

Choose a reason for hiding this comment

Uh oh!

ngxson Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

CISC Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants