-
Notifications
You must be signed in to change notification settings - Fork 13.3k
mtmd: Add JinaCLIP v2 vision projector + GGUF support for jina-bert-v3 (merged-LoRA or adapter) #16574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…ged-LoRA or adapter)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a minimal validation tool llama-jinaclip-cli (built by default) for text/image embedding numerical/performance checks;
I don't see why wee need to add this new CLI. The mtmd-cli
can do this with -p
and --image
params
|
||
# JinaCLIP CLI (align style with other targets above) | ||
set(TARGET llama-jinaclip-cli) | ||
add_executable (${TARGET} jinaclip-cli.cpp) | ||
target_link_libraries (${TARGET} PRIVATE common mtmd Threads::Threads) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should try to merge this with mtmd-cli
to avoid the "fragmentation" trap of the old llava-cli
binary
self.gguf_writer.add_uint32("clip.vision.image_size", img_sz) | ||
self.gguf_writer.add_uint32("clip.vision.patch_size", patch_sz) | ||
self.gguf_writer.add_uint32("clip.vision.embedding_length", n_embd) | ||
self.gguf_writer.add_uint32("clip.vision.block_count", n_layer) | ||
self.gguf_writer.add_uint32("clip.vision.projection_dim", proj_dim) | ||
self.gguf_writer.add_uint32("clip.vision.feed_forward_length", n_ff) | ||
self.gguf_writer.add_uint32("clip.vision.attention.head_count", n_head) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had specific functions and constants to add these metadata keys. Use them instead
|
||
# Top-level direct mappings | ||
if src_no_vm == 'cls_token': | ||
return [('v.cls_token', data_torch)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use proper mapping instead
if (!ctx->jinaclip_rope_initialized) { | ||
const int half_dim = rope_dim / 2; | ||
std::vector<float> base_freqs(half_dim); | ||
for (int i = 0; i < half_dim; i++) { | ||
float arange_val = i * 2.0f; // [0, 2, 4, ..., 30] | ||
float normalized = arange_val / rope_dim; // [0, 2/32, 4/32, ..., 30/32] | ||
float theta_powered = powf(freq_base, normalized); // theta^normalized | ||
base_freqs[i] = 1.0f / theta_powered; // 1.0 / theta^normalized | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you're trying to do here, is this just 2D RoPE? (which we already supported)
} | ||
|
||
clip_image_u8 resized_keep_ratio; | ||
image_manipulation::bicubic_pil_resize(*img, resized_keep_ratio, out_w, out_h); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally pre-processing doesn't need to be byte-exact. I would prefer keeping the old bicubic_resize
to keep it simple.
self.model_arch = gguf.MODEL_ARCH.JINA_BERT_V3 | ||
|
||
# Jina v3 (RoPE) without LoRA should export as jina-bert-v3 to avoid expecting absolute position embeddings | ||
try: | ||
text_cfg = hparams.get("text_config", {}) if isinstance(hparams.get("text_config", {}), dict) else {} | ||
pe_type = (text_cfg.get("position_embedding_type") or hparams.get("position_embedding_type") or "").lower() | ||
rope_base = text_cfg.get("rotary_emb_base", hparams.get("rotary_emb_base")) | ||
name_path = (hparams.get("_name_or_path") or "").lower() | ||
is_v3 = (pe_type == "rotary" or rope_base is not None) and ("jina" in name_path and "v3" in name_path) | ||
if is_v3 and not self._lora_names: | ||
self.model_arch = gguf.MODEL_ARCH.JINA_BERT_V3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please explain this, first off it breaks jina-embeddings-v3
conversion, secondly jina-clip-v2
looks like it loads jina-embeddings-v3
and uses the retrieval.query
LoRA/prompt, but load_trained_adapters
set to false
suggests it's not applied?
https://huggingface.co/jinaai/jina-clip-v2/blob/main/config.json#L15-L38
mtmd: Add JinaCLIP v2 vision projector + GGUF support for jina-bert-v3 (merged-LoRA or adapter)
Overview
common_embd_normalize(..., 2)
.llama-jinaclip-cli
(built by default) for text/image embedding numerical/performance checks; depends only on common+mtmd+Threads, cross-platform buildable, no third-party deps.Scope of changes
clip.projector_type=jinaclip
,clip.vision.rope_theta
(configurable), image_size/patch_size/projection_dim, and map tensors for fused/non-fused QKV.clip_n_output_tokens()
returns 1 for JinaCLIP;clip_n_mmproj_embd()
returns projection_dim.llama-jinaclip-cli
target (default); one command covers text/image minimal validation, thread scaling, encode_ms reporting, and saves embeddings for Python parity.Validation summary
ci/run.sh
passes locally; no ggml op changes in this PR.encode_ms
and thread scaling; no regression observed. More data can be added if requested.Performance (absolute metrics, CPU-only minimal samples)
GPU group (absolute metrics, minimal samples)
Reproduction (optional)
Minimal commands & data (CPU)
jina-bert-v3.pooling_type = MEAN/CLS/LAST
clip.projector_type = jinaclip
,clip.vision.rope_theta = 10000
(default)CUDA_VISIBLE_DEVICES= ./build/bin/llama-jinaclip-cli -m /path/jina-text-converted.gguf -p "hello world" --n-gpu-layers 0
python3 <ref>/debug.py --mode text --input "hello world" --out-dir <dir> --fa off
CUDA_VISIBLE_DEVICES= ./build/bin/llama-jinaclip-cli --mmproj /path/mmproj-jina-vision-converted.gguf --image /path/img.jpg --n-gpu-layers 0
python3 <ref>/debug.py --mode image --input /path/img.jpg --out-dir <dir> --fa off
Files in this PR
convert_hf_to_gguf.py
tools/mtmd/clip.cpp
tools/mtmd/clip-impl.h
tools/mtmd/jinaclip-cli.cpp
tools/mtmd/CMakeLists.txt