Skip to content

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Oct 15, 2025

This PR allows llama-quantize to work with mmproj. It should allow quantizing mmproj to Qx_K and Qx_0 variants (no imatrix), reducing memory usage on mobile deployments.

Tested with https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct

To quantize the mmproj:

llama-quantize mmproj-model-f16.gguf mmproj-model-Q4_K_M.gguf Q4_K_M

Then, use it as usual:

llama-mtmd-cli -m language_model.gguf --mmproj mmproj-model-Q4_K_M.gguf

Ref discussion: #15453

@ngxson ngxson requested review from CISC and ggerganov as code owners October 15, 2025 09:42
ngxson and others added 2 commits October 15, 2025 12:19
@ngxson ngxson merged commit 3e3cb19 into ggml-org:master Oct 15, 2025
70 checks passed
yael-works pushed a commit to yael-works/llama.cpp that referenced this pull request Oct 15, 2025
* llama-quant: add support for mmproj

* Update src/llama.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

* check prefix instead

* small fix

---------

Co-authored-by: Georgi Gerganov <[email protected]>
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request Oct 15, 2025
* origin/master:
Add server-driven parameter defaults and syncing (ggml-org#16515)
metal: optimise `GGML_OP_SUM` (ggml-org#16559)
server : fix img token logs (ggml-org#16595)
llama-quant: add support for mmproj (ggml-org#16592)
CUDA: Changing the CUDA scheduling strategy to spin (ggml-org#16585)
server : fix mtmd checkpoints (ggml-org#16591)
metal : avoid using Metal's gpuAddress property (ggml-org#16576)
vulkan: Add ACC_TYPE_VEC2 implementation (ggml-org#16203)
CUDA + openCL: fix bug in accessing rms_norm->src while doing fusion (ggml-org#16577)
vulkan: Support FA with K/V in F32 (ggml-org#16543)
vulkan: Improve build time for MSVC (ggml-org#16545)
CUDA: enable FA for FP32 KV cache (ggml-org#16546)
CUDA: use fastdiv + ggml_cuda_mad for mmvf (ggml-org#16557)
CUDA: add fp kernel for larger batch size MoE (ggml-org#16512)
cuda : remove legacy copy-op pointer indirection code (ggml-org#16485)
server : dynamic token limit for prompt cache (ggml-org#16560)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants