Fix imatrix overprotectiveness #202

ikawrakow · 2025-02-11T16:30:28Z

I hear reports that people are having trouble creating imatrix data for models with many experts (e.g., DeepSeek-R1, Arctic). For such models it may be very hard to activate all experts in all layers, which it turns out leads to the data for the entire tensor containing experts with missing data to be not stored in the imatrix file. Which then prevents usage of the imatrix data for low-bit quantization of such models.

It wasn't like this when I added the imatrix to llama.cpp, but it turns out the protection police has been at work and has added these checks, which I then inherited when syncing with upstream. Thanks to @saood06 for making me aware of this unfortunate situation.

This PR reduces the powers of the protection police. If a tensor is found that has partial contributions to the imatrix data, instead of simply skipping it, we now

Check if it is a tensor containing experts
If so, count how many experts are missing data
If less than 5% of the experts are missing data, we
- Warn the user, but still store the data in the imatrix file
- Set the imatrix weights to 1 for the experts missing data

The rationale behind this approach is that if an expert was never activated after processing a significant amount of calibration data, this expert cannot be very important, so we can afford to quantize it with low bpw quants even without guidance on the importance of columns of this expert.

Strictly speaking it would be better to leave the zeros in the imatrix data of experts that have never been activated. But this would require to go and add proper protection against all-zeros imatrices, along with the appropriate corrective action, for all quants, and not just for IQ1_S_R4 as I did in #191. So, for now we go with same-importance columns for never activated experts.

saood06 · 2025-02-11T17:09:17Z

for the entire tensor containing experts

Not entirely related to this, but do you know why GGUF stores all the experts together? (I just checked the initial PR in mainline for an MoE and no rationale was given for this).

I plan to port over code that lets you override where certain tensors are allocated which allows you to store non-shared experts on RAM and all else on VRAM. If the experts were not consolidated into one large tensor this could easily allow for expert parallelism which would benefit NUMA systems.

ikawrakow · 2025-02-11T17:16:38Z

but do you know why GGUF stores all the experts together?

No I don't. The initial MoE implementation was not like that, and then it got changed. I have kept the ability to use the original version in my fork (so I don't need to re-download MoE models that were created before the change).

Fix imatrix overprotectiveness

1044af2

ikawrakow merged commit e974fc9 into main Feb 12, 2025

This was referenced Jul 8, 2025

imatrix : use GGUF to store importance matrices ggml-org/llama.cpp#9400

Merged

Bug: llama-imatrix crashing #601

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix imatrix overprotectiveness #202

Fix imatrix overprotectiveness #202

Uh oh!

ikawrakow commented Feb 11, 2025 •

edited

Loading

Uh oh!

saood06 commented Feb 11, 2025

Uh oh!

ikawrakow commented Feb 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix imatrix overprotectiveness #202

Fix imatrix overprotectiveness #202

Uh oh!

Conversation

ikawrakow commented Feb 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

saood06 commented Feb 11, 2025

Uh oh!

ikawrakow commented Feb 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ikawrakow commented Feb 11, 2025 •

edited

Loading