Switching from llama.cpp/ktransformers, seeking advice/guidance #242
Replies: 9 comments 24 replies
-
Is the 72 GB VRAM from 3 x 24 GB GPUs? You setup is somewhat unusual as you "only" have 128 GB of RAM. If you want to use a ready model your only option would be the If you are willing to do your custom quantization, it will require a manual setup as there isn't an out-of-the-box mix to best take advantage of your amount of RAM+VRAM. I guess, I should add a similar functionality as the tensor overrides from #232 also to Once you have a model that you want to use, I think the best way to distribute the model weights between CPU RAM and GPU VRAM will be to use several What is the CPU in this system? |
Beta Was this translation helpful? Give feedback.
-
PR #244 has been merged, so hopefully this will help you with making your custom DeepSeekR1 quantization. The |
Beta Was this translation helpful? Give feedback.
-
Could the following work in your 3x24 GiB VRAM + 128 GiB RAM:
Oh, forgot. The tensors that go on the CPU should be quantized to the corresponding |
Beta Was this translation helpful? Give feedback.
-
The NaNs are concerning. If we got NaN probabilities (logits) out of the forward pass, the imatrix will be useless (will likely have NaNs). Another way to get a NaN in the perplexity is if the predicted probability for the observed token is zero. You maybe better of getting an imatrix from somewhere else. Have you tried running the same calculation with mainline The messages about partial data are to be expected. Only 8 out of 256 experts get activated per token, so if the batch was short, it is likely to have some experts that never were activated, so the imatrix for those contains just zeros. If one tries to use such an imatrix to quantize a model, this can lead to bad results (including NaNs in the model). That's why in mainline Concerning offloading specific experts: I haven't gathered statistics myself, so I don't know how useful that could be. I have seen claims around the Internet that one can gain that way (by offloading often used experts). On the other hand, this is such an obvious thing to do but has not become widely used, so my guess is that this may not be really true. The term "expert" is kind of misleading in the sense that it kind of implies that a given set of experts will be active when dealing with a given kind of context. But this is absolutely not true. If you process a paragraph of, say, 500 tokens on some specific topic, you will observe that basically all "experts" were active at least once. |
Beta Was this translation helpful? Give feedback.
-
You calculate the imatrix with MLA enabled (and no FA, because this skips one of the activations). This gives you imatrix data for For imatrix data computed with standard attention, imatrix data for |
Beta Was this translation helpful? Give feedback.
-
So here's what I came up with following your instructions:
#!/bin/bash
cd /home/user/nvme/gguf/DeepSeek-R1
rm -f DeepSeek-R1-custom.gguf
custom="
# Token embedding and output tensors
token_embd\.weight=q8_0
output\.weight=q6_K
output_norm\.weight=q5_K
# First 3 dense layers (GPU0)
blk\.[0-2]\..*=q5_K
# Layers 3-4 (GPU0) - MoE experts
blk\.[3-4]\.ffn_down_exps\.weight=iq4_xs
blk\.[3-4]\.ffn_gate_exps\.weight=iq2_xxs
blk\.[3-4]\.ffn_up_exps\.weight=iq2_xxs
# Layers 5-11 (GPU1) - MoE experts
blk\.[5-9]\.ffn_down_exps\.weight=iq3_xxs
blk\.[5-9]\.ffn_gate_exps\.weight=iq2_xxs
blk\.[5-9]\.ffn_up_exps\.weight=iq2_xxs
blk\.1[0-1]\.ffn_down_exps\.weight=iq3_xxs
blk\.1[0-1]\.ffn_gate_exps\.weight=iq2_xxs
blk\.1[0-1]\.ffn_up_exps\.weight=iq2_xxs
# Layers 12-18 (GPU2) - MoE experts
blk\.1[2-8]\.ffn_down_exps\.weight=iq3_xxs
blk\.1[2-8]\.ffn_gate_exps\.weight=iq2_xxs
blk\.1[2-8]\.ffn_up_exps\.weight=iq2_xxs
# Layers 19-60 (CPU) - MoE experts
blk\.19\.ffn_down_exps\.weight=iq2_k_r4
blk\.[2-5][0-9]\.ffn_down_exps\.weight=iq2_k_r4
blk\.60\.ffn_down_exps\.weight=iq2_k_r4
blk\.19\.ffn_gate_exps\.weight=iq2_xxs_r4
blk\.[2-5][0-9]\.ffn_gate_exps\.weight=iq2_xxs_r4
blk\.60\.ffn_gate_exps\.weight=iq2_xxs_r4
blk\.19\.ffn_up_exps\.weight=iq2_xxs_r4
blk\.[2-5][0-9]\.ffn_up_exps\.weight=iq2_xxs_r4
blk\.60\.ffn_up_exps\.weight=iq2_xxs_r4
# All attention tensors for MoE layers (3-60)
blk\.[3-9]\.attn_.*=q5_K
blk\.[1-5][0-9]\.attn_.*=q5_K
blk\.60\.attn_.*=q5_K
# Norm weights and bias for MoE layers (3-60)
blk\.[3-9]\.ffn_norm\.weight=q5_K
blk\.[1-5][0-9]\.ffn_norm\.weight=q5_K
blk\.60\.ffn_norm\.weight=q5_K
blk\.[3-9]\.exp_probs_b\.bias=q5_K
blk\.[1-5][0-9]\.exp_probs_b\.bias=q5_K
blk\.60\.exp_probs_b\.bias=q5_K
# Shared experts weights for MoE layers (3-60)
blk\.3\.ffn_.*shexp\.weight=q5_K
blk\.[4-9]\.ffn_.*shexp\.weight=q5_K
blk\.[1-5][0-9]\.ffn_.*shexp\.weight=q5_K
blk\.60\.ffn_.*shexp\.weight=q5_K
"
custom=$(
echo "$custom" | grep -v '^#' | \
sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)
/home/user/files/ai/llama/ik_llama.cpp/llama-quantize \
--imatrix imatrix.dat \
--token-embedding-type q8_0 \
--output-tensor-type q6_K \
--ignore-imatrix-rules \
--custom-q "$custom" \
DeepSeek-R1-F16.gguf DeepSeek-R1-custom.gguf Q6_K 32
#!/bin/bash
/home/user/files/ai/llama/ik_llama.cpp/llama-server \
-m /home/user/nvme/gguf/DeepSeek-R1/DeepSeek-R1-custom.gguf \
--api-key "$LOCAL_API_KEY" \
--host 0.0.0.0 \
--port 5000 \
-c 8192 \
-t 16 \
-sm layer \
-mg 1 \
-mla 2 \
-fmoe \
-ot "output\.weight=CUDA1" \
-ot "output_norm\.weight=CUDA1" \
-ot "token_embd\.weight=CUDA1" \
-ot "blk\.[0-4]\..*=CUDA1" \
-ot "blk\.[3-9]\.attn_.*=CUDA1" \
-ot "blk\.[1-5][0-9]\.attn_.*=CUDA1" \
-ot "blk\.60\.attn_.*=CUDA1" \
-ot "blk\.[3-9]\.ffn_norm\.weight=CUDA1" \
-ot "blk\.[1-5][0-9]\.ffn_norm\.weight=CUDA1" \
-ot "blk\.60\.ffn_norm\.weight=CUDA1" \
-ot "blk\.[3-9]\.ffn_.*shexp\.weight=CUDA1" \
-ot "blk\.[1-5][0-9]\.ffn_.*shexp\.weight=CUDA1" \
-ot "blk\.60\.ffn_.*shexp\.weight=CUDA1" \
-ot "blk\.[5-9]\.ffn_down_exps\.weight=CUDA0" \
-ot "blk\.[5-9]\.ffn_gate_exps\.weight=CUDA0" \
-ot "blk\.[5-9]\.ffn_up_exps\.weight=CUDA0" \
-ot "blk\.1[0-1]\.ffn_down_exps\.weight=CUDA0" \
-ot "blk\.1[0-1]\.ffn_gate_exps\.weight=CUDA0" \
-ot "blk\.1[0-1]\.ffn_up_exps\.weight=CUDA0" \
-ot "blk\.1[2-8]\.ffn_down_exps\.weight=CUDA2" \
-ot "blk\.1[2-8]\.ffn_gate_exps\.weight=CUDA2" \
-ot "blk\.1[2-8]\.ffn_up_exps\.weight=CUDA2" \ Even though I haven't spent much time playing with the settings, the speed is already at 7.1-7.3 tok/s with very short prompt and generation, 6.6-6.8tok/s with a few hundred tokens and 6.2-6.4tok/s for 1k. Also, a ~1k token ingestion goes at 35-40tok/s. I don't really know if those numbers make sense given the setup, but I am already very happy with these speeds. VRAM use is 23.59GB on the main GPU and 23.00GB on the other two. So 2.3/2.4GB is free to play with for longer context. Next steps:
Also, it seems that I can't use Edit: Main GPU usage is at 25% and other cards are at 0% when generating. Is it because of the RAM speed limitations? |
Beta Was this translation helpful? Give feedback.
-
Here are some early results for wiki.test: PPL for IQ2_XXS unsloth (size equivalent with your custom quant) and IQ1_S_R4/IQ1_M_R4 are still running. In the meantime, is there any reason why you didn't recommend your new SOTA quant types like IQ2_K, or IQ4_KSS? I see you added Q8 KV cache for MLA2. Nice! I will test perfs after the PPL tests. Finally, I stumbled upon this paper I thought you might find interesting: https://arxiv.org/pdf/2503.05840 |
Beta Was this translation helpful? Give feedback.
-
Someone else was observing issues (NaNs) with
Yes, I know about this paper. MLA=2 does the same thing, there is only K cache and the |
Beta Was this translation helpful? Give feedback.
-
Do you have the results now? I'm curious to know. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
I discovered this repo today, and I'm very excited to try all the new features and optimizations made here.
I am currently downloading R1 BF16 (can't convert using 3090, lack of fp8 support), and in the meantime, I am trying to learn as much as possible.
The goal is to run R1 with a reasonable PPL using 72GB VRAM and 128 GB RAM. Looking at the PRs and comments, the new IQ1_S_R4 (#185) and IQ1_M_R4 (#187) quants look really promising, as well as all the fancy stuff related to MLA and context cache (#208, #240, #241, ...), but it's a bit overwhelming at first glance.
I guess that the best option right now is to run one of these R4 quants, writing rules that are equivalent to a Ktransformers config for partial offload of critical sections of the model (#232), and try poking around with
--mla
values. For cache, I guess I can play with the new Q8_KV if applicable. Regarding CUDA, MLA and/or FA, I am sure what is compatible for CPU / GPU / multi GPU, what combinations of parameters could work.Do you have any advice regarding this type of setup? Is there a way to use more VRAM by selectively offloading individual experts/layers? If I read it right, R4 quants do not support offloading yet. Are there other tweaks or resources I can learn from to try and use your work as efficiently as possible?
I'd be happy to share my benchmarks and params when I am done quanting the model.
Thank you very much
Beta Was this translation helpful? Give feedback.
All reactions