Replies: 10 comments 6 replies
-
You can use
The command will not overwrite the existing model, so you need to have enough free disk space for both models. In your command that starts the server, you can simplify to
It is a regular expression, so it is equivalent to explicitly listing
More generally, you can use
is equivalent. |
Beta Was this translation helpful? Give feedback.
-
A few thoughts here:
So yeah, like ik mentions, you will want to use Haha hope I didn't confuse too much. This is indeed a more straight-forward way than rolling your own quant, which would have the same steps but more. Cheers! |
Beta Was this translation helpful? Give feedback.
-
@ikawrakow With the original Unsloth quant and -rtr option, I get more than 7 tokens/s, while with converted quant without -rtr option, I get 4-5 tokens/s. Maybe it converted some tensors to more compute intensive equivalents? Perhaps there are other options besides The command I used was:
Here is full conversion log which includes all the output during the conversion: Three runs using the original Unsloth quant with -rtr option (timings line only for each run):
Three runs using the same prompt with the converted quant (without the -rtr option):
|
Beta Was this translation helpful? Give feedback.
-
@saood06 As of VRAM usage, I think it depends on context length. To be more precise, with 80K context I get around 19 gigabytes VRAM utilization on each GPU, so around 76-80 VRAM usage in total. If I try to increase context size too much, I get CUDA OOM errors, confirming it is using VRAM for context. Maybe I could put some additional ffn_down_exps, ffn_up_exps or ffn_gate_exps on each GPU, but not sure which of them is more beneficial to put in VRAM yet. I already experimented with blk.3.ffn_gate_exps=CUDA0, ... and so on, but since I cannot put too many of them due to having not that much VRAM free, I did not notice difference in performance. I did not try with non-gate ones yet. With my workflow that involves loading 72B vision model in VRAM, processing images, then load V3, not being able to get mmap working with good performance is the biggest bottleneck at the moment. I am still trying to figure out if there are options I could try to achieve the same kind of conversion -rtr option does, to create a new GGUF that would work the same in terms of performance but would not require -rtr anymore. |
Beta Was this translation helpful? Give feedback.
-
The offline repacking command should produce a result that is 100% equivalent to what happens with online repacking. But the two runs will not be equivalent as memory will be allocated and assigned to tensors in a different way. I have seen performance differences between offline and online repacking on my hardware, but never as large as you are reporting. Can you try dropping caches before using the offline repacked model?
|
Beta Was this translation helpful? Give feedback.
-
If you have spare VRAM, the best strategy is to put the
to have all attention and shared experts tensors plus the first 20 layers of |
Beta Was this translation helpful? Give feedback.
-
First, I load the repacked model with -rtr option - obviously should be unnecessary, but I was curious if it makes a difference, and to my surprise, it did, I got good performance again (full log: https://pastebin.com/5d6R2GDG):
Then, I ran
I tried adding --mlock, but the performance did not improve much (still was getting at most 4-5 tokens/s no matter how many times I tried). Since -rtr option disables mmap, I decided to disable it explicitly with --no-mmap and run without -rtr option, to see if it is mmap that ruins the performance:
...and with the repacked quant and --no-mmap option, performance was back to normal. So, it seems something about mmap that drastically reduces performance. Nothing wrong with the quant file then. Very strange. In theory, I would expect the performance to be about the same, since either way the same memory is used and I have plenty of it free. Please let me know if there are some kind of performance profiling or additional logging I could do on my side. As of putting more ffn_up_exps and ffn_gate_exps on GPU, I will try that with as much layers as I can, thank you very much for the suggestion. |
Beta Was this translation helpful? Give feedback.
-
I was able to achieve similar speed with mmap after resetting my BIOS, and changing only absolutely necessary settings. Before that, no matter what I did, it ran at 30%-50% reduced speed. Not sure exactly what setting was messing up results, maybe performance tuning settings for memory throughput. But all good now, this is my current performance with mmap enabled using repacked quant (this is with around 2.5K token long fill in the context window):
With 32K filled, I get lesser performance but still good:
I did not save exact stats for 64K+ context fill, but it was slightly above 3 tokens/s for output. Input generally was within 50-80 tokens/s range. Reloading model with mmap enabled takes about 45 seconds, which is great. My final command to repack R1 and V3 was like this:
The pattern in llama-quantize crafted in a way that avoids repacking tensors I intent to use on GPUs. This the command I use to run it:
I also noticed that I need to specify CPU overrides last rather than first for CUDA overrides to have an effect. I used multiple -ot arguments since a single one could not understand multi-line format, but with many -ot, I can use multiple lines in my script for better readability. Putting ffn_up_exps and ffn_gate_exps from blocks 3-6 on my GPUs (one pair per GPU) is all that I could fit, I had even reduce context length to 72K (73728). Thank you so very much, @ikawrakow and @ubergarm , for helping me to figure this out! |
Beta Was this translation helpful? Give feedback.
-
So to repack I do inverse of my cuda regex? Can quant type also be converted? Or does it just become same_R4? MMAP or not, the entire model gets cached on my system, at least for qwen 235b sizes. |
Beta Was this translation helpful? Give feedback.
-
@Ph0rk0z |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
DeepSeek-V3-0324-GGUF-UD-Q4_K_XL works great for me when I load it using --run-time-repack, I get more than 7 tokens/s with EPYC 7763 and 1TB of 3200MHz RAM + 4x3090 GPUs. But this unfortunately disables mmap and requires a lot of compute on each reload - and if I need to switch models often in some tasks (for example, a separate model to process input images and describe them, then continue with DeepSeek V3), it slows things down.
So, what I am looking for, is it possible to repack DeepSeek-V3-0324-GGUF-UD-Q4_K_XL offline to a new GGUF which would work well with ik_llama.cpp and I ould load it without the --run-time-repack?
I know there are some existing quants made specifically for ik_llama.cpp, like https://huggingface.co/ubergarm/DeepSeek-V3-0324-GGUF, but I noticed that DeepSeek-V3-0324-GGUF-IQ4_K_R4 for example gives me 4-5 tokens/s at most, my guess because it quantized very differently, even though it has about the same size. This also suggests that creating my own quant from scratch may be very difficult - not only I have to download the full size models for V3 and R1 (which would take weeks via 4G connection I have), but I also may end up with a quant that does not perform as good as the original Unsloth quant, since I do not have any experience with creating GGUF quants. This is why I would prefer to find a way to repack an existing quant, rather than trying to create one from scratch, if that is possible?
In case it matters, here is the command I use to run the model (I specify only -ctk q8_0 because my understanding -ctv does not have any effect when due to enabled optimizations V cache is not actually used):
This command utilizes about 20GB of VRAM on each 24GB GPU. The main issue is that I am yet to figure out a way how to repack this GGUF so I could run without the -rtr option. I would appreciate any help how to resolve this?
Beta Was this translation helpful? Give feedback.
All reactions