Recent performance testing with DeepSeek R1 #223
Replies: 8 comments 7 replies
-
Thank you so much for these results. Also was the test conducted the same as before with a 500 token prompt and a 300 token response, or something different?
I can make a branch containing what fairydreaming used to evaluate PP and TG performance. From it's readme:
|
Beta Was this translation helpful? Give feedback.
-
The fairydreaming benchmark includes a script that contains a python script that generates a graph that would display multiple configurations against each other here are two examples of it's output from fairydreaming ( 1 and 2 ) We could tell you what configs to run and then you just pass all the jsonl output from each config into the script and it outputs a graph. Edit: Fixed image link to show PP instead of TG graph |
Beta Was this translation helpful? Give feedback.
-
Thank you for this! What is the hardware configuration? (EPYC model, single or dual socket, how many RAM sticks and what type) How many threads do you use when running the benchmarks? I think the most pressing issue is to understand why TG performance with FA enabled is so low. Is it possible to run one FA configuration with varying number of threads (e.g., The MLA failures are also concerning, but solving them would require debugging. CUDA does not support FA with different K and V head sizes and in the DeepSeekV3/R1 models, so no need to run those. I guess, I should add a check for that. Run time repacking seems to be adding 2-3 minutes to the load time. This is better than I expected but I guess it could be very annoying if used regularly. I should try to optimize or perhaps create a tool to repack an existing model. |
Beta Was this translation helpful? Give feedback.
-
Epyc 7773X (64 cores, 128 threads), one socket, 8x128GB RAM For the above I used 63 threads as a balance between prefill and generation. Is the run time repacking equivalent of using Q4_K_S versus quantizing a model with Q4_K_R4? Also, there is no repacking for Q4_K_M? If so, some of the comparisons are off as the models being compared are in fact different. I don't think repacking time is important for such a large model. Can't imagine loading it on demand in many environments. Here is a table of the benchmarks you asked for above.
|
Beta Was this translation helpful? Give feedback.
-
Thanks! So, what is the difference between the above and the original table? Here we see FA having lower performance than std/MLA, but only 10-20% lower and not 2.5x lower as in the original table. FA having slightly lower TG performance is in line with the expectation. Its main benefit is prefill performance, so depending on context (number of tokens generated vs prompt length), it will often win against std or MLA in terms of total processing time. But not when TG performance is 2.5X lower...
63 or 64? 63 is really bad as suddenly number of rows in tensors is no longer a multiple of the number of threads, so threads process different portions, and one likely even ends up with false sharing (threads writing into the same cache line, triggering cache syncs with potentially disastrous effects on performance). You see a little bit of that in the FA column above at 24, 48 and 96 threads, but these are still relatively "nice" thread numbers compared to 63.
Run-time-repacking (rtr) does not change the mix of quantization types.
OK, so this is Zen3, so using vanilla AVX2 implementation. If the information I find on the Internet is correct, it should have ~200 GB/s memory bandwidth. We have 37B active parameters at about 4.8 bpw for |
Beta Was this translation helpful? Give feedback.
-
Really curious to see what happens with PR #232. |
Beta Was this translation helpful? Give feedback.
-
Basically whatever command you use for your standard testing, but add
This bothers me too, but that's how it got implemented in this unmerged llama.cpp PR where the MLA implementation here originally came from (but there have been quite a few improvements compared to the PR in On KV cache size: To match KTransformers, Of note: MLA is ~20% slower than standard attention for less than a few hundred tokens in the cache. It becomes competitive performance wise only beyond 16k tokens. With MLA there are two matrix multiplications that are extremely slow on CUDA. I'm trying to improve that but no luck so far. |
Beta Was this translation helpful? Give feedback.
-
So I was going to try and get a bunch of benchmarks with recent code and I encountered a problem using any GPU offloading. This was a feature that was working, but poorly, last time I did some hand testing. The model is DeepSeek R1 Q8_0
Command lines like these with GPU offloading failed: CUDA_VISIBLE_DEVICES=0 ~/llmla/ik_llama.cpp/build/bin/llama-cli -mla 1 -rtr -b 1024 -ub 1024 -m DeepSeek-R1-Q8_0.gguf -c 8192 -t 64 --mlock -n 300 -f /mnt/data/prompt-prefill-benchmark.txt -ngl 999 -ot ".ffn_.*_exps.=CPU" |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm open to a more rigorous set of tests using accepted benchmark files. Just point me to them. I can run this periodically if it's scripted. Available are 2x24GB GPUs and 1TB of RAM on an Epyc CPU.
Tested with:
commit 4b45b82 (HEAD -> main, origin/main, origin/HEAD)
Author: Iwan Kawrakow [email protected]
Date: Thu Feb 20 17:42:07 2025 +0200
Honor attn_output specified in the command line also for low-bit quants
DeepSeek R1 Q4_K_M
Only the MLA configuration worked at 163840 token context. Everything else was OOM.
Beta Was this translation helpful? Give feedback.
All reactions