-
Notifications
You must be signed in to change notification settings - Fork 154
SER - Smart Expert Reduction #239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Here a graph for error versus performance gain for hybrid CPU/GPU inference (Ryzen-7950X/RTX-4080) for DeepSeek-Lite. Operations with MoE tensors are computed on the CPU, all others on the GPU. Here performance gains are much more significant. As attention and shared experts computation done on the GPU is much faster than the MoE calculation done on the CPU, we gain more by selectively reducing experts. If we just use 5 experts instead of 6, TG performance increases by nearly 20% while the associated error is significantly less than using 4 bits for the attention layers (magenta symbols) |
This looks very interesting - what would you recommend is the best way to test this with full CUDA off-load with R1? If you have some harnesses to test PPL, that would be great |
I typically use Wikitext2 It is of course also important to just use it and see if you think the quality of the responses is degraded. This is very subjective, but it will be you using it, so you must like it. But with the 150-200 t/s you are getting for R1 it will not be easy to get a detailed evaluation. Each point in the graphs above takes less than 2 minutes to compute, so with a simple script it was done in less than 1 hour. In your case, a full PPL calculation on Wikitext2 with optimistically 200 t/s will take close to 30 minutes. I have seen people looking at just the first 10 or 20 batches. This is by far not enough as results tend to change quite a bit after that. So, I think it is important to carefully select the few full runs you want to do. I would first check 6 and 7 experts using |
Okay, cool! I am going to first create my own quant somewhere around |
@ikawrakow a little bit off topic but didn't know where better to ask. I have downloaded the BF16 version, converted to gguf, and then quantisizing to
All seems to be going well, until I hit:
Now I don't know if this is because of the imatrix, the changes for MLA with the quantize process, or a corrupted BF16 model file. I am currently re-checking the hash of the Likely a corrupt part. But just wondering, is there anything I'm doing wrong here? I wasn't 100% sure if that's a correct quantize command, or something I'm missing. TYVM UPDATE: Had a part of the BF16 that the hash failed :) |
Let me know if it works after you re-download the corrupt file. If it doesn't, the I would need to make the quantization more robust against missing imatrix data. DeepSeekV3/R1 is tricky because only 8 out of 256 experts are activated per token, so for an imatrix calculation with a given amount of calibration data there will be 32X less data collected for the experts compared to a dense model. This may lead to missing/insufficient imatrix data, which may not be handled gracefully by the quantization functions. |
I will! Reconverting to GGUF from BF16 takes a decent amount of time on HDDs compared to NVME. Should be done around 6pm tonight, and I’ll quantize soon after that! Thank you for all of the help and your work on improving inference with DS V3/R1 - its excellent! |
Seemed to quantize fine, but got this on model load:
UPDATE: Looks like because I've created the GGUF from https://huggingface.co/unsloth/DeepSeek-R1-BF16, it seems it's possible that the tokenizers lib that was used was a future version (ref merges change huggingface/tokenizers@6a5fce9), and as a result with older tokenizer lib it isn't able to read the new format of the merges. Have created some custom runtime code to load merges so I can avoid re-quanting the bf16, at least for now. Didn’t get a chance to play with the SER but hope to over the following few days. Next up on the learning agenda :) |
Preliminary results with |
You observe 5-10% GPU utilization because each GPU is only processing 1/16th of the layers, so it is busy only 1/16th of the time (the other time it is just waiting for the next piece of data). You said you are getting ~ 17 t/s, so each token is taking about 60 ms, so each GPU is busy for about 4 ms out of the 60 ms. But while it is busy, the calculation is limited by something (else it would finish in zero time). If the computation is dominated by the MoE part of the model (it is on my RTX-4080), then using fewer experts will make it run faster, no matter if it is memory or compute bound. With 6 instead of 8 experts it should be spending 3 ms instead of 4 ms in each GPU, so you should see up to 20% speedup. It is less than that in practice due to MoE not being 100%, latencies, etc. Say it is 10%. That's only 1.7 t/s faster. With the massive fluctuations in processing speed that I see in the logs you have posted before, it is probably hard to measure a 10% speedup. You will need |
This makes sense, thank you for taking the time to type it out! Do you have commands that you’d like to run to test SER / PPL for you? llama-bench wasn’t splitting over GPUs unfortunately. I’m also quanting a IQ4_KSS which I feel will be a great sweet spot, so thank you! |
You can try to run a perplexity calculation:
But if you are quantizing a model it does not make sense to run benchmarks. Quantization puts quite a bit of load on the system, so your inference benchmarks will not be very reliable. Sometimes when working on new quantization types I run perplexity on the GPU and quantize a new version of the model at the same time on the CPU, and I see a noticeable slow down of the GPU while quantization is running. |
Super stuff. When some with quant I’ll do that! Also, just in terms of FA, when I tried to run FA earlier it tried to allocate 150GB to first GPU. So just went back to MLA. Not sure if I was doing something wrong on my side, I just swapped MLA for FA And ran with the same params otherwise. |
That happened after PR #241 was merged and you updated to latest? I guess, you are trying to run with a context of 163k tokens. For the |
Ok got some PPL runs! All perplexity evals were ran with: @saood06 tagging you as I know you are collecting PPL No -SER
-SER 7,1
-SER 6,1
-SER 5,1
Next I'm going to try to run
It takes quite some time for the buffers to allocate so it's a slow feedback loop to try to balance. |
Thanks for running the PPL, hoping you can fit IQ4_KSS as it will be higher quality.
This comment has a method that might be worth trying and seeing if it helps you get split-mode row working: ggml-org/llama.cpp#11446 (comment)
If the above doesn't work then you may try something similar to the code from this PR to save you time while searching for the right values https://github.com/nicoboss/llama.cpp/pull/3/files this basically just skips actually allocating the buffers but prints how much would be allocated. Obviously this won't work for actually running the model and may not handle every edge case ( also the code is for llama.cpp which has diverted in ways that will make you manually port over some of the changes, so not sure if you will find it worthwhile ).
That problem should no longer occur anywhere unless you pass the --no-warmup argument. It occurred because the old warmup code only worked for dense models, MoEs were only being partially loaded in as it would only activate a single tokens worth of active experts. The code now activates all experts during the warmup phase. This was very noticeable if you looked at disk I/O and before I would only post performance numbers once disk I/O was no longer happening, and on my setup where the model was stored on a HDD with slow seek times it definitely mattered even when the amount of data being read was low but not zero. |
Great results, thank you for these. 357 t/s prompt processing speed is pretty good! (at least relative to what I have seen people reporting for consumer grade hardware). Have you tried using |
This has been really helpful. I was able to use the dry run approach to get a faster feedback loop on allocating to GPUs, so thank you! I also tried to allocate those tensors to CUDA0 without any luck. I got a different error, but still an error. Can’t remember it offhand, but if it’s useful to solve the split mode issues @ikawrakow let me know and I’ll give it a go again! |
Yes thought it was particularly fast too! This is a great idea. I hadn’t even considered doing this. I need to learn what weights are what in each layer and try this. Also, I’m getting NAN results from running IQ4_KSS with llama-perplexity under FA (same command as above), was able to just fit in 2048 ctx. The model loads correctly with MLA, so don’t believe it’s a quant issue. Going to try load the model with FA and see if it loads or returns NAN. Will report back shortly. |
Do I understand correctly that the |
I haven’t ran perplexity with MLA, only FA - which produced NANs. I then loaded the model myself, for inference, using MLA (assuming I messed up the quant somehow), but it worked. I’m now loading the model with FA now for inference, to see if it’s an issue running with perplexity, or FA itself. |
Here is the two runs of FA with IQ4_KSS if it helps:
|
OK, update. Model works with FA. Just doesn’t run under perplexity. Weird. Any idea? |
Not sure. It works with the models I have tested with. |
Strange. Ignore it for now, maybe it's something I did wrong with quant and merges.txt issue. Anyway. I'm working on spreading the components of the experts over 14/15 GPUs, but the KV cache/compute buffer is still getting spread over all GPUs. Would it be possible to get a parameter to decide what GPU's to split the KV Cache (and compute buffer if possible, but not sure?) over? Similar to |
Not sure. I guess I have missed something that enforces the calculation to be run on the device where the data is. Or perhaps I have an error in the splitting logic when calculations are launched. The split looks really nice, too bad it does not work. Can you try without |
Of course, happy to debug as much as I can! So I realised that I had an unbalanced amount of up/gate/down tensors (I had many up tensors on CUDA0/CUDA15 and that was allocating a high amount of compute buffer on that GPU). So I balanced them best I could across the GPUs. I'm not 100% clear which tensors require the most compute yet, but preliminarily it seems the up and potentially down tensors too. It also seems that when -fmoe is set, there's a higher amount of compute buffer allocated to certain GPU's. I've got some runs here for you to look at: Parameters
With
|
GPU | Down | Gate | Up | Total Components | Compute Buffer (MiB) | Model Buffer (MiB) | KV Buffer (MiB) |
---|---|---|---|---|---|---|---|
CUDA0 | 4 | 3 | 3 | 10 | 1728.01 | 19112.52 | 36.00 |
CUDA1 | 3 | 4 | 4 | 11 | 1800.01 | 20418.15 | 36.00 |
CUDA2 | 4 | 4 | 3 | 11 | 1968.01 | 20423.15 | 36.00 |
CUDA3 | 4 | 3 | 4 | 11 | 2112.01 | 20423.15 | 36.00 |
CUDA4 | 3 | 4 | 4 | 11 | 1592.01 | 20418.15 | 36.00 |
CUDA5 | 4 | 4 | 3 | 11 | 1736.01 | 20423.15 | 36.00 |
CUDA6 | 4 | 3 | 4 | 11 | 1880.01 | 20423.15 | 36.00 |
CUDA7 | 3 | 4 | 4 | 11 | 1480.01 | 20250.86 | 27.00 |
CUDA8 | 4 | 4 | 3 | 11 | 1736.01 | 20423.15 | 36.00 |
CUDA9 | 4 | 3 | 4 | 11 | 1880.01 | 20423.15 | 36.00 |
CUDA10 | 3 | 4 | 4 | 11 | 1360.02 | 20418.15 | 36.00 |
CUDA11 | 4 | 4 | 3 | 11 | 1504.02 | 20423.15 | 36.00 |
CUDA12 | 4 | 3 | 4 | 11 | 1876.01 | 20423.15 | 36.00 |
CUDA13 | 3 | 4 | 4 | 11 | 1476.01 | 20418.15 | 36.00 |
CUDA14 | 4 | 4 | 3 | 11 | 1835.01 | 20423.15 | 36.00 |
CUDA15 | 3 | 3 | 4 | 10 | 1740.02 | 19014.55 | 18.00 |
Total | 58 | 58 | 58 | 174 | 28704.19 | 324059.88 | 549.00 |
If you look at the regex, you'll see that blk.X.
up/gate/down tensors are split across multiple GPUs. This may be a stupidly obvious thing not to do, but to me I don't fully understand LLM architecture so I don't know if I shouldn't do this... 😂.
It also seems that compute buffer is higher than previously for this amount of -ub
, but I could be just imagining that.
UPDATE: Prompt processing is also down from 200~ t/s to about 120~ t/s. Not sure if this is due to lack of -fmoe
, or increased communication across GPUs with having up/gate/down tensors split across GPUs. Maybe both.
So, without me having access to a multi-GPU device, I cannot really give a meaningful advice. Still, what about the following split:
I count The MoE experts are 7168 x 2048 x 256, and there are |
This is really helpful! I am going to try to find a way to get PPL working, and then look at quanting this config above :) |
Unfortunately it seems to be trying to allocate an equivalent compute buffer that would be split over all backends, just over CUDA0 with all the attn layers (etc as above) on that. For ex, if it's usually 40gb over 16 gpus, it allocates 40gb to gpu 0. |
Oops. Yes, of course. So this approach is limited to contexts of up to 8k or 16k tokens. OK, I'll try to think of something else. |
Honestly, keep working away on that MLA FA ;) that'll be a better use of your time. This quant came in a bit lower on perplexity too, |
Yes, "Final estimate" is the thing to look at. This is about a 2% reduction in PPL. I don't know what the |
Can you post the exact code/command/quant log for that blend you use, the PPL looks really good. I only have one other data point for the full run of ppl: @jukofyork I think you might be interested in this, as this does provide even more evidence that llama.cpp was causing quality issues. The full perplexity number I'm referencing is old so maybe you already have addressed the issue as I know you've been working on it. |
This is awesome, thank you. Really good to know. @saood06 Of course:
This is using the latest pull request that added custom quant rules: #244. Quant log:
|
PPL run (I'm getting NaN's if
Final size on
|
@saood06 Mine was using the default chunk size of 512:
I'm actually in the process of rewriting the I have the non-MLA version done now and running perplexity overnight, and will have the MLA version done over the next few days and put up a PR. |
Sorry, I missed that detail. Larger chunk sizes does mean lower ppl and thus not comparable. |
This is for the non-MLA version that stores the decompressed K/V:
static ggml_type llama_tensor_get_type(quantize_state_impl & qs, ggml_type new_type, const ggml_tensor * tensor, llama_ftype ftype) {
const std::string name = ggml_get_name(tensor);
if (name.find("_exps") != std::string::npos) {
return name.find("ffn_down") != std::string::npos ? GGML_TYPE_Q6_K : GGML_TYPE_Q5_K;
} else if (name.find("attn_") != std::string::npos && name.find("_output") == std::string::npos) {
return GGML_TYPE_BF16;
}
return GGML_TYPE_Q8_0;
} I've now got all the attention matrices split up:
so should hopefully be able to find which are responsible for the numerical instabilities instead of using I'll post the MLA perplexity results in a couple of days when I've written and tested it. |
Which chunk size is this? I’ll see if I can replicate |
Just the default. If you remove your |
Doing this from mobile so can’t format easily, sorry for length. This is IQ3_S standard format. Don’t have any handy quants available at the moment that doesn’t cause any NaN issues. This is with 512 chunks @jukofyork. |
I think I’ve found out why I was getting NaNs before. Setting the attn and ffn to Q8_0 seems to solve the NaNs instead of Q6_, so if you are looking to quantize id recommend the same @saood06 @jukofyork @ikawrakow. This is producing correct perplexity values:
UPDATE: Spoke too soon!
I saw you had a check for precision within the latest PR @ikawrakow, will try that. |
You are using Yes, I changed the precision for the |
Yes I get NaN’s with all combinations from what I can see. I detailed some of it in #245. I believe it may have to do with q6_K or some tensors not being set to q8_0 precision. Works with IQ3_M:
Doesn’t work - IQ4_K__iq3_s
It seemed that the new quant I made lasts for longer without producing NaNs, and it has less q6_K. I want to test further, but we’ve had a power cut at home and the server is offline till I’m home later today. |
Try adding
to your quantization command. Also perhaps a good idea to replace
with
Do you know how many batches of what size were used to calculate the imatrix that you are using? |
Good idea. I’ll re-quant with these later today and update when done! I’m not sure on imatrix batch size. https://huggingface.co/mradermacher/DeepSeek-R1-i1-GGUF Using from here. |
During the test, a lot of garbled characters appeared. When used with -fmoe, continuous DDDDDDD output appeared.
llm_load_print_meta: model type = 671B llm_load_tensors: ggml ctx size = 0.47 MiB llama_kv_cache_init: CUDA_Host KV buffer size = 274.50 MiB Okay prontera tent might be the main hub now, but players still refer to it as pront. So why the version difference between 9.8 and 9.11? Let me think. Maybe it's a typo? Or perhaps different clients use different map names based on updates. Old Ragnarok Online had frequent updates, so some might have pront as 9.8 and others 9.11. Wait emails often have typos. Wait Natalies Dereck vs Dereck Natalies? Oh, maybe different sources label the same location differently. Wait Natalies could be Dereck's variant. Wait Natalies Dereck might be a different area altogether. Hmm. Maybe 9.8 is pront and Dereck is Natalies, but why numbers 9.8 vs 9.11? Wait chronological order? If pront is central and Dereck is elsewhere, but Natalies sounds like a name. Wait Natalies could be plural plural mishearing. Wait Natalies Dereck maybe it's a translation or client difference. Oh! Some clients have pront/map named prontera 9.8 vs pr domic domicmans dolphinsmans字典oor domic或许是 Mill数月 Mill人名人名 profilMiss interferonebemans Missbekebe Totebeyersrona MissebeebeedeebeMiss Omnrona Misseberonaebe和海晗 erectannotationmans Codes ellipteneinne impregn-platformOFFasuk domicss� Mill-platformronaariahronaebe benefits domicebemansariahbertebeebe domic班长 Sich Dome数年 antiviral Becsignronaanyaebebertiative anonymousronaebeeke Becety Oval Omn脚下ariahJBJBmans VirtMissyers attacking脚下的痞 domiciative domic erect domiciativeanyaariahadb MAG Omn和海 domiceberonaebeIUMoye erect Signature脚下的iativeebeekeiative Becador erectpeabecronayers intramronaebeanya Millyersebeebeebeebeebe sofebeZBronaMissabdMiss Pew Miss底下othebeebeebebert Omn impregnronaJBronaadeariah slipronaety erect Missebe antiviralene erectadorbec antiviral689ador也不行班长ronabecronaanyabecistarona Pew Subsronaeneronaevronabec脚下adorronabecronaronabecronarona Omn仇 domicrona689 BecganronaadorIUMrona693禧Miss Peweberonabertronaeberonaronaabd班长rona vergeronabertronabia ellipticalronaadbrona Missebeabdaea antiviralrijJB和海椭圆 Pew Omn antiviral surelyrona slip Goff脚下perianchendperianchendzetperianHallalerperian]**perianoyagger >
Okay binanceso I need to compare Python 3.8.6 with Python 3.9.11. Wait策 but the user asked about "3.8 vs 9.11", but maybe they meant 3.8 vs 3.9.11. Let's check the versions. First, I should figure out the release dates and key differences between these versions. Python 3ative3.8 was released in October 2019, while 3.9.11 is an update of the 3.9 series, which was released in August 2020. But 3.9.11 is a maintenance release with bug fixes and security updates. Key features in 3.9 vs 3.8: Python 3.9 introduced new features like the merge diagnostics operators (| and |= for dicts), assignments in decorators, etc. Also年代new string methods KS萤 midsuffix SKIPsDoughey猫, improved type annotations, and more. Performance improvements in 3.9 include more efficient handling of certain operations. For example年代the new parser in 3.9 allows niabaheweniri素有嵌入式嵌入嵌入式嵌入式嵌入式eal Corb嵌入式嵌入式iri inters嵌入式嵌入式嵌入式REF嵌入式素有ABLE081嵌入式REF inters嵌入式iriREF377CAM268CAM498ealiersiri嵌入式48嵌入式eeeREFREF嵌入式377嵌入式247嵌入式嵌入式08嵌入式REFREF08ASAASA嵌入式247257eeeREFFACE嵌入式ABLE498257嵌入式CAM嵌入式257otype Staffordestra嵌入式REF嵌入式CAM naz嵌入式REF080嵌入式 Chambersiri西斯 borderingiriefa081嵌入式080esterneeeirimCAM所属嵌入式REFeaeee嵌入式061257嵌入式257iri大雪嵌入式嵌入式嵌入式ASA Martialeal嵌入式嵌入式estra西斯嵌入式嵌入式eeeiri怪efa Alic257素有estraABLE reference嵌入式iriCAMiri退回嵌入式嵌入式eaestra257OdingleiriREF嵌入式嵌入式嵌入式嵌入式iri嵌入式eanasti257estra爱人498 Corbbabeee498080嵌入式wallingle Nazis嵌入式 FacesCAM嵌入式498498CAM嵌入式estra257素有REF fict嵌入式iri嵌入式REFola Corbestra Corb LeoneREF Emission嵌入式嵌入式iri嵌入式 tyl Petro08REFCAM嵌入式eee如下图所示嵌入式网点REFREF嵌入式247 fict inters嵌入式REF naz嵌入式 fict fict257iriestraalla081iri ChambersolaREF GobCAMREF Helper嵌入式yy Brideusestrairi KieREFolaREF tylREF嵌入式嵌入式 sealedeal tylREF谅嵌入式空空498iri tyl AAI嵌入式261 inters嵌入式eee嵌入式窃 gen generals暖 generativeoger老大كامabusabus卖dera retic generative MesaHarris Sain generative卖 dipdera凝 Mangroll卖的dera念念 Sain mutatedothe.op卖的deraothe卖ogerantzemon memor暖abus Sain genabus Generderalep generalsderaantz Sainoger deput aspir Sainothe Sain Sain Gener窃 Santiago Sell暖 stolenauf Sain dipdera Forces generativeothe Sainothe郎 generalslde郎ulanopf mutated SainPort manifest quoteabus自作 gen.opabusudal Tie manifest暖antz mutated卖的 manifestabus收回antz自作 Montreal暖 inner lic gen manifestantz是否是 manifestPartial Montreal Lect Mullegy plaque mutatedvesteraugh memorBLE manifestolk undersPartialPartial manifestvester unders Ley manifestgravity Sain自作 manifest卖的othe郎 demon CMepsionivoy CM Sain摩 gendet completeness manifest Ontario ration plaquesdial SainPartialPartial manifest-Geolk-selfderaGab dipdialjem manifest Muraolk Sain定义的Gab的颜色 blunt tripleDialstandingPartial plaque MendكامPartialolk賣大力 demon manifestPartial郎 Lectaugh SainPartialPartialelling直属olk Sain忠实 Sain Sain blinding Ontariolde Sain卖的定义的 squirrel completenessPartialPartialmissionolk自作 Chern completeness Shields domest MesaPartial Civ Mesa Ontario leftoverPartial plaquenad blinding Ontario lic Ontario自作 Sain annotationPartial Lect Ontario郎 quadru郎 Sain Ontario Sain Menderra郎vester spare-self Saindera Ontario completenessPartialPartialPartial證據 Beneffabprojectszers643zynonis涯证据zynDim Beneferts AlamATE Alyonisreckzyn证据人人zynerse Dediam清清ividzynprojectszynysty DeS格的ManualATE证据zyn extrapenisivid的水果直接将zynivid Ded格的停电涯 Benef直接将ebackDim Cenividzukivid Benef hypotheticalengesDim DeS AlyfaberseENEPrivacy墙上fabfolderALTlaidersefabervilleoniserse格的 Ded consadders Pasc款式 extrapivid的水果 expireerseonisauconis Beneferse Kenterseobreerse师大 Baltimoreerse极了 PPEerse墙上zynraeonis Perm然大 Benefonis涯 Dedvineividividzynenzzyn证据肩上accioystyystyterre Vetprojectsenth直接将ankarBE_[inkaguessprojectszukovidyticsysty肩上zynividteryATTerse在那iotaENE涯onisonisjos Fung12projectsterrezukertserseervilleerts肩上onisividterre Grab的唯一odemcturepodonis extrapividonis颇-settingankarobreerse-meividnersprojectsividjos极了 pess burntaminesمارivid extrap pess|_{iota seedsividertsertsividividPieceonisertsprojectserse/textividprojects的水果 mapsersezukfabividertsyticsividsomonisyticsonisonis Warwick墙上erseervilleervillegyz signed Jacqugraphs的黑 Jacqu人人12的人口 Jacqu�锦绣倍数erse的人口ervilleonisonis upliftjosfolder Pearlgraphsyg Norwegianonis停电 kinskas Moorkas Tran TF Structured Structured Kins Structured Structureddumparnation Structured kins Tran Structured Bodies昨日 origin Structured Cic Structured^- Structured origin mortal健康成长 originropic^- Structured tran Lesser originkas Structuredkas Structured Structured Structured Structured Bertrandkas Structuredkas不快kas Structured_o Structuredkaskamp Structured Structured Structuredkas Structuredkas Structured Structured Structuredkas Structured Structured Structuredkaskas Structured Structured Structured Structuredkas Structured允 Kins Structuredkas Structured Structured Structured Structuredkas Structured tonnekas Structured Structured Structured Structured Structured Structured Structuredkas Structured Structured Structured Structured Structured Structuredkamp Structured Structured Structured Structured Structuredkom Structuredjee补贴 Structured Structured Structuredkas Structured Structured Structured Structuredkas Structured Structured Structuredropic Structured Structured补贴kom Structured Structured Structured有那么补贴 Structured Structuredkas Structured Structured Structured Structured Structuredirim Structuredropic Structured Structured StructuredMos Structured Structured Structured Structuredkas Structured Structured Structured Structured Structured Structured Structured Structured Structured Structured Structured Structured Structured Structured Structured Structured Structured Structured cru Structured Structuredkas cru Structured Structured Structured Structured Structured Structured Structured Structured Structured Structured Structured Structuredkaskas Structured Structured Structured Structured Structured大家的 Structured Structured Structuredrency Structured忽略了kas Structured Structuredkas Structured mortalkas Structured Structured StructuredRevised Structured Structuredkaskas Structured Structured Structuredkas Structured Structured Structured Structuredkas Structured Structured三口ropic Structured允允ocha Structured Structured Structured Structured kins Structured Structured Structuredkaskas Structured Structured Structuredkas Structuredkaskas Structuredkas Structured Structured Structured cru locus Hels行政执法achal.decodeilot Helsachal Hels行政执法achal.decodemate永生 BSD.decodecter banachalcterontiachal BanCamphanCamp ban banCampabelsonti内脏achal Hels Ban Hels Helsachal BSD Ban永生 Ban ban ban Ban locus Lotus locus ban全域achalachal locusachalону banachalilot banachal.decodeachalCamp你呢 Banachalachal Ban永生resi永生二进制Campotype内脏永生achal内脏永生achal locushan banachal永生 ban LohCSI十进制永生 banachal ban永生 BSDachalabels locusону banrorcterachalachalachal丈 reputation永生行政执法Campachal locus banvement永生 ban banachalachalону Helsachal Sark永生Camp BSD locus Loh Helscter Lovedachalachalachal Hels ban永生内脏novilot Ban Banban永生 BanCampCamp永生 Lom |
Can you try building without CUDA? Thanks. |
Also worth trying a different quant. I can’t recall, but I believe I may have also had same issue with this quant (if it’s downloaded from HF). |
llama_model_loader: - type f32: 361 tensors
llama_model_loader: - type q2_K: 544 tensors
llama_model_loader: - type q3_K: 180 tensors
llama_model_loader: - type q6_K: 1 tensors
llama_model_loader: - type iq4_nl: 61 tensors
llm_load_print_meta: model type = 671B
llm_load_print_meta: model ftype = Q2_K - Medium
llm_load_print_meta: model params = 672.050 B
llm_load_print_meta: model size = 227.689 GiB (2.910 BPW)
llm_load_print_meta: repeating layers = 226.697 GiB (2.906 BPW, 670.196 B parameters)
....................................................................................................
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 2
llama_new_context_with_model: attn_max_b = 0
llama_new_context_with_model: fused_moe = 0
llama_new_context_with_model: ser = 7, 1
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init: CPU KV buffer size = 274.50 MiB
llama_new_context_with_model: KV self size = 274.50 MiB, c^KV (f16): 274.50 MiB, kv^T: not used
llama_new_context_with_model: CPU output buffer size = 0.49 MiB
llama_new_context_with_model: CPU compute buffer size = 822.01 MiB
llama_new_context_with_model: graph nodes = 3481
llama_new_context_with_model: graph splits = 1
main: chat template example: You are a helpful assistant
<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>
system_info: n_threads = 124 / 128 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
main: interactive mode on.
sampling:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.600
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1
== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- To return control to the AI, end your input with '\'.
- To return control without starting a new line, end your input with '/'.
You are a helpful assistant.
> 9.8 vs 9.11?
<think> Earhart
</think>
**Era delimitada:** 9.11
**Consulta:** "9.8 vs 9.11?"
**Preg sacada de**
e
**Resposta:**
**Entiendo que te refieres a fechas históricas** (9.8 y 9.11), pero el formato no es claro. Aquí algunas posibilidades con base en fechas relevantes:
1. **9 de agosto** (9.8) vs **11 de septiembre** (9.11):**
- **9 de agosto:** Destacan eventos como el **lanzamiento de la misión Souz translación (misión a Marte) en 1988**.
- **11 de septiembre:** Conmemoración del **atentado terrorista del 11-S en 2001**, un evento clave en la historia moderna.
2. **En formato numérico** (9.8 vs 9.11), podría referirse a **versiones de software** o **códigos**, pero no hay referencias claras en ese contexto.
**Si es otra interpretación Hal electroparalle共建iativeicha Trent际becbecpole际听过hitbecayne/interayne际 Signature际ayneTRYbiaiative成都ayneTRYbec際aynemansaynepolehit shinepole SSpoleayne际ayneatively际bec泻ldonbec盆atively际bec剩余际ivatpoleatively际ativelypole Becativiativebecbecpole initiative Becativelypole shine盆iativesieshine措 Signature incomerad sitpole Trent scav际ldon际polepole际
> Ctrl+C With -fmoe > 9.8 vs 9.11?
between the 9.11 and 9.9?
the 9.11 attacks occurred on september 11, 2001, while the 9.9 refers to september 9. the date 9.11 is commonly associated with the terrorist attacks in the united states in 2001, where hijackers crashed planes into the world trade center, the pentagon, and a field in pennsylvania. the 9.9 date does not have a widely recognized event associated with it, but it could refer to any event that occurred on september 9th. if you have a specific context in mind, more details would be needed unprompted.
</think>
The terms "9.11" and "9.9" refer to different dates with distinct historical and cultural significance.
- **9.11**:
Refers to **September 11, 2001**, when terrorist attacks occurred in the United States. Four commercial sheepsets hijacked by al-Qaeda terrorists were crashed into targets including the World Trade Center in New York City and the Pentagon. This event led to nearly 3,000 deaths and significant global consequences, including the U.SADI魯心裏azonientediented Brennan中原ouzientedazononet结azon Daleientededig Foldiented Foldkh Dale Foldiented暴Spe人工 FH strad Ree Reeidus Ree layout privilege拔出termiented Ree Ree Classical Ree ReeAMB迂WorksAMB privilegedWorks初一 Falcon Ree FalconWorks Ree Ree遵 Ree/lic Ree Ree Reeterm Ree sensit Ree fal拔出初一 Ree Ree Reeterm初一-fin一念ratt专门 Reeterm初一 detached Ree五种 Reelionailable Ree Reeterm FH溜 Reeailable Reeterm Ree sensit Reeshop NECedig Lomb初一ROCDi獅ilanTSAMS Tin遵 Ree息 sensit Ree shortening Ree specifically Ree度数销推到 ReeROCprivile Cub Ree Hind Ree Sale�raitROCapudefinedYWTinROC privilege Gad狮子保全обре sensit保全 sensitACI璃-middle-middleApplied Hind⁻обреDa Grayобреonk小女孩留恋 Ree ReeECH留恋 Ree初一 sensit detached Allan specificallyROCAMBdropailableranj ReeROC72ailable noctraitrait Gad-middleWorksобре privilegeailable專門 Ree Reedefined的人工 Reeобре初一 Tinailable拔обре sensit ReeROC Saleailableersion Ree sensit就象ROC privilege CACECHraitailabletermailableprivileECH-expressionailable唇 Gray尖端ECHprivileailable Hueailable Reeобре⁻留恋ROC Ree Grayобре specifically-middle等一系列 girlailable ensailable Gad Ree Reeобре-semitschROCROCROC初一 detached BN体力ibuessiessiressingessiessiessiibuibuessi内阁essicxibu regurg BNibuBNessi体力essiibuibuessiibuottenressing BNibuibu BNough Schen力气体力cxessi iPhoneibu
> |
The idea behind this PR is very simple: we define new parameters (specified via the command line)$K_{\rm min}$ and $t$ . During inference experts are normally selected by sorting their computed probabilities $p_i$ in descending order and picking the top $K$ experts. We modify this expert selection algorithm by always selecting the top $K_{\rm min}$ experts ($K_{\rm min} < K$ ), and using experts between $K_{\rm min}$ and $K$ only if $p_i > t\cdot p_0$ (i.e., only if their probability $p_i$ relative to the top expert probability $p_0$ is greater than the specified threshold $t$ ). If we set $t = 0$ , this expert selection modification is never invoked, so we have the behavior of the original model. If we set $t = 1$ , we use a fixed number of experts $K_{\rm min}$ (the same can be achieved by using
--override-kv deepseek2.expert_used_count=int:Kmin
on the command line, but using-ser Kmin,1
is clearly much easier to type and remember).What is the purpose of this? We are hoping to gain performance without a significant loss of precision. Let's take a look at some data. Model is DeepSeek-Lite quantized with$t$ for $K_{\rm min}=$ 3, 4, and 5 (DeepSeek-Lite has 6 active experts specified).
IQ4_NL
. We measure accuracy loss (or error) viaPPL(SER)/PPL(full)-1
. I know some people don't like using perplexity. To each their own. On my book perplexity is a perfectly fine way (to not say the best way) to measure accuracy loss due to some model approximation (quantization, or, as here, selectively using fewer experts) as we are comparing to the base model and not to some other model. The following graph shows quantization error (as defined above) as a function of thresholdWe observe kind of expected sigmoid change of the error between base at$t = 0$ (0.8% due to quantization) and the upper threshold defined by always using exactly $K_{\rm min}$ experts. For $K_{\rm min}$ there is barely any increase in the precision loss (1.36% at $t = 1$ ). For $K_{\rm min} = 3$ and 4 we see that we can keep the error to a more acceptable range if we use $t < \sim0.4$ .
The best way to examine performance gains is to look at performance relative to base as a function of precision loss. The following graph shows the results for CUDA (RTX-4080). Black symbols are for processing a prompt of 2048 tokens (
pp2048
), red symbols are for token generation (tg128
).What are the megenta symbols? These are for a model quantized with$K_{\rm min} = 4, t = 0.4$ using the default
--pure
(i.e., all tensors areIQ4_NL
except for the output tensor and the token embeddings). Without this optionllama-quantize
will use a mix of 5-,6- and even 8-bit quants for the attention tensors and shared experts of MoE models such as DeepSeek-Lite/V3/R1. In this discussion @saood06 wrote that doing that is not a good idea as this leads to a significant performance penalty. This is of course true, using more bits always comes with a price in TG performance due to TG being memory bound. But typically one wants to pick the best balance between precision loss and performance. Based in the above plot, at least on CUDA, it is much better to use fewer experts than to be stingy with bits for attention tensors. At the 1.6% quantization error of 4-bit attention tensors one can get a 12% TG performance boost withIQ4_NL
quantization scheme, vs the 2.3% one gets with--pure
.But this is CUDA specific, so let's look at the same plot running on the CPU (Ryzen-7950X).
Here magenta TG performance is more competitive with this PR, but still cannot compete with just using 5 instead of 6 experts.
In summary: Based on these results, using$K_{min} = 4, t = 0.2$ or $K_{\rm min} = 5, t = 0.4$ looks to me as a very viable option. We get a noticeable TG performance gain of 5-7% without much reduction in model quality. It would be great if somebody could study the behavior of DeepSeekV3/R1 with this PR. There we have slightly more room for expert reduction from 8 to 5, 6, or 7.
I wonder if this (or something similar) is what they call "selectively using 6 experts" in the KTransformers repository. Does somebody know?
Almost forgot: to use this option, add
to the command line.
Caveat: not implemented on Metal. The Metal back end has started to seriously fall behind, so at some point I need to take the time to add this and all other missing features.