Replies: 3 comments 1 reply
-
| To partially answer my own question, the modified GPTQ that turboderp's working on for ExLlama v2 is looking really promising even down to 3 bits. 3B, 7B, and 13B models have been unthoroughly tested, but going by early results, each step up in parameter size is notably more resistant to quantization loss than the last, and 3-bit 13B already looks like it could be a winner. Assuming the trend continues, I wouldn't be surprised if 3-bit 70B using the new quantization method equals or even outperforms current GPTQ 4-bit ungrouped, and if so that's a very respectable memory save. Also I have seen one report that P100 performance is acceptable with ExLlama (unlike P40), though mixing cards from different generations can be sketchy. Regardless, it still looks like it may be viable, eventually. | 
Beta Was this translation helpful? Give feedback.
-
| 
 Have you tried the 22b merges? The couple I used seemed alright as a midpoint. 
 Would have to turn it back on again. In SD, I am finding that just using attention upcast at FP32 setting returns most of the speed for P40 and at that point doing full precision and keeping the model FP32 is making no difference while using the xformers optimizer. People were also having luck adding P40 to a faster card and splitting the model, as in they still got respectable speeds in exllama. | 
Beta Was this translation helpful? Give feedback.
-
| I have 2 P100s that I'm looking to run for inference. I'm thinking about using the dual gpu setup to run the Mixtral MoE model, but I'm not super familiar with the new terminology I.e. (Q4, Q5), and all the different finetunes coming out. Any luck with PyTorch disabling FP16 on the P100? Given the current developments with exl2 and the P100's fp16 performance, what configuration do you guys suggest? | 
Beta Was this translation helpful? Give feedback.
-
| I just tried Oogabooga -> Exllama yesterday and it works fine on a P100. Just had to disable flash attention, I suppose it needs tensor cores or something which is weird. But bottom line: Exllama works on a P100. | 
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
So, P40s have already been discussed, and despite the nice 24GB chunk of VRAM, unfortunately aren't viable with ExLlama on account of the abysmal FP16 performance.
I was looking at card specs earlier and realized something interesting: P100s, despite being slightly older and from the same generation as P40s, actually have very good FP16 performance!
Early Pascal (P100) runs at a 2:1 FP16:FP32 ratio, which is great on paper for ExLlama. Later Pascal runs at a really awful 1:64 ratio, meaning FP16 math is completely unviable. Turing/Volta also run at a 2:1 ratio, and Ampere and Lovelace/Hopper are both just a 1:1 ratio. In absolute terms, Nvidia claims 18.7 TFLOP/s for FP16 on a P100, where by comparison a 3090 is listed at 29-35 TFLOP/s, so a 3090 is a little less than twice as fast.
Memory bandwidth on the P100 is also excellent on account of using HBM2, listed at 732 GB/s, which is not that far off ~900-1000 of a 3090/4090.
16GB P100s are dirt cheap on ebay, for example here's a current listing (no affiliation) of at least 100 cards for sale at $150/each.
A caveat is that software support for FP16 on a P100 is reportedly spotty, for example PyTorch apparently disabled FP16 math on these cards, citing "numerical instability". Though it's unclear to me if that's really meaningful, or if it just means there's slightly more rounding error or whatever, which probably wouldn't make any practical difference just for LLM inference. (Also, would PyTorch weirdness here actually matter for ExLlama, since most of the math is being done by a custom kernel anyway?)
Of course, all that means nothing if the available VRAM doesn't pass certain thresholds. Right now Meta withholding LLaMA 2 34B puts single 24GB card users in an awkward position, where LLaMA 2 13B is arguably not that far off of LLaMA 1 33B, leaving a lot of unused VRAM, and it takes quite a bit extra to fit 70B. Adding a second 16GB card, to total 40GB, by my napkin math gives enough total VRAM to almost run 70B (i.e. it might load, but if it did context would be extremely limited). I know @turboderp's been working on quantization improvements for ExLlama v2, and by my math it would only take something in the ballpark of a 10-15% reduction in overall quant size to make room for 70B at full context with such a proposed hardware configuration. I understand if you can't promise anything yet, but is at least that much a realistic hope for ExLlama v2?
Anyway I was just curious if anyone had any thoughts on this, and especially if anyone has tried running ExLlama on a P100 already. Seeing how popular 3090s and 4090s are, the possibility of running 70B well just by adding a cheap $150 secondary card, instead of having to spend 4-5× as much on a second 3090, or 10× as much on a second 4090, could make such a configuration a very practical low-budget solution for getting 70B running on a system like mine.
Beta Was this translation helpful? Give feedback.
All reactions