Replies: 3 comments 1 reply
-
I tried other combinations. The Flux Q4_K_S just seems to be faster than the smaller Flux Q3_K_S, despite the latter being The T5 FP8 + Flux Q3_K_S obviously don't fit together in 8 GB VRAM, and still the Flux Q3_K_S was With 512 x 512 images: T5 FP8 + Flux Q3_K_S (
T5 Q3_K_L + Flux Q4_K_S (
|
Beta Was this translation helpful? Give feedback.
-
I think that the Q4 quants are, in general, faster to dequantise than the Q3 or Q5 quants. I suspect this is because 4-bit quants fit more simply into the 8-bit pseudo-tensors than 3bit or 5bit? |
Beta Was this translation helpful? Give feedback.
-
The code I'm using from city96 has on the fly encoding to Q4_1, Q5_1, and Q8_0 - not Q6. So to do a Q6 mix requires using the patching mechanism (downloading a Q6 model and mixing it in). I tend to do Q4 and Q8 and bfloat16 mixes. But I'm working on some objective measures of the accuracy of different quants in different bits of the model. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
My system:
I have an old system with low VRAM, so Flux was always slow.
But with the Flux Q3_K_S model and the T5 Q3_K_L encoder, I was able to generate, without offloading to RAM.
https://github.com/city96/ComfyUI-GGUF
I thought this would improve speeds, but it's about the same and in some cases even worse, when
loaded completely
is used.So is there a different bottleneck here?
This is with 1024 x 1024 images.
The speed was about the same as with larger models, where Flux needs to be
loaded partially
.T5 Q3_K_L + Flux Q3_K_S (
loaded completely
)T5 FP8 + Flux Q4_K_S (
loaded partially
)And with 512 x 512 images, generation is even faster when the Flux model is
loaded partially
!T5 Q3_K_L + Flux Q3_K_S (
loaded completely
)T5 FP8 + Flux Q4_K_S (
loaded partially
)Beta Was this translation helpful? Give feedback.
All reactions