Why is Flux faster with VRAM offloading ("loaded partially") ?! #4601

JorgeR81 · 2024-08-25T14:19:16Z

JorgeR81
Aug 25, 2024

My system:

GTX 1070 ( 8 GB VRAM )
32 GB RAM
Windows 10
pytorch version: 2.3.1+cu121

I have an old system with low VRAM, so Flux was always slow.

But with the Flux Q3_K_S model and the T5 Q3_K_L encoder, I was able to generate, without offloading to RAM.
https://github.com/city96/ComfyUI-GGUF

I thought this would improve speeds, but it's about the same and in some cases even worse, when loaded completely is used.
So is there a different bottleneck here?

This is with 1024 x 1024 images.
The speed was about the same as with larger models, where Flux needs to be loaded partially .

T5 Q3_K_L + Flux Q3_K_S ( loaded completely )

Requested to load FluxClipModel_
Loading 1 new model
loaded completely 0.0 2732.35888671875 True
Requested to load Flux
Loading 1 new model
loaded completely 0.0 4991.648681640625 True
100%|██████████████████████████████████████████████████████████████████████████████████| 12/12 [04:48<00:00, 24.01s/it]
Requested to load AutoencodingEngine
Loading 1 new model
loaded completely 0.0 319.7467155456543 True
Prompt executed in 315.33 seconds

T5 FP8 + Flux Q4_K_S ( loaded partially )

Requested to load FluxClipModel_
Loading 1 new model
loaded completely 0.0 4778.66552734375 True
Requested to load Flux
Loading 1 new model
loaded partially 5730.115987487793 5729.402587890625 0
100%|██████████████████████████████████████████████████████████████████████████████████| 12/12 [04:42<00:00, 23.54s/it]
Requested to load AutoencodingEngine
Loading 1 new model
loaded completely 0.0 319.7467155456543 True
Prompt executed in 317.06 seconds

And with 512 x 512 images, generation is even faster when the Flux model is loaded partially !

T5 Q3_K_L + Flux Q3_K_S ( loaded completely )

Requested to load FluxClipModel_
Loading 1 new model
loaded completely 0.0 2732.35888671875 True
Requested to load Flux
Loading 1 new model
loaded completely 0.0 4991.648681640625 True
100%|██████████████████████████████████████████████████████████████████████████████████| 12/12 [01:27<00:00,  7.30s/it]
Requested to load AutoencodingEngine
Loading 1 new model
loaded completely 0.0 319.7467155456543 True
Prompt executed in 121.11 seconds

T5 FP8 + Flux Q4_K_S ( loaded partially )

Requested to load FluxClipModel_
Loading 1 new model
loaded completely 0.0 4778.66552734375 True
Requested to load Flux
Loading 1 new model
loaded partially 5918.819987487793 5906.859619140625 0
100%|██████████████████████████████████████████████████████████████████████████████████| 12/12 [01:19<00:00,  6.61s/it]
Requested to load AutoencodingEngine
Loading 1 new model
loaded completely 0.0 319.7467155456543 True
Prompt executed in 93.36 seconds

JorgeR81 · 2024-08-25T15:14:25Z

JorgeR81
Aug 25, 2024
Author

I tried other combinations.

The Flux Q4_K_S just seems to be faster than the smaller Flux Q3_K_S, despite the latter being loaded completely.
( Maybe it's got something to do with the quantization method ? )

The T5 FP8 + Flux Q3_K_S obviously don't fit together in 8 GB VRAM, and still the Flux Q3_K_S was loaded completely, so maybe I'm just not reading the console right ...

With 512 x 512 images:

T5 FP8 + Flux Q3_K_S ( loaded completely )

Requested to load FluxClipModel_
Loading 1 new model
loaded completely 0.0 4778.66552734375 True
Requested to load Flux
Loading 1 new model
loaded completely 0.0 4991.648681640625 True
100%|██████████████████████████████████████████████████████████████████████████████████| 12/12 [01:26<00:00,  7.24s/it]
Prompt executed in 100.84 seconds

T5 Q3_K_L + Flux Q4_K_S ( loaded partially )

Requested to load FluxClipModel_
Loading 1 new model
loaded completely 0.0 2732.35888671875 True
Requested to load Flux
Loading 1 new model
loaded partially 5918.819987487793 5906.859619140625 0
100%|██████████████████████████████████████████████████████████████████████████████████| 12/12 [01:18<00:00,  6.57s/it]
Requested to load AutoencodingEngine
Loading 1 new model
loaded completely 0.0 319.7467155456543 True
Prompt executed in 96.80 seconds

0 replies

chrisgoringe · 2024-08-28T21:52:29Z

chrisgoringe
Aug 28, 2024

I think that the Q4 quants are, in general, faster to dequantise than the Q3 or Q5 quants. I suspect this is because 4-bit quants fit more simply into the 8-bit pseudo-tensors than 3bit or 5bit?

1 reply

JorgeR81 Aug 28, 2024
Author

I haven't tried much of the Q5 model, but I think I prefer results with the Q4 or Q6 models.

Maybe you could try mixing Q6 instead of Q5 in your custom node. Or just all Q4 and Q8.
https://github.com/chrisgoringe/cg-mixed-casting

chrisgoringe · 2024-08-28T22:56:15Z

chrisgoringe
Aug 28, 2024

The code I'm using from city96 has on the fly encoding to Q4_1, Q5_1, and Q8_0 - not Q6.

So to do a Q6 mix requires using the patching mechanism (downloading a Q6 model and mixing it in).

I tend to do Q4 and Q8 and bfloat16 mixes. But I'm working on some objective measures of the accuracy of different quants in different bits of the model.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why is Flux faster with VRAM offloading ("loaded partially") ?! #4601

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Why is Flux faster with VRAM offloading ("loaded partially") ?! #4601

Uh oh!

Uh oh!

JorgeR81 Aug 25, 2024

Replies: 3 comments · 1 reply

Uh oh!

Uh oh!

JorgeR81 Aug 25, 2024 Author

Uh oh!

chrisgoringe Aug 28, 2024

Uh oh!

Uh oh!

JorgeR81 Aug 28, 2024 Author

Uh oh!

chrisgoringe Aug 28, 2024

JorgeR81
Aug 25, 2024

Replies: 3 comments 1 reply

JorgeR81
Aug 25, 2024
Author

chrisgoringe
Aug 28, 2024

JorgeR81 Aug 28, 2024
Author

chrisgoringe
Aug 28, 2024