-
Notifications
You must be signed in to change notification settings - Fork 24
Open
Description
On a 3090 (24GB vram) I can run a batch size of 30 for SDXL (1280x720) but only 12 with TGATE enabled (33 steps, gate_step=10). I'm using diffusers and running in a text console without a GUI loaded so all VRAM is available. Is this high VRAM cost for TGATE expected?
NVIDIA-SMI 570.124.06 Driver Version: 570.124.06 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3090 Off | 00000000:07:00.0 On | N/A |
| 54% 64C P0 298W / 300W | 23642MiB / 24576MiB | 100% E. Process |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 8392 C python3 23624MiB
I'm seeing 30% speedup. Quality wise: there are a few more unusable images. The output has more high frequency details so things like hair is improved but in a noisy high frequency output, problems and noise (caused by the model) are highlighted making it less likely that the output is acceptable. I could try prompting for bokeh or using clip skip to reduce sharpness.
I wonder if SDXL or other models could be finetuned for TGATE to reduce the VRAM cost?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels