What is the correct way of using Forge for Flux checkpoints (CUDA OOM error) #1844

nitinmukesh · 2024-09-17T09:39:19Z

nitinmukesh
Sep 17, 2024

So I was using Auto1111 all the time before switching to Forge. I never get CUDA OOM error before while using Auto1111.
Something which I learned (from variety of tools like Auto1111, Diffsynth studio, Talking head, etc) is that there is GPU VRAM and then Shared memory from system RAM. First the GPU VRAM memory is utilized fully and then the rest will be spilled to shared RAM and if both are full and more is needed than you get CUDA OOM error.

However in Forge, it is not fully utilizing the VRAM and starts utilizing shared RAM. To support this check this image

4.8 Gb of VRAM is used and then started to use shared. So I got CUDA OOM error on 7.5 GB / 15.9 GB, which is not correct. It should be for e.g. 14 GB / 15.9 GB and application request for another 2 GB one should get CUDA OOM error.

Please find the console log and settings I am using. What am I doing wrong, Please advice.

venv "C:\usable\webui_forge_cu121_torch231\webui\venv\Scripts\Python.exe"
Python 3.10.6 (tags/v3.10.6:9c7b4bd, Aug  1 2022, 21:53:49) [MSC v.1932 64 bit (AMD64)]
Version: f2.0.1v1.10.1-previous-526-gc13b26ba
Commit hash: c13b26ba271bac327879d32f01307fc21a012321
Launching Web UI with arguments: --disable-gpu-warning
Total VRAM 8188 MB, total RAM 16108 MB
pytorch version: 2.3.1+cu121
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 4060 Laptop GPU : native
Hint: your device supports --cuda-malloc for potential speed improvements.
VAE dtype preferences: [torch.bfloat16, torch.float32] -> torch.bfloat16
CUDA Using Stream: False
Using pytorch cross attention
Using pytorch attention for VAE
ControlNet preprocessor location: C:\usable\webui_forge_cu121_torch231\webui\models\ControlNetPreprocessor
2024-09-17 14:54:19,225 - ControlNet - INFO - ControlNet UI callback registered.
Model selected: {'checkpoint_info': {'filename': 'C:\\usable\\webui_forge_cu121_torch231\\webui\\models\\Stable-diffusion\\FLUX\\FLUX.1-schnell-dev-merged-fp8-4step.safetensors', 'hash': '9e0fb423'}, 'additional_modules': [], 'unet_storage_dtype': 'nf4'}
Using online LoRAs in FP16: False
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Startup time: 42.1s (prepare environment: 8.4s, import torch: 17.2s, initialize shared: 0.4s, other imports: 1.7s, list extensions: 0.1s, list SD models: 0.2s, load scripts: 5.1s, initialize extra networks: 0.1s, create ui: 6.1s, gradio launch: 2.9s).
Environment vars changed: {'stream': False, 'inference_memory': 1024.0, 'pin_shared_memory': True}
[GPU Setting] You will use 87.49% GPU memory (7163.00 MB) to load weights, and use 12.51% GPU memory (1024.00 MB) to do matrix computation.
Environment vars changed: {'stream': False, 'inference_memory': 1024.0, 'pin_shared_memory': False}
[GPU Setting] You will use 87.49% GPU memory (7163.00 MB) to load weights, and use 12.51% GPU memory (1024.00 MB) to do matrix computation.
Model selected: {'checkpoint_info': {'filename': 'C:\\usable\\webui_forge_cu121_torch231\\webui\\models\\Stable-diffusion\\FLUX\\FLUX.1-schnell-dev-merged-fp8-4step.safetensors', 'hash': '9e0fb423'}, 'additional_modules': [], 'unet_storage_dtype': None}
Using online LoRAs in FP16: False
Environment vars changed: {'stream': False, 'inference_memory': 1024.0, 'pin_shared_memory': True}
[GPU Setting] You will use 87.49% GPU memory (7163.00 MB) to load weights, and use 12.51% GPU memory (1024.00 MB) to do matrix computation.
Model selected: {'checkpoint_info': {'filename': 'C:\\usable\\webui_forge_cu121_torch231\\webui\\models\\Stable-diffusion\\FLUX\\FLUX.1-schnell-dev-merged-fp8-4step.safetensors', 'hash': '9e0fb423'}, 'additional_modules': [], 'unet_storage_dtype': 'nf4'}
Using online LoRAs in FP16: False
Loading Model: {'checkpoint_info': {'filename': 'C:\\usable\\webui_forge_cu121_torch231\\webui\\models\\Stable-diffusion\\FLUX\\FLUX.1-schnell-dev-merged-fp8-4step.safetensors', 'hash': '9e0fb423'}, 'additional_modules': [], 'unet_storage_dtype': 'nf4'}
[Unload] Trying to free all memory for cuda:0 with 0 models keep loaded ... Done.
StateDict Keys: {'transformer': 776, 'vae': 244, 'text_encoder': 198, 'text_encoder_2': 220, 'ignore': 0}
Using Default T5 Data Type: torch.float16
Working with z of shape (1, 16, 32, 32) = 16384 dimensions.
K-Model Created: {'storage_dtype': 'nf4', 'computation_dtype': torch.bfloat16}
Model loaded in 101.7s (unload existing model: 0.3s, forge model load: 101.4s).
Skipping unconditional conditioning when CFG = 1. Negative Prompts are ignored.
[Unload] Trying to free 13465.80 MB for cuda:0 with 0 models keep loaded ... Done.
Traceback (most recent call last):
  File "C:\usable\webui_forge_cu121_torch231\webui\modules_forge\main_thread.py", line 30, in work
    self.result = self.func(*self.args, **self.kwargs)
  File "C:\usable\webui_forge_cu121_torch231\webui\modules\txt2img.py", line 121, in txt2img_function
    processed = processing.process_images(p)
  File "C:\usable\webui_forge_cu121_torch231\webui\modules\processing.py", line 816, in process_images
    res = process_images_inner(p)
  File "C:\usable\webui_forge_cu121_torch231\webui\modules\processing.py", line 929, in process_images_inner
    p.setup_conds()
  File "C:\usable\webui_forge_cu121_torch231\webui\modules\processing.py", line 1519, in setup_conds
    super().setup_conds()
  File "C:\usable\webui_forge_cu121_torch231\webui\modules\processing.py", line 501, in setup_conds
    self.c = self.get_conds_with_caching(prompt_parser.get_multicond_learned_conditioning, prompts, total_steps, [self.cached_c], self.extra_network_data)
  File "C:\usable\webui_forge_cu121_torch231\webui\modules\processing.py", line 470, in get_conds_with_caching
    cache[1] = function(shared.sd_model, required_prompts, steps, hires_steps, shared.opts.use_old_scheduling)
  File "C:\usable\webui_forge_cu121_torch231\webui\modules\prompt_parser.py", line 262, in get_multicond_learned_conditioning
    learned_conditioning = get_learned_conditioning(model, prompt_flat_list, steps, hires_steps, use_old_scheduling)
  File "C:\usable\webui_forge_cu121_torch231\webui\modules\prompt_parser.py", line 189, in get_learned_conditioning
    conds = model.get_learned_conditioning(texts)
  File "C:\usable\webui_forge_cu121_torch231\webui\venv\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "C:\usable\webui_forge_cu121_torch231\webui\backend\diffusion_engine\flux.py", line 77, in get_learned_conditioning
    memory_management.load_model_gpu(self.forge_objects.clip.patcher)
  File "C:\usable\webui_forge_cu121_torch231\webui\backend\memory_management.py", line 689, in load_model_gpu
    return load_models_gpu([model])
  File "C:\usable\webui_forge_cu121_torch231\webui\backend\memory_management.py", line 679, in load_models_gpu
    loaded_model.model_load(model_gpu_memory_when_using_cpu_swap)
  File "C:\usable\webui_forge_cu121_torch231\webui\backend\memory_management.py", line 501, in model_load
    m.weight = utils.tensor2parameter(m.weight.to(self.model.offload_device).pin_memory())
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

[Memory Management] Target: JointTextEncoder, Free GPU: 7086.73 MB, Model Require: 9570.62 MB, Previously Loaded: 0.00 MB, Inference Require: 1024.00 MB, Remaining: -3507.89 MB, CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

I think GPU weight plays a role in this but looking at the value it seems correct, use 7xxx VRAM. Or maybe my understanding is incorrect.

nitinmukesh · 2024-09-17T10:08:08Z

nitinmukesh
Sep 17, 2024
Author

And I am surprised to learn that changing GPU weight to 13000 worked. Though it defaulted to the max VRAM automatically.

Funny thing is it used only 5.x GB VRAM and generate the image. Last time it crashed and CUDA OOM error while using 7.8 GB.

Can someone explain plz

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What is the correct way of using Forge for Flux checkpoints (CUDA OOM error) #1844

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

What is the correct way of using Forge for Flux checkpoints (CUDA OOM error) #1844

Uh oh!

Uh oh!

nitinmukesh Sep 17, 2024

Replies: 1 comment

Uh oh!

Uh oh!

nitinmukesh Sep 17, 2024 Author

nitinmukesh
Sep 17, 2024

nitinmukesh
Sep 17, 2024
Author