Skip to content

Global PyTorch settings cause massive memory usage and compilation delays across all ComfyUI workflows #10

@Shoe-Lac3

Description

@Shoe-Lac3

Hi, I ran into a pretty serious issue where having this custom node installed was causing massive memory and inference startup problems throughout my entire ComfyUI setup, even for workflows that don't use any MiniCPM nodes at all.

After installing, every mode started exhibiting weird behavior:

  • 1~3 minutes "compilation" time on first load for every model (including VAE, diffusion, clip etc.)
  • A 6GB SDXL model was consuming 96GB VRAM and ~100GB system RAM
  • This happened even in workflows with zero MiniCPM nodes

Image

The issue seems to be in both AILab_MiniCPM.py and AILab_MiniCPM_GGUF.py around lines 21-28 and 31-38 respectively. These files set global PyTorch configurations that affect the entire ComfyUI backend:

if torch.cuda.is_available():
    torch.backends.cudnn.benchmark = True
    if hasattr(torch.backends, 'cuda'):
        if hasattr(torch.backends.cuda, 'matmul'):
            torch.backends.cuda.matmul.allow_tf32 = True
        if hasattr(torch.backends.cuda, 'allow_tf32'):
            torch.backends.cuda.allow_tf32 = True
    os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:128"

Simply commenting out those lines seems to resolve the issue. MiniCPM performance remains the same, but ComfyUI memory usage returns to normal levels, and there's no more startup delay.

Thanks for the great work on this node otherwise.

On these settings, per Claude (might be inaccuracies):

  1. torch.backends.cudnn.benchmark = True
  • Purpose: Enables cuDNN's auto-tuning to find the fastest convolution algorithms
  • When it helps: For models with fixed input sizes doing many repeated inferences
  • Why it was problematic: ComfyUI workflows have variable input sizes (different image resolutions, batch sizes), so the auto-tuning becomes counterproductive and causes compilation delays
  1. torch.backends.cuda.matmul.allow_tf32 = True
  • Purpose: Uses TensorFloat-32 (TF32) for matrix operations on Ampere+ GPUs for speed
  • When it helps: Large matrix multiplications in transformers
  • Why it was unnecessary: Modern PyTorch already enables TF32 by default on supported hardware
  1. os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:128"
  • Purpose: Forces smaller memory chunks to reduce fragmentation
  • When it helps: When you have limited VRAM and need to fit large models
  • Why it was harmful: This setting was forcing unnecessary memory fragmentation, causing the allocator to request far more system RAM as backup

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions