-
-
Notifications
You must be signed in to change notification settings - Fork 14
Description
Hi, I ran into a pretty serious issue where having this custom node installed was causing massive memory and inference startup problems throughout my entire ComfyUI setup, even for workflows that don't use any MiniCPM nodes at all.
After installing, every mode started exhibiting weird behavior:
- 1~3 minutes "compilation" time on first load for every model (including VAE, diffusion, clip etc.)
- A 6GB SDXL model was consuming 96GB VRAM and ~100GB system RAM
- This happened even in workflows with zero MiniCPM nodes
The issue seems to be in both AILab_MiniCPM.py and AILab_MiniCPM_GGUF.py around lines 21-28 and 31-38 respectively. These files set global PyTorch configurations that affect the entire ComfyUI backend:
if torch.cuda.is_available():
torch.backends.cudnn.benchmark = True
if hasattr(torch.backends, 'cuda'):
if hasattr(torch.backends.cuda, 'matmul'):
torch.backends.cuda.matmul.allow_tf32 = True
if hasattr(torch.backends.cuda, 'allow_tf32'):
torch.backends.cuda.allow_tf32 = True
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:128"
Simply commenting out those lines seems to resolve the issue. MiniCPM performance remains the same, but ComfyUI memory usage returns to normal levels, and there's no more startup delay.
Thanks for the great work on this node otherwise.
On these settings, per Claude (might be inaccuracies):
torch.backends.cudnn.benchmark = True
- Purpose: Enables cuDNN's auto-tuning to find the fastest convolution algorithms
- When it helps: For models with fixed input sizes doing many repeated inferences
- Why it was problematic: ComfyUI workflows have variable input sizes (different image resolutions, batch sizes), so the auto-tuning becomes counterproductive and causes compilation delays
torch.backends.cuda.matmul.allow_tf32 = True
- Purpose: Uses TensorFloat-32 (TF32) for matrix operations on Ampere+ GPUs for speed
- When it helps: Large matrix multiplications in transformers
- Why it was unnecessary: Modern PyTorch already enables TF32 by default on supported hardware
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:128"
- Purpose: Forces smaller memory chunks to reduce fragmentation
- When it helps: When you have limited VRAM and need to fit large models
- Why it was harmful: This setting was forcing unnecessary memory fragmentation, causing the allocator to request far more system RAM as backup
