Global PyTorch settings cause massive memory usage and compilation delays across all ComfyUI workflows

Hi, I ran into a pretty serious issue where having this custom node installed was causing massive memory and inference startup problems throughout my entire ComfyUI setup, even for workflows that don't use any MiniCPM nodes at all.

After installing, every mode started exhibiting weird behavior:

- 1~3 minutes "compilation" time on first load for every model (including VAE, diffusion, clip etc.)
- A 6GB SDXL model was consuming 96GB VRAM and ~100GB system RAM
- This happened even in workflows with zero MiniCPM nodes

![Image](https://github.com/user-attachments/assets/00bfdc18-ce8b-41bc-b532-aabe4419e9c8)

The issue seems to be in both `AILab_MiniCPM.py` and `AILab_MiniCPM_GGUF.py` around lines 21-28 and 31-38 respectively. These files set global PyTorch configurations that affect the entire ComfyUI backend:
```
if torch.cuda.is_available():
    torch.backends.cudnn.benchmark = True
    if hasattr(torch.backends, 'cuda'):
        if hasattr(torch.backends.cuda, 'matmul'):
            torch.backends.cuda.matmul.allow_tf32 = True
        if hasattr(torch.backends.cuda, 'allow_tf32'):
            torch.backends.cuda.allow_tf32 = True
    os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:128"
```

Simply commenting out those lines seems to resolve the issue. MiniCPM performance remains the same, but ComfyUI memory usage returns to normal levels, and there's no more startup delay.

Thanks for the great work on this node otherwise.

On these settings, per Claude (might be inaccuracies):
1. `torch.backends.cudnn.benchmark = True`

- **Purpose**: Enables cuDNN's auto-tuning to find the fastest convolution algorithms
- **When it helps**: For models with fixed input sizes doing many repeated inferences
- **Why it was problematic**: ComfyUI workflows have variable input sizes (different image resolutions, batch sizes), so the auto-tuning becomes counterproductive and causes compilation delays

2. `torch.backends.cuda.matmul.allow_tf32 = True`

- **Purpose**: Uses TensorFloat-32 (TF32) for matrix operations on Ampere+ GPUs for speed
- **When it helps**: Large matrix multiplications in transformers
- **Why it was unnecessary**: Modern PyTorch already enables TF32 by default on supported hardware

3. `os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:128"`

- **Purpose**: Forces smaller memory chunks to reduce fragmentation
- **When it helps**: When you have limited VRAM and need to fit large models
- **Why it was harmful**: This setting was forcing unnecessary memory fragmentation, causing the allocator to request far more system RAM as backup



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Global PyTorch settings cause massive memory usage and compilation delays across all ComfyUI workflows #10

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Global PyTorch settings cause massive memory usage and compilation delays across all ComfyUI workflows #10

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions