wan2.1 vae take more gpu memory after compile

### Describe the bug

After `torch.compile` wan2.1 vae consume more GPU memory than `no compilation`, which is unexpected in my opinion.

**compiled**
<img width="3006" height="1158" alt="Image" src="https://github.com/user-attachments/assets/3eff903c-af4f-422a-b407-9afdd77ef843" />

**no-compile**
<img width="2850" height="1108" alt="Image" src="https://github.com/user-attachments/assets/9e7c16e2-bf43-426a-8570-3f11f59f9c57" />

### Reproduction

```python
import sys

import torch
from diffusers import AutoencoderKLWan


def compile_wan_vae(compile):
    model_id = 'Wan-AI/Wan2.1-T2V-14B-Diffusers'
    dtype = torch.float32
    device = 'cuda'

    torch.cuda.memory._record_memory_history()
    vae = AutoencoderKLWan.from_pretrained(
        model_id, subfolder="vae", torch_dtype=dtype
    ).to(device)

    if compile:
        vae.decoder = torch.compile(vae.decoder)

    shape = (1, 16, 13, 120, 120)
    with torch.no_grad():
        latents = torch.randn(shape, device=device, dtype=dtype)
        video = vae.decode(latents, return_dict=False)[0]
    torch.cuda.empty_cache()

    with torch.no_grad():
        for _ in range(3):
            latents = torch.randn(shape, device=device, dtype=dtype)
            video = vae.decode(latents, return_dict=False)[0]
    torch.cuda.memory._dump_snapshot(f"{compile}-compile.pickle")


if __name__ == '__main__':
    compile_wan_vae(sys.argv[1] == 'compile')
```

### Logs

```shell

```

### System Info


- 🤗 Diffusers version: 0.34.0
- Platform: Linux-5.10.134-16.1.3.vip.an8.x86_64-x86_64-with-glibc2.39
- Running on Google Colab?: No
- Python version: 3.12.3
- PyTorch version (GPU?): 2.7.1+cu126 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Huggingface_hub version: 0.34.2
- Transformers version: 4.54.0
- Accelerate version: 1.9.0
- PEFT version: 0.16.0
- Bitsandbytes version: not installed
- Safetensors version: 0.5.3
- xFormers version: not installed
- Accelerator: NVIDIA L20, 46068 MiB
NVIDIA L20, 46068 MiB
NVIDIA L20, 46068 MiB
NVIDIA L20, 46068 MiB
NVIDIA L20, 46068 MiB
NVIDIA L20, 46068 MiB
NVIDIA L20, 46068 MiB
NVIDIA L20, 46068 MiB
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

### Who can help?

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

wan2.1 vae take more gpu memory after compile #12082

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

wan2.1 vae take more gpu memory after compile #12082

Description

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions