Skip to content

The Process of Wan2.1 t2v 1.3B VAE Decoding needs too much VRAM/RAM??? #872

@ReloadProcz

Description

@ReloadProcz

Hi there! Thanks for your great work! However, there is a problem that makes me really headache:

Though I'm running the wan2.1-t2v-1.3b with the command below, the decoding process ALWAYS needs lots of VRAM/RAM:

The command I use:

sudo ./build/bin/sd -M vid_gen --diffusion-model /mnt/f/ComfyUI/models/diffusion_models/wan2.1_t2v_1.3B_fp16.safetensors --vae /mnt/f/ComfyUI/models/vae/wan_2.1_vae.safetensors --t5xxl /mnt/f/ComfyUI/models/text_encoders/umt5-xxl-encoder-Q4_K_S.gguf -p "An Asian Ballet young girl dancing" --cfg-scale 6.0 --sampling-method euler -v -n "色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部, 畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走" -W 512 -H 768 --diffusion-fa --video-frames 33 --flow-shift 3.0 --offload-to-cpu --vae-tiling --clip-on-cpu --vae-on-cpu

Then I got the following Debug output:

[DEBUG] stable-diffusion.cpp:144  - Using CUDA backend
[INFO ] ggml_extend.hpp:65   - ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
[INFO ] ggml_extend.hpp:65   - ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
[INFO ] ggml_extend.hpp:65   - ggml_cuda_init: found 1 CUDA devices:
[INFO ] ggml_extend.hpp:65   -   Device 0: NVIDIA GeForce RTX 4070 Laptop GPU, compute capability 8.9, VMM: yes
[INFO ] stable-diffusion.cpp:208  - loading diffusion model from '/mnt/f/ComfyUI/models/diffusion_models/wan2.1_t2v_1.3B_fp16.safetensors'
[INFO ] model.cpp:1044 - load /mnt/f/ComfyUI/models/diffusion_models/wan2.1_t2v_1.3B_fp16.safetensors using safetensors format
[DEBUG] model.cpp:1151 - init from '/mnt/f/ComfyUI/models/diffusion_models/wan2.1_t2v_1.3B_fp16.safetensors', prefix = 'model.diffusion_model.'
[INFO ] stable-diffusion.cpp:248  - loading t5xxl from '/mnt/f/ComfyUI/models/text_encoders/umt5-xxl-encoder-Q4_K_S.gguf'
[INFO ] model.cpp:1041 - load /mnt/f/ComfyUI/models/text_encoders/umt5-xxl-encoder-Q4_K_S.gguf using gguf format
[DEBUG] model.cpp:1058 - init from '/mnt/f/ComfyUI/models/text_encoders/umt5-xxl-encoder-Q4_K_S.gguf'
[INFO ] stable-diffusion.cpp:255  - loading vae from '/mnt/f/ComfyUI/models/vae/wan_2.1_vae.safetensors'
[INFO ] model.cpp:1044 - load /mnt/f/ComfyUI/models/vae/wan_2.1_vae.safetensors using safetensors format
[DEBUG] model.cpp:1151 - init from '/mnt/f/ComfyUI/models/vae/wan_2.1_vae.safetensors', prefix = 'vae.'
[DEBUG] model.cpp:1784 - patch_embedding_channels 24576
[INFO ] stable-diffusion.cpp:267  - Version: Wan 2.x
[INFO ] stable-diffusion.cpp:298  - Weight type:                 f16
[INFO ] stable-diffusion.cpp:299  - Conditioner weight type:     q4_K
[INFO ] stable-diffusion.cpp:300  - Diffusion model weight type: f16
[INFO ] stable-diffusion.cpp:301  - VAE weight type:             NONE
[DEBUG] stable-diffusion.cpp:303  - ggml tensor size = 400 bytes
[INFO ] stable-diffusion.cpp:338  - CLIP: Using CPU backend
[INFO ] stable-diffusion.cpp:342  - Using flash attention in the diffusion model
[INFO ] wan.hpp:2131 - Wan2.1-T2V-1.3B
[DEBUG] ggml_extend.hpp:1725 - t5 params backend buffer size =  6513.95 MB(RAM) (242 tensors)
[DEBUG] ggml_extend.hpp:1725 - Wan2.1-T2V-1.3B params backend buffer size =  2708.92 MB(RAM) (825 tensors)
[INFO ] stable-diffusion.cpp:456  - VAE Autoencoder: Using CPU backend
[DEBUG] ggml_extend.hpp:1725 - wan_vae params backend buffer size =  242.10 MB(RAM) (194 tensors)
[DEBUG] stable-diffusion.cpp:565  - loading weights
[DEBUG] model.cpp:1961 - using 16 threads for model loading
[DEBUG] model.cpp:2044 - loading tensors from /mnt/f/ComfyUI/models/diffusion_models/wan2.1_t2v_1.3B_fp16.safetensors
  |================================>                 | 825/1261 - 183.82it/s
[DEBUG] model.cpp:2044 - loading tensors from /mnt/f/ComfyUI/models/text_encoders/umt5-xxl-encoder-Q4_K_S.gguf
  |==========================================>       | 1067/1261 - 71.94it/s
[DEBUG] model.cpp:2044 - loading tensors from /mnt/f/ComfyUI/models/vae/wan_2.1_vae.safetensors
  |==================================================| 1261/1261 - 82.63it/s
[INFO ] model.cpp:2282 - loading tensors completed, taking 15.27s (process: 0.01s, read: 9.10s, memcpy: 0.00s, convert: 0.15s, copy_to_backend: 0.00s)
[INFO ] stable-diffusion.cpp:646  - total params memory size = 9464.97MB (VRAM 2708.92MB, RAM 6756.05MB): text_encoders 6513.95MB(RAM), diffusion_model 2708.92MB(VRAM), vae 242.10MB(RAM), controlnet 0.00MB(VRAM), pmid 0.00MB(RAM)
[INFO ] stable-diffusion.cpp:701  - running in FLOW mode
[DEBUG] stable-diffusion.cpp:725  - finished loaded file
[INFO ] stable-diffusion.cpp:2493 - generate_video 512x768x33
[INFO ] stable-diffusion.cpp:874  - attempting to apply 0 LoRAs
[INFO ] stable-diffusion.cpp:894  - apply_loras completed, taking 0.00s
[DEBUG] stable-diffusion.cpp:895  - prompt after extract and remove lora: "A sexy Asian Ballet young girl dancing"
[DEBUG] conditioner.hpp:1267 - parse 'An Asian Ballet young girl dancing' to [['A sexy Asian Ballet young girl dancing', 1], ]
[DEBUG] t5.hpp:402  - token length: 512
[DEBUG] ggml_extend.hpp:1550 - t5 compute buffer size: 297.00 MB(RAM)
[DEBUG] conditioner.hpp:1359 - computing condition graph completed, taking 7025 ms
[DEBUG] conditioner.hpp:1267 - parse '色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部, 畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走' to [['色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部, 畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走', 1], ]
[DEBUG] t5.hpp:402  - token length: 512
[DEBUG] ggml_extend.hpp:1550 - t5 compute buffer size: 297.00 MB(RAM)
[DEBUG] conditioner.hpp:1359 - computing condition graph completed, taking 7205 ms
[INFO ] stable-diffusion.cpp:2757 - get_learned_condition completed, taking 14239 ms
[DEBUG] stable-diffusion.cpp:2819 - sample 64x96x9
[INFO ] ggml_extend.hpp:1648 - Wan2.1-T2V-1.3B offload params (2708.92 MB, 825 tensors) to runtime backend (CUDA0), taking 0.94s
[DEBUG] ggml_extend.hpp:1550 - Wan2.1-T2V-1.3B compute buffer size: 659.38 MB(VRAM)
  |==================================================| 20/20 - 5.96s/it
[INFO ] stable-diffusion.cpp:2846 - sampling completed, taking 120.77s
[INFO ] stable-diffusion.cpp:2867 - generating latent video completed, taking 121.16s
[DEBUG] ggml_extend.hpp:1550 - wan_vae compute buffer size: 19349.82 MB(RAM)

It's really headache that the vae decoding process may take so long time. And I don't have any idea how to optimize it.🤔

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions