-
Notifications
You must be signed in to change notification settings - Fork 431
Description
Hi there! Thanks for your great work! However, there is a problem that makes me really headache:
Though I'm running the wan2.1-t2v-1.3b with the command below, the decoding process ALWAYS needs lots of VRAM/RAM:
The command I use:
sudo ./build/bin/sd -M vid_gen --diffusion-model /mnt/f/ComfyUI/models/diffusion_models/wan2.1_t2v_1.3B_fp16.safetensors --vae /mnt/f/ComfyUI/models/vae/wan_2.1_vae.safetensors --t5xxl /mnt/f/ComfyUI/models/text_encoders/umt5-xxl-encoder-Q4_K_S.gguf -p "An Asian Ballet young girl dancing" --cfg-scale 6.0 --sampling-method euler -v -n "色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部, 畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走" -W 512 -H 768 --diffusion-fa --video-frames 33 --flow-shift 3.0 --offload-to-cpu --vae-tiling --clip-on-cpu --vae-on-cpu
Then I got the following Debug output:
[DEBUG] stable-diffusion.cpp:144 - Using CUDA backend
[INFO ] ggml_extend.hpp:65 - ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
[INFO ] ggml_extend.hpp:65 - ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
[INFO ] ggml_extend.hpp:65 - ggml_cuda_init: found 1 CUDA devices:
[INFO ] ggml_extend.hpp:65 - Device 0: NVIDIA GeForce RTX 4070 Laptop GPU, compute capability 8.9, VMM: yes
[INFO ] stable-diffusion.cpp:208 - loading diffusion model from '/mnt/f/ComfyUI/models/diffusion_models/wan2.1_t2v_1.3B_fp16.safetensors'
[INFO ] model.cpp:1044 - load /mnt/f/ComfyUI/models/diffusion_models/wan2.1_t2v_1.3B_fp16.safetensors using safetensors format
[DEBUG] model.cpp:1151 - init from '/mnt/f/ComfyUI/models/diffusion_models/wan2.1_t2v_1.3B_fp16.safetensors', prefix = 'model.diffusion_model.'
[INFO ] stable-diffusion.cpp:248 - loading t5xxl from '/mnt/f/ComfyUI/models/text_encoders/umt5-xxl-encoder-Q4_K_S.gguf'
[INFO ] model.cpp:1041 - load /mnt/f/ComfyUI/models/text_encoders/umt5-xxl-encoder-Q4_K_S.gguf using gguf format
[DEBUG] model.cpp:1058 - init from '/mnt/f/ComfyUI/models/text_encoders/umt5-xxl-encoder-Q4_K_S.gguf'
[INFO ] stable-diffusion.cpp:255 - loading vae from '/mnt/f/ComfyUI/models/vae/wan_2.1_vae.safetensors'
[INFO ] model.cpp:1044 - load /mnt/f/ComfyUI/models/vae/wan_2.1_vae.safetensors using safetensors format
[DEBUG] model.cpp:1151 - init from '/mnt/f/ComfyUI/models/vae/wan_2.1_vae.safetensors', prefix = 'vae.'
[DEBUG] model.cpp:1784 - patch_embedding_channels 24576
[INFO ] stable-diffusion.cpp:267 - Version: Wan 2.x
[INFO ] stable-diffusion.cpp:298 - Weight type: f16
[INFO ] stable-diffusion.cpp:299 - Conditioner weight type: q4_K
[INFO ] stable-diffusion.cpp:300 - Diffusion model weight type: f16
[INFO ] stable-diffusion.cpp:301 - VAE weight type: NONE
[DEBUG] stable-diffusion.cpp:303 - ggml tensor size = 400 bytes
[INFO ] stable-diffusion.cpp:338 - CLIP: Using CPU backend
[INFO ] stable-diffusion.cpp:342 - Using flash attention in the diffusion model
[INFO ] wan.hpp:2131 - Wan2.1-T2V-1.3B
[DEBUG] ggml_extend.hpp:1725 - t5 params backend buffer size = 6513.95 MB(RAM) (242 tensors)
[DEBUG] ggml_extend.hpp:1725 - Wan2.1-T2V-1.3B params backend buffer size = 2708.92 MB(RAM) (825 tensors)
[INFO ] stable-diffusion.cpp:456 - VAE Autoencoder: Using CPU backend
[DEBUG] ggml_extend.hpp:1725 - wan_vae params backend buffer size = 242.10 MB(RAM) (194 tensors)
[DEBUG] stable-diffusion.cpp:565 - loading weights
[DEBUG] model.cpp:1961 - using 16 threads for model loading
[DEBUG] model.cpp:2044 - loading tensors from /mnt/f/ComfyUI/models/diffusion_models/wan2.1_t2v_1.3B_fp16.safetensors
|================================> | 825/1261 - 183.82it/s
[DEBUG] model.cpp:2044 - loading tensors from /mnt/f/ComfyUI/models/text_encoders/umt5-xxl-encoder-Q4_K_S.gguf
|==========================================> | 1067/1261 - 71.94it/s
[DEBUG] model.cpp:2044 - loading tensors from /mnt/f/ComfyUI/models/vae/wan_2.1_vae.safetensors
|==================================================| 1261/1261 - 82.63it/s
[INFO ] model.cpp:2282 - loading tensors completed, taking 15.27s (process: 0.01s, read: 9.10s, memcpy: 0.00s, convert: 0.15s, copy_to_backend: 0.00s)
[INFO ] stable-diffusion.cpp:646 - total params memory size = 9464.97MB (VRAM 2708.92MB, RAM 6756.05MB): text_encoders 6513.95MB(RAM), diffusion_model 2708.92MB(VRAM), vae 242.10MB(RAM), controlnet 0.00MB(VRAM), pmid 0.00MB(RAM)
[INFO ] stable-diffusion.cpp:701 - running in FLOW mode
[DEBUG] stable-diffusion.cpp:725 - finished loaded file
[INFO ] stable-diffusion.cpp:2493 - generate_video 512x768x33
[INFO ] stable-diffusion.cpp:874 - attempting to apply 0 LoRAs
[INFO ] stable-diffusion.cpp:894 - apply_loras completed, taking 0.00s
[DEBUG] stable-diffusion.cpp:895 - prompt after extract and remove lora: "A sexy Asian Ballet young girl dancing"
[DEBUG] conditioner.hpp:1267 - parse 'An Asian Ballet young girl dancing' to [['A sexy Asian Ballet young girl dancing', 1], ]
[DEBUG] t5.hpp:402 - token length: 512
[DEBUG] ggml_extend.hpp:1550 - t5 compute buffer size: 297.00 MB(RAM)
[DEBUG] conditioner.hpp:1359 - computing condition graph completed, taking 7025 ms
[DEBUG] conditioner.hpp:1267 - parse '色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部, 畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走' to [['色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部, 畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走', 1], ]
[DEBUG] t5.hpp:402 - token length: 512
[DEBUG] ggml_extend.hpp:1550 - t5 compute buffer size: 297.00 MB(RAM)
[DEBUG] conditioner.hpp:1359 - computing condition graph completed, taking 7205 ms
[INFO ] stable-diffusion.cpp:2757 - get_learned_condition completed, taking 14239 ms
[DEBUG] stable-diffusion.cpp:2819 - sample 64x96x9
[INFO ] ggml_extend.hpp:1648 - Wan2.1-T2V-1.3B offload params (2708.92 MB, 825 tensors) to runtime backend (CUDA0), taking 0.94s
[DEBUG] ggml_extend.hpp:1550 - Wan2.1-T2V-1.3B compute buffer size: 659.38 MB(VRAM)
|==================================================| 20/20 - 5.96s/it
[INFO ] stable-diffusion.cpp:2846 - sampling completed, taking 120.77s
[INFO ] stable-diffusion.cpp:2867 - generating latent video completed, taking 121.16s
[DEBUG] ggml_extend.hpp:1550 - wan_vae compute buffer size: 19349.82 MB(RAM)
It's really headache that the vae decoding process may take so long time. And I don't have any idea how to optimize it.🤔