Skip to content

Conversation

@csirikak
Copy link

@csirikak csirikak commented Oct 10, 2025

The WanVideoT5TextEncoderLoader is missing an fp16 option. Running the text encoder at bf16 causes a small deviation from the fp32 ground truth, as well as the default ComfyUI encoder. Should a user have the umt5_xxl_fp16.safetensors version, then using the encoder in the fp16 mode would be best.

The following tensors are from the text encoder on empty prompts.

WanVideoWrapper @ bf16

'prompt_embeds': 
tensor([[-0.0007, -0.0096,  0.0081,  ...,  0.0006, -0.0264, -0.0022]], device='cuda:0', dtype=torch.bfloat16)]

WanVideoWrapper @ fp16

'prompt_embeds': 
[tensor([[-0.0007, -0.0099,  0.0083,  ...,  0.0006, -0.0261, -0.0021]], device='cuda:0', dtype=torch.float16)]

WanVideoWrapper @ fp32

'prompt_embeds': 
[tensor([[-0.0007, -0.0099,  0.0083,  ...,  0.0006, -0.0261, -0.0021]], device='cuda:0', dtype=torch.float32)]

ComfyUI

'prompt_embeds': 
         tensor([[[-0.0007, -0.0099,  0.0083,  ...,  0.0006, -0.0261, -0.0021],


         [ 0.0000, -0.0000,  0.0000,  ...,  0.0000, -0.0000, -0.0000],
         [ 0.0000, -0.0000,  0.0000,  ...,  0.0000, -0.0000, -0.0000],
         ...,
         [ 0.0000, -0.0000,  0.0000,  ...,  0.0000, -0.0000, -0.0000],
         [ 0.0000, -0.0000,  0.0000,  ...,  0.0000, -0.0000, -0.0000],
         [ 0.0000, -0.0000,  0.0000,  ...,  0.0000, -0.0000, -0.0000]]], device='cuda:0'), 
image

Test Wan Text Encoder.json

@kijai
Copy link
Owner

kijai commented Oct 10, 2025

Did you actually test it in practice though? Because in my previous experience it doesn't work, using it in fp16 causes Nans and even with the fallback in the code, that ends up with corrupt video. Original T5 was always bf16 too, I don't know how comfy got it to work with the native encoding in fp16.

@csirikak
Copy link
Author

csirikak commented Oct 10, 2025

Did you actually test it in practice though? Because in my previous experience it doesn't work, using it in fp16 causes Nans and even with the fallback in the code, that ends up with corrupt video. Original T5 was always bf16 too, I don't know how comfy got it to work with the native encoding in fp16.

@kijai Did you use the bf16 or fp16 version of the text encoder model when running at fp16?

I checked the generated video output and the bf16/fp32 baselines, and observed no significant differences. To ensure a fresh comparison, use_disk_cache was disabled.

I also used a script to inspect the text encoder's output tensors for NaN or Inf values for a bunch of prompts when run at fp16 on the fp16 model, but I didn't find any. I'll download the bf16 version to see if running it at fp16 causes overflow issues. If that's the case, I think it could be smart to have the node check the format of the weights and inform the user that there's a precision mismatch between the execution and model.

Feel free to try and inspect the tensors from the text encoder with my script, it'll let you know if there are NaNs or Infs and let you see the histogram. To use it make sure that the disk cache is empty, and run the text encoder node with the disk cache option on.

python textEmbeddingAnalyzer.py "./path/to/text_encoder_cache/hash_of_prompt"
textEmbeddingAnalyzer.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants