Using FP8 for inference without CPU offloading can introduce noise.

### Describe the bug

If I use ```pipe.enable_model_cpu_offload(device=device)```, the model can perform inference correctly after warming up. However, if I comment out this line, the inference results are noisy.

### Reproduction

```python
from diffusers import (
    FluxPipeline, 
    FluxTransformer2DModel
)
from transformers import T5EncoderModel, CLIPTextModel,CLIPTokenizer,T5TokenizerFast
from optimum.quanto import freeze, qfloat8, quantize
import torch
from diffusers import FlowMatchEulerDiscreteScheduler, AutoencoderKL
dtype = torch.bfloat16
bfl_repo = f"black-forest-labs/FLUX.1-dev" 
device = "cuda"
scheduler       = FlowMatchEulerDiscreteScheduler.from_pretrained(bfl_repo, subfolder="scheduler", torch_dtype=dtype)
text_encoder    = CLIPTextModel.from_pretrained(bfl_repo, subfolder="text_encoder", torch_dtype=dtype)
tokenizer       = CLIPTokenizer.from_pretrained(bfl_repo, subfolder="tokenizer", torch_dtype=dtype, clean_up_tokenization_spaces=True)
text_encoder_2  = T5EncoderModel.from_pretrained(bfl_repo, subfolder="text_encoder_2", torch_dtype=dtype)
tokenizer_2     = T5TokenizerFast.from_pretrained(bfl_repo, subfolder="tokenizer_2", torch_dtype=dtype, clean_up_tokenization_spaces=True)
vae             = AutoencoderKL.from_pretrained(bfl_repo, subfolder="vae", torch_dtype=dtype)

transformer = FluxTransformer2DModel.from_single_file("https://huggingface.co/Kijai/flux-fp8/blob/main/flux1-dev-fp8.safetensors", torch_dtype=dtype)
quantize(transformer, weights=qfloat8)
freeze(transformer)
quantize(text_encoder_2, weights=qfloat8)
freeze(text_encoder_2)

pipe = FluxPipeline(
            scheduler=scheduler,
            text_encoder=text_encoder,
            tokenizer=tokenizer,
            text_encoder_2=text_encoder_2,
            tokenizer_2=tokenizer_2,
            vae=vae,
            transformer=transformer
        ).to(device, dtype=dtype)  # edit

# pipe.enable_model_cpu_offload(device=device)            
params = {
                "prompt": "a cat",
                "num_images_per_prompt": 1,
                "num_inference_steps":1,
                "width": 64,
                "height": 64,
                "guidance_scale": 7,
            }
image = pipe(**params).images[0]    # wamup
params = {
                "prompt": "a cat",
                "num_images_per_prompt": 1,
                "num_inference_steps":25,
                "width": 512,
                "height": 512,
                "guidance_scale": 7,
            }
image = pipe(**params).images[0]    
image.save("1.jpg")
```

### Logs

_No response_

### System Info

WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
    PyTorch 2.5.1+cu121 with CUDA 1201 (you have 2.4.1+cu121)
    Python  3.10.15 (you have 3.10.13)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

- 🤗 Diffusers version: 0.32.0.dev0
- Platform: Linux-6.8.0-49-generic-x86_64-with-glibc2.35
- Running on Google Colab?: No
- Python version: 3.10.13
- PyTorch version (GPU?): 2.4.1+cu121 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Huggingface_hub version: 0.26.2
- Transformers version: 4.46.2
- Accelerate version: 0.31.0
- PEFT version: 0.14.0
- Bitsandbytes version: not installed
- Safetensors version: 0.4.3
- xFormers version: 0.0.28.post3
- Accelerator: NVIDIA GeForce RTX 3090, 24576 MiB
NVIDIA GeForce RTX 3090, 24576 MiB
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

### Who can help?

@yiyixuxu @DN6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Using FP8 for inference without CPU offloading can introduce noise. #10302

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Using FP8 for inference without CPU offloading can introduce noise. #10302

Description

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions