huggingface · yiyixuxu · Jan 16, 2025 · Jan 10, 2025 · Jan 10, 2025 · Jan 10, 2025
diff --git a/examples/research_projects/pytorch_xla/inference/flux/README.md b/examples/research_projects/pytorch_xla/inference/flux/README.md
@@ -0,0 +1,100 @@
+# Generating images using Flux and PyTorch/XLA
+
+The `flux_inference` script shows how to do image generation using Flux on TPU devices using PyTorch/XLA. It uses the pallas kernel for flash attention for faster generation.
+
+It has been tested on [Trillium](https://cloud.google.com/blog/products/compute/introducing-trillium-6th-gen-tpus) TPU versions. No other TPU types have been tested.
+
+## Create TPU
+
+To create a TPU on Google Cloud, follow [this guide](https://cloud.google.com/tpu/docs/v6e)
+
+## Setup TPU environment
+
+SSH into the VM and install Pytorch, Pytorch/XLA
+
+```bash
+pip install torch~=2.5.0 torch_xla[tpu]~=2.5.0 -f https://storage.googleapis.com/libtpu-releases/index.html -f https://storage.googleapis.com/libtpu-wheels/index.html
+pip install torch_xla[pallas] -f https://storage.googleapis.com/jax-releases/jax_nightly_releases.html -f https://storage.googleapis.com/jax-releases/jaxlib_nightly_releases.html
+```
+
+Verify that PyTorch and PyTorch/XLA were installed correctly:
+
+```bash
+python3 -c "import torch; import torch_xla;"
+```
+
+Install dependencies
+
+```bash
+pip install transformers accelerate sentencepiece structlog
+pushd ../../..
+pip install .
+popd
+```
+
+## Run the inference job
+
+### Authenticate
+
+Run the following command to authenticate your token in order to download Flux weights.
+
+```bash
+huggingface-cli login
+```
+
+Then run:
+
+```bash
+python flux_inference.py
+```
+
+The script loads the text encoders onto the CPU and the Flux transformer and VAE models onto the TPU. The first time the script runs, the compilation time is longer, while the cache stores the compiled programs. On subsequent runs, compilation is much faster and the subsequent passes being the fastest. 
+
+On a Trillium v6e-4, you should expect ~9 sec / 4 images or 2.25 sec / image (as devices run generation in parallel):
+
+```bash
+WARNING:root:libtpu.so and TPU device found. Setting PJRT_DEVICE=TPU.
+Loading checkpoint shards: 100%|███████████████████████████████| 2/2 [00:00<00:00,  7.01it/s]
+Loading pipeline components...:  40%|██████████▍               | 2/5 [00:00<00:00,  3.78it/s]You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
+Loading pipeline components...: 100%|██████████████████████████| 5/5 [00:00<00:00,  6.72it/s]
+2025-01-10 00:51:25 [info     ] loading flux from black-forest-labs/FLUX.1-dev
+2025-01-10 00:51:25 [info     ] loading flux from black-forest-labs/FLUX.1-dev
+2025-01-10 00:51:26 [info     ] loading flux from black-forest-labs/FLUX.1-dev
+2025-01-10 00:51:26 [info     ] loading flux from black-forest-labs/FLUX.1-dev
+Loading pipeline components...: 100%|██████████████████████████| 3/3 [00:00<00:00,  4.29it/s]
+Loading pipeline components...: 100%|██████████████████████████| 3/3 [00:00<00:00,  3.26it/s]
+Loading pipeline components...: 100%|██████████████████████████| 3/3 [00:00<00:00,  3.27it/s]
+Loading pipeline components...: 100%|██████████████████████████| 3/3 [00:00<00:00,  3.25it/s]
+2025-01-10 00:51:34 [info     ] starting compilation run...   
+2025-01-10 00:51:35 [info     ] starting compilation run...   
+2025-01-10 00:51:37 [info     ] starting compilation run...   
+2025-01-10 00:51:37 [info     ] starting compilation run...   
+2025-01-10 00:52:52 [info     ] compilation took 78.5155531649998 sec.
+2025-01-10 00:52:53 [info     ] starting inference run...     
+2025-01-10 00:52:57 [info     ] compilation took 79.52986721400157 sec.
+2025-01-10 00:52:57 [info     ] compilation took 81.91776501700042 sec.
+2025-01-10 00:52:57 [info     ] compilation took 80.24951512600092 sec.
+2025-01-10 00:52:57 [info     ] starting inference run...     
+2025-01-10 00:52:57 [info     ] starting inference run...     
+2025-01-10 00:52:58 [info     ] starting inference run...     
+2025-01-10 00:53:22 [info     ] inference time: 25.112665320000815
+2025-01-10 00:53:30 [info     ] inference time: 7.7019307739992655
+2025-01-10 00:53:38 [info     ] inference time: 7.693858365000779
+2025-01-10 00:53:46 [info     ] inference time: 7.690621814001133
+2025-01-10 00:53:53 [info     ] inference time: 7.679490454000188
+2025-01-10 00:54:01 [info     ] inference time: 7.68949568500102
+2025-01-10 00:54:09 [info     ] inference time: 7.686633744000574
+2025-01-10 00:54:16 [info     ] inference time: 7.696786873999372
+2025-01-10 00:54:24 [info     ] inference time: 7.691988694999964
+2025-01-10 00:54:32 [info     ] inference time: 7.700649563999832
+2025-01-10 00:54:39 [info     ] inference time: 7.684993574001055
+2025-01-10 00:54:47 [info     ] inference time: 7.68343457499941
+2025-01-10 00:54:55 [info     ] inference time: 7.667921153999487
+2025-01-10 00:55:02 [info     ] inference time: 7.683585194001353
+2025-01-10 00:55:06 [info     ] avg. inference over 15 iterations took 8.61202360273334 sec.
+2025-01-10 00:55:07 [info     ] avg. inference over 15 iterations took 8.952725123600006 sec.
+2025-01-10 00:55:10 [info     ] inference time: 7.673799695001435
+2025-01-10 00:55:10 [info     ] avg. inference over 15 iterations took 8.849190365400379 sec.
+2025-01-10 00:55:10 [info     ] saved metric information as /tmp/metrics_report.txt
+2025-01-10 00:55:12 [info     ] avg. inference over 15 iterations took 8.940161458400205 sec.
+```
diff --git a/examples/research_projects/pytorch_xla/inference/flux/flux_inference.py b/examples/research_projects/pytorch_xla/inference/flux/flux_inference.py
@@ -0,0 +1,102 @@
+from time import perf_counter
+from pathlib import Path
+from argparse import ArgumentParser
+
+import structlog
+
+import torch
+import torch_xla.core.xla_model as xm
+import torch_xla.runtime as xr
+import torch_xla.debug.profiler as xp
+import torch_xla.debug.metrics as met
+from diffusers import FluxPipeline
+import torch_xla.distributed.xla_multiprocessing as xmp
+
+logger = structlog.get_logger()
+metrics_filepath = '/tmp/metrics_report.txt'
+
+def _main(index, args, text_pipe, ckpt_id):
+
+    cache_path = Path('/tmp/data/compiler_cache_tRiLlium_eXp')
+    cache_path.mkdir(parents=True, exist_ok=True)
+    xr.initialize_cache(str(cache_path), readonly=False)
+
+    profile_path = Path('/tmp/data/profiler_out_tRiLlium_eXp')
+    profile_path.mkdir(parents=True, exist_ok=True)
+    profiler_port = 9012
+    profile_duration = args.profile_duration
+    if args.profile:
+        logger.info(f'starting profiler on port {profiler_port}')
+        _ = xp.start_server(profiler_port)
+    device0 = xm.xla_device()
+
+    logger.info(f'loading flux from {ckpt_id}')
+    flux_pipe = FluxPipeline.from_pretrained(ckpt_id, text_encoder=None, tokenizer=None,
+                                            text_encoder_2=None, tokenizer_2=None, torch_dtype=torch.bfloat16).to(device0)
+
+    prompt = 'photograph of an electronics chip in the shape of a race car with trillium written on its side'
+    width = args.width
+    height = args.height
+    guidance = args.guidance
+    n_steps = 4 if args.schnell else 28
+
+    logger.info('starting compilation run...')
+    ts = perf_counter()
+    with torch.no_grad():
+        prompt_embeds, pooled_prompt_embeds, text_ids = text_pipe.encode_prompt(
+            prompt=prompt, prompt_2=None, max_sequence_length=512)
+    prompt_embeds = prompt_embeds.to(device0)
+    pooled_prompt_embeds = pooled_prompt_embeds.to(device0)
+
+    image = flux_pipe(prompt_embeds=prompt_embeds, pooled_prompt_embeds=pooled_prompt_embeds,
+                    num_inference_steps=28, guidance_scale=guidance, height=height, width=width).images[0]
+    logger.info(f'compilation took {perf_counter() - ts} sec.')
+    image.save('/tmp/compile_out.png')
+
+    base_seed = 4096 if args.seed is None else args.seed
+    seed_range = 1000
+    unique_seed = base_seed + index * seed_range
+    xm.set_rng_state(seed=unique_seed, device=device0)
+    times = []
+    logger.info('starting inference run...')
+    for _ in range(args.itters):
+        ts = perf_counter()
+        with torch.no_grad():
+            prompt_embeds, pooled_prompt_embeds, text_ids = text_pipe.encode_prompt(
+                prompt=prompt, prompt_2=None, max_sequence_length=512)
+        prompt_embeds = prompt_embeds.to(device0)
+        pooled_prompt_embeds = pooled_prompt_embeds.to(device0)
+
+        if args.profile:
+            xp.trace_detached(f"localhost:{profiler_port}", str(profile_path), duration_ms=profile_duration)
+        image = flux_pipe(prompt_embeds=prompt_embeds, pooled_prompt_embeds=pooled_prompt_embeds,
+                        num_inference_steps=n_steps, guidance_scale=guidance, height=height, width=width).images[0]
+        inference_time = perf_counter() - ts
+        if index == 0:
+            logger.info(f"inference time: {inference_time}")
+        times.append(inference_time)
+    logger.info(f'avg. inference over {args.itters} iterations took {sum(times)/len(times)} sec.')
+    image.save(f'/tmp/inference_out-{index}.png')
+    if index == 0:
+        metrics_report = met.metrics_report()
+        with open(metrics_filepath, 'w+') as fout:
+            fout.write(metrics_report)
+        logger.info(f'saved metric information as {metrics_filepath}')
+
+if __name__ == '__main__':
+    parser = ArgumentParser()
+    parser.add_argument('--schnell', action='store_true', help='run flux schnell instead of dev')
+    parser.add_argument('--width', type=int, default=1024, help='width of the image to generate')
+    parser.add_argument('--height', type=int, default=1024, help='height of the image to generate')
+    parser.add_argument('--guidance', type=float, default=3.5, help='gauidance strentgh for dev')
+    parser.add_argument('--seed', type=int, default=None, help='seed for inference')
+    parser.add_argument('--profile', action='store_true', help='enable profiling')
+    parser.add_argument('--profile-duration', type=int, default=10000, help='duration for profiling in msec.')
+    parser.add_argument('--itters', type=int, default=15, help='tiems to run inference and get avg time in sec.')
+    args = parser.parse_args()
+    if args.schnell:
+        ckpt_id = "black-forest-labs/FLUX.1-schnell"
+    else:
+        ckpt_id = "black-forest-labs/FLUX.1-dev"
+    text_pipe = FluxPipeline.from_pretrained(ckpt_id, transformer=None, vae=None, torch_dtype=torch.bfloat16).to('cpu')
+    xmp.spawn(_main, args=(args, text_pipe, ckpt_id))
diff --git a/...s/research_projects/pytorch_xla/README.md → ...orch_xla/training/text_to_image/README.md b/...s/research_projects/pytorch_xla/README.md → ...orch_xla/training/text_to_image/README.md
diff --git a/...rch_projects/pytorch_xla/requirements.txt → ...a/training/text_to_image/requirements.txt b/...rch_projects/pytorch_xla/requirements.txt → ...a/training/text_to_image/requirements.txt
diff --git a/...ts/pytorch_xla/train_text_to_image_xla.py → .../text_to_image/train_text_to_image_xla.py b/...ts/pytorch_xla/train_text_to_image_xla.py → .../text_to_image/train_text_to_image_xla.py
diff --git a/src/diffusers/models/attention_processor.py b/src/diffusers/models/attention_processor.py
@@ -2318,9 +2318,12 @@ def __call__(
             query = apply_rotary_emb(query, image_rotary_emb)
             key = apply_rotary_emb(key, image_rotary_emb)
 
-        hidden_states = F.scaled_dot_product_attention(
-            query, key, value, attn_mask=attention_mask, dropout_p=0.0, is_causal=False
-        )
+        if XLA_AVAILABLE:
+           query /= math.sqrt(head_dim)
+           hidden_states = flash_attention(query, key, value, causal=False)
+        else:
+            hidden_states = F.scaled_dot_product_attention(query, key, value, dropout_p=0.0, is_causal=False)
+
         hidden_states = hidden_states.transpose(1, 2).reshape(batch_size, -1, attn.heads * head_dim)
         hidden_states = hidden_states.to(query.dtype)
 
@@ -2521,7 +2524,12 @@ def __call__(
             query = apply_rotary_emb(query, image_rotary_emb)
             key = apply_rotary_emb(key, image_rotary_emb)
 
-        hidden_states = F.scaled_dot_product_attention(query, key, value, dropout_p=0.0, is_causal=False)
+        if XLA_AVAILABLE:
+          query /= math.sqrt(head_dim)
+          hidden_states = flash_attention(query, key, value)
+        else:    
+          hidden_states = F.scaled_dot_product_attention(query, key, value, dropout_p=0.0, is_causal=False)
+
         hidden_states = hidden_states.transpose(1, 2).reshape(batch_size, -1, attn.heads * head_dim)
         hidden_states = hidden_states.to(query.dtype)