entrpn
diff --git a/‎examples/research_projects/pytorch_xla/training/text_to_image/README.md‎
Lines changed: 0 additions & 170 deletions b/‎examples/research_projects/pytorch_xla/training/text_to_image/README.md‎
Lines changed: 0 additions & 170 deletions
diff --git a/‎examples/research_projects/pytorch_xla/training/text_to_image/README_sdxl.md‎
Lines changed: 54 additions & 93 deletions b/‎examples/research_projects/pytorch_xla/training/text_to_image/README_sdxl.md‎
Lines changed: 54 additions & 93 deletions
diff --git a/‎examples/research_projects/pytorch_xla/training/text_to_image/inference_sdxl.py‎
Lines changed: 0 additions & 1 deletion b/‎examples/research_projects/pytorch_xla/training/text_to_image/inference_sdxl.py‎
Lines changed: 0 additions & 1 deletion
@@ -1,21 +1,20 @@
 # Stable Diffusion XL text-to-image fine-tuning using PyTorch/XLA
 
-The `train_text_to_image_xla.py` script shows how to fine-tune stable diffusion model on TPU devices using PyTorch/XLA.
+The `train_text_to_image_sdxl.py` script shows how to fine-tune stable diffusion model on TPU devices using PyTorch/XLA.
 
-It has been tested on v4 and v5p TPU versions. Training code has been tested on multi-host. 
+It has been tested on v5p TPU versions. Training code has been tested on a single v5p-8.
 
 This script implements Distributed Data Parallel using GSPMD feature in XLA compiler
 where we shard the input batches over the TPU devices. 
 
-As of 10-31-2024, these are some expected step times.
+As of 04-04-2025, these are some expected step times.
 
 | accelerator | global batch size | step time (seconds) |
 | ----------- | ----------------- | --------- |
-| v5p-512 | 16384 | 1.01 |
-| v5p-256 | 8192 | 1.01 |
-| v5p-128 | 4096 | 1.0 |
-| v5p-64 | 2048 | 1.01 |
-
+| v5p-8 | 32 | 1.02 |
+| v5p-8 | 48 (with gradient checkpointing) | 1.42 |
+| v5p-8 | 64 (with gradient checkpointing) | 1.66 |
+|
 ## Create TPU
 
 To create a TPU on Google Cloud first set these environment variables:
@@ -39,65 +38,61 @@ You can also use other ways to reserve TPUs like GKE or queued resources.
 
 ## Setup TPU environment
 
+Assuming that you have conda setup on the VM
 Install PyTorch and PyTorch/XLA nightly versions:
 ```bash
-gcloud compute tpus tpu-vm ssh ${TPU_NAME} \
---project=${PROJECT_ID} --zone=${ZONE} --worker=all \
---command='
-pip install torch~=2.5.0 torch_xla[tpu]~=2.5.0 -f https://storage.googleapis.com/libtpu-releases/index.html -f https://storage.googleapis.com/libtpu-wheels/index.html
-pip install torch_xla[pallas] -f https://storage.googleapis.com/jax-releases/jax_nightly_releases.html -f https://storage.googleapis.com/jax-releases/jaxlib_nightly_releases.html
-pip install torch_xla[pallas] -f https://storage.googleapis.com/jax-releases/jax_nightly_releases.html -f https://storage.googleapis.com/jax-releases/jaxlib_nightly_releases.html
-'
+conda create -n torch310 python=3.10
+conda activate torch310
+pip install https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch-2.8.0a0%2Bgit01cb351-cp310-cp310-linux_x86_64.whl
+pip install https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-2.8.0%2Bgite341ff0-cp310-cp310-linux_x86_64.whl
+pip install https://storage.googleapis.com/libtpu-nightly-releases/wheels/libtpu/libtpu-0.0.11.dev20250303+nightly-py3-none-manylinux_2_27_x86_64.whl
+pip install torchax
+pip install jax==0.5.4.dev20250321 jaxlib==0.5.4.dev20250321 \
+    -f https://storage.googleapis.com/jax-releases/jax_nightly_releases.html
+pip install --no-deps torchvision --index-url https://download.pytorch.org/whl/nightly/cpu
+
+git clone -b sdxl_training_bbahl [email protected]:entrpn/diffusers.git
+cd diffusers/
+pip install -r examples/research_projects/pytorch_xla/training/text_to_image/requirements_sdxl.txt
+
+export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
+ln -sf /usr/lib/x86_64-linux-gnu/libstdc++.so.6 ${CMAKE_PREFIX_PATH}/lib/libstdc++.so.6
+
+pip install -e .
 ```
 
-Verify that PyTorch and PyTorch/XLA were installed correctly:
+## Run the training job
 
-```bash
-gcloud compute tpus tpu-vm ssh ${TPU_NAME} \
---project ${PROJECT_ID} --zone ${ZONE} --worker=all \
---command='python3 -c "import torch; import torch_xla;"'
-```
+Run the following command to authenticate your token.
 
-Install dependencies:
 ```bash
-gcloud compute tpus tpu-vm ssh ${TPU_NAME} \
---project=${PROJECT_ID} --zone=${ZONE} --worker=all \
---command='
-git clone https://github.com/huggingface/diffusers.git
-cd diffusers
-git checkout main
-cd examples/research_projects/pytorch_xla/training/text_to_image/
-pip3 install -r requirements_sdxl.txt
-pip3 install pillow --upgrade
-cd ../../../../../
-pip3 install .'
+huggingface-cli login
 ```
 
-## Run the training job
-
-### Authenticate
-
-Run the following command to authenticate your token.
+Please update the variables in `train.sh` as needed
 
 ```bash
-huggingface-cli login
+export XLA_DISABLE_FUNCTIONALIZATION=1
+export TORCH_DISABLE_FUNCTIONALIZATION_META_REFERENCE=1
+export PROFILE_DIR=/tmp/xla_profile/
+export CACHE_DIR=/tmp/xla_cache/
+export DATASET_NAME=lambdalabs/naruto-blip-captions
+export PER_HOST_BATCH_SIZE=64 # This is known to work on TPU v5p with gradient checkpointing.
+export TRAIN_STEPS=50
+export PROFILE_START_STEP=10
+export OUTPUT_DIR=/tmp/docker/trained-model/
+export HF_HOME="/tmp/hf_home/"
+# export XLA_HLO_DEBUG=1
+# export XLA_IR_DEBUG=1
+python examples/research_projects/pytorch_xla/training/text_to_image/train_text_to_image_sdxl.py --pretrained_model_name_or_path=stabilityai/stable-diffusion-xl-base-1.0 --dataset_name=$DATASET_NAME --resolution=1024 --center_crop --random_flip --train_batch_size=$PER_HOST_BATCH_SIZE  --max_train_steps=$TRAIN_STEPS --measure_start_step=$PROFILE_START_STEP --learning_rate=1e-06 --mixed_precision=bf16 --profile_duration=5000 --output_dir=$OUTPUT_DIR --dataloader_num_workers=8 --loader_prefetch_size=4 --device_prefetch_size=4 --xla_gradient_checkpointing
 ```
 
+
 This script only trains the unet part of the network. The VAE and text encoder
 are fixed.
 
 ```bash
-gcloud compute tpus tpu-vm ssh ${TPU_NAME} \
---project=${PROJECT_ID} --zone=${ZONE} --worker=all \
---command='
-export XLA_DISABLE_FUNCTIONALIZATION=0
-export PROFILE_DIR=/tmp/
-export CACHE_DIR=/tmp/
-export DATASET_NAME=lambdalabs/naruto-blip-captions
-export PER_HOST_BATCH_SIZE=32 # This is known to work on TPU v4. Can set this to 64 for TPU v5p
-export TRAIN_STEPS=50
-export OUTPUT_DIR=/tmp/trained-model/
-python diffusers/examples/research_projects/pytorch_xla/training/text_to_image/train_text_to_image_sdxl.py --pretrained_model_name_or_path=stabilityai/stable-diffusion-xl-base-1.0 --dataset_name=$DATASET_NAME --resolution=512 --center_crop --random_flip --train_batch_size=$PER_HOST_BATCH_SIZE  --max_train_steps=$TRAIN_STEPS --learning_rate=1e-06 --mixed_precision=bf16 --profile_duration=80000 --output_dir=$OUTPUT_DIR --dataloader_num_workers=8 --loader_prefetch_size=4 --device_prefetch_size=4'
+./train.sh
 ```
 
 Pass `--print_loss` if you would like to see the loss printed at every step. Be aware that printing the loss at every step disrupts the optimized flow execution, thus the step time will be longer. 
@@ -119,52 +114,18 @@ input prompts. The first pass will compile the graph and takes longer with the f
 
 ```bash
 export CACHE_DIR=/tmp/
+export OUTPUT_DIR=/tmp/trained-model
+python inference_sdxl.py
 ```
-
-```python
-import torch
-import os
-import sys
-import  numpy as np
-
-import torch_xla.core.xla_model as xm
-from time import time
-from diffusers import StableDiffusionPipeline
-import torch_xla.runtime as xr
-
-CACHE_DIR = os.environ.get("CACHE_DIR", None)
-if CACHE_DIR:
-    xr.initialize_cache(CACHE_DIR, readonly=False)
-
-def main():
-    device = xm.xla_device()
-    model_path = "jffacevedo/pxla_trained_model"
-    pipe = StableDiffusionPipeline.from_pretrained(
-        model_path, 
-        torch_dtype=torch.bfloat16
-    )
-    pipe.to(device)
-    prompt = ["A naruto with green eyes and red legs."]
-    start = time()
-    print("compiling...")
-    image = pipe(prompt, num_inference_steps=30, guidance_scale=7.5).images[0]
-    print(f"compile time: {time() - start}")
-    print("generate...")
-    start = time()
-    image = pipe(prompt, num_inference_steps=30, guidance_scale=7.5).images[0]
-    print(f"generation time (after compile) : {time() - start}")
-    image.save("naruto.png")
-
-if __name__ == '__main__':
-    main()
-```
-
 Expected Results:
 
 ```bash
 compiling...
-100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [10:03<00:00, 20.10s/it]
-compile time: 720.656970500946
+  0%|                                                                                                                                                                                                                                      | 0/30 [00:00<?, ?it/s]/mnt/bbahl/miniconda3/envs/torchverify/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:351: UserWarning: Device capability of jax unspecified, assuming `cpu` and `cuda`. Please specify it via the `devices` argument of `register_backend`.
+  warnings.warn(
+100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [01:29<00:00,  2.97s/it]
+compile time: 226.64892053604126
 generate...
-100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:01<00:00, 17.65it/s]
-generation time (after compile) : 1.8461642265319824
+100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:04<00:00,  6.17it/s]
+generation time (after compile) : 5.120622396469116
+```
@@ -27,7 +27,6 @@
 # pipe.vae.enable_xla_attention()
 start = time()
 print("compiling...")
-import pdb; pdb.set_trace()
 image = pipe(prompt, num_inference_steps=30, guidance_scale=7.5).images[0]
 print(f"compile time: {time() - start}")
 print("generate...")