Ptxla sd training #9381

entrpn · 2024-09-06T17:38:38Z

What does this PR do?

@sayakpaul Enables Pytorch XLA training on TPUs for Stable Diffusion 2.x models.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

…ssues.

sayakpaul · 2024-09-07T03:19:55Z

Thanks for your contributions! Could we maybe move this to the research_projects folder as we cannot test it at the moment?

HuggingFaceDocBuilderDev · 2024-09-07T03:25:02Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

… into ptxla_sd_training

entrpn · 2024-09-09T16:11:01Z

@sayakpaul I moved the files to research_projects. Please review.

sayakpaul · 2024-09-10T01:49:47Z

examples/research_projects/pytorch_xla/README_xla.md

+
+def main(args):
+    device = xm.xla_device()
+    model_path = <output_dir>


Can we use a repo id on the Hub that could be loaded here? This way, users can directly try out the snippet without having to look for one.

uploaded a trained model to the hub and updated the model_path

examples/research_projects/pytorch_xla/train_text_to_image_xla.py

sayakpaul · 2024-09-10T01:56:11Z

examples/research_projects/pytorch_xla/train_text_to_image_xla.py

+        pixel_values,
+        input_ids,
+    ):
+        with xp.Trace("model.forward"):


I am assuming these traces are thin enough to NOT introduce any unnecessary latency?

I think xp.Trace calls are very lightweight, and can help profiling.

examples/research_projects/pytorch_xla/train_text_to_image_xla.py

sayakpaul

Thanks! Some minor comments here and there. But from an implementation standpoint, this looks excellent.

In the README, I think it'd be nice to include some gotchas that the users need to be aware of:

Would the example work on a multi-node TPU host?
How much wall-clock time can the users expect?
Would the inference snippet work on multiple TPU chips?

tengomucho · 2024-09-10T07:32:45Z

examples/research_projects/pytorch_xla/README_xla.md

+
+The `train_text_to_image_xla.py` script shows how to fine-tune stable diffusion model on TPU devices using PyTorch/XLA.
+
+It has been tested on v4 and v5p TPU versions.


can you be a little more specific on the TPU models you used? It would be nice if someone wants to reproduce this to know what is the amount of TPUs required to do this without hitting an OOM.

I've added this to the readme. Please review.

entrpn · 2024-09-11T16:56:29Z

@sayakpaul @tengomucho added changes per the comments. Please review.

… into ptxla_sd_training

sayakpaul

Looking really good!

sayakpaul · 2024-09-12T02:50:52Z

examples/research_projects/pytorch_xla/README.md

+As of 9-11-2024, these are some expected step times.
+
+| accelerator | global batch size | step time (seconds) |
+| ----------- | ----------------- | --------- |
+| v5p-128 | 1024 | 0.245 |
+| v5p-256 | 2048 | 0.234 |
+| v5p-512 | 4096 | 0.2498 |


This is very helpful, thanks much!

sayakpaul · 2024-09-12T02:51:17Z

examples/research_projects/pytorch_xla/README.md

+| v5p-256 | 2048 | 0.234 |
+| v5p-512 | 4096 | 0.2498 |
+
+## Create TPU


If there are official documentation from GCP to be linked here, feel free to.

sayakpaul · 2024-09-12T03:05:18Z

Thank you for this contribution!

sayakpaul · 2024-09-12T04:10:36Z

@entrpn do you think it could make sense to have something similar for Flux? It's the most popular text-to-image generation model right now.

Ccing @linoytsaban and @apolinario for awareness as you do a lot of fine-tuning.

entrpn · 2024-09-13T17:56:42Z

@entrpn do you think it could make sense to have something similar for Flux? It's the most popular text-to-image generation model right now.

Ccing @linoytsaban and @apolinario for awareness as you do a lot of fine-tuning.

We can revisit this in the future. Thank you for your help merging this.

* enable pxla training of stable diffusion 2.x models. * run linter/style and run pipeline test for stable diffusion and fix issues. * update xla libraries * fix read me newline. * move files to research folder. * update per comments. * rename readme. --------- Co-authored-by: Juan Acevedo <[email protected]> Co-authored-by: Sayak Paul <[email protected]>

jfacevedo-google and others added 5 commits September 6, 2024 00:08

enable pxla training of stable diffusion 2.x models.

761e3b1

run linter/style and run pipeline test for stable diffusion and fix i…

28cc49c

…ssues.

update xla libraries

e2613fb

Merge branch 'main' into ptxla_sd_training

a32acd5

fix read me newline.

fd86770

jfacevedo-google added 2 commits September 9, 2024 16:08

move files to research folder.

ee74cf6

Merge branch 'ptxla_sd_training' of https://github.com/entrpn/diffusers…

7c37315

… into ptxla_sd_training