From b87a96219fa5263a4db60d204266e93e4229e033 Mon Sep 17 00:00:00 2001 From: sayakpaul Date: Thu, 15 May 2025 09:30:02 +0530 Subject: [PATCH 1/4] add regional compilation docs. --- docs/source/en/optimization/torch2.0.md | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/docs/source/en/optimization/torch2.0.md b/docs/source/en/optimization/torch2.0.md index 01ea00310a75..d3ee6cbfc22e 100644 --- a/docs/source/en/optimization/torch2.0.md +++ b/docs/source/en/optimization/torch2.0.md @@ -78,6 +78,24 @@ For more information and different options about `torch.compile`, refer to the [ > [!TIP] > Learn more about other ways PyTorch 2.0 can help optimize your model in the [Accelerate inference of text-to-image diffusion models](../tutorials/fast_diffusion) tutorial. +### Regional compilation + +Compiling the whole model usually has a big problem space for optimization. Models are often composed of multiple repeated blocks. [Regional compilation](https://pytorch.org/tutorials/recipes/regional_compilation.html) compiles the repeated block first (a transformer encoder block, for example), so that the Torch compiler would re-use its cached/optimized generated code for the other blocks, reducing the cold start compilation time observed on the first inference call. + +Enabling regional compilation might require simple yet intrusive changes to the +modeling code. However, 🤗 Accelerate provides a utility [`compile_regions()`](https://huggingface.co/docs/accelerate/main/en/usage_guides/compilation#how-to-use-regional-compilation) which automatically _only_ compiles +the repeated blocks of the provided `nn.Module`. + +```py +# Make sure you're on the latest `accelerate`: `pip install -U accelerate`. +from accelerate.utils import compile_regions + +pipe.unet = compile_regions(pipe.unet, mode="reduce-overhead", fullgraph=True) +``` + +As you may have noticed `compile_regions()` takes the same arguments as `torch.compile()`, allowing +flexibility. + ## Benchmark We conducted a comprehensive benchmark with PyTorch 2.0's efficient attention implementation and `torch.compile` across different GPUs and batch sizes for five of our most used pipelines. The code is benchmarked on 🤗 Diffusers v0.17.0.dev0 to optimize `torch.compile` usage (see [here](https://github.com/huggingface/diffusers/pull/3313) for more details). From 581cba452f24328e87fb2689d216174bf9d5308c Mon Sep 17 00:00:00 2001 From: sayakpaul Date: Thu, 15 May 2025 09:30:30 +0530 Subject: [PATCH 2/4] minor. --- docs/source/en/optimization/torch2.0.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/optimization/torch2.0.md b/docs/source/en/optimization/torch2.0.md index d3ee6cbfc22e..5849d71b92a8 100644 --- a/docs/source/en/optimization/torch2.0.md +++ b/docs/source/en/optimization/torch2.0.md @@ -80,7 +80,7 @@ For more information and different options about `torch.compile`, refer to the [ ### Regional compilation -Compiling the whole model usually has a big problem space for optimization. Models are often composed of multiple repeated blocks. [Regional compilation](https://pytorch.org/tutorials/recipes/regional_compilation.html) compiles the repeated block first (a transformer encoder block, for example), so that the Torch compiler would re-use its cached/optimized generated code for the other blocks, reducing the cold start compilation time observed on the first inference call. +Compiling the whole model usually has a big problem space for optimization. Models are often composed of multiple repeated blocks. [Regional compilation](https://pytorch.org/tutorials/recipes/regional_compilation.html) compiles the repeated block first (a transformer encoder block, for example), so that the Torch compiler would re-use its cached/optimized generated code for the other blocks, reducing (often massively) the cold start compilation time observed on the first inference call. Enabling regional compilation might require simple yet intrusive changes to the modeling code. However, 🤗 Accelerate provides a utility [`compile_regions()`](https://huggingface.co/docs/accelerate/main/en/usage_guides/compilation#how-to-use-regional-compilation) which automatically _only_ compiles From 8881dc6c4921e4ad36d788272c4942439c23f1fd Mon Sep 17 00:00:00 2001 From: sayakpaul Date: Thu, 15 May 2025 18:44:23 +0530 Subject: [PATCH 3/4] reviwer feedback. --- docs/source/en/optimization/torch2.0.md | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/docs/source/en/optimization/torch2.0.md b/docs/source/en/optimization/torch2.0.md index 5849d71b92a8..288e898a9092 100644 --- a/docs/source/en/optimization/torch2.0.md +++ b/docs/source/en/optimization/torch2.0.md @@ -83,8 +83,8 @@ For more information and different options about `torch.compile`, refer to the [ Compiling the whole model usually has a big problem space for optimization. Models are often composed of multiple repeated blocks. [Regional compilation](https://pytorch.org/tutorials/recipes/regional_compilation.html) compiles the repeated block first (a transformer encoder block, for example), so that the Torch compiler would re-use its cached/optimized generated code for the other blocks, reducing (often massively) the cold start compilation time observed on the first inference call. Enabling regional compilation might require simple yet intrusive changes to the -modeling code. However, 🤗 Accelerate provides a utility [`compile_regions()`](https://huggingface.co/docs/accelerate/main/en/usage_guides/compilation#how-to-use-regional-compilation) which automatically _only_ compiles -the repeated blocks of the provided `nn.Module`. +modeling code. However, 🤗 Accelerate provides a utility [`compile_regions()`](https://huggingface.co/docs/accelerate/main/en/usage_guides/compilation#how-to-use-regional-compilation) which automatically compiles +the repeated blocks of the provided `nn.Module` along with other parts of it that are non-repeating. This helps with not only just cold start time but also the inference latency. ```py # Make sure you're on the latest `accelerate`: `pip install -U accelerate`. @@ -93,8 +93,7 @@ from accelerate.utils import compile_regions pipe.unet = compile_regions(pipe.unet, mode="reduce-overhead", fullgraph=True) ``` -As you may have noticed `compile_regions()` takes the same arguments as `torch.compile()`, allowing -flexibility. +As you may have noticed `compile_regions()` takes the same arguments as `torch.compile()`, allowing flexibility. ## Benchmark From bacd4034d7aaf697154ab2a4f940954e19d1bdba Mon Sep 17 00:00:00 2001 From: Sayak Paul Date: Thu, 15 May 2025 19:03:06 +0530 Subject: [PATCH 4/4] Update docs/source/en/optimization/torch2.0.md Co-authored-by: Ilyas Moutawwakil <57442720+IlyasMoutawwakil@users.noreply.github.com> --- docs/source/en/optimization/torch2.0.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/optimization/torch2.0.md b/docs/source/en/optimization/torch2.0.md index 288e898a9092..cc69eceff3af 100644 --- a/docs/source/en/optimization/torch2.0.md +++ b/docs/source/en/optimization/torch2.0.md @@ -84,7 +84,7 @@ Compiling the whole model usually has a big problem space for optimization. Mode Enabling regional compilation might require simple yet intrusive changes to the modeling code. However, 🤗 Accelerate provides a utility [`compile_regions()`](https://huggingface.co/docs/accelerate/main/en/usage_guides/compilation#how-to-use-regional-compilation) which automatically compiles -the repeated blocks of the provided `nn.Module` along with other parts of it that are non-repeating. This helps with not only just cold start time but also the inference latency. +the repeated blocks of the provided `nn.Module` sequentially, and the rest of the model separately. This helps with reducing cold start time while keeping most (if not all) of the speedup you would get from full compilation. ```py # Make sure you're on the latest `accelerate`: `pip install -U accelerate`.