|
8 | 8 | techniques often can be implemented by changing only a few lines of code and can |
9 | 9 | be applied to a wide range of deep learning models across all domains. |
10 | 10 |
|
| 11 | +.. grid:: 2 |
| 12 | +
|
| 13 | + .. grid-item-card:: :octicon:`mortar-board;1em;` What you will learn |
| 14 | + :class-card: card-prerequisites |
| 15 | +
|
| 16 | + * General optimization techniques for PyTorch models |
| 17 | + * CPU-specific performance optimizations |
| 18 | + * GPU acceleration strategies |
| 19 | + * Distributed training optimizations |
| 20 | +
|
| 21 | + .. grid-item-card:: :octicon:`list-unordered;1em;` Prerequisites |
| 22 | + :class-card: card-prerequisites |
| 23 | +
|
| 24 | + * PyTorch 2.0 or later |
| 25 | + * Python 3.8 or later |
| 26 | + * CUDA-capable GPU (recommended for GPU optimizations) |
| 27 | + * Linux, macOS, or Windows operating system |
| 28 | +
|
| 29 | +Overview |
| 30 | +-------- |
| 31 | +
|
| 32 | +Performance optimization is crucial for efficient deep learning model training and inference. |
| 33 | +This tutorial covers a comprehensive set of techniques to accelerate PyTorch workloads across |
| 34 | +different hardware configurations and use cases. |
| 35 | +
|
11 | 36 | General optimizations |
12 | 37 | --------------------- |
13 | 38 | """ |
14 | 39 |
|
| 40 | +import torch |
| 41 | +import torchvision |
| 42 | + |
15 | 43 | ############################################################################### |
16 | 44 | # Enable asynchronous data loading and augmentation |
17 | 45 | # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
|
90 | 118 | # setting it to zero, for more details refer to the |
91 | 119 | # `documentation <https://pytorch.org/docs/master/optim.html#torch.optim.Optimizer.zero_grad>`_. |
92 | 120 | # |
93 | | -# Alternatively, starting from PyTorch 1.7, call ``model`` or |
| 121 | +# Alternatively, call ``model`` or |
94 | 122 | # ``optimizer.zero_grad(set_to_none=True)``. |
95 | 123 |
|
96 | 124 | ############################################################################### |
@@ -129,7 +157,7 @@ def gelu(x): |
129 | 157 | ############################################################################### |
130 | 158 | # Enable channels_last memory format for computer vision models |
131 | 159 | # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
132 | | -# PyTorch 1.5 introduced support for ``channels_last`` memory format for |
| 160 | +# PyTorch supports ``channels_last`` memory format for |
133 | 161 | # convolutional networks. This format is meant to be used in conjunction with |
134 | 162 | # `AMP <https://pytorch.org/docs/stable/amp.html>`_ to further accelerate |
135 | 163 | # convolutional neural networks with |
@@ -250,65 +278,6 @@ def gelu(x): |
250 | 278 | # |
251 | 279 | # export LD_PRELOAD=<jemalloc.so/tcmalloc.so>:$LD_PRELOAD |
252 | 280 |
|
253 | | -############################################################################### |
254 | | -# Use oneDNN Graph with TorchScript for inference |
255 | | -# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
256 | | -# oneDNN Graph can significantly boost inference performance. It fuses some compute-intensive operations such as convolution, matmul with their neighbor operations. |
257 | | -# In PyTorch 2.0, it is supported as a beta feature for ``Float32`` & ``BFloat16`` data-types. |
258 | | -# oneDNN Graph receives the model’s graph and identifies candidates for operator-fusion with respect to the shape of the example input. |
259 | | -# A model should be JIT-traced using an example input. |
260 | | -# Speed-up would then be observed after a couple of warm-up iterations for inputs with the same shape as the example input. |
261 | | -# The example code-snippets below are for resnet50, but they can very well be extended to use oneDNN Graph with custom models as well. |
262 | | - |
263 | | -# Only this extra line of code is required to use oneDNN Graph |
264 | | -torch.jit.enable_onednn_fusion(True) |
265 | | - |
266 | | -############################################################################### |
267 | | -# Using the oneDNN Graph API requires just one extra line of code for inference with Float32. |
268 | | -# If you are using oneDNN Graph, please avoid calling ``torch.jit.optimize_for_inference``. |
269 | | - |
270 | | -# sample input should be of the same shape as expected inputs |
271 | | -sample_input = [torch.rand(32, 3, 224, 224)] |
272 | | -# Using resnet50 from torchvision in this example for illustrative purposes, |
273 | | -# but the line below can indeed be modified to use custom models as well. |
274 | | -model = getattr(torchvision.models, "resnet50")().eval() |
275 | | -# Tracing the model with example input |
276 | | -traced_model = torch.jit.trace(model, sample_input) |
277 | | -# Invoking torch.jit.freeze |
278 | | -traced_model = torch.jit.freeze(traced_model) |
279 | | - |
280 | | -############################################################################### |
281 | | -# Once a model is JIT-traced with a sample input, it can then be used for inference after a couple of warm-up runs. |
282 | | - |
283 | | -with torch.no_grad(): |
284 | | - # a couple of warm-up runs |
285 | | - traced_model(*sample_input) |
286 | | - traced_model(*sample_input) |
287 | | - # speedup would be observed after warm-up runs |
288 | | - traced_model(*sample_input) |
289 | | - |
290 | | -############################################################################### |
291 | | -# While the JIT fuser for oneDNN Graph also supports inference with ``BFloat16`` datatype, |
292 | | -# performance benefit with oneDNN Graph is only exhibited by machines with AVX512_BF16 |
293 | | -# instruction set architecture (ISA). |
294 | | -# The following code snippets serves as an example of using ``BFloat16`` datatype for inference with oneDNN Graph: |
295 | | - |
296 | | -# AMP for JIT mode is enabled by default, and is divergent with its eager mode counterpart |
297 | | -torch._C._jit_set_autocast_mode(False) |
298 | | - |
299 | | -with torch.no_grad(), torch.cpu.amp.autocast(cache_enabled=False, dtype=torch.bfloat16): |
300 | | - # Conv-BatchNorm folding for CNN-based Vision Models should be done with ``torch.fx.experimental.optimization.fuse`` when AMP is used |
301 | | - import torch.fx.experimental.optimization as optimization |
302 | | - # Please note that optimization.fuse need not be called when AMP is not used |
303 | | - model = optimization.fuse(model) |
304 | | - model = torch.jit.trace(model, (example_input)) |
305 | | - model = torch.jit.freeze(model) |
306 | | - # a couple of warm-up runs |
307 | | - model(example_input) |
308 | | - model(example_input) |
309 | | - # speedup would be observed in subsequent runs. |
310 | | - model(example_input) |
311 | | - |
312 | 281 |
|
313 | 282 | ############################################################################### |
314 | 283 | # Train a model on CPU with PyTorch ``DistributedDataParallel``(DDP) functionality |
@@ -426,9 +395,8 @@ def gelu(x): |
426 | 395 | # * enable AMP |
427 | 396 | # |
428 | 397 | # * Introduction to Mixed Precision Training and AMP: |
429 | | -# `video <https://www.youtube.com/watch?v=jF4-_ZK_tyc&feature=youtu.be>`_, |
430 | 398 | # `slides <https://nvlabs.github.io/eccv2020-mixed-precision-tutorial/files/dusan_stosic-training-neural-networks-with-tensor-cores.pdf>`_ |
431 | | -# * native PyTorch AMP is available starting from PyTorch 1.6: |
| 399 | +# * native PyTorch AMP is available: |
432 | 400 | # `documentation <https://pytorch.org/docs/stable/amp.html>`_, |
433 | 401 | # `examples <https://pytorch.org/docs/stable/notes/amp_examples.html#amp-examples>`_, |
434 | 402 | # `tutorial <https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html>`_ |
@@ -536,3 +504,31 @@ def gelu(x): |
536 | 504 | # approximately constant number of tokens (and variable number of sequences in a |
537 | 505 | # batch), other models solve imbalance by bucketing samples with similar |
538 | 506 | # sequence length or even by sorting dataset by sequence length. |
| 507 | + |
| 508 | +############################################################################### |
| 509 | +# Conclusion |
| 510 | +# ---------- |
| 511 | +# |
| 512 | +# This tutorial covered a comprehensive set of performance optimization techniques |
| 513 | +# for PyTorch models. The key takeaways include: |
| 514 | +# |
| 515 | +# * **General optimizations**: Enable async data loading, disable gradients for |
| 516 | +# inference, fuse operations with ``torch.compile``, and use efficient memory formats |
| 517 | +# * **CPU optimizations**: Leverage NUMA controls, optimize OpenMP settings, and |
| 518 | +# use efficient memory allocators |
| 519 | +# * **GPU optimizations**: Enable Tensor cores, use CUDA graphs, enable cuDNN |
| 520 | +# autotuner, and implement mixed precision training |
| 521 | +# * **Distributed optimizations**: Use DistributedDataParallel, optimize gradient |
| 522 | +# synchronization, and balance workloads across devices |
| 523 | +# |
| 524 | +# Many of these optimizations can be applied with minimal code changes and provide |
| 525 | +# significant performance improvements across a wide range of deep learning models. |
| 526 | +# |
| 527 | +# Further Reading |
| 528 | +# --------------- |
| 529 | +# |
| 530 | +# * `PyTorch Performance Tuning Documentation <https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html>`_ |
| 531 | +# * `CUDA Best Practices <https://pytorch.org/docs/stable/notes/cuda.html>`_ |
| 532 | +# * `Distributed Training Documentation <https://pytorch.org/tutorials/intermediate/ddp_tutorial.html>`_ |
| 533 | +# * `Mixed Precision Training <https://pytorch.org/docs/stable/amp.html>`_ |
| 534 | +# * `torch.compile Tutorial <https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html>`_ |
0 commit comments