|
28 | 28 | "a few lines of code and can be applied to a wide range of deep learning\n", |
29 | 29 | "models across all domains.\n", |
30 | 30 | "\n", |
| 31 | + "<div style=\"width: 45%; float: left; padding: 20px;\"><h2> What you will learn</h2><ul><li>General optimization techniques for PyTorch models</li><li>CPU-specific performance optimizations</li><li>GPU acceleration strategies</li><li>Distributed training optimizations</li></ul></div><div style=\"width: 45%; float: right; padding: 20px;\"><h2> Prerequisites</h2><ul><li>PyTorch 2.0 or later</li><li>Python 3.8 or later</li><li>CUDA-capable GPU (recommended for GPU optimizations)</li><li>Linux, macOS, or Windows operating system</li></ul></div>\n", |
| 32 | + "\n", |
| 33 | + "Overview\n", |
| 34 | + "--------\n", |
| 35 | + "\n", |
| 36 | + "Performance optimization is crucial for efficient deep learning model\n", |
| 37 | + "training and inference. This tutorial covers a comprehensive set of\n", |
| 38 | + "techniques to accelerate PyTorch workloads across different hardware\n", |
| 39 | + "configurations and use cases.\n", |
| 40 | + "\n", |
31 | 41 | "General optimizations\n", |
32 | 42 | "---------------------\n" |
33 | 43 | ] |
34 | 44 | }, |
| 45 | + { |
| 46 | + "cell_type": "code", |
| 47 | + "execution_count": null, |
| 48 | + "metadata": { |
| 49 | + "collapsed": false |
| 50 | + }, |
| 51 | + "outputs": [], |
| 52 | + "source": [ |
| 53 | + "import torch\n", |
| 54 | + "import torchvision" |
| 55 | + ] |
| 56 | + }, |
35 | 57 | { |
36 | 58 | "cell_type": "markdown", |
37 | 59 | "metadata": {}, |
|
157 | 179 | "than setting it to zero, for more details refer to the\n", |
158 | 180 | "[documentation](https://pytorch.org/docs/master/optim.html#torch.optim.Optimizer.zero_grad).\n", |
159 | 181 | "\n", |
160 | | - "Alternatively, starting from PyTorch 1.7, call `model` or\n", |
161 | | - "`optimizer.zero_grad(set_to_none=True)`.\n" |
| 182 | + "Alternatively, call `model` or `optimizer.zero_grad(set_to_none=True)`.\n" |
162 | 183 | ] |
163 | 184 | }, |
164 | 185 | { |
|
222 | 243 | "Enable channels\\_last memory format for computer vision models\n", |
223 | 244 | "==============================================================\n", |
224 | 245 | "\n", |
225 | | - "PyTorch 1.5 introduced support for `channels_last` memory format for\n", |
226 | | - "convolutional networks. This format is meant to be used in conjunction\n", |
227 | | - "with [AMP](https://pytorch.org/docs/stable/amp.html) to further\n", |
228 | | - "accelerate convolutional neural networks with [Tensor\n", |
| 246 | + "PyTorch supports `channels_last` memory format for convolutional\n", |
| 247 | + "networks. This format is meant to be used in conjunction with\n", |
| 248 | + "[AMP](https://pytorch.org/docs/stable/amp.html) to further accelerate\n", |
| 249 | + "convolutional neural networks with [Tensor\n", |
229 | 250 | "Cores](https://www.nvidia.com/en-us/data-center/tensor-cores/).\n", |
230 | 251 | "\n", |
231 | 252 | "Support for `channels_last` is experimental, but it\\'s expected to work\n", |
|
439 | 460 | "```\n" |
440 | 461 | ] |
441 | 462 | }, |
442 | | - { |
443 | | - "cell_type": "markdown", |
444 | | - "metadata": {}, |
445 | | - "source": [ |
446 | | - "Use oneDNN Graph with TorchScript for inference\n", |
447 | | - "===============================================\n", |
448 | | - "\n", |
449 | | - "oneDNN Graph can significantly boost inference performance. It fuses\n", |
450 | | - "some compute-intensive operations such as convolution, matmul with their\n", |
451 | | - "neighbor operations. In PyTorch 2.0, it is supported as a beta feature\n", |
452 | | - "for `Float32` & `BFloat16` data-types. oneDNN Graph receives the model's\n", |
453 | | - "graph and identifies candidates for operator-fusion with respect to the\n", |
454 | | - "shape of the example input. A model should be JIT-traced using an\n", |
455 | | - "example input. Speed-up would then be observed after a couple of warm-up\n", |
456 | | - "iterations for inputs with the same shape as the example input. The\n", |
457 | | - "example code-snippets below are for resnet50, but they can very well be\n", |
458 | | - "extended to use oneDNN Graph with custom models as well.\n" |
459 | | - ] |
460 | | - }, |
461 | | - { |
462 | | - "cell_type": "code", |
463 | | - "execution_count": null, |
464 | | - "metadata": { |
465 | | - "collapsed": false |
466 | | - }, |
467 | | - "outputs": [], |
468 | | - "source": [ |
469 | | - "# Only this extra line of code is required to use oneDNN Graph\n", |
470 | | - "torch.jit.enable_onednn_fusion(True)" |
471 | | - ] |
472 | | - }, |
473 | | - { |
474 | | - "cell_type": "markdown", |
475 | | - "metadata": {}, |
476 | | - "source": [ |
477 | | - "Using the oneDNN Graph API requires just one extra line of code for\n", |
478 | | - "inference with Float32. If you are using oneDNN Graph, please avoid\n", |
479 | | - "calling `torch.jit.optimize_for_inference`.\n" |
480 | | - ] |
481 | | - }, |
482 | | - { |
483 | | - "cell_type": "code", |
484 | | - "execution_count": null, |
485 | | - "metadata": { |
486 | | - "collapsed": false |
487 | | - }, |
488 | | - "outputs": [], |
489 | | - "source": [ |
490 | | - "# sample input should be of the same shape as expected inputs\n", |
491 | | - "sample_input = [torch.rand(32, 3, 224, 224)]\n", |
492 | | - "# Using resnet50 from torchvision in this example for illustrative purposes,\n", |
493 | | - "# but the line below can indeed be modified to use custom models as well.\n", |
494 | | - "model = getattr(torchvision.models, \"resnet50\")().eval()\n", |
495 | | - "# Tracing the model with example input\n", |
496 | | - "traced_model = torch.jit.trace(model, sample_input)\n", |
497 | | - "# Invoking torch.jit.freeze\n", |
498 | | - "traced_model = torch.jit.freeze(traced_model)" |
499 | | - ] |
500 | | - }, |
501 | | - { |
502 | | - "cell_type": "markdown", |
503 | | - "metadata": {}, |
504 | | - "source": [ |
505 | | - "Once a model is JIT-traced with a sample input, it can then be used for\n", |
506 | | - "inference after a couple of warm-up runs.\n" |
507 | | - ] |
508 | | - }, |
509 | | - { |
510 | | - "cell_type": "code", |
511 | | - "execution_count": null, |
512 | | - "metadata": { |
513 | | - "collapsed": false |
514 | | - }, |
515 | | - "outputs": [], |
516 | | - "source": [ |
517 | | - "with torch.no_grad():\n", |
518 | | - " # a couple of warm-up runs\n", |
519 | | - " traced_model(*sample_input)\n", |
520 | | - " traced_model(*sample_input)\n", |
521 | | - " # speedup would be observed after warm-up runs\n", |
522 | | - " traced_model(*sample_input)" |
523 | | - ] |
524 | | - }, |
525 | | - { |
526 | | - "cell_type": "markdown", |
527 | | - "metadata": {}, |
528 | | - "source": [ |
529 | | - "While the JIT fuser for oneDNN Graph also supports inference with\n", |
530 | | - "`BFloat16` datatype, performance benefit with oneDNN Graph is only\n", |
531 | | - "exhibited by machines with AVX512\\_BF16 instruction set architecture\n", |
532 | | - "(ISA). The following code snippets serves as an example of using\n", |
533 | | - "`BFloat16` datatype for inference with oneDNN Graph:\n" |
534 | | - ] |
535 | | - }, |
536 | | - { |
537 | | - "cell_type": "code", |
538 | | - "execution_count": null, |
539 | | - "metadata": { |
540 | | - "collapsed": false |
541 | | - }, |
542 | | - "outputs": [], |
543 | | - "source": [ |
544 | | - "# AMP for JIT mode is enabled by default, and is divergent with its eager mode counterpart\n", |
545 | | - "torch._C._jit_set_autocast_mode(False)\n", |
546 | | - "\n", |
547 | | - "with torch.no_grad(), torch.cpu.amp.autocast(cache_enabled=False, dtype=torch.bfloat16):\n", |
548 | | - " # Conv-BatchNorm folding for CNN-based Vision Models should be done with ``torch.fx.experimental.optimization.fuse`` when AMP is used\n", |
549 | | - " import torch.fx.experimental.optimization as optimization\n", |
550 | | - " # Please note that optimization.fuse need not be called when AMP is not used\n", |
551 | | - " model = optimization.fuse(model)\n", |
552 | | - " model = torch.jit.trace(model, (example_input))\n", |
553 | | - " model = torch.jit.freeze(model)\n", |
554 | | - " # a couple of warm-up runs\n", |
555 | | - " model(example_input)\n", |
556 | | - " model(example_input)\n", |
557 | | - " # speedup would be observed in subsequent runs.\n", |
558 | | - " model(example_input)" |
559 | | - ] |
560 | | - }, |
561 | 463 | { |
562 | 464 | "cell_type": "markdown", |
563 | 465 | "metadata": {}, |
|
751 | 653 | " NLP models\n", |
752 | 654 | "- enable AMP\n", |
753 | 655 | " - Introduction to Mixed Precision Training and AMP:\n", |
754 | | - " [video](https://www.youtube.com/watch?v=jF4-_ZK_tyc&feature=youtu.be),\n", |
755 | 656 | " [slides](https://nvlabs.github.io/eccv2020-mixed-precision-tutorial/files/dusan_stosic-training-neural-networks-with-tensor-cores.pdf)\n", |
756 | | - " - native PyTorch AMP is available starting from PyTorch 1.6:\n", |
| 657 | + " - native PyTorch AMP is available:\n", |
757 | 658 | " [documentation](https://pytorch.org/docs/stable/amp.html),\n", |
758 | 659 | " [examples](https://pytorch.org/docs/stable/notes/amp_examples.html#amp-examples),\n", |
759 | 660 | " [tutorial](https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html)\n" |
|
894 | 795 | "by bucketing samples with similar sequence length or even by sorting\n", |
895 | 796 | "dataset by sequence length.\n" |
896 | 797 | ] |
| 798 | + }, |
| 799 | + { |
| 800 | + "cell_type": "markdown", |
| 801 | + "metadata": {}, |
| 802 | + "source": [ |
| 803 | + "Conclusion\n", |
| 804 | + "==========\n", |
| 805 | + "\n", |
| 806 | + "This tutorial covered a comprehensive set of performance optimization\n", |
| 807 | + "techniques for PyTorch models. The key takeaways include:\n", |
| 808 | + "\n", |
| 809 | + "- **General optimizations**: Enable async data loading, disable\n", |
| 810 | + " gradients for inference, fuse operations with `torch.compile`, and\n", |
| 811 | + " use efficient memory formats\n", |
| 812 | + "- **CPU optimizations**: Leverage NUMA controls, optimize OpenMP\n", |
| 813 | + " settings, and use efficient memory allocators\n", |
| 814 | + "- **GPU optimizations**: Enable Tensor cores, use CUDA graphs, enable\n", |
| 815 | + " cuDNN autotuner, and implement mixed precision training\n", |
| 816 | + "- **Distributed optimizations**: Use DistributedDataParallel, optimize\n", |
| 817 | + " gradient synchronization, and balance workloads across devices\n", |
| 818 | + "\n", |
| 819 | + "Many of these optimizations can be applied with minimal code changes and\n", |
| 820 | + "provide significant performance improvements across a wide range of deep\n", |
| 821 | + "learning models.\n", |
| 822 | + "\n", |
| 823 | + "Further Reading\n", |
| 824 | + "===============\n", |
| 825 | + "\n", |
| 826 | + "- [PyTorch Performance Tuning\n", |
| 827 | + " Documentation](https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html)\n", |
| 828 | + "- [CUDA Best\n", |
| 829 | + " Practices](https://pytorch.org/docs/stable/notes/cuda.html)\n", |
| 830 | + "- [Distributed Training\n", |
| 831 | + " Documentation](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html)\n", |
| 832 | + "- [Mixed Precision Training](https://pytorch.org/docs/stable/amp.html)\n", |
| 833 | + "- [torch.compile\n", |
| 834 | + " Tutorial](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html)\n" |
| 835 | + ] |
897 | 836 | } |
898 | 837 | ], |
899 | 838 | "metadata": { |
|
0 commit comments