Merge branch 'dcp_async_save' of github.com:pytorch/tutorials into dcp_async_save

LucasLLC · LucasLLC · commit 6acfa5583cdf · 2024-07-18T15:40:18.000-07:00
diff --git a/advanced_source/cpp_export.rst b/advanced_source/cpp_export.rst
@@ -203,7 +203,7 @@ minimal ``CMakeLists.txt`` to build it could look as simple as:
 
   add_executable(example-app example-app.cpp)
   target_link_libraries(example-app "${TORCH_LIBRARIES}")
-  set_property(TARGET example-app PROPERTY CXX_STANDARD 14)
+  set_property(TARGET example-app PROPERTY CXX_STANDARD 17)
 
 The last thing we need to build the example application is the LibTorch
 distribution. You can always grab the latest stable release from the `download
diff --git a/advanced_source/super_resolution_with_onnxruntime.py b/advanced_source/super_resolution_with_onnxruntime.py
@@ -9,7 +9,7 @@
     * ``torch.onnx.export`` is based on TorchScript backend and has been available since PyTorch 1.2.0.
 
 In this tutorial, we describe how to convert a model defined
-in PyTorch into the ONNX format using the TorchScript ``torch.onnx.export` ONNX exporter.
+in PyTorch into the ONNX format using the TorchScript ``torch.onnx.export`` ONNX exporter.
 
 The exported model will be executed with ONNX Runtime.
 ONNX Runtime is a performance-focused engine for ONNX models,
diff --git a/intermediate_source/inductor_debug_cpu.py b/intermediate_source/inductor_debug_cpu.py
@@ -87,9 +87,9 @@ def neg1(x):
 # +-----------------------------+----------------------------------------------------------------+
 # | ``fx_graph_transformed.py`` | Transformed FX graph, after pattern match                      |
 # +-----------------------------+----------------------------------------------------------------+
-# | ``ir_post_fusion.txt``      | Inductor IR before fusion                                      |
+# | ``ir_pre_fusion.txt``       | Inductor IR before fusion                                      |
 # +-----------------------------+----------------------------------------------------------------+
-# | ``ir_pre_fusion.txt``       | Inductor IR after fusion                                       |
+# | ``ir_post_fusion.txt``      | Inductor IR after fusion                                       |
 # +-----------------------------+----------------------------------------------------------------+
 # | ``output_code.py``          | Generated Python code for graph, with C++/Triton kernels       |
 # +-----------------------------+----------------------------------------------------------------+
diff --git a/recipes_source/distributed_async_checkpoint_recipe.rst b/recipes_source/distributed_async_checkpoint_recipe.rst
@@ -1,12 +1,13 @@
 Asynchronous Saving with Distributed Checkpoint (DCP)
 =====================================================
 
+**Author:** `Lucas Pasqualin <https://github.com/lucasllc>`__, `Iris Zhang <https://github.com/wz337>`__, `Rodrigo Kumpera <https://github.com/kumpera>`__, `Chien-Chin Huang <https://github.com/fegin>`__
+
 Checkpointing is often a bottle-neck in the critical path for distributed training workloads, incurring larger and larger costs as both model and world sizes grow.
 One excellent strategy for offsetting this cost is to checkpoint in parallel, asynchronously. Below, we expand the save example
 from the `Getting Started with Distributed Checkpoint Tutorial <https://github.com/pytorch/tutorials/blob/main/recipes_source/distributed_checkpoint_recipe.rst>`__
 to show how this can be integrated quite easily with ``torch.distributed.checkpoint.async_save``.
 
-**Author**: , `Lucas Pasqualin <https://github.com/lucasllc>`__, `Iris Zhang <https://github.com/wz337>`__, `Rodrigo Kumpera <https://github.com/kumpera>`__, `Chien-Chin Huang <https://github.com/fegin>`__
 
 .. grid:: 2
 
@@ -156,9 +157,12 @@ If the above optimization is still not performant enough, you can take advantage
 Specifically, this optimization attacks the main overhead of asynchronous checkpointing, which is the in-memory copying to checkpointing buffers. By maintaining a pinned memory buffer between
 checkpoint requests users can take advantage of direct memory access to speed up this copy.
 
-.. note:: The main drawback of this optimization is the persistence of the buffer in between checkpointing steps. Without the pinned memory optimization (as demonstrated above),
-any checkpointing buffers are released as soon as checkpointing is finished. With the pinned memory implementation, this buffer is maintained between steps, leading to the same
-peak memory pressure being sustained through the application life.
+.. note::
+   The main drawback of this optimization is the persistence of the buffer in between checkpointing steps. Without 
+   the pinned memory optimization (as demonstrated above), any checkpointing buffers are released as soon as 
+   checkpointing is finished. With the pinned memory implementation, this buffer is maintained between steps, 
+   leading to the same
+   peak memory pressure being sustained through the application life.
 
 
 .. code-block:: python
diff --git a/recipes_source/distributed_device_mesh.rst b/recipes_source/distributed_device_mesh.rst
@@ -156,4 +156,4 @@ they can be used to describe the layout of devices across the cluster.
 For more information, please see the following:
 
 - `2D parallel combining Tensor/Sequance Parallel with FSDP <https://github.com/pytorch/examples/blob/main/distributed/tensor_parallelism/fsdp_tp_example.py>`__
-- `Composable PyTorch Distributed with PT2 <chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://static.sched.com/hosted_files/pytorch2023/d1/%5BPTC%2023%5D%20Composable%20PyTorch%20Distributed%20with%20PT2.pdf>`__
+- `Composable PyTorch Distributed with PT2 <https://static.sched.com/hosted_files/pytorch2023/d1/%5BPTC%2023%5D%20Composable%20PyTorch%20Distributed%20with%20PT2.pdf>`__