spelling

Vincent Moens · Vincent Moens · commit 4fe1b2d7760e · 2024-07-24T17:18:02.000+01:00
diff --git a/en-wordlist.txt b/en-wordlist.txt
@@ -1,3 +1,4 @@
+
 ACL
 ADI
 AOT
@@ -50,6 +51,7 @@ DDP
 DDPG
 DDQN
 DLRM
+DMA
 DNN
 DQN
 DataLoaders
@@ -139,6 +141,7 @@ MKLDNN
 MLP
 MLPs
 MNIST
+MPS
 MUC
 MacBook
 MacOS
@@ -219,6 +222,7 @@ STR
 SVE
 SciPy
 Sequentials
+Sharding
 Sigmoid
 SoTA
 Sohn
@@ -254,6 +258,7 @@ VLDB
 VQA
 VS Code
 ViT
+Volterra
 WMT
 WSI
 WSIs
@@ -336,11 +341,11 @@ dataset’s
 deallocation
 decompositions
 decorrelated
-devicemesh
 deserialize
 deserialized
 desynchronization
 deterministically
+devicemesh
 dimensionality
 dir
 discontiguous
@@ -384,6 +389,7 @@ hessian
 hessians
 histoencoder
 histologically
+homonymous
 hotspot
 hvp
 hyperparameter
@@ -459,6 +465,7 @@ optimizer's
 optimizers
 otsu
 overfitting
+pageable
 parallelizable
 parallelization
 parametrization
@@ -522,7 +529,6 @@ runtime
 runtimes
 scalable
 sharded
-Sharding
 softmax
 sparsified
 sparsifier
@@ -609,4 +615,4 @@ warmstarting
 warmup
 webp
 wsi
-wsis
+wsis
diff --git a/intermediate_source/pinmem_nonblock.py b/intermediate_source/pinmem_nonblock.py
@@ -10,7 +10,8 @@
 
 - Calling `tensor.pin_memory().to(device, non_blocking=True)` can be as twice as slow as a plain `tensor.to(device)`;
 - `tensor.to(device, non_blocking=True)` is usually a good choice;
-- `cpu_tensor.to("cuda", non_blocking=True).mean()` is ok, but `cuda_tensor.to("cpu", non_blocking=True).mean()` will produce garbage.
+- `cpu_tensor.to("cuda", non_blocking=True).mean()` will work, but `cuda_tensor.to("cpu", non_blocking=True).mean()`
+  will produce garbage.
 
 """
 
@@ -196,7 +197,7 @@ def profile_mem(cmd):
 
 ######################################################################
 # The results are without any doubt better when using `non_blocking=True`, as all transfers are initiated simultaneously on the host side.
-# Note that, interestingly, `to("cuda")` actually performs the same asynchrous device casting operation as the one with  `non_blocking=True` with a synchronization point after each copy. 
+# Note that, interestingly, `to("cuda")` actually performs the same asynchronous device casting operation as the one with  `non_blocking=True` with a synchronization point after each copy. 
 # 
 # The benefit will vary depending on the number and the size of the tensors as well as depending on the hardware being used.
 # 
@@ -286,9 +287,9 @@ def pin_copy_to_device_nonblocking(*tensors):
 # -------------------------
 #
 # We can now wrap up some early recommendations based on our observations:
-# In general, `non_blocking=True` will provide a good speed of transfer, regardless of whether the original tensor is or isn't in pinned memory. If the tensor is already in pinned memory, the transfer can be accelerated, but sending it to pin memory manually is a blocking operation on the host and hence will anihilate much of the benefit of using `non_blocking=True` (and CUDA does the `pin_memory` transfer anyway).
+# In general, `non_blocking=True` will provide a good speed of transfer, regardless of whether the original tensor is or isn't in pinned memory. If the tensor is already in pinned memory, the transfer can be accelerated, but sending it to pin memory manually is a blocking operation on the host and hence will annihilate much of the benefit of using `non_blocking=True` (and CUDA does the `pin_memory` transfer anyway).
 # 
-# One might now legitimetely ask what use there is for the `pin_memory()` method within the `torch.Tensor` class. In the following section, we will explore further how this can be used to accelerate the data transfer even more.
+# One might now legitimately ask what use there is for the `pin_memory()` method within the `torch.Tensor` class. In the following section, we will explore further how this can be used to accelerate the data transfer even more.
 # 
 # Additional considerations
 # -------------------------
@@ -298,7 +299,7 @@ def pin_copy_to_device_nonblocking(*tensors):
 # 
 # The answer is resides in the fact that the dataloader reserves a separate thread to copy the data from pageable to pinned memory, thereby avoiding to block the main thread with this. Consider the following example, where we send a list of tensors to cuda after calling pin_memory on a separate thread:
 # 
-# A more isolated example of this is the TensorDict primitive from the homonymous library: when calling `TensorDict.to(device)`, the default behaviour is to send these tensors to the device asynchronously and make a `device.synchronize()` call after. `TensorDict.to()` also offers a `non_blocking_pin` argument which will spawn multiple threads to do the calls to `pin_memory()` before launching the calls to `to(device)`.
+# A more isolated example of this is the TensorDict primitive from the homonymous library: when calling `TensorDict.to(device)`, the default behavior is to send these tensors to the device asynchronously and make a `device.synchronize()` call after. `TensorDict.to()` also offers a `non_blocking_pin` argument which will spawn multiple threads to do the calls to `pin_memory()` before launching the calls to `to(device)`.
 # This can further speed up the copies as the following example shows:
 #
 # .. code-block:: bash

Original file line number	Diff line number	Diff line change
`@@ -10,7 +10,8 @@`
`10`	`10`
`11`	`11`	- Calling `tensor.pin_memory().to(device, non_blocking=True)` can be as twice as slow as a plain `tensor.to(device)`;
`12`	`12`	- `tensor.to(device, non_blocking=True)` is usually a good choice;
`13`		-- `cpu_tensor.to("cuda", non_blocking=True).mean()` is ok, but `cuda_tensor.to("cpu", non_blocking=True).mean()` will produce garbage.
	`13`	+- `cpu_tensor.to("cuda", non_blocking=True).mean()` will work, but `cuda_tensor.to("cpu", non_blocking=True).mean()`
	`14`	`+ will produce garbage.`
`14`	`15`
`15`	`16`	`"""`
`16`	`17`
`@@ -196,7 +197,7 @@ def profile_mem(cmd):`
`196`	`197`
`197`	`198`	`######################################################################`
`198`	`199`	# The results are without any doubt better when using `non_blocking=True`, as all transfers are initiated simultaneously on the host side.
`199`		-# Note that, interestingly, `to("cuda")` actually performs the same asynchrous device casting operation as the one with `non_blocking=True` with a synchronization point after each copy.
	`200`	+# Note that, interestingly, `to("cuda")` actually performs the same asynchronous device casting operation as the one with `non_blocking=True` with a synchronization point after each copy.
`200`	`201`	`#`
`201`	`202`	`# The benefit will vary depending on the number and the size of the tensors as well as depending on the hardware being used.`
`202`	`203`	`#`
`@@ -286,9 +287,9 @@ def pin_copy_to_device_nonblocking(*tensors):`
`286`	`287`	`# -------------------------`
`287`	`288`	`#`
`288`	`289`	`# We can now wrap up some early recommendations based on our observations:`
`289`		-# In general, `non_blocking=True` will provide a good speed of transfer, regardless of whether the original tensor is or isn't in pinned memory. If the tensor is already in pinned memory, the transfer can be accelerated, but sending it to pin memory manually is a blocking operation on the host and hence will anihilate much of the benefit of using `non_blocking=True` (and CUDA does the `pin_memory` transfer anyway).
	`290`	+# In general, `non_blocking=True` will provide a good speed of transfer, regardless of whether the original tensor is or isn't in pinned memory. If the tensor is already in pinned memory, the transfer can be accelerated, but sending it to pin memory manually is a blocking operation on the host and hence will annihilate much of the benefit of using `non_blocking=True` (and CUDA does the `pin_memory` transfer anyway).
`290`	`291`	`#`
`291`		-# One might now legitimetely ask what use there is for the `pin_memory()` method within the `torch.Tensor` class. In the following section, we will explore further how this can be used to accelerate the data transfer even more.
	`292`	+# One might now legitimately ask what use there is for the `pin_memory()` method within the `torch.Tensor` class. In the following section, we will explore further how this can be used to accelerate the data transfer even more.
`292`	`293`	`#`
`293`	`294`	`# Additional considerations`
`294`	`295`	`# -------------------------`
`@@ -298,7 +299,7 @@ def pin_copy_to_device_nonblocking(*tensors):`
`298`	`299`	`#`
`299`	`300`	`# The answer is resides in the fact that the dataloader reserves a separate thread to copy the data from pageable to pinned memory, thereby avoiding to block the main thread with this. Consider the following example, where we send a list of tensors to cuda after calling pin_memory on a separate thread:`
`300`	`301`	`#`
`301`		-# A more isolated example of this is the TensorDict primitive from the homonymous library: when calling `TensorDict.to(device)`, the default behaviour is to send these tensors to the device asynchronously and make a `device.synchronize()` call after. `TensorDict.to()` also offers a `non_blocking_pin` argument which will spawn multiple threads to do the calls to `pin_memory()` before launching the calls to `to(device)`.
	`302`	+# A more isolated example of this is the TensorDict primitive from the homonymous library: when calling `TensorDict.to(device)`, the default behavior is to send these tensors to the device asynchronously and make a `device.synchronize()` call after. `TensorDict.to()` also offers a `non_blocking_pin` argument which will spawn multiple threads to do the calls to `pin_memory()` before launching the calls to `to(device)`.
`302`	`303`	`# This can further speed up the copies as the following example shows:`
`303`	`304`	`#`
`304`	`305`	`# .. code-block:: bash`