Skip to content

Commit 4fe1b2d

Browse files
author
Vincent Moens
committed
spelling
1 parent 72f8951 commit 4fe1b2d

File tree

2 files changed

+15
-8
lines changed

2 files changed

+15
-8
lines changed

en-wordlist.txt

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
12
ACL
23
ADI
34
AOT
@@ -50,6 +51,7 @@ DDP
5051
DDPG
5152
DDQN
5253
DLRM
54+
DMA
5355
DNN
5456
DQN
5557
DataLoaders
@@ -139,6 +141,7 @@ MKLDNN
139141
MLP
140142
MLPs
141143
MNIST
144+
MPS
142145
MUC
143146
MacBook
144147
MacOS
@@ -219,6 +222,7 @@ STR
219222
SVE
220223
SciPy
221224
Sequentials
225+
Sharding
222226
Sigmoid
223227
SoTA
224228
Sohn
@@ -254,6 +258,7 @@ VLDB
254258
VQA
255259
VS Code
256260
ViT
261+
Volterra
257262
WMT
258263
WSI
259264
WSIs
@@ -336,11 +341,11 @@ dataset’s
336341
deallocation
337342
decompositions
338343
decorrelated
339-
devicemesh
340344
deserialize
341345
deserialized
342346
desynchronization
343347
deterministically
348+
devicemesh
344349
dimensionality
345350
dir
346351
discontiguous
@@ -384,6 +389,7 @@ hessian
384389
hessians
385390
histoencoder
386391
histologically
392+
homonymous
387393
hotspot
388394
hvp
389395
hyperparameter
@@ -459,6 +465,7 @@ optimizer's
459465
optimizers
460466
otsu
461467
overfitting
468+
pageable
462469
parallelizable
463470
parallelization
464471
parametrization
@@ -522,7 +529,6 @@ runtime
522529
runtimes
523530
scalable
524531
sharded
525-
Sharding
526532
softmax
527533
sparsified
528534
sparsifier
@@ -609,4 +615,4 @@ warmstarting
609615
warmup
610616
webp
611617
wsi
612-
wsis
618+
wsis

intermediate_source/pinmem_nonblock.py

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,8 @@
1010
1111
- Calling `tensor.pin_memory().to(device, non_blocking=True)` can be as twice as slow as a plain `tensor.to(device)`;
1212
- `tensor.to(device, non_blocking=True)` is usually a good choice;
13-
- `cpu_tensor.to("cuda", non_blocking=True).mean()` is ok, but `cuda_tensor.to("cpu", non_blocking=True).mean()` will produce garbage.
13+
- `cpu_tensor.to("cuda", non_blocking=True).mean()` will work, but `cuda_tensor.to("cpu", non_blocking=True).mean()`
14+
will produce garbage.
1415
1516
"""
1617

@@ -196,7 +197,7 @@ def profile_mem(cmd):
196197

197198
######################################################################
198199
# The results are without any doubt better when using `non_blocking=True`, as all transfers are initiated simultaneously on the host side.
199-
# Note that, interestingly, `to("cuda")` actually performs the same asynchrous device casting operation as the one with `non_blocking=True` with a synchronization point after each copy.
200+
# Note that, interestingly, `to("cuda")` actually performs the same asynchronous device casting operation as the one with `non_blocking=True` with a synchronization point after each copy.
200201
#
201202
# The benefit will vary depending on the number and the size of the tensors as well as depending on the hardware being used.
202203
#
@@ -286,9 +287,9 @@ def pin_copy_to_device_nonblocking(*tensors):
286287
# -------------------------
287288
#
288289
# We can now wrap up some early recommendations based on our observations:
289-
# In general, `non_blocking=True` will provide a good speed of transfer, regardless of whether the original tensor is or isn't in pinned memory. If the tensor is already in pinned memory, the transfer can be accelerated, but sending it to pin memory manually is a blocking operation on the host and hence will anihilate much of the benefit of using `non_blocking=True` (and CUDA does the `pin_memory` transfer anyway).
290+
# In general, `non_blocking=True` will provide a good speed of transfer, regardless of whether the original tensor is or isn't in pinned memory. If the tensor is already in pinned memory, the transfer can be accelerated, but sending it to pin memory manually is a blocking operation on the host and hence will annihilate much of the benefit of using `non_blocking=True` (and CUDA does the `pin_memory` transfer anyway).
290291
#
291-
# One might now legitimetely ask what use there is for the `pin_memory()` method within the `torch.Tensor` class. In the following section, we will explore further how this can be used to accelerate the data transfer even more.
292+
# One might now legitimately ask what use there is for the `pin_memory()` method within the `torch.Tensor` class. In the following section, we will explore further how this can be used to accelerate the data transfer even more.
292293
#
293294
# Additional considerations
294295
# -------------------------
@@ -298,7 +299,7 @@ def pin_copy_to_device_nonblocking(*tensors):
298299
#
299300
# The answer is resides in the fact that the dataloader reserves a separate thread to copy the data from pageable to pinned memory, thereby avoiding to block the main thread with this. Consider the following example, where we send a list of tensors to cuda after calling pin_memory on a separate thread:
300301
#
301-
# A more isolated example of this is the TensorDict primitive from the homonymous library: when calling `TensorDict.to(device)`, the default behaviour is to send these tensors to the device asynchronously and make a `device.synchronize()` call after. `TensorDict.to()` also offers a `non_blocking_pin` argument which will spawn multiple threads to do the calls to `pin_memory()` before launching the calls to `to(device)`.
302+
# A more isolated example of this is the TensorDict primitive from the homonymous library: when calling `TensorDict.to(device)`, the default behavior is to send these tensors to the device asynchronously and make a `device.synchronize()` call after. `TensorDict.to()` also offers a `non_blocking_pin` argument which will spawn multiple threads to do the calls to `pin_memory()` before launching the calls to `to(device)`.
302303
# This can further speed up the copies as the following example shows:
303304
#
304305
# .. code-block:: bash

0 commit comments

Comments
 (0)