Skip to content

Commit f9471ef

Browse files
Vincent Moensjaneyx99mikaylagawareckishagunsodhani
authored
Apply suggestions from code review
Co-authored-by: Jane (Yuan) Xu <[email protected]> Co-authored-by: mikaylagawarecki <[email protected]> Co-authored-by: Shagun Sodhani <[email protected]>
1 parent 8aa882d commit f9471ef

File tree

1 file changed

+13
-12
lines changed

1 file changed

+13
-12
lines changed

intermediate_source/pinmem_nonblock.py

Lines changed: 13 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@
4444
#
4545
# - :ref:`pin_memory <pinmem_pinmem>`
4646
# - :ref:`non_blocking=True <pinmem_nb>`
47-
# - :ref:`Synergies <synergies>`
47+
# - :ref:`Synergies <pinmem_synergies>
4848
# - :ref:`Other copy directions (GPU -> CPU) <pinmem_otherdir>`
4949
#
5050
# - :ref:`Practical recommendations <pinmem_recom>`
@@ -65,18 +65,18 @@
6565
#
6666
# When one creates a CPU tensor in PyTorch, the content of this tensor needs to be placed
6767
# in memory. The memory we talk about here is a rather complex concept worth looking at carefully.
68-
# We distinguish two types of memories that are handled by the Memory Management Unit: the main memory (for simplicity)
69-
# and the disk (which may or may not be the hard drive). Together, the available space in disk and RAM (physical memory)
68+
# We distinguish two types of memory that are handled by the Memory Management Unit: the main memory (for simplicity)
69+
# and the swap space on disk (which may or may not be the hard drive). Together, the available space in disk and RAM (physical memory)
7070
# make up the virtual memory, which is an abstraction of the total resources available.
7171
# In short, the virtual memory makes it so that the available space is larger than what can be found on RAM in isolation
7272
# and creates the illusion that the main memory is larger than it actually is.
7373
#
74-
# In normal circumstances, a regular CPU tensor is _paged_, which means that it is divided in blocks called _pages_ that
74+
# In normal circumstances, a regular CPU tensor is pageable which means that it is divided in blocks called pages that
7575
# can live anywhere in the virtual memory (both in RAM or on disk). As mentioned earlier, this has the advantage that
7676
# the memory seems larger than what the main memory actually is.
7777
#
7878
# Typically, when a program accesses a page that is not in RAM, a "page fault" occurs and the operating system (OS) then brings
79-
# back this page into RAM (_swap in_ or _page in_).
79+
# back this page into RAM ("swap in" or "page in").
8080
# In turn, the OS may have to _swap out_ (or _page out_) another page to make room for the new page.
8181
#
8282
# In contrast to pageable memory, a _pinned_ (or _page-locked_ or _non-pageable_) memory is a type of memory that cannot
@@ -93,6 +93,7 @@
9393
# .. _pinmem_cuda_pageable_mem:
9494
#
9595
# To understand how CUDA copies a tensor from CPU to CUDA, let's consider the two scenarios above:
96+
#
9697
# - If the memory is page-locked, the device can access the memory directly in the main memory. The memory addresses are well
9798
# defined and functions that need to read these data can be significantly accelerated.
9899
# - If the memory is pageable, all the pages will have to be brought to the main memory before being sent to the GPU.
@@ -143,7 +144,7 @@
143144
#
144145
# PyTorch offers the possibility to create and send tensors to page-locked memory through the
145146
# :meth:`~torch.Tensor.pin_memory` method and constructor arguments.
146-
# Cpu tensors on a machine where a cuda is initialized can be cast to pinned memory through the :meth:`~torch.Tensor.pin_memory`
147+
# CPU tensors on a machine where CUDA is initialized can be cast to pinned memory through the :meth:`~torch.Tensor.pin_memory`
147148
# method. Importantly, ``pin_memory`` is blocking on the main thread of the host: it will wait for the tensor to be copied to
148149
# page-locked memory before executing the next operation.
149150
# New tensors can be directly created in pinned memory with functions like :func:`~torch.zeros`, :func:`~torch.ones` and other
@@ -299,7 +300,7 @@ def profile_mem(cmd):
299300
print("Call to `to(device)`", profile_mem("copy_to_device(*tensors)"))
300301

301302
######################################################################
302-
# and now the ``non_blocing`` version:
303+
# and now the ``non_blocking`` version:
303304
#
304305

305306
print(
@@ -316,7 +317,7 @@ def profile_mem(cmd):
316317
# used.
317318
#
318319
# .. note:: Interestingly, the blocking ``to("cuda")`` actually performs the same asynchronous device casting operation
319-
# (``cudaMemcpyAsync``) as the one with ```non_blocking=True`` with a synchronization point after each copy.
320+
# (``cudaMemcpyAsync``) as the one with ``non_blocking=True`` with a synchronization point after each copy.
320321
#
321322
# Synergies
322323
# ~~~~~~~~~
@@ -429,7 +430,7 @@ def pin_copy_to_device_nonblocking(*tensors):
429430
)
430431
print("No test failed with non_blocking")
431432
except AssertionError:
432-
print(f"One test failed with non_blocking: {i}th assertion!")
433+
print(f"{i}th test failed with non_blocking. Skipping remaining tests")
433434
try:
434435
i = -1
435436
for i in range(100):
@@ -459,7 +460,7 @@ def pin_copy_to_device_nonblocking(*tensors):
459460
#
460461
# We can now wrap up some early recommendations based on our observations:
461462
#
462-
# In general, ``non_blocking=True`` will provide a good throughput, regardless of whether the original tensor is or
463+
# In general, ``non_blocking=True`` will provide good throughput, regardless of whether the original tensor is or
463464
# isn't in pinned memory.
464465
# If the tensor is already in pinned memory, the transfer can be accelerated, but sending it to
465466
# pin memory manually from python main thread is a blocking operation on the host, and hence will annihilate much of
@@ -473,7 +474,7 @@ def pin_copy_to_device_nonblocking(*tensors):
473474
#
474475
# .. _pinmem_considerations:
475476
#
476-
# PyTorch notoriously provides a :class:`~torch.utils.data.DataLoader` class which constructor accepts a
477+
# PyTorch notoriously provides a :class:`~torch.utils.data.DataLoader` class whose constructor accepts a
477478
# ``pin_memory`` argument.
478479
# Considering our previous discussion on ``pin_memory``, you might wonder how the ``DataLoader`` manages to
479480
# accelerate data transfers if memory pinning is inherently blocking.
@@ -498,7 +499,7 @@ def pin_copy_to_device_nonblocking(*tensors):
498499
from tensordict import TensorDict
499500
import torch
500501
from torch.utils.benchmark import Timer
501-
502+
import matplotlib.pyplot as plt
502503
# Create the dataset
503504
td = TensorDict({str(i): torch.randn(1_000_000) for i in range(1000)})
504505

0 commit comments

Comments
 (0)