44
44
#
45
45
# - :ref:`pin_memory <pinmem_pinmem>`
46
46
# - :ref:`non_blocking=True <pinmem_nb>`
47
- # - :ref:`Synergies <synergies>`
47
+ # - :ref:`Synergies <pinmem_synergies>
48
48
# - :ref:`Other copy directions (GPU -> CPU) <pinmem_otherdir>`
49
49
#
50
50
# - :ref:`Practical recommendations <pinmem_recom>`
65
65
#
66
66
# When one creates a CPU tensor in PyTorch, the content of this tensor needs to be placed
67
67
# in memory. The memory we talk about here is a rather complex concept worth looking at carefully.
68
- # We distinguish two types of memories that are handled by the Memory Management Unit: the main memory (for simplicity)
69
- # and the disk (which may or may not be the hard drive). Together, the available space in disk and RAM (physical memory)
68
+ # We distinguish two types of memory that are handled by the Memory Management Unit: the main memory (for simplicity)
69
+ # and the swap space on disk (which may or may not be the hard drive). Together, the available space in disk and RAM (physical memory)
70
70
# make up the virtual memory, which is an abstraction of the total resources available.
71
71
# In short, the virtual memory makes it so that the available space is larger than what can be found on RAM in isolation
72
72
# and creates the illusion that the main memory is larger than it actually is.
73
73
#
74
- # In normal circumstances, a regular CPU tensor is _paged_, which means that it is divided in blocks called _pages_ that
74
+ # In normal circumstances, a regular CPU tensor is pageable which means that it is divided in blocks called pages that
75
75
# can live anywhere in the virtual memory (both in RAM or on disk). As mentioned earlier, this has the advantage that
76
76
# the memory seems larger than what the main memory actually is.
77
77
#
78
78
# Typically, when a program accesses a page that is not in RAM, a "page fault" occurs and the operating system (OS) then brings
79
- # back this page into RAM (_swap in_ or _page in_ ).
79
+ # back this page into RAM ("swap in" or "page in" ).
80
80
# In turn, the OS may have to _swap out_ (or _page out_) another page to make room for the new page.
81
81
#
82
82
# In contrast to pageable memory, a _pinned_ (or _page-locked_ or _non-pageable_) memory is a type of memory that cannot
93
93
# .. _pinmem_cuda_pageable_mem:
94
94
#
95
95
# To understand how CUDA copies a tensor from CPU to CUDA, let's consider the two scenarios above:
96
+ #
96
97
# - If the memory is page-locked, the device can access the memory directly in the main memory. The memory addresses are well
97
98
# defined and functions that need to read these data can be significantly accelerated.
98
99
# - If the memory is pageable, all the pages will have to be brought to the main memory before being sent to the GPU.
143
144
#
144
145
# PyTorch offers the possibility to create and send tensors to page-locked memory through the
145
146
# :meth:`~torch.Tensor.pin_memory` method and constructor arguments.
146
- # Cpu tensors on a machine where a cuda is initialized can be cast to pinned memory through the :meth:`~torch.Tensor.pin_memory`
147
+ # CPU tensors on a machine where CUDA is initialized can be cast to pinned memory through the :meth:`~torch.Tensor.pin_memory`
147
148
# method. Importantly, ``pin_memory`` is blocking on the main thread of the host: it will wait for the tensor to be copied to
148
149
# page-locked memory before executing the next operation.
149
150
# New tensors can be directly created in pinned memory with functions like :func:`~torch.zeros`, :func:`~torch.ones` and other
@@ -299,7 +300,7 @@ def profile_mem(cmd):
299
300
print ("Call to `to(device)`" , profile_mem ("copy_to_device(*tensors)" ))
300
301
301
302
######################################################################
302
- # and now the ``non_blocing `` version:
303
+ # and now the ``non_blocking `` version:
303
304
#
304
305
305
306
print (
@@ -316,7 +317,7 @@ def profile_mem(cmd):
316
317
# used.
317
318
#
318
319
# .. note:: Interestingly, the blocking ``to("cuda")`` actually performs the same asynchronous device casting operation
319
- # (``cudaMemcpyAsync``) as the one with ``` non_blocking=True`` with a synchronization point after each copy.
320
+ # (``cudaMemcpyAsync``) as the one with ``non_blocking=True`` with a synchronization point after each copy.
320
321
#
321
322
# Synergies
322
323
# ~~~~~~~~~
@@ -429,7 +430,7 @@ def pin_copy_to_device_nonblocking(*tensors):
429
430
)
430
431
print ("No test failed with non_blocking" )
431
432
except AssertionError :
432
- print (f"One test failed with non_blocking: { i } th assertion! " )
433
+ print (f"{ i } th test failed with non_blocking. Skipping remaining tests " )
433
434
try :
434
435
i = - 1
435
436
for i in range (100 ):
@@ -459,7 +460,7 @@ def pin_copy_to_device_nonblocking(*tensors):
459
460
#
460
461
# We can now wrap up some early recommendations based on our observations:
461
462
#
462
- # In general, ``non_blocking=True`` will provide a good throughput, regardless of whether the original tensor is or
463
+ # In general, ``non_blocking=True`` will provide good throughput, regardless of whether the original tensor is or
463
464
# isn't in pinned memory.
464
465
# If the tensor is already in pinned memory, the transfer can be accelerated, but sending it to
465
466
# pin memory manually from python main thread is a blocking operation on the host, and hence will annihilate much of
@@ -473,7 +474,7 @@ def pin_copy_to_device_nonblocking(*tensors):
473
474
#
474
475
# .. _pinmem_considerations:
475
476
#
476
- # PyTorch notoriously provides a :class:`~torch.utils.data.DataLoader` class which constructor accepts a
477
+ # PyTorch notoriously provides a :class:`~torch.utils.data.DataLoader` class whose constructor accepts a
477
478
# ``pin_memory`` argument.
478
479
# Considering our previous discussion on ``pin_memory``, you might wonder how the ``DataLoader`` manages to
479
480
# accelerate data transfers if memory pinning is inherently blocking.
@@ -498,7 +499,7 @@ def pin_copy_to_device_nonblocking(*tensors):
498
499
from tensordict import TensorDict
499
500
import torch
500
501
from torch .utils .benchmark import Timer
501
-
502
+ import matplotlib . pyplot as plt
502
503
# Create the dataset
503
504
td = TensorDict ({str (i ): torch .randn (1_000_000 ) for i in range (1000 )})
504
505
0 commit comments