Skip to content

Commit 0974b34

Browse files
author
Vincent Moens
committed
init
1 parent cb3e0ac commit 0974b34

File tree

2 files changed

+334
-0
lines changed

2 files changed

+334
-0
lines changed

_static/img/pinmem.png

72 KB
Loading
Lines changed: 334 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,334 @@
1+
# -*- coding: utf-8 -*-
2+
"""
3+
A guide on good usage of `non_blocking` and `pin_memory()` in PyTorch
4+
=====================================================================
5+
6+
TL;DR
7+
-----
8+
9+
Sending tensors from CPU to GPU can be made faster by using asynchronous transfer and memory pinning, but:
10+
11+
- Calling `tensor.pin_memory().to(device, non_blocking=True)` can be as twice as slow as a plain `tensor.to(device)`;
12+
- `tensor.to(device, non_blocking=True)` is usually a good choice;
13+
- `cpu_tensor.to("cuda", non_blocking=True).mean()` is ok, but `cuda_tensor.to("cpu", non_blocking=True).mean()` will produce garbage.
14+
15+
"""
16+
17+
import torch
18+
assert torch.cuda.is_available(), "A cuda device is required to run this tutorial"
19+
20+
21+
######################################################################
22+
# Introduction
23+
# ------------
24+
#
25+
# Sending data from CPU to GPU is a cornerstone of many applications that use PyTorch.
26+
# Given this, users should have a good understanding of what tools and options they should be using
27+
# when moving data from one device to another.
28+
#
29+
# This tutorial focuses on two aspects of device-to-device transfer: `Tensor.pin_memory()` and `Tensor.to(device, non_blocking=True)`.
30+
# We start by outlining the theory surrounding these concepts, and then move to concrete test examples of the features.
31+
#
32+
# - [Background](#background)
33+
# - [Memory management basics](#memory-management-basics)
34+
# - [CUDA and (non-)pageable memory](#cuda-and-non-pageable-memory)
35+
# - [Asynchronous vs synchronous operations](#asynchronous-vs-synchronous-operations)
36+
# - [Deep dive](#deep-dive)
37+
# - [`pin_memory()`](#pin_memory)
38+
# - [`non_blocking=True`](#non_blockingtrue)
39+
# - [Synergies](#synergies)
40+
# - [Other directions (GPU -> CPU etc.)](#other-directions)
41+
# - [Practical recommendations](#practical-recommendations)
42+
# - [Case studies](#case-studies)
43+
# - [Conclusion](#conclusion)
44+
# - [Additional resources](#additional-resources)
45+
#
46+
#
47+
# Background
48+
# ----------
49+
#
50+
# Memory management basics
51+
# ~~~~~~~~~~~~~~~~~~~~~~~~
52+
#
53+
# When one creates a CPU tensor in PyTorch, the content of this tensor needs to be placed
54+
# in memory. The memory we talk about here is a rather complex concept worth looking at carefully.
55+
# We distinguish two types of memories that are handled by the Memory Management Unit: the main memory (for simplicity)
56+
# and the disk (which may or may not be the hard drive). Together, the available space in disk and RAM (physical memory)
57+
# make up the virtual memory, which is an abstraction of the total resources available.
58+
# In short, the virtual memory makes it so that the available space is larger than what can be found on RAM in isolation
59+
# and creates the illusion that the main memory is larger than it actually is.
60+
#
61+
# In normal circumstances, a regular CPU tensor is _paged_, which means that it is divided in blocks called _pages_ that
62+
# can live anywhere in the virtual memory (both in RAM or on disk). As mentioned earlier, this has the advantage that
63+
# the memory seems larger than what the main memory actually is.
64+
#
65+
# Typically, when a program accesses a page that is not in RAM, a "page fault" occurs and the operating system (OS) then brings
66+
# back this page into RAM (_swap in_ or _page in_).
67+
# In turn, the OS may have to _swap out_ (or _page out_) another page to make room for the new page.
68+
#
69+
# In contrast to pageable memory, a _pinned_ (or _page-locked_ or _non-pegeable_) memory is a type of memory that cannot be swapped out to disk.
70+
# It allows for faster and more predictable access times, but has the downside that it is more limited than the
71+
# pageable memory (aka the main memory).
72+
#
73+
# .. figure:: /_static/img/pinmem.png
74+
# :alt:
75+
#
76+
# CUDA and (non-)pageable memory
77+
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
78+
#
79+
# To understand how CUDA copies a tensor from CPU to CUDA, let's consider the two scenarios above:
80+
# - If the memory is page-locked, the device can access the memory directly in the main memory. The memory addresses are well
81+
# defined and functions that need to read these data can be significantly accelerated.
82+
# - If the memory is pageable, all the pages will have to be brought to the main memory before being sent to the GPU.
83+
# This operation may take time and is less predictable than when executed on page-locked tensors.
84+
#
85+
# More precisely, when CUDA sends pageable data from CPU to GPU, it must first create a page-locked copy of that data
86+
# before making the transfer.
87+
#
88+
# Asynchronous vs. Synchronous Operations with `non_blocking=True` (CUDA `cudaMemcpyAsync`)
89+
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
90+
#
91+
# When executing a copy from a host (e.g., CPU) to a device (e.g., GPU), the CUDA toolkit offers modalities to do these
92+
# operations synchronously or asynchronously with respect to the host. In the synchronous case, the call to `cudaMemcpy`
93+
# that is queries by `tensor.to(device)` is blocking in the python main thread, which means that the code will stop until
94+
# the data has been transferred to the device.
95+
#
96+
# When calling `tensor.to(device)`, PyTorch always makes a call to [`cudaMemcpyAsync`](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1g85073372f776b4c4d5f89f7124b7bf79). If `non_blocking=False` (default), a `cudaStreamSynchronize` will be called after each and every `cudaMemcpyAsync`. If `non_blocking=True`, no synchronization is triggered, and the main thread on the host is not blocked.
97+
# Therefore, from the host perspective, multiple tensors can be sent to the device simultaneously in the latter case, as the thread does not need for one transfer to be completed to initiate the other.
98+
#
99+
# .. note:: In general, the transfer is blocking on the device size even if it's not on the host side: the copy on the device cannot
100+
# occur while another operation is being executed. However, in some advanced scenarios, multiple copies or copy and kernel
101+
# executions can be done simultaneously on the GPU side. To enable this, three requirements must be met:
102+
#
103+
# 1. The device must have at least one free DMA (Direct Memory Access) engine. Modern GPU architectures such as Volterra,
104+
# Tesla or H100 devices have more than one DMA engine.
105+
#
106+
# 2. The transfer must be done on a separate, non-default cuda stream. In PyTorch, cuda streams can be handles using
107+
# `torch.cuda.Stream`.
108+
#
109+
# 3. The source data must be in pinned memory.
110+
#
111+
#
112+
# A PyTorch perspective
113+
# ---------------------
114+
#
115+
# `pin_memory()`
116+
# ~~~~~~~~~~~~~~
117+
#
118+
# PyTorch offers the possibility to create and send tensors to page-locked memory through the `pin_memory` functions and
119+
# arguments.
120+
# Any cpu tensor on a machine where a cuda is initialized can be sent to pinned memory through the `pin_memory`
121+
# method. Importantly, `pin_memory` is blocking on the host: the main thread will wait for the tensor to be copied to
122+
# page-locked memory before executing the next operation.
123+
# New tensors can be directly created in pinned memory with functions like `torch.zeros`, `torch.ones` and other
124+
# constructors.
125+
#
126+
# Let us check the speed of pinning memory and sending tensors to cuda:
127+
128+
129+
import torch
130+
import gc
131+
from torch.utils.benchmark import Timer
132+
133+
tensor_pageable = torch.randn(100_000)
134+
135+
tensor_pinned = torch.randn(100_000, pin_memory=True)
136+
137+
print("Regular to(device)",
138+
Timer("tensor_pageable.to('cuda:0')", globals=globals()).adaptive_autorange())
139+
print("Pinned to(device)",
140+
Timer("tensor_pinned.to('cuda:0')", globals=globals()).adaptive_autorange())
141+
print("pin_memory() along",
142+
Timer("tensor_pageable.pin_memory()", globals=globals()).adaptive_autorange())
143+
print("pin_memory() + to(device)",
144+
Timer("tensor_pageable.pin_memory().to('cuda:0')", globals=globals()).adaptive_autorange())
145+
del tensor_pageable, tensor_pinned
146+
gc.collect()
147+
148+
149+
######################################################################
150+
# We can observe that casting a pinned-memory tensor to GPU is indeed much faster than a pageable tensor, because under the hood, a pageable tensor must be copied to pinned memory before being sent to GPU.
151+
#
152+
# However, calling `pin_memory()` on a pageable tensor before casting it to GPU does not bring any speed-up, on the contrary this call is actually slower than just executing the transfer. Again, this makes sense, since we're actually asking python to execute an operation that CUDA will perform anyway before copying the data from host to device.
153+
#
154+
# `non_blocking=True`
155+
# ~~~~~~~~~~~~~~~~~~~
156+
#
157+
# As mentioned earlier, many PyTorch operations have the option of being executed asynchronously with respect to the host through the `non_blocking` argument.
158+
# Here, to account accurately of the benefits of using `non_blocking`, we will design a slightly more involved experiment since we want to assess how fast it is to send multiple tensors to GPU with and without calling `non_blocking`.
159+
#
160+
161+
162+
def copy_to_device(*tensors, display_peak_mem=False):
163+
result = []
164+
for tensor in tensors:
165+
result.append(tensor.to("cuda:0"))
166+
return result
167+
def copy_to_device_nonblocking(*tensors, display_peak_mem=False):
168+
result = []
169+
for tensor in tensors:
170+
result.append(tensor.to("cuda:0", non_blocking=True))
171+
# We need to synchronize
172+
torch.cuda.synchronize()
173+
return result
174+
175+
tensors = [torch.randn(1000) for _ in range(1000)]
176+
print("Call to `to(device)`", Timer("copy_to_device(*tensors)", globals=globals()).adaptive_autorange())
177+
print("Call to `to(device, non_blocking=True)`", Timer("copy_to_device_nonblocking(*tensors)",
178+
globals=globals()).adaptive_autorange())
179+
180+
181+
######################################################################
182+
# To get a better sense of what is happening here, let us run a profiling of these two code executions:
183+
184+
185+
from torch.profiler import profile, record_function, ProfilerActivity
186+
187+
def profile_mem(cmd):
188+
with profile(activities=[ProfilerActivity.CPU]) as prof:
189+
exec(cmd)
190+
print(cmd)
191+
print(prof.key_averages().table(row_limit=10))
192+
193+
print("Call to `to(device)`", profile_mem("copy_to_device(*tensors)"))
194+
print("Call to `to(device, non_blocking=True)`", profile_mem("copy_to_device_nonblocking(*tensors)"))
195+
196+
197+
######################################################################
198+
# The results are without any doubt better when using `non_blocking=True`, as all transfers are initiated simultaneously on the host side.
199+
# Note that, interestingly, `to("cuda")` actually performs the same asynchrous device casting operation as the one with `non_blocking=True` with a synchronization point after each copy.
200+
#
201+
# The benefit will vary depending on the number and the size of the tensors as well as depending on the hardware being used.
202+
#
203+
# Synergies
204+
# ~~~~~~~~~
205+
#
206+
# Now that we have made the point that data transfer of tensors already in pinned memory to GPU is faster than from pageable memory, and that we know that doing these transfers asynchronously is also faster than synchronously, we can benchmark the various combinations at hand:
207+
208+
209+
def pin_copy_to_device(*tensors):
210+
result = []
211+
for tensor in tensors:
212+
result.append(tensor.pin_memory().to("cuda:0"))
213+
return result
214+
def pin_copy_to_device_nonblocking(*tensors):
215+
result = []
216+
for tensor in tensors:
217+
result.append(tensor.pin_memory().to("cuda:0", non_blocking=True))
218+
# We need to synchronize
219+
torch.cuda.synchronize()
220+
return result
221+
222+
print("\nCall to `pin_memory()` + `to(device)`")
223+
print("pin_memory().to(device)",
224+
Timer("pin_copy_to_device(*tensors)", globals=globals()).adaptive_autorange())
225+
print("pin_memory().to(device, non_blocking=True)",
226+
Timer("pin_copy_to_device_nonblocking(*tensors)",
227+
globals=globals()).adaptive_autorange())
228+
229+
print("\nCall to `to(device)`")
230+
print("to(device)",
231+
Timer("copy_to_device(*tensors)", globals=globals()).adaptive_autorange())
232+
print("to(device, non_blocking=True)",
233+
Timer("copy_to_device_nonblocking(*tensors)",
234+
globals=globals()).adaptive_autorange())
235+
236+
print("\nCall to `to(device)` from pinned tensors")
237+
tensors_pinned = [torch.zeros(1000, pin_memory=True) for _ in range(1000)]
238+
print("tensor_pinned.to(device)",
239+
Timer("copy_to_device(*tensors_pinned)", globals=globals()).adaptive_autorange())
240+
print("tensor_pinned.to(device, non_blocking=True)",
241+
Timer("copy_to_device_nonblocking(*tensors_pinned)",
242+
globals=globals()).adaptive_autorange())
243+
244+
del tensors, tensors_pinned
245+
gc.collect()
246+
247+
248+
######################################################################
249+
# Other directions (GPU -> CPU, CPU -> MPS etc.)
250+
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
251+
#
252+
# So far, we have assumed that doing asynchronous copies from CPU to GPU was safe.
253+
# Indeed, it is a safe thing to do because CUDA will synchronize whenever it is needed to make sure that the data being read is not garbage.
254+
# However, any other copy (e.g., from GPU to CPU) has no guarantee whatsoever that the copy will be completed when the data is read. In fact, if no explicit synchronization is done, the data on the host can be garbage:
255+
#
256+
257+
258+
259+
tensor = torch.arange(1, 1_000_000, dtype=torch.double, device="cuda").expand(100, 999999).clone()
260+
torch.testing.assert_close(tensor.mean(), torch.tensor(500_000, dtype=torch.double, device="cuda")), tensor.mean()
261+
try:
262+
i = -1
263+
for i in range(100):
264+
cpu_tensor = tensor.to("cpu", non_blocking=True)
265+
torch.testing.assert_close(cpu_tensor.mean(), torch.tensor(500_000, dtype=torch.double))
266+
print("No test failed with non_blocking")
267+
except AssertionError:
268+
print(f"One test failed with non_blocking: {i}th assertion!")
269+
try:
270+
i = -1
271+
for i in range(100):
272+
cpu_tensor = tensor.to("cpu", non_blocking=True)
273+
torch.cuda.synchronize()
274+
torch.testing.assert_close(cpu_tensor.mean(), torch.tensor(500_000, dtype=torch.double))
275+
print("No test failed with synchronize")
276+
except AssertionError:
277+
print(f"One test failed with synchronize: {i}th assertion!")
278+
279+
280+
######################################################################
281+
# The same observation could be made with copies from CPU to a non-CUDA device such as MPS.
282+
#
283+
# In summary, copying data from CPU to GPU is safe when using `non_blocking=True`, but for any other direction, `non_blocking=True` can still be used but the user must make sure that a device synchronization is executed after the data is accessed.
284+
#
285+
# Practical recommendations
286+
# -------------------------
287+
#
288+
# We can now wrap up some early recommendations based on our observations:
289+
# In general, `non_blocking=True` will provide a good speed of transfer, regardless of whether the original tensor is or isn't in pinned memory. If the tensor is already in pinned memory, the transfer can be accelerated, but sending it to pin memory manually is a blocking operation on the host and hence will anihilate much of the benefit of using `non_blocking=True` (and CUDA does the `pin_memory` transfer anyway).
290+
#
291+
# One might now legitimetely ask what use there is for the `pin_memory()` method within the `torch.Tensor` class. In the following section, we will explore further how this can be used to accelerate the data transfer even more.
292+
#
293+
# Additional considerations
294+
# -------------------------
295+
#
296+
# PyTorch notoriously provides a `DataLoader` class that accepts a `pin_memory` argument.
297+
# Given everything we have said so far about calls to `pin_memory`, how does the dataloader manage to accelerate data transfers?
298+
#
299+
# The answer is resides in the fact that the dataloader reserves a separate thread to copy the data from pageable to pinned memory, thereby avoiding to block the main thread with this. Consider the following example, where we send a list of tensors to cuda after calling pin_memory on a separate thread:
300+
#
301+
# A more isolated example of this is the TensorDict primitive from the homonymous library: when calling `TensorDict.to(device)`, the default behaviour is to send these tensors to the device asynchronously and make a `device.synchronize()` call after. `TensorDict.to()` also offers a `non_blocking_pin` argument which will spawn multiple threads to do the calls to `pin_memory()` before launching the calls to `to(device)`.
302+
# This can further speed up the copies as the following example shows:
303+
#
304+
# .. code-block:: bash
305+
#
306+
# !pip3 install https://github.com/pytorch/tensordict
307+
#
308+
309+
from tensordict import TensorDict
310+
import torch
311+
from torch.utils.benchmark import Timer
312+
313+
td = TensorDict({str(i): torch.randn(1_000_000) for i in range(100)})
314+
315+
print(Timer("td.to('cuda:0', non_blocking=False)", globals=globals()).adaptive_autorange())
316+
print(Timer("td.to('cuda:0')", globals=globals()).adaptive_autorange())
317+
print(Timer("td.to('cuda:0', non_blocking=True, non_blocking_pin=True)", globals=globals()).adaptive_autorange())
318+
319+
320+
######################################################################
321+
# As a side note, it may be tempting to create everlasting buffers in pinned memory and copy tensors from pageable memory to pinned memory, and use these as shuttle before sending the data to GPU.
322+
# Unfortunately, this does not speed up computation because the bottleneck of copying data to pinned memory is still present.
323+
#
324+
# Another consideration is that transferring data that is stored on disk (shared memory or files) to GPU will usually require the data to be copied to pinned memory (which is on RAM) as an intermediate step.
325+
#
326+
# Using `non_blocking` in these context for large amount of data may have devastating effects on RAM consumption. In practice, there is no silver bullet, and the performance of any combination of multithreaded pin_memory and non_blocking will depend on multiple factors such as the system being used, the OS, the hardware and the tasks being performed.
327+
#
328+
# Finally, creating a large number of tensors or a few large tensors in pinned memory will effectively reserve more RAM than pageable tensors would, thereby lowering the amount of available RAM for other operations (such as swapping pages in and out), which can have a negative impact over the overall runtime of an algorithm.
329+
330+
######################################################################
331+
# ## Conclusion
332+
#
333+
# ## Additional resources
334+
#

0 commit comments

Comments
 (0)