|
| 1 | +# -*- coding: utf-8 -*- |
| 2 | +""" |
| 3 | +A guide on good usage of `non_blocking` and `pin_memory()` in PyTorch |
| 4 | +===================================================================== |
| 5 | +
|
| 6 | +TL;DR |
| 7 | +----- |
| 8 | +
|
| 9 | +Sending tensors from CPU to GPU can be made faster by using asynchronous transfer and memory pinning, but: |
| 10 | +
|
| 11 | +- Calling `tensor.pin_memory().to(device, non_blocking=True)` can be as twice as slow as a plain `tensor.to(device)`; |
| 12 | +- `tensor.to(device, non_blocking=True)` is usually a good choice; |
| 13 | +- `cpu_tensor.to("cuda", non_blocking=True).mean()` is ok, but `cuda_tensor.to("cpu", non_blocking=True).mean()` will produce garbage. |
| 14 | +
|
| 15 | +""" |
| 16 | + |
| 17 | +import torch |
| 18 | +assert torch.cuda.is_available(), "A cuda device is required to run this tutorial" |
| 19 | + |
| 20 | + |
| 21 | +###################################################################### |
| 22 | +# Introduction |
| 23 | +# ------------ |
| 24 | +# |
| 25 | +# Sending data from CPU to GPU is a cornerstone of many applications that use PyTorch. |
| 26 | +# Given this, users should have a good understanding of what tools and options they should be using |
| 27 | +# when moving data from one device to another. |
| 28 | +# |
| 29 | +# This tutorial focuses on two aspects of device-to-device transfer: `Tensor.pin_memory()` and `Tensor.to(device, non_blocking=True)`. |
| 30 | +# We start by outlining the theory surrounding these concepts, and then move to concrete test examples of the features. |
| 31 | +# |
| 32 | +# - [Background](#background) |
| 33 | +# - [Memory management basics](#memory-management-basics) |
| 34 | +# - [CUDA and (non-)pageable memory](#cuda-and-non-pageable-memory) |
| 35 | +# - [Asynchronous vs synchronous operations](#asynchronous-vs-synchronous-operations) |
| 36 | +# - [Deep dive](#deep-dive) |
| 37 | +# - [`pin_memory()`](#pin_memory) |
| 38 | +# - [`non_blocking=True`](#non_blockingtrue) |
| 39 | +# - [Synergies](#synergies) |
| 40 | +# - [Other directions (GPU -> CPU etc.)](#other-directions) |
| 41 | +# - [Practical recommendations](#practical-recommendations) |
| 42 | +# - [Case studies](#case-studies) |
| 43 | +# - [Conclusion](#conclusion) |
| 44 | +# - [Additional resources](#additional-resources) |
| 45 | +# |
| 46 | +# |
| 47 | +# Background |
| 48 | +# ---------- |
| 49 | +# |
| 50 | +# Memory management basics |
| 51 | +# ~~~~~~~~~~~~~~~~~~~~~~~~ |
| 52 | +# |
| 53 | +# When one creates a CPU tensor in PyTorch, the content of this tensor needs to be placed |
| 54 | +# in memory. The memory we talk about here is a rather complex concept worth looking at carefully. |
| 55 | +# We distinguish two types of memories that are handled by the Memory Management Unit: the main memory (for simplicity) |
| 56 | +# and the disk (which may or may not be the hard drive). Together, the available space in disk and RAM (physical memory) |
| 57 | +# make up the virtual memory, which is an abstraction of the total resources available. |
| 58 | +# In short, the virtual memory makes it so that the available space is larger than what can be found on RAM in isolation |
| 59 | +# and creates the illusion that the main memory is larger than it actually is. |
| 60 | +# |
| 61 | +# In normal circumstances, a regular CPU tensor is _paged_, which means that it is divided in blocks called _pages_ that |
| 62 | +# can live anywhere in the virtual memory (both in RAM or on disk). As mentioned earlier, this has the advantage that |
| 63 | +# the memory seems larger than what the main memory actually is. |
| 64 | +# |
| 65 | +# Typically, when a program accesses a page that is not in RAM, a "page fault" occurs and the operating system (OS) then brings |
| 66 | +# back this page into RAM (_swap in_ or _page in_). |
| 67 | +# In turn, the OS may have to _swap out_ (or _page out_) another page to make room for the new page. |
| 68 | +# |
| 69 | +# In contrast to pageable memory, a _pinned_ (or _page-locked_ or _non-pegeable_) memory is a type of memory that cannot be swapped out to disk. |
| 70 | +# It allows for faster and more predictable access times, but has the downside that it is more limited than the |
| 71 | +# pageable memory (aka the main memory). |
| 72 | +# |
| 73 | +# .. figure:: /_static/img/pinmem.png |
| 74 | +# :alt: |
| 75 | +# |
| 76 | +# CUDA and (non-)pageable memory |
| 77 | +# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 78 | +# |
| 79 | +# To understand how CUDA copies a tensor from CPU to CUDA, let's consider the two scenarios above: |
| 80 | +# - If the memory is page-locked, the device can access the memory directly in the main memory. The memory addresses are well |
| 81 | +# defined and functions that need to read these data can be significantly accelerated. |
| 82 | +# - If the memory is pageable, all the pages will have to be brought to the main memory before being sent to the GPU. |
| 83 | +# This operation may take time and is less predictable than when executed on page-locked tensors. |
| 84 | +# |
| 85 | +# More precisely, when CUDA sends pageable data from CPU to GPU, it must first create a page-locked copy of that data |
| 86 | +# before making the transfer. |
| 87 | +# |
| 88 | +# Asynchronous vs. Synchronous Operations with `non_blocking=True` (CUDA `cudaMemcpyAsync`) |
| 89 | +# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 90 | +# |
| 91 | +# When executing a copy from a host (e.g., CPU) to a device (e.g., GPU), the CUDA toolkit offers modalities to do these |
| 92 | +# operations synchronously or asynchronously with respect to the host. In the synchronous case, the call to `cudaMemcpy` |
| 93 | +# that is queries by `tensor.to(device)` is blocking in the python main thread, which means that the code will stop until |
| 94 | +# the data has been transferred to the device. |
| 95 | +# |
| 96 | +# When calling `tensor.to(device)`, PyTorch always makes a call to [`cudaMemcpyAsync`](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1g85073372f776b4c4d5f89f7124b7bf79). If `non_blocking=False` (default), a `cudaStreamSynchronize` will be called after each and every `cudaMemcpyAsync`. If `non_blocking=True`, no synchronization is triggered, and the main thread on the host is not blocked. |
| 97 | +# Therefore, from the host perspective, multiple tensors can be sent to the device simultaneously in the latter case, as the thread does not need for one transfer to be completed to initiate the other. |
| 98 | +# |
| 99 | +# .. note:: In general, the transfer is blocking on the device size even if it's not on the host side: the copy on the device cannot |
| 100 | +# occur while another operation is being executed. However, in some advanced scenarios, multiple copies or copy and kernel |
| 101 | +# executions can be done simultaneously on the GPU side. To enable this, three requirements must be met: |
| 102 | +# |
| 103 | +# 1. The device must have at least one free DMA (Direct Memory Access) engine. Modern GPU architectures such as Volterra, |
| 104 | +# Tesla or H100 devices have more than one DMA engine. |
| 105 | +# |
| 106 | +# 2. The transfer must be done on a separate, non-default cuda stream. In PyTorch, cuda streams can be handles using |
| 107 | +# `torch.cuda.Stream`. |
| 108 | +# |
| 109 | +# 3. The source data must be in pinned memory. |
| 110 | +# |
| 111 | +# |
| 112 | +# A PyTorch perspective |
| 113 | +# --------------------- |
| 114 | +# |
| 115 | +# `pin_memory()` |
| 116 | +# ~~~~~~~~~~~~~~ |
| 117 | +# |
| 118 | +# PyTorch offers the possibility to create and send tensors to page-locked memory through the `pin_memory` functions and |
| 119 | +# arguments. |
| 120 | +# Any cpu tensor on a machine where a cuda is initialized can be sent to pinned memory through the `pin_memory` |
| 121 | +# method. Importantly, `pin_memory` is blocking on the host: the main thread will wait for the tensor to be copied to |
| 122 | +# page-locked memory before executing the next operation. |
| 123 | +# New tensors can be directly created in pinned memory with functions like `torch.zeros`, `torch.ones` and other |
| 124 | +# constructors. |
| 125 | +# |
| 126 | +# Let us check the speed of pinning memory and sending tensors to cuda: |
| 127 | + |
| 128 | + |
| 129 | +import torch |
| 130 | +import gc |
| 131 | +from torch.utils.benchmark import Timer |
| 132 | + |
| 133 | +tensor_pageable = torch.randn(100_000) |
| 134 | + |
| 135 | +tensor_pinned = torch.randn(100_000, pin_memory=True) |
| 136 | + |
| 137 | +print("Regular to(device)", |
| 138 | + Timer("tensor_pageable.to('cuda:0')", globals=globals()).adaptive_autorange()) |
| 139 | +print("Pinned to(device)", |
| 140 | + Timer("tensor_pinned.to('cuda:0')", globals=globals()).adaptive_autorange()) |
| 141 | +print("pin_memory() along", |
| 142 | + Timer("tensor_pageable.pin_memory()", globals=globals()).adaptive_autorange()) |
| 143 | +print("pin_memory() + to(device)", |
| 144 | + Timer("tensor_pageable.pin_memory().to('cuda:0')", globals=globals()).adaptive_autorange()) |
| 145 | +del tensor_pageable, tensor_pinned |
| 146 | +gc.collect() |
| 147 | + |
| 148 | + |
| 149 | +###################################################################### |
| 150 | +# We can observe that casting a pinned-memory tensor to GPU is indeed much faster than a pageable tensor, because under the hood, a pageable tensor must be copied to pinned memory before being sent to GPU. |
| 151 | +# |
| 152 | +# However, calling `pin_memory()` on a pageable tensor before casting it to GPU does not bring any speed-up, on the contrary this call is actually slower than just executing the transfer. Again, this makes sense, since we're actually asking python to execute an operation that CUDA will perform anyway before copying the data from host to device. |
| 153 | +# |
| 154 | +# `non_blocking=True` |
| 155 | +# ~~~~~~~~~~~~~~~~~~~ |
| 156 | +# |
| 157 | +# As mentioned earlier, many PyTorch operations have the option of being executed asynchronously with respect to the host through the `non_blocking` argument. |
| 158 | +# Here, to account accurately of the benefits of using `non_blocking`, we will design a slightly more involved experiment since we want to assess how fast it is to send multiple tensors to GPU with and without calling `non_blocking`. |
| 159 | +# |
| 160 | + |
| 161 | + |
| 162 | +def copy_to_device(*tensors, display_peak_mem=False): |
| 163 | + result = [] |
| 164 | + for tensor in tensors: |
| 165 | + result.append(tensor.to("cuda:0")) |
| 166 | + return result |
| 167 | +def copy_to_device_nonblocking(*tensors, display_peak_mem=False): |
| 168 | + result = [] |
| 169 | + for tensor in tensors: |
| 170 | + result.append(tensor.to("cuda:0", non_blocking=True)) |
| 171 | + # We need to synchronize |
| 172 | + torch.cuda.synchronize() |
| 173 | + return result |
| 174 | + |
| 175 | +tensors = [torch.randn(1000) for _ in range(1000)] |
| 176 | +print("Call to `to(device)`", Timer("copy_to_device(*tensors)", globals=globals()).adaptive_autorange()) |
| 177 | +print("Call to `to(device, non_blocking=True)`", Timer("copy_to_device_nonblocking(*tensors)", |
| 178 | + globals=globals()).adaptive_autorange()) |
| 179 | + |
| 180 | + |
| 181 | +###################################################################### |
| 182 | +# To get a better sense of what is happening here, let us run a profiling of these two code executions: |
| 183 | + |
| 184 | + |
| 185 | +from torch.profiler import profile, record_function, ProfilerActivity |
| 186 | + |
| 187 | +def profile_mem(cmd): |
| 188 | + with profile(activities=[ProfilerActivity.CPU]) as prof: |
| 189 | + exec(cmd) |
| 190 | + print(cmd) |
| 191 | + print(prof.key_averages().table(row_limit=10)) |
| 192 | + |
| 193 | +print("Call to `to(device)`", profile_mem("copy_to_device(*tensors)")) |
| 194 | +print("Call to `to(device, non_blocking=True)`", profile_mem("copy_to_device_nonblocking(*tensors)")) |
| 195 | + |
| 196 | + |
| 197 | +###################################################################### |
| 198 | +# The results are without any doubt better when using `non_blocking=True`, as all transfers are initiated simultaneously on the host side. |
| 199 | +# Note that, interestingly, `to("cuda")` actually performs the same asynchrous device casting operation as the one with `non_blocking=True` with a synchronization point after each copy. |
| 200 | +# |
| 201 | +# The benefit will vary depending on the number and the size of the tensors as well as depending on the hardware being used. |
| 202 | +# |
| 203 | +# Synergies |
| 204 | +# ~~~~~~~~~ |
| 205 | +# |
| 206 | +# Now that we have made the point that data transfer of tensors already in pinned memory to GPU is faster than from pageable memory, and that we know that doing these transfers asynchronously is also faster than synchronously, we can benchmark the various combinations at hand: |
| 207 | + |
| 208 | + |
| 209 | +def pin_copy_to_device(*tensors): |
| 210 | + result = [] |
| 211 | + for tensor in tensors: |
| 212 | + result.append(tensor.pin_memory().to("cuda:0")) |
| 213 | + return result |
| 214 | +def pin_copy_to_device_nonblocking(*tensors): |
| 215 | + result = [] |
| 216 | + for tensor in tensors: |
| 217 | + result.append(tensor.pin_memory().to("cuda:0", non_blocking=True)) |
| 218 | + # We need to synchronize |
| 219 | + torch.cuda.synchronize() |
| 220 | + return result |
| 221 | + |
| 222 | +print("\nCall to `pin_memory()` + `to(device)`") |
| 223 | +print("pin_memory().to(device)", |
| 224 | + Timer("pin_copy_to_device(*tensors)", globals=globals()).adaptive_autorange()) |
| 225 | +print("pin_memory().to(device, non_blocking=True)", |
| 226 | + Timer("pin_copy_to_device_nonblocking(*tensors)", |
| 227 | + globals=globals()).adaptive_autorange()) |
| 228 | + |
| 229 | +print("\nCall to `to(device)`") |
| 230 | +print("to(device)", |
| 231 | + Timer("copy_to_device(*tensors)", globals=globals()).adaptive_autorange()) |
| 232 | +print("to(device, non_blocking=True)", |
| 233 | + Timer("copy_to_device_nonblocking(*tensors)", |
| 234 | + globals=globals()).adaptive_autorange()) |
| 235 | + |
| 236 | +print("\nCall to `to(device)` from pinned tensors") |
| 237 | +tensors_pinned = [torch.zeros(1000, pin_memory=True) for _ in range(1000)] |
| 238 | +print("tensor_pinned.to(device)", |
| 239 | + Timer("copy_to_device(*tensors_pinned)", globals=globals()).adaptive_autorange()) |
| 240 | +print("tensor_pinned.to(device, non_blocking=True)", |
| 241 | + Timer("copy_to_device_nonblocking(*tensors_pinned)", |
| 242 | + globals=globals()).adaptive_autorange()) |
| 243 | + |
| 244 | +del tensors, tensors_pinned |
| 245 | +gc.collect() |
| 246 | + |
| 247 | + |
| 248 | +###################################################################### |
| 249 | +# Other directions (GPU -> CPU, CPU -> MPS etc.) |
| 250 | +# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 251 | +# |
| 252 | +# So far, we have assumed that doing asynchronous copies from CPU to GPU was safe. |
| 253 | +# Indeed, it is a safe thing to do because CUDA will synchronize whenever it is needed to make sure that the data being read is not garbage. |
| 254 | +# However, any other copy (e.g., from GPU to CPU) has no guarantee whatsoever that the copy will be completed when the data is read. In fact, if no explicit synchronization is done, the data on the host can be garbage: |
| 255 | +# |
| 256 | + |
| 257 | + |
| 258 | + |
| 259 | +tensor = torch.arange(1, 1_000_000, dtype=torch.double, device="cuda").expand(100, 999999).clone() |
| 260 | +torch.testing.assert_close(tensor.mean(), torch.tensor(500_000, dtype=torch.double, device="cuda")), tensor.mean() |
| 261 | +try: |
| 262 | + i = -1 |
| 263 | + for i in range(100): |
| 264 | + cpu_tensor = tensor.to("cpu", non_blocking=True) |
| 265 | + torch.testing.assert_close(cpu_tensor.mean(), torch.tensor(500_000, dtype=torch.double)) |
| 266 | + print("No test failed with non_blocking") |
| 267 | +except AssertionError: |
| 268 | + print(f"One test failed with non_blocking: {i}th assertion!") |
| 269 | +try: |
| 270 | + i = -1 |
| 271 | + for i in range(100): |
| 272 | + cpu_tensor = tensor.to("cpu", non_blocking=True) |
| 273 | + torch.cuda.synchronize() |
| 274 | + torch.testing.assert_close(cpu_tensor.mean(), torch.tensor(500_000, dtype=torch.double)) |
| 275 | + print("No test failed with synchronize") |
| 276 | +except AssertionError: |
| 277 | + print(f"One test failed with synchronize: {i}th assertion!") |
| 278 | + |
| 279 | + |
| 280 | +###################################################################### |
| 281 | +# The same observation could be made with copies from CPU to a non-CUDA device such as MPS. |
| 282 | +# |
| 283 | +# In summary, copying data from CPU to GPU is safe when using `non_blocking=True`, but for any other direction, `non_blocking=True` can still be used but the user must make sure that a device synchronization is executed after the data is accessed. |
| 284 | +# |
| 285 | +# Practical recommendations |
| 286 | +# ------------------------- |
| 287 | +# |
| 288 | +# We can now wrap up some early recommendations based on our observations: |
| 289 | +# In general, `non_blocking=True` will provide a good speed of transfer, regardless of whether the original tensor is or isn't in pinned memory. If the tensor is already in pinned memory, the transfer can be accelerated, but sending it to pin memory manually is a blocking operation on the host and hence will anihilate much of the benefit of using `non_blocking=True` (and CUDA does the `pin_memory` transfer anyway). |
| 290 | +# |
| 291 | +# One might now legitimetely ask what use there is for the `pin_memory()` method within the `torch.Tensor` class. In the following section, we will explore further how this can be used to accelerate the data transfer even more. |
| 292 | +# |
| 293 | +# Additional considerations |
| 294 | +# ------------------------- |
| 295 | +# |
| 296 | +# PyTorch notoriously provides a `DataLoader` class that accepts a `pin_memory` argument. |
| 297 | +# Given everything we have said so far about calls to `pin_memory`, how does the dataloader manage to accelerate data transfers? |
| 298 | +# |
| 299 | +# The answer is resides in the fact that the dataloader reserves a separate thread to copy the data from pageable to pinned memory, thereby avoiding to block the main thread with this. Consider the following example, where we send a list of tensors to cuda after calling pin_memory on a separate thread: |
| 300 | +# |
| 301 | +# A more isolated example of this is the TensorDict primitive from the homonymous library: when calling `TensorDict.to(device)`, the default behaviour is to send these tensors to the device asynchronously and make a `device.synchronize()` call after. `TensorDict.to()` also offers a `non_blocking_pin` argument which will spawn multiple threads to do the calls to `pin_memory()` before launching the calls to `to(device)`. |
| 302 | +# This can further speed up the copies as the following example shows: |
| 303 | +# |
| 304 | +# .. code-block:: bash |
| 305 | +# |
| 306 | +# !pip3 install https://github.com/pytorch/tensordict |
| 307 | +# |
| 308 | + |
| 309 | +from tensordict import TensorDict |
| 310 | +import torch |
| 311 | +from torch.utils.benchmark import Timer |
| 312 | + |
| 313 | +td = TensorDict({str(i): torch.randn(1_000_000) for i in range(100)}) |
| 314 | + |
| 315 | +print(Timer("td.to('cuda:0', non_blocking=False)", globals=globals()).adaptive_autorange()) |
| 316 | +print(Timer("td.to('cuda:0')", globals=globals()).adaptive_autorange()) |
| 317 | +print(Timer("td.to('cuda:0', non_blocking=True, non_blocking_pin=True)", globals=globals()).adaptive_autorange()) |
| 318 | + |
| 319 | + |
| 320 | +###################################################################### |
| 321 | +# As a side note, it may be tempting to create everlasting buffers in pinned memory and copy tensors from pageable memory to pinned memory, and use these as shuttle before sending the data to GPU. |
| 322 | +# Unfortunately, this does not speed up computation because the bottleneck of copying data to pinned memory is still present. |
| 323 | +# |
| 324 | +# Another consideration is that transferring data that is stored on disk (shared memory or files) to GPU will usually require the data to be copied to pinned memory (which is on RAM) as an intermediate step. |
| 325 | +# |
| 326 | +# Using `non_blocking` in these context for large amount of data may have devastating effects on RAM consumption. In practice, there is no silver bullet, and the performance of any combination of multithreaded pin_memory and non_blocking will depend on multiple factors such as the system being used, the OS, the hardware and the tasks being performed. |
| 327 | +# |
| 328 | +# Finally, creating a large number of tensors or a few large tensors in pinned memory will effectively reserve more RAM than pageable tensors would, thereby lowering the amount of available RAM for other operations (such as swapping pages in and out), which can have a negative impact over the overall runtime of an algorithm. |
| 329 | + |
| 330 | +###################################################################### |
| 331 | +# ## Conclusion |
| 332 | +# |
| 333 | +# ## Additional resources |
| 334 | +# |
0 commit comments