Skip to content

Commit cc31f3f

Browse files
committed
Automated tutorials push
1 parent 671cdb0 commit cc31f3f

File tree

196 files changed

+13393
-12968
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

196 files changed

+13393
-12968
lines changed

_downloads/562d6bd0e2a429f010fcf8007f6a7cac/pinmem_nonblock.py

Lines changed: 48 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -108,7 +108,7 @@
108108
#
109109
# .. _pinned_memory_async_sync:
110110
#
111-
# When executing a copy from a host (e.g., CPU) to a device (e.g., GPU), the CUDA toolkit offers modalities to do these
111+
# When executing a copy from a host (such as, CPU) to a device (such as, GPU), the CUDA toolkit offers modalities to do these
112112
# operations synchronously or asynchronously with respect to the host.
113113
#
114114
# In practice, when calling :meth:`~torch.Tensor.to`, PyTorch always makes a call to
@@ -512,12 +512,54 @@ def pin_copy_to_device_nonblocking(*tensors):
512512
#
513513
# Until now, we have operated under the assumption that asynchronous copies from the CPU to the GPU are safe.
514514
# This is generally true because CUDA automatically handles synchronization to ensure that the data being accessed is
515-
# valid at read time.
516-
# However, this guarantee does not extend to transfers in the opposite direction, from GPU to CPU.
517-
# Without explicit synchronization, these transfers offer no assurance that the copy will be complete at the time of
518-
# data access. Consequently, the data on the host might be incomplete or incorrect, effectively rendering it garbage:
515+
# valid at read time __whenever the tensor is in pageable memory__.
519516
#
517+
# However, in other cases we cannot make the same assumption: when a tensor is placed in pinned memory, mutating the
518+
# original copy after calling the host-to-device transfer may corrupt the data received on GPU.
519+
# Similarly, when a transfer is achieved in the opposite direction, from GPU to CPU, or from any device that is not CPU
520+
# or GPU to any device that is not a CUDA-handled GPU (such as, MPS), there is no guarantee that the data read on GPU is
521+
# valid without explicit synchronization.
522+
#
523+
# In these scenarios, these transfers offer no assurance that the copy will be complete at the time of
524+
# data access. Consequently, the data on the host might be incomplete or incorrect, effectively rendering it garbage.
525+
#
526+
# Let's first demonstrate this with a pinned-memory tensor:
527+
DELAY = 100000000
528+
try:
529+
i = -1
530+
for i in range(100):
531+
# Create a tensor in pin-memory
532+
cpu_tensor = torch.ones(1024, 1024, pin_memory=True)
533+
torch.cuda.synchronize()
534+
# Send the tensor to CUDA
535+
cuda_tensor = cpu_tensor.to("cuda", non_blocking=True)
536+
torch.cuda._sleep(DELAY)
537+
# Corrupt the original tensor
538+
cpu_tensor.zero_()
539+
assert (cuda_tensor == 1).all()
540+
print("No test failed with non_blocking and pinned tensor")
541+
except AssertionError:
542+
print(f"{i}th test failed with non_blocking and pinned tensor. Skipping remaining tests")
520543

544+
######################################################################
545+
# Using a pageable tensor always works:
546+
#
547+
548+
i = -1
549+
for i in range(100):
550+
# Create a tensor in pin-memory
551+
cpu_tensor = torch.ones(1024, 1024)
552+
torch.cuda.synchronize()
553+
# Send the tensor to CUDA
554+
cuda_tensor = cpu_tensor.to("cuda", non_blocking=True)
555+
torch.cuda._sleep(DELAY)
556+
# Corrupt the original tensor
557+
cpu_tensor.zero_()
558+
assert (cuda_tensor == 1).all()
559+
print("No test failed with non_blocking and pageable tensor")
560+
561+
######################################################################
562+
# Now let's demonstrate that CUDA to CPU also fails to produce reliable outputs without synchronization:
521563

522564
tensor = (
523565
torch.arange(1, 1_000_000, dtype=torch.double, device="cuda")
@@ -551,9 +593,8 @@ def pin_copy_to_device_nonblocking(*tensors):
551593

552594

553595
######################################################################
554-
# The same considerations apply to copies from the CPU to non-CUDA devices, such as MPS.
555596
# Generally, asynchronous copies to a device are safe without explicit synchronization only when the target is a
556-
# CUDA-enabled device.
597+
# CUDA-enabled device and the original tensor is in pageable memory.
557598
#
558599
# In summary, copying data from CPU to GPU is safe when using ``non_blocking=True``, but for any other direction,
559600
# ``non_blocking=True`` can still be used but the user must make sure that a device synchronization is executed before

_downloads/6a760a243fcbf87fb3368be3d4d860ee/pinmem_nonblock.ipynb

Lines changed: 86 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -147,9 +147,9 @@
147147
"Asynchronous vs. Synchronous Operations with `non_blocking=True` (CUDA `cudaMemcpyAsync`)\n",
148148
"-----------------------------------------------------------------------------------------\n",
149149
"\n",
150-
"When executing a copy from a host (e.g., CPU) to a device (e.g., GPU),\n",
151-
"the CUDA toolkit offers modalities to do these operations synchronously\n",
152-
"or asynchronously with respect to the host.\n",
150+
"When executing a copy from a host (such as, CPU) to a device (such as,\n",
151+
"GPU), the CUDA toolkit offers modalities to do these operations\n",
152+
"synchronously or asynchronously with respect to the host.\n",
153153
"\n",
154154
"In practice, when calling `~torch.Tensor.to`{.interpreted-text\n",
155155
"role=\"meth\"}, PyTorch always makes a call to\n",
@@ -696,12 +696,86 @@
696696
"Until now, we have operated under the assumption that asynchronous\n",
697697
"copies from the CPU to the GPU are safe. This is generally true because\n",
698698
"CUDA automatically handles synchronization to ensure that the data being\n",
699-
"accessed is valid at read time. However, this guarantee does not extend\n",
700-
"to transfers in the opposite direction, from GPU to CPU. Without\n",
701-
"explicit synchronization, these transfers offer no assurance that the\n",
702-
"copy will be complete at the time of data access. Consequently, the data\n",
703-
"on the host might be incomplete or incorrect, effectively rendering it\n",
704-
"garbage:\n"
699+
"accessed is valid at read time \\_\\_whenever the tensor is in pageable\n",
700+
"memory\\_\\_.\n",
701+
"\n",
702+
"However, in other cases we cannot make the same assumption: when a\n",
703+
"tensor is placed in pinned memory, mutating the original copy after\n",
704+
"calling the host-to-device transfer may corrupt the data received on\n",
705+
"GPU. Similarly, when a transfer is achieved in the opposite direction,\n",
706+
"from GPU to CPU, or from any device that is not CPU or GPU to any device\n",
707+
"that is not a CUDA-handled GPU (such as, MPS), there is no guarantee\n",
708+
"that the data read on GPU is valid without explicit synchronization.\n",
709+
"\n",
710+
"In these scenarios, these transfers offer no assurance that the copy\n",
711+
"will be complete at the time of data access. Consequently, the data on\n",
712+
"the host might be incomplete or incorrect, effectively rendering it\n",
713+
"garbage.\n",
714+
"\n",
715+
"Let\\'s first demonstrate this with a pinned-memory tensor:\n"
716+
]
717+
},
718+
{
719+
"cell_type": "code",
720+
"execution_count": null,
721+
"metadata": {
722+
"collapsed": false
723+
},
724+
"outputs": [],
725+
"source": [
726+
"DELAY = 100000000\n",
727+
"try:\n",
728+
" i = -1\n",
729+
" for i in range(100):\n",
730+
" # Create a tensor in pin-memory\n",
731+
" cpu_tensor = torch.ones(1024, 1024, pin_memory=True)\n",
732+
" torch.cuda.synchronize()\n",
733+
" # Send the tensor to CUDA\n",
734+
" cuda_tensor = cpu_tensor.to(\"cuda\", non_blocking=True)\n",
735+
" torch.cuda._sleep(DELAY)\n",
736+
" # Corrupt the original tensor\n",
737+
" cpu_tensor.zero_()\n",
738+
" assert (cuda_tensor == 1).all()\n",
739+
" print(\"No test failed with non_blocking and pinned tensor\")\n",
740+
"except AssertionError:\n",
741+
" print(f\"{i}th test failed with non_blocking and pinned tensor. Skipping remaining tests\")"
742+
]
743+
},
744+
{
745+
"cell_type": "markdown",
746+
"metadata": {},
747+
"source": [
748+
"Using a pageable tensor always works:\n"
749+
]
750+
},
751+
{
752+
"cell_type": "code",
753+
"execution_count": null,
754+
"metadata": {
755+
"collapsed": false
756+
},
757+
"outputs": [],
758+
"source": [
759+
"i = -1\n",
760+
"for i in range(100):\n",
761+
" # Create a tensor in pin-memory\n",
762+
" cpu_tensor = torch.ones(1024, 1024)\n",
763+
" torch.cuda.synchronize()\n",
764+
" # Send the tensor to CUDA\n",
765+
" cuda_tensor = cpu_tensor.to(\"cuda\", non_blocking=True)\n",
766+
" torch.cuda._sleep(DELAY)\n",
767+
" # Corrupt the original tensor\n",
768+
" cpu_tensor.zero_()\n",
769+
" assert (cuda_tensor == 1).all()\n",
770+
"print(\"No test failed with non_blocking and pageable tensor\")"
771+
]
772+
},
773+
{
774+
"cell_type": "markdown",
775+
"metadata": {},
776+
"source": [
777+
"Now let\\'s demonstrate that CUDA to CPU also fails to produce reliable\n",
778+
"outputs without synchronization:\n"
705779
]
706780
},
707781
{
@@ -747,10 +821,9 @@
747821
"cell_type": "markdown",
748822
"metadata": {},
749823
"source": [
750-
"The same considerations apply to copies from the CPU to non-CUDA\n",
751-
"devices, such as MPS. Generally, asynchronous copies to a device are\n",
752-
"safe without explicit synchronization only when the target is a\n",
753-
"CUDA-enabled device.\n",
824+
"Generally, asynchronous copies to a device are safe without explicit\n",
825+
"synchronization only when the target is a CUDA-enabled device and the\n",
826+
"original tensor is in pageable memory.\n",
754827
"\n",
755828
"In summary, copying data from CPU to GPU is safe when using\n",
756829
"`non_blocking=True`, but for any other direction, `non_blocking=True`\n",
-245 Bytes
Loading
-178 Bytes
Loading

_images/sphx_glr_coding_ddpg_001.png

5.72 KB
Loading
-2.41 KB
Loading
-110 Bytes
Loading
-567 Bytes
Loading
5 Bytes
Loading
527 Bytes
Loading

0 commit comments

Comments
 (0)