|
147 | 147 | "Asynchronous vs. Synchronous Operations with `non_blocking=True` (CUDA `cudaMemcpyAsync`)\n",
|
148 | 148 | "-----------------------------------------------------------------------------------------\n",
|
149 | 149 | "\n",
|
150 |
| - "When executing a copy from a host (e.g., CPU) to a device (e.g., GPU),\n", |
151 |
| - "the CUDA toolkit offers modalities to do these operations synchronously\n", |
152 |
| - "or asynchronously with respect to the host.\n", |
| 150 | + "When executing a copy from a host (such as, CPU) to a device (such as,\n", |
| 151 | + "GPU), the CUDA toolkit offers modalities to do these operations\n", |
| 152 | + "synchronously or asynchronously with respect to the host.\n", |
153 | 153 | "\n",
|
154 | 154 | "In practice, when calling `~torch.Tensor.to`{.interpreted-text\n",
|
155 | 155 | "role=\"meth\"}, PyTorch always makes a call to\n",
|
|
696 | 696 | "Until now, we have operated under the assumption that asynchronous\n",
|
697 | 697 | "copies from the CPU to the GPU are safe. This is generally true because\n",
|
698 | 698 | "CUDA automatically handles synchronization to ensure that the data being\n",
|
699 |
| - "accessed is valid at read time. However, this guarantee does not extend\n", |
700 |
| - "to transfers in the opposite direction, from GPU to CPU. Without\n", |
701 |
| - "explicit synchronization, these transfers offer no assurance that the\n", |
702 |
| - "copy will be complete at the time of data access. Consequently, the data\n", |
703 |
| - "on the host might be incomplete or incorrect, effectively rendering it\n", |
704 |
| - "garbage:\n" |
| 699 | + "accessed is valid at read time \\_\\_whenever the tensor is in pageable\n", |
| 700 | + "memory\\_\\_.\n", |
| 701 | + "\n", |
| 702 | + "However, in other cases we cannot make the same assumption: when a\n", |
| 703 | + "tensor is placed in pinned memory, mutating the original copy after\n", |
| 704 | + "calling the host-to-device transfer may corrupt the data received on\n", |
| 705 | + "GPU. Similarly, when a transfer is achieved in the opposite direction,\n", |
| 706 | + "from GPU to CPU, or from any device that is not CPU or GPU to any device\n", |
| 707 | + "that is not a CUDA-handled GPU (such as, MPS), there is no guarantee\n", |
| 708 | + "that the data read on GPU is valid without explicit synchronization.\n", |
| 709 | + "\n", |
| 710 | + "In these scenarios, these transfers offer no assurance that the copy\n", |
| 711 | + "will be complete at the time of data access. Consequently, the data on\n", |
| 712 | + "the host might be incomplete or incorrect, effectively rendering it\n", |
| 713 | + "garbage.\n", |
| 714 | + "\n", |
| 715 | + "Let\\'s first demonstrate this with a pinned-memory tensor:\n" |
| 716 | + ] |
| 717 | + }, |
| 718 | + { |
| 719 | + "cell_type": "code", |
| 720 | + "execution_count": null, |
| 721 | + "metadata": { |
| 722 | + "collapsed": false |
| 723 | + }, |
| 724 | + "outputs": [], |
| 725 | + "source": [ |
| 726 | + "DELAY = 100000000\n", |
| 727 | + "try:\n", |
| 728 | + " i = -1\n", |
| 729 | + " for i in range(100):\n", |
| 730 | + " # Create a tensor in pin-memory\n", |
| 731 | + " cpu_tensor = torch.ones(1024, 1024, pin_memory=True)\n", |
| 732 | + " torch.cuda.synchronize()\n", |
| 733 | + " # Send the tensor to CUDA\n", |
| 734 | + " cuda_tensor = cpu_tensor.to(\"cuda\", non_blocking=True)\n", |
| 735 | + " torch.cuda._sleep(DELAY)\n", |
| 736 | + " # Corrupt the original tensor\n", |
| 737 | + " cpu_tensor.zero_()\n", |
| 738 | + " assert (cuda_tensor == 1).all()\n", |
| 739 | + " print(\"No test failed with non_blocking and pinned tensor\")\n", |
| 740 | + "except AssertionError:\n", |
| 741 | + " print(f\"{i}th test failed with non_blocking and pinned tensor. Skipping remaining tests\")" |
| 742 | + ] |
| 743 | + }, |
| 744 | + { |
| 745 | + "cell_type": "markdown", |
| 746 | + "metadata": {}, |
| 747 | + "source": [ |
| 748 | + "Using a pageable tensor always works:\n" |
| 749 | + ] |
| 750 | + }, |
| 751 | + { |
| 752 | + "cell_type": "code", |
| 753 | + "execution_count": null, |
| 754 | + "metadata": { |
| 755 | + "collapsed": false |
| 756 | + }, |
| 757 | + "outputs": [], |
| 758 | + "source": [ |
| 759 | + "i = -1\n", |
| 760 | + "for i in range(100):\n", |
| 761 | + " # Create a tensor in pin-memory\n", |
| 762 | + " cpu_tensor = torch.ones(1024, 1024)\n", |
| 763 | + " torch.cuda.synchronize()\n", |
| 764 | + " # Send the tensor to CUDA\n", |
| 765 | + " cuda_tensor = cpu_tensor.to(\"cuda\", non_blocking=True)\n", |
| 766 | + " torch.cuda._sleep(DELAY)\n", |
| 767 | + " # Corrupt the original tensor\n", |
| 768 | + " cpu_tensor.zero_()\n", |
| 769 | + " assert (cuda_tensor == 1).all()\n", |
| 770 | + "print(\"No test failed with non_blocking and pageable tensor\")" |
| 771 | + ] |
| 772 | + }, |
| 773 | + { |
| 774 | + "cell_type": "markdown", |
| 775 | + "metadata": {}, |
| 776 | + "source": [ |
| 777 | + "Now let\\'s demonstrate that CUDA to CPU also fails to produce reliable\n", |
| 778 | + "outputs without synchronization:\n" |
705 | 779 | ]
|
706 | 780 | },
|
707 | 781 | {
|
|
747 | 821 | "cell_type": "markdown",
|
748 | 822 | "metadata": {},
|
749 | 823 | "source": [
|
750 |
| - "The same considerations apply to copies from the CPU to non-CUDA\n", |
751 |
| - "devices, such as MPS. Generally, asynchronous copies to a device are\n", |
752 |
| - "safe without explicit synchronization only when the target is a\n", |
753 |
| - "CUDA-enabled device.\n", |
| 824 | + "Generally, asynchronous copies to a device are safe without explicit\n", |
| 825 | + "synchronization only when the target is a CUDA-enabled device and the\n", |
| 826 | + "original tensor is in pageable memory.\n", |
754 | 827 | "\n",
|
755 | 828 | "In summary, copying data from CPU to GPU is safe when using\n",
|
756 | 829 | "`non_blocking=True`, but for any other direction, `non_blocking=True`\n",
|
|
0 commit comments