You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# The results are without any doubt better when using `non_blocking=True`, as all transfers are initiated simultaneously on the host side.
199
-
# Note that, interestingly, `to("cuda")` actually performs the same asynchrous device casting operation as the one with `non_blocking=True` with a synchronization point after each copy.
200
+
# Note that, interestingly, `to("cuda")` actually performs the same asynchronous device casting operation as the one with `non_blocking=True` with a synchronization point after each copy.
200
201
#
201
202
# The benefit will vary depending on the number and the size of the tensors as well as depending on the hardware being used.
# We can now wrap up some early recommendations based on our observations:
289
-
# In general, `non_blocking=True` will provide a good speed of transfer, regardless of whether the original tensor is or isn't in pinned memory. If the tensor is already in pinned memory, the transfer can be accelerated, but sending it to pin memory manually is a blocking operation on the host and hence will anihilate much of the benefit of using `non_blocking=True` (and CUDA does the `pin_memory` transfer anyway).
290
+
# In general, `non_blocking=True` will provide a good speed of transfer, regardless of whether the original tensor is or isn't in pinned memory. If the tensor is already in pinned memory, the transfer can be accelerated, but sending it to pin memory manually is a blocking operation on the host and hence will annihilate much of the benefit of using `non_blocking=True` (and CUDA does the `pin_memory` transfer anyway).
290
291
#
291
-
# One might now legitimetely ask what use there is for the `pin_memory()` method within the `torch.Tensor` class. In the following section, we will explore further how this can be used to accelerate the data transfer even more.
292
+
# One might now legitimately ask what use there is for the `pin_memory()` method within the `torch.Tensor` class. In the following section, we will explore further how this can be used to accelerate the data transfer even more.
# The answer is resides in the fact that the dataloader reserves a separate thread to copy the data from pageable to pinned memory, thereby avoiding to block the main thread with this. Consider the following example, where we send a list of tensors to cuda after calling pin_memory on a separate thread:
300
301
#
301
-
# A more isolated example of this is the TensorDict primitive from the homonymous library: when calling `TensorDict.to(device)`, the default behaviour is to send these tensors to the device asynchronously and make a `device.synchronize()` call after. `TensorDict.to()` also offers a `non_blocking_pin` argument which will spawn multiple threads to do the calls to `pin_memory()` before launching the calls to `to(device)`.
302
+
# A more isolated example of this is the TensorDict primitive from the homonymous library: when calling `TensorDict.to(device)`, the default behavior is to send these tensors to the device asynchronously and make a `device.synchronize()` call after. `TensorDict.to()` also offers a `non_blocking_pin` argument which will spawn multiple threads to do the calls to `pin_memory()` before launching the calls to `to(device)`.
302
303
# This can further speed up the copies as the following example shows:
0 commit comments