Commit a4ca26f
Device-to-Host LazyAwaitable (#3477)
Summary:
Pull Request resolved: #3477
workplace post: https://fb.workplace.com/groups/811751593969209/permalink/1285164823294548/
# TL;DR
* A new `DeviceToHostTensorAwaitable` class is available to wrap the device-to-host data transfer, and defers the `cudaEventSync` call until the data is really used on the host.
* It aims at helping sync-point removal in training optimization which often suffers from cpu-blocking sync points.
# why awaitable
* as shown in the following diagram, a comms op is often better to overlap with another (irrelevant) compute op to better utilize the device capability
* the idea is to **defer** the `wait()` call until running the function that uses the result from the comm op
* a convenient way to achieve this "deferring" behavior is to use the `lazy_awaitable` concept, which is already [implemented in torchrec](https://github.com/meta-pytorch/torchrec/blob/main/torchrec/distributed/types.py#L368)
* diagram of (lazy_)awaitable in torchrec
{F1982900178}
# why device-to-host transfer
* there are scenarios that the on-device data is needed from the host side, such as metrics logging and data-dependent shape operation.
* those pattern creates a device-to-host sync (data transfer) that often blocks the cpu execution, and the correct implementation (with `.to(non_blocking=True)` and cuda event: [PR 3436](#3436)) usually spans across multiple code domain making it difficult to optimize.
* here we borrow the `LazyAwaitable` concept for the device-side comms and wrap the (1) non-blocking device-to-host data transfer, and (2) `cuda_event.wait()` inside a `DeviceToHostTensorAwaitable` class for better user experience.
* diagram of lazy_awaitable for device-to-host data transfer
{F1982900233}
# results
* the "comms check" result is on device and is needed for validation (host-side assertion)
* the `DeviceToHostTensorAwaitable.wait()` **defer** the cudaEventSync until the very end where the result is really needed by host.
* You can see the post-comms computes are scheduled before the assertion on the host side.
{F1982900468}
NOTE: in this version of implementation we don't use a separate stream (as shown in the diagram above) for the non-blocking device-to-host data transfer because usually the data volume is relatively small.
{F1982901286}
Reviewed By: spmex
Differential Revision: D85211205
fbshipit-source-id: 41d03230dd9b190085545cfb76192d59375646c41 parent d26aa0d commit a4ca26f
2 files changed
+63
-2
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
22 | 22 | | |
23 | 23 | | |
24 | 24 | | |
25 | | - | |
| 25 | + | |
26 | 26 | | |
27 | 27 | | |
28 | 28 | | |
| |||
39 | 39 | | |
40 | 40 | | |
41 | 41 | | |
| 42 | + | |
42 | 43 | | |
43 | 44 | | |
44 | 45 | | |
| |||
253 | 254 | | |
254 | 255 | | |
255 | 256 | | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
| 261 | + | |
| 262 | + | |
| 263 | + | |
| 264 | + | |
| 265 | + | |
| 266 | + | |
| 267 | + | |
| 268 | + | |
| 269 | + | |
| 270 | + | |
| 271 | + | |
| 272 | + | |
| 273 | + | |
| 274 | + | |
| 275 | + | |
| 276 | + | |
| 277 | + | |
| 278 | + | |
| 279 | + | |
| 280 | + | |
| 281 | + | |
| 282 | + | |
| 283 | + | |
| 284 | + | |
| 285 | + | |
| 286 | + | |
| 287 | + | |
| 288 | + | |
| 289 | + | |
| 290 | + | |
| 291 | + | |
| 292 | + | |
| 293 | + | |
| 294 | + | |
| 295 | + | |
| 296 | + | |
256 | 297 | | |
257 | 298 | | |
258 | 299 | | |
| |||
274 | 315 | | |
275 | 316 | | |
276 | 317 | | |
| 318 | + | |
| 319 | + | |
277 | 320 | | |
278 | | - | |
| 321 | + | |
279 | 322 | | |
280 | 323 | | |
281 | 324 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
463 | 463 | | |
464 | 464 | | |
465 | 465 | | |
| 466 | + | |
| 467 | + | |
| 468 | + | |
| 469 | + | |
| 470 | + | |
| 471 | + | |
| 472 | + | |
| 473 | + | |
| 474 | + | |
| 475 | + | |
| 476 | + | |
| 477 | + | |
| 478 | + | |
| 479 | + | |
| 480 | + | |
| 481 | + | |
| 482 | + | |
| 483 | + | |
466 | 484 | | |
467 | 485 | | |
468 | 486 | | |
| |||
0 commit comments