-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Compare the following two things:
- Allocate memory for the input data
(1)
Do the following in a loop:
- compute a part of the big input array (sleep might be sufficient)
- return the address of the computed from the kernel launch data to the CPU.
- wait for kernel to finish
- send data using rdma to other node
- continue loop with next part input data
(2)
- Start a kernel
- do the following in a loop:
-
- compute part of the data (sleep sufficient)
-
- send the computed part of the data
-
- wait for send operation to complete (nvshmem_quiet)
-
- repeat for next portion of data
Time the runtime of both approaches. This should show the potential one kernel launch with nvshmem in comparison to multiple kernel launches with device RDMA.
NOTE: For simplicity, the data can be all written to the same destination, thereby overwriting the data from previous send operations. In reality, we would have to flush the data from the receive buffer to some other storage medium.
NOTE: For simplicity, it is sufficient to have only one sending PE and one receiving PE.
NOTE: We might implement a second version of NVSHMEM sending that uses double buffering like our shuffle does. This allows us evaluate the effect of double buffering independent of the remainder of the shuffle code.