compare one kernel launch with nvshmem vs. multiple kernel launches with RDMA

Compare the following two things:

- Allocate memory for the input data

(1)
Do the following in a loop:
- compute a part of the big input array (sleep might be sufficient)
- return the address of the computed from the kernel launch data to the CPU.
- wait for kernel to finish
- send data using rdma to other node
- continue loop with next part input data

(2)
- Start a kernel
- do the following in a loop:
- - compute part of the data (sleep sufficient)
- - send the computed part of the data
- - wait for send operation to complete (nvshmem_quiet)
- - repeat for next portion of data

Time the runtime of both approaches. This should show the potential one kernel launch with nvshmem in comparison to multiple kernel launches with device RDMA.

NOTE: For simplicity, the data can be all written to the same destination, thereby overwriting the data from previous send operations. In reality, we would have to flush the data from the receive buffer to some other storage medium.

NOTE: For simplicity, it is sufficient to have only one sending PE and one receiving PE.

NOTE: We might implement a second version of NVSHMEM sending that uses double buffering like our shuffle does. This allows us evaluate the effect of double buffering independent of the remainder of the shuffle code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

compare one kernel launch with nvshmem vs. multiple kernel launches with RDMA #17

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

compare one kernel launch with nvshmem vs. multiple kernel launches with RDMA #17

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions