Skip to content

compare one kernel launch with nvshmem vs. multiple kernel launches with RDMA #17

@Alex2804

Description

@Alex2804

Compare the following two things:

  • Allocate memory for the input data

(1)
Do the following in a loop:

  • compute a part of the big input array (sleep might be sufficient)
  • return the address of the computed from the kernel launch data to the CPU.
  • wait for kernel to finish
  • send data using rdma to other node
  • continue loop with next part input data

(2)

  • Start a kernel
  • do the following in a loop:
    • compute part of the data (sleep sufficient)
    • send the computed part of the data
    • wait for send operation to complete (nvshmem_quiet)
    • repeat for next portion of data

Time the runtime of both approaches. This should show the potential one kernel launch with nvshmem in comparison to multiple kernel launches with device RDMA.

NOTE: For simplicity, the data can be all written to the same destination, thereby overwriting the data from previous send operations. In reality, we would have to flush the data from the receive buffer to some other storage medium.

NOTE: For simplicity, it is sufficient to have only one sending PE and one receiving PE.

NOTE: We might implement a second version of NVSHMEM sending that uses double buffering like our shuffle does. This allows us evaluate the effect of double buffering independent of the remainder of the shuffle code.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions