-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
Description
Implement a version of the shuffle that sends directly tuple-by-tuple using nvshmem_put.
For this, we have to have a thread-level histogram to calculate thread-level write offsets on the destinaiton PEs.
We also need to make sure that the input data is in symmetric device memory. We can register it as symm. mem. if possible or allocate symm. dev. mem. and copy the data over.