atomicAdd(reinterpret_cast<unsigned long long*>(dispatch_wait_recv_cost_stats + src_rank), wait_recv_cost);
This op adds the wait time as seen by a given warp for a given src_rank. This time is being added by each warp.
I wonder what is the utility of this metric? Each warp is waiting in parallel. Are we trying to infer slow ranks through this metric?
More interesting would be the max across warps since that would indicate the actual wait time from a src_rank.
Then we could also get a max of this max across src_ranks to actually infer the pure network comms latency of the recv kernel.