`dispatch_wait_recv_cost_stats` is cumulating time from each warp, when instead computing a max across warps would be more helpful

```
atomicAdd(reinterpret_cast<unsigned long long*>(dispatch_wait_recv_cost_stats + src_rank), wait_recv_cost);
```
This op adds the wait time as seen by a given warp for a given src_rank. This time is being added by each warp. 

I wonder what is the utility of this metric? Each warp is waiting in parallel. Are we trying to infer slow ranks through this metric? 

More interesting would be the max across warps since that would indicate the actual wait time from a src_rank. 

Then we could also get a max of this max across src_ranks to actually infer the pure network comms latency of the recv kernel. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`dispatch_wait_recv_cost_stats` is cumulating time from each warp, when instead computing a max across warps would be more helpful #473

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

dispatch_wait_recv_cost_stats is cumulating time from each warp, when instead computing a max across warps would be more helpful #473

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`dispatch_wait_recv_cost_stats` is cumulating time from each warp, when instead computing a max across warps would be more helpful #473