I am using nccl 2.27 and nccl-tests, I captured the B200 all reduce kernel using:
ncu --section MemoryWorkloadAnalysis_Chart --replay-mode app-range -o nccl_allreduce_2cards_2.27_b200 ./build/all_reduce_perf -g 2 -n 1 -w 0 -b 2M -e 2M -c 0
I saw the peer memory traffic is still 0B. Is it expected or the function is not ready yet?