Skip to content

[Performance]: The relationship between transfer engine performance and the number of network cards #1617

@yoqiu-amd

Description

@yoqiu-amd

Describe your performance question

Hi Team,

I found that using all nics in an 8 nics environment (8 * 400gb) does not bring optimal performance. Here are my test results. Using 1 network card can achieve theoretical performance. As the number of network cards increases, the performance trend first increases and then decreases.
Can this problem be solved by configuring certain parameters?

8nics => throughput 118.40 GB/s
7nics => throughput 84.60 GB/s
4nics => throughput 75.94 GB/s
3nics => throughput 135.03 GB/s
2nics => throughput 94.98 GB/s
1nics => throughput 48.30 GB/s

root@mi355-gpu-17:/workspace# MC_GID_INDEX=1 \
transfer_engine_bench \
  --mode=initiator \
  --device_name=rocep105s0,rocep121s0,rocep137s0,rocep233s0,rocep249s0,rocep25s0,rocep9s0 \
  --metadata_server=P2PHANDSHAKE \
  --segment_id=mi355-gpu-3:16526
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0306 02:44:23.946800 157141 transfer_engine_impl.cpp:586] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I0306 02:44:23.946867 157141 transfer_engine_impl.cpp:105] Transfer Engine parseHostNameWithPort. server_name: mi355-gpu-17 port: 12001
I0306 02:44:23.946899 157141 transfer_engine_impl.cpp:172] Transfer Engine RPC using P2P handshake, listening on mi355-gpu-17:16255
I0306 02:44:23.947069 157141 rdma_transport.cpp:63] [RDMA] Relaxed ordering disabled via MC_IB_PCI_RELAXED_ORDERING=0. Falling back to strict ordering.
I0306 02:44:23.947129 157141 rdma_context.cpp:77] Using SIEVE endpoint store
I0306 02:44:23.948887 157141 rdma_context.cpp:594] Using user-specified GID index: 1 on rocep105s0/ (with network device)
I0306 02:44:24.047127 157141 rdma_context.cpp:140] RDMA device: rocep105s0, LID: 0, GID: (GID_Index 1) fd:93:16:d3:59:b6:01:0d:06:90:81:ff:fe:39:e3:e8
I0306 02:44:24.047142 157141 rdma_context.cpp:77] Using SIEVE endpoint store
I0306 02:44:24.047366 157141 rdma_context.cpp:594] Using user-specified GID index: 1 on rocep121s0/ (with network device)
I0306 02:44:24.104812 157141 rdma_context.cpp:140] RDMA device: rocep121s0, LID: 0, GID: (GID_Index 1) fd:93:16:d3:59:b6:01:0e:06:90:81:ff:fe:39:01:c8
I0306 02:44:24.104827 157141 rdma_context.cpp:77] Using SIEVE endpoint store
I0306 02:44:24.105098 157141 rdma_context.cpp:594] Using user-specified GID index: 1 on rocep137s0/ (with network device)
I0306 02:44:24.175522 157141 rdma_context.cpp:140] RDMA device: rocep137s0, LID: 0, GID: (GID_Index 1) fd:93:16:d3:59:b6:01:0c:06:90:81:ff:fe:39:f4:20
I0306 02:44:24.175537 157141 rdma_context.cpp:77] Using SIEVE endpoint store
I0306 02:44:24.175807 157141 rdma_context.cpp:594] Using user-specified GID index: 1 on rocep233s0/ (with network device)
I0306 02:44:24.240439 157141 rdma_context.cpp:140] RDMA device: rocep233s0, LID: 0, GID: (GID_Index 1) fd:93:16:d3:59:b6:01:09:06:90:81:ff:fe:3a:48:c8
I0306 02:44:24.240453 157141 rdma_context.cpp:77] Using SIEVE endpoint store
I0306 02:44:24.240726 157141 rdma_context.cpp:594] Using user-specified GID index: 1 on rocep249s0/ (with network device)
I0306 02:44:24.300143 157141 rdma_context.cpp:140] RDMA device: rocep249s0, LID: 0, GID: (GID_Index 1) fd:93:16:d3:59:b6:01:0a:06:90:81:ff:fe:39:f0:c0
I0306 02:44:24.300160 157141 rdma_context.cpp:77] Using SIEVE endpoint store
I0306 02:44:24.300446 157141 rdma_context.cpp:594] Using user-specified GID index: 1 on rocep25s0/ (with network device)
I0306 02:44:24.371096 157141 rdma_context.cpp:140] RDMA device: rocep25s0, LID: 0, GID: (GID_Index 1) fd:93:16:d3:59:b6:01:0f:06:90:81:ff:fe:39:ef:10
I0306 02:44:24.371109 157141 rdma_context.cpp:77] Using SIEVE endpoint store
I0306 02:44:24.371378 157141 rdma_context.cpp:594] Using user-specified GID index: 1 on rocep9s0/ (with network device)
I0306 02:44:24.434262 157141 rdma_context.cpp:140] RDMA device: rocep9s0, LID: 0, GID: (GID_Index 1) fd:93:16:d3:59:b6:01:10:06:90:81:ff:fe:39:f9:78
I0306 02:44:24.434329 157141 transfer_engine_bench.cpp:294] DRAM is used, numa node num: 2
I0306 02:44:34.647830 157184 transfer_engine_bench.cpp:397] Worker 10 stopped!
I0306 02:44:34.647848 157174 transfer_engine_bench.cpp:397] Worker 0 stopped!
I0306 02:44:34.647881 157185 transfer_engine_bench.cpp:397] Worker 11 stopped!
I0306 02:44:34.647894 157176 transfer_engine_bench.cpp:397] Worker 2 stopped!
I0306 02:44:34.647903 157179 transfer_engine_bench.cpp:397] Worker 5 stopped!
I0306 02:44:34.647886 157183 transfer_engine_bench.cpp:397] Worker 9 stopped!
I0306 02:44:34.647951 157181 transfer_engine_bench.cpp:397] Worker 7 stopped!
I0306 02:44:34.647933 157177 transfer_engine_bench.cpp:397] Worker 3 stopped!
I0306 02:44:34.647962 157182 transfer_engine_bench.cpp:397] Worker 8 stopped!
I0306 02:44:34.647926 157180 transfer_engine_bench.cpp:397] Worker 6 stopped!
I0306 02:44:34.647979 157175 transfer_engine_bench.cpp:397] Worker 1 stopped!
I0306 02:44:34.648010 157178 transfer_engine_bench.cpp:397] Worker 4 stopped!
I0306 02:44:34.648324 157141 transfer_engine_bench.cpp:513] Test completed: duration 10.00, batch count 100873, throughput 84.60 GB/s
I0306 02:44:34.750689 157141 transfer_metadata.cpp:301] removeSegmentDesc mi355-gpu-17:16255 finish
root@mi355-gpu-17:/workspace# MC_GID_INDEX=1 \
transfer_engine_bench \
  --mode=initiator \
  --device_name=rocep105s0 \
  --metadata_server=P2PHANDSHAKE \
  --segment_id=mi355-gpu-3:16358
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0306 02:45:59.641531 157201 transfer_engine_impl.cpp:586] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I0306 02:45:59.641594 157201 transfer_engine_impl.cpp:105] Transfer Engine parseHostNameWithPort. server_name: mi355-gpu-17 port: 12001
I0306 02:45:59.641625 157201 transfer_engine_impl.cpp:172] Transfer Engine RPC using P2P handshake, listening on mi355-gpu-17:16539
I0306 02:45:59.641780 157201 rdma_transport.cpp:63] [RDMA] Relaxed ordering disabled via MC_IB_PCI_RELAXED_ORDERING=0. Falling back to strict ordering.
I0306 02:45:59.641837 157201 rdma_context.cpp:77] Using SIEVE endpoint store
I0306 02:45:59.643596 157201 rdma_context.cpp:594] Using user-specified GID index: 1 on rocep105s0/ (with network device)
I0306 02:45:59.716431 157201 rdma_context.cpp:140] RDMA device: rocep105s0, LID: 0, GID: (GID_Index 1) fd:93:16:d3:59:b6:01:0d:06:90:81:ff:fe:39:e3:e8
I0306 02:45:59.716462 157201 transfer_engine_bench.cpp:294] DRAM is used, numa node num: 2
I0306 02:46:09.848446 157215 transfer_engine_bench.cpp:397] Worker 0 stopped!
I0306 02:46:09.848616 157217 transfer_engine_bench.cpp:397] Worker 2 stopped!
I0306 02:46:09.848790 157216 transfer_engine_bench.cpp:397] Worker 1 stopped!
I0306 02:46:09.848963 157218 transfer_engine_bench.cpp:397] Worker 3 stopped!
I0306 02:46:09.849133 157219 transfer_engine_bench.cpp:397] Worker 4 stopped!
I0306 02:46:09.849308 157220 transfer_engine_bench.cpp:397] Worker 5 stopped!
I0306 02:46:09.849478 157221 transfer_engine_bench.cpp:397] Worker 6 stopped!
I0306 02:46:09.849651 157222 transfer_engine_bench.cpp:397] Worker 7 stopped!
I0306 02:46:09.849823 157224 transfer_engine_bench.cpp:397] Worker 9 stopped!
I0306 02:46:09.849988 157226 transfer_engine_bench.cpp:397] Worker 11 stopped!
I0306 02:46:09.850165 157225 transfer_engine_bench.cpp:397] Worker 10 stopped!
I0306 02:46:09.850339 157223 transfer_engine_bench.cpp:397] Worker 8 stopped!
I0306 02:46:09.850471 157201 transfer_engine_bench.cpp:513] Test completed: duration 10.00, batch count 57600, throughput 48.30 GB/s
I0306 02:46:09.925596 157201 transfer_metadata.cpp:301] removeSegmentDesc mi355-gpu-17:16539 finish
root@mi355-gpu-17:/workspace# MC_GID_INDEX=1 \
transfer_engine_bench \
  --mode=initiator \
  --device_name=rocep105s0,rocep121s0 \
  --metadata_server=P2PHANDSHAKE \
  --segment_id=mi355-gpu-3:16526^C
root@mi355-gpu-17:/workspace# MC_GID_INDEX=1 \
transfer_engine_bench \
  --mode=initiator \
  --device_name=rocep105s0,rocep121s0 \
  --metadata_server=P2PHANDSHAKE \
  --segment_id=mi355-gpu-3:15319
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0306 02:47:17.595677 157228 transfer_engine_impl.cpp:586] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I0306 02:47:17.595747 157228 transfer_engine_impl.cpp:105] Transfer Engine parseHostNameWithPort. server_name: mi355-gpu-17 port: 12001
I0306 02:47:17.595781 157228 transfer_engine_impl.cpp:172] Transfer Engine RPC using P2P handshake, listening on mi355-gpu-17:16636
I0306 02:47:17.595934 157228 rdma_transport.cpp:63] [RDMA] Relaxed ordering disabled via MC_IB_PCI_RELAXED_ORDERING=0. Falling back to strict ordering.
I0306 02:47:17.596071 157228 rdma_context.cpp:77] Using SIEVE endpoint store
I0306 02:47:17.597848 157228 rdma_context.cpp:594] Using user-specified GID index: 1 on rocep105s0/ (with network device)
I0306 02:47:17.680841 157228 rdma_context.cpp:140] RDMA device: rocep105s0, LID: 0, GID: (GID_Index 1) fd:93:16:d3:59:b6:01:0d:06:90:81:ff:fe:39:e3:e8
I0306 02:47:17.680855 157228 rdma_context.cpp:77] Using SIEVE endpoint store
I0306 02:47:17.681088 157228 rdma_context.cpp:594] Using user-specified GID index: 1 on rocep121s0/ (with network device)
I0306 02:47:17.745155 157228 rdma_context.cpp:140] RDMA device: rocep121s0, LID: 0, GID: (GID_Index 1) fd:93:16:d3:59:b6:01:0e:06:90:81:ff:fe:39:01:c8
I0306 02:47:17.745196 157228 transfer_engine_bench.cpp:294] DRAM is used, numa node num: 2
I0306 02:47:27.914700 157256 transfer_engine_bench.cpp:397] Worker 10 stopped!
I0306 02:47:27.914722 157251 transfer_engine_bench.cpp:397] Worker 5 stopped!
I0306 02:47:27.914850 157255 transfer_engine_bench.cpp:397] Worker 9 stopped!
I0306 02:47:27.914865 157247 transfer_engine_bench.cpp:397] Worker 1 stopped!
I0306 02:47:27.914942 157252 transfer_engine_bench.cpp:397] Worker 6 stopped!
I0306 02:47:27.914961 157248 transfer_engine_bench.cpp:397] Worker 2 stopped!
I0306 02:47:27.915051 157257 transfer_engine_bench.cpp:397] Worker 11 stopped!
I0306 02:47:27.915071 157250 transfer_engine_bench.cpp:397] Worker 4 stopped!
I0306 02:47:27.915164 157246 transfer_engine_bench.cpp:397] Worker 0 stopped!
I0306 02:47:27.915184 157253 transfer_engine_bench.cpp:397] Worker 7 stopped!
I0306 02:47:27.915261 157254 transfer_engine_bench.cpp:397] Worker 8 stopped!
I0306 02:47:27.915285 157249 transfer_engine_bench.cpp:397] Worker 3 stopped!
I0306 02:47:27.915509 157228 transfer_engine_bench.cpp:513] Test completed: duration 10.00, batch count 113245, throughput 94.98 GB/s
I0306 02:47:28.007961 157228 transfer_metadata.cpp:301] removeSegmentDesc mi355-gpu-17:16636 finish
root@mi355-gpu-17:/workspace# MC_GID_INDEX=1 \
transfer_engine_bench \
  --mode=initiator \
  --device_name=rocep105s0,rocep121s0,rocep137s0 \
  --metadata_server=P2PHANDSHAKE \
  --segment_id=mi355-gpu-3:16186
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0306 02:50:39.494439 157263 transfer_engine_impl.cpp:586] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I0306 02:50:39.494508 157263 transfer_engine_impl.cpp:105] Transfer Engine parseHostNameWithPort. server_name: mi355-gpu-17 port: 12001
I0306 02:50:39.494542 157263 transfer_engine_impl.cpp:172] Transfer Engine RPC using P2P handshake, listening on mi355-gpu-17:16998
I0306 02:50:39.494692 157263 rdma_transport.cpp:63] [RDMA] Relaxed ordering disabled via MC_IB_PCI_RELAXED_ORDERING=0. Falling back to strict ordering.
I0306 02:50:39.494757 157263 rdma_context.cpp:77] Using SIEVE endpoint store
I0306 02:50:39.496498 157263 rdma_context.cpp:594] Using user-specified GID index: 1 on rocep105s0/ (with network device)
I0306 02:50:39.505214 157263 rdma_context.cpp:140] RDMA device: rocep105s0, LID: 0, GID: (GID_Index 1) fd:93:16:d3:59:b6:01:0d:06:90:81:ff:fe:39:e3:e8
I0306 02:50:39.505223 157263 rdma_context.cpp:77] Using SIEVE endpoint store
I0306 02:50:39.505373 157263 rdma_context.cpp:594] Using user-specified GID index: 1 on rocep121s0/ (with network device)
I0306 02:50:39.593290 157263 rdma_context.cpp:140] RDMA device: rocep121s0, LID: 0, GID: (GID_Index 1) fd:93:16:d3:59:b6:01:0e:06:90:81:ff:fe:39:01:c8
I0306 02:50:39.593300 157263 rdma_context.cpp:77] Using SIEVE endpoint store
I0306 02:50:39.593482 157263 rdma_context.cpp:594] Using user-specified GID index: 1 on rocep137s0/ (with network device)
I0306 02:50:39.615041 157263 rdma_context.cpp:140] RDMA device: rocep137s0, LID: 0, GID: (GID_Index 1) fd:93:16:d3:59:b6:01:0c:06:90:81:ff:fe:39:f4:20
I0306 02:50:39.615085 157263 transfer_engine_bench.cpp:294] DRAM is used, numa node num: 2
I0306 02:50:49.786646 157290 transfer_engine_bench.cpp:397] Worker 5 stopped!
I0306 02:50:49.786700 157287 transfer_engine_bench.cpp:397] Worker 2 stopped!
I0306 02:50:49.786803 157288 transfer_engine_bench.cpp:397] Worker 3 stopped!
I0306 02:50:49.786849 157285 transfer_engine_bench.cpp:397] Worker 0 stopped!
I0306 02:50:49.786900 157286 transfer_engine_bench.cpp:397] Worker 1 stopped!
I0306 02:50:49.786967 157294 transfer_engine_bench.cpp:397] Worker 9 stopped!
I0306 02:50:49.787019 157293 transfer_engine_bench.cpp:397] Worker 8 stopped!
I0306 02:50:49.787066 157295 transfer_engine_bench.cpp:397] Worker 10 stopped!
I0306 02:50:49.787087 157289 transfer_engine_bench.cpp:397] Worker 4 stopped!
I0306 02:50:49.787109 157296 transfer_engine_bench.cpp:397] Worker 11 stopped!
I0306 02:50:49.787153 157291 transfer_engine_bench.cpp:397] Worker 6 stopped!
I0306 02:50:49.787196 157292 transfer_engine_bench.cpp:397] Worker 7 stopped!
I0306 02:50:49.787349 157263 transfer_engine_bench.cpp:513] Test completed: duration 10.00, batch count 161008, throughput 135.03 GB/s
I0306 02:50:49.845264 157263 transfer_metadata.cpp:301] removeSegmentDesc mi355-gpu-17:16998 finish
root@mi355-gpu-17:/workspace# MC_GID_INDEX=1 \
transfer_engine_bench \
  --mode=initiator \
  --device_name=rocep105s0,rocep121s0,rocep137s0,rocep153s0,rocep233s0,rocep249s0,rocep25s0,rocep9s0 \
  --metadata_server=P2PHANDSHAKE \
  --segment_id=mi355-gpu-3:15717
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0306 02:53:50.225337 157304 transfer_engine_impl.cpp:586] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I0306 02:53:50.225410 157304 transfer_engine_impl.cpp:105] Transfer Engine parseHostNameWithPort. server_name: mi355-gpu-17 port: 12001
I0306 02:53:50.225441 157304 transfer_engine_impl.cpp:172] Transfer Engine RPC using P2P handshake, listening on mi355-gpu-17:16354
I0306 02:53:50.225615 157304 rdma_transport.cpp:63] [RDMA] Relaxed ordering disabled via MC_IB_PCI_RELAXED_ORDERING=0. Falling back to strict ordering.
I0306 02:53:50.225711 157304 rdma_context.cpp:77] Using SIEVE endpoint store
I0306 02:53:50.227766 157304 rdma_context.cpp:594] Using user-specified GID index: 1 on rocep105s0/ (with network device)
I0306 02:53:50.276499 157304 rdma_context.cpp:140] RDMA device: rocep105s0, LID: 0, GID: (GID_Index 1) fd:93:16:d3:59:b6:01:0d:06:90:81:ff:fe:39:e3:e8
I0306 02:53:50.276512 157304 rdma_context.cpp:77] Using SIEVE endpoint store
I0306 02:53:50.276713 157304 rdma_context.cpp:594] Using user-specified GID index: 1 on rocep121s0/ (with network device)
I0306 02:53:50.313146 157304 rdma_context.cpp:140] RDMA device: rocep121s0, LID: 0, GID: (GID_Index 1) fd:93:16:d3:59:b6:01:0e:06:90:81:ff:fe:39:01:c8
I0306 02:53:50.313156 157304 rdma_context.cpp:77] Using SIEVE endpoint store
I0306 02:53:50.313344 157304 rdma_context.cpp:594] Using user-specified GID index: 1 on rocep137s0/ (with network device)
I0306 02:53:50.317991 157304 rdma_context.cpp:140] RDMA device: rocep137s0, LID: 0, GID: (GID_Index 1) fd:93:16:d3:59:b6:01:0c:06:90:81:ff:fe:39:f4:20
I0306 02:53:50.318003 157304 rdma_context.cpp:77] Using SIEVE endpoint store
I0306 02:53:50.318261 157304 rdma_context.cpp:594] Using user-specified GID index: 1 on rocep153s0/ (with network device)
I0306 02:53:50.358019 157304 rdma_context.cpp:140] RDMA device: rocep153s0, LID: 0, GID: (GID_Index 1) fd:93:16:d3:59:b6:01:0b:06:90:81:ff:fe:39:e7:18
I0306 02:53:50.358042 157304 rdma_context.cpp:77] Using SIEVE endpoint store
I0306 02:53:50.358292 157304 rdma_context.cpp:594] Using user-specified GID index: 1 on rocep233s0/ (with network device)
I0306 02:53:50.408221 157304 rdma_context.cpp:140] RDMA device: rocep233s0, LID: 0, GID: (GID_Index 1) fd:93:16:d3:59:b6:01:09:06:90:81:ff:fe:3a:48:c8
I0306 02:53:50.408231 157304 rdma_context.cpp:77] Using SIEVE endpoint store
I0306 02:53:50.408473 157304 rdma_context.cpp:594] Using user-specified GID index: 1 on rocep249s0/ (with network device)
I0306 02:53:50.467978 157304 rdma_context.cpp:140] RDMA device: rocep249s0, LID: 0, GID: (GID_Index 1) fd:93:16:d3:59:b6:01:0a:06:90:81:ff:fe:39:f0:c0
I0306 02:53:50.467989 157304 rdma_context.cpp:77] Using SIEVE endpoint store
I0306 02:53:50.468284 157304 rdma_context.cpp:594] Using user-specified GID index: 1 on rocep25s0/ (with network device)
I0306 02:53:50.476724 157304 rdma_context.cpp:140] RDMA device: rocep25s0, LID: 0, GID: (GID_Index 1) fd:93:16:d3:59:b6:01:0f:06:90:81:ff:fe:39:ef:10
I0306 02:53:50.476734 157304 rdma_context.cpp:77] Using SIEVE endpoint store
I0306 02:53:50.477012 157304 rdma_context.cpp:594] Using user-specified GID index: 1 on rocep9s0/ (with network device)
I0306 02:53:50.549619 157304 rdma_context.cpp:140] RDMA device: rocep9s0, LID: 0, GID: (GID_Index 1) fd:93:16:d3:59:b6:01:10:06:90:81:ff:fe:39:f9:78
I0306 02:53:50.549677 157304 transfer_engine_bench.cpp:294] DRAM is used, numa node num: 2
I0306 02:54:00.777693 157343 transfer_engine_bench.cpp:397] Worker 3 stopped!
I0306 02:54:00.777719 157351 transfer_engine_bench.cpp:397] Worker 11 stopped!
I0306 02:54:00.777737 157344 transfer_engine_bench.cpp:397] Worker 4 stopped!
I0306 02:54:00.777751 157346 transfer_engine_bench.cpp:397] Worker 6 stopped!
I0306 02:54:00.781872 157342 transfer_engine_bench.cpp:397] Worker 2 stopped!
I0306 02:54:00.781885 157340 transfer_engine_bench.cpp:397] Worker 0 stopped!
I0306 02:54:00.781942 157347 transfer_engine_bench.cpp:397] Worker 7 stopped!
I0306 02:54:00.781960 157348 transfer_engine_bench.cpp:397] Worker 8 stopped!
I0306 02:54:00.781966 157350 transfer_engine_bench.cpp:397] Worker 10 stopped!
I0306 02:54:00.781977 157349 transfer_engine_bench.cpp:397] Worker 9 stopped!
I0306 02:54:00.781998 157345 transfer_engine_bench.cpp:397] Worker 5 stopped!
I0306 02:54:00.782004 157341 transfer_engine_bench.cpp:397] Worker 1 stopped!
I0306 02:54:00.782276 157304 transfer_engine_bench.cpp:513] Test completed: duration 10.01, batch count 141295, throughput 118.40 GB/s
I0306 02:54:00.883960 157304 transfer_metadata.cpp:301] removeSegmentDesc mi355-gpu-17:16354 finish
root@mi355-gpu-17:/workspace# MC_GID_INDEX=1 \
transfer_engine_bench \
  --mode=initiator \
  --device_name=rocep105s0,rocep121s0,rocep137s0,rocep153s0 \
  --metadata_server=P2PHANDSHAKE \
  --segment_id=mi355-gpu-3:15563
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0306 02:56:41.366161 157369 transfer_engine_impl.cpp:586] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I0306 02:56:41.366235 157369 transfer_engine_impl.cpp:105] Transfer Engine parseHostNameWithPort. server_name: mi355-gpu-17 port: 12001
I0306 02:56:41.366266 157369 transfer_engine_impl.cpp:172] Transfer Engine RPC using P2P handshake, listening on mi355-gpu-17:16613
I0306 02:56:41.366431 157369 rdma_transport.cpp:63] [RDMA] Relaxed ordering disabled via MC_IB_PCI_RELAXED_ORDERING=0. Falling back to strict ordering.
I0306 02:56:41.366472 157369 rdma_context.cpp:77] Using SIEVE endpoint store
I0306 02:56:41.368603 157369 rdma_context.cpp:594] Using user-specified GID index: 1 on rocep105s0/ (with network device)
I0306 02:56:41.401108 157369 rdma_context.cpp:140] RDMA device: rocep105s0, LID: 0, GID: (GID_Index 1) fd:93:16:d3:59:b6:01:0d:06:90:81:ff:fe:39:e3:e8
I0306 02:56:41.401121 157369 rdma_context.cpp:77] Using SIEVE endpoint store
I0306 02:56:41.401405 157369 rdma_context.cpp:594] Using user-specified GID index: 1 on rocep121s0/ (with network device)
I0306 02:56:41.445807 157369 rdma_context.cpp:140] RDMA device: rocep121s0, LID: 0, GID: (GID_Index 1) fd:93:16:d3:59:b6:01:0e:06:90:81:ff:fe:39:01:c8
I0306 02:56:41.445820 157369 rdma_context.cpp:77] Using SIEVE endpoint store
I0306 02:56:41.446089 157369 rdma_context.cpp:594] Using user-specified GID index: 1 on rocep137s0/ (with network device)
I0306 02:56:41.451256 157369 rdma_context.cpp:140] RDMA device: rocep137s0, LID: 0, GID: (GID_Index 1) fd:93:16:d3:59:b6:01:0c:06:90:81:ff:fe:39:f4:20
I0306 02:56:41.451267 157369 rdma_context.cpp:77] Using SIEVE endpoint store
I0306 02:56:41.451516 157369 rdma_context.cpp:594] Using user-specified GID index: 1 on rocep153s0/ (with network device)
I0306 02:56:41.502600 157369 rdma_context.cpp:140] RDMA device: rocep153s0, LID: 0, GID: (GID_Index 1) fd:93:16:d3:59:b6:01:0b:06:90:81:ff:fe:39:e7:18
I0306 02:56:41.502647 157369 transfer_engine_bench.cpp:294] DRAM is used, numa node num: 2
I0306 02:56:51.704783 157403 transfer_engine_bench.cpp:397] Worker 10 stopped!
I0306 02:56:51.704821 157399 transfer_engine_bench.cpp:397] Worker 6 stopped!
I0306 02:56:51.704851 157400 transfer_engine_bench.cpp:397] Worker 7 stopped!
I0306 02:56:51.704885 157395 transfer_engine_bench.cpp:397] Worker 2 stopped!
I0306 02:56:51.704910 157402 transfer_engine_bench.cpp:397] Worker 9 stopped!
I0306 02:56:51.704941 157397 transfer_engine_bench.cpp:397] Worker 4 stopped!
I0306 02:56:51.704955 157404 transfer_engine_bench.cpp:397] Worker 11 stopped!
I0306 02:56:51.704993 157398 transfer_engine_bench.cpp:397] Worker 5 stopped!
I0306 02:56:51.705013 157396 transfer_engine_bench.cpp:397] Worker 3 stopped!
I0306 02:56:51.709245 157401 transfer_engine_bench.cpp:397] Worker 8 stopped!
I0306 02:56:51.709246 157393 transfer_engine_bench.cpp:397] Worker 0 stopped!
I0306 02:56:51.709291 157394 transfer_engine_bench.cpp:397] Worker 1 stopped!
I0306 02:56:51.709574 157369 transfer_engine_bench.cpp:513] Test completed: duration 10.01, batch count 90621, throughput 75.94 GB/s

show_gids

root@mi355-gpu-17:/workspace# show_gid
DEV     PORT    INDEX   GID                                     IPv4            VER     DEV
---     ----    -----   ---                                     ------------    ---     ---
rocep105s0      1       0       fe80:0000:0000:0000:0690:81ff:fe39:e3e8                 v2      enp105s0
rocep105s0      1       1       fd93:16d3:59b6:010d:0690:81ff:fe39:e3e8                 v2      enp105s0
rocep121s0      1       0       fe80:0000:0000:0000:0690:81ff:fe39:01c8                 v2      enp121s0
rocep121s0      1       1       fd93:16d3:59b6:010e:0690:81ff:fe39:01c8                 v2      enp121s0
rocep137s0      1       0       fe80:0000:0000:0000:0690:81ff:fe39:f420                 v2      enp137s0
rocep137s0      1       1       fd93:16d3:59b6:010c:0690:81ff:fe39:f420                 v2      enp137s0
rocep153s0      1       0       fe80:0000:0000:0000:0690:81ff:fe39:e718                 v2      enp153s0
rocep153s0      1       1       fd93:16d3:59b6:010b:0690:81ff:fe39:e718                 v2      enp153s0
rocep193s0f0    1       0       fe80:0000:0000:0000:7ec2:55ff:feba:ff88                 v1      enp193s0f0np0
rocep193s0f0    1       1       fe80:0000:0000:0000:7ec2:55ff:feba:ff88                 v2      enp193s0f0np0
rocep193s0f0    1       2       0000:0000:0000:0000:0000:ffff:2d3f:4cc8 45.63.76.200    v1      enp193s0f0np0
rocep193s0f0    1       3       0000:0000:0000:0000:0000:ffff:2d3f:4cc8 45.63.76.200    v2      enp193s0f0np0
rocep193s0f1    1       0       fe80:0000:0000:0000:7ec2:55ff:feba:ff89                 v1      enp193s0f1np1
rocep193s0f1    1       1       fe80:0000:0000:0000:7ec2:55ff:feba:ff89                 v2      enp193s0f1np1
rocep193s0f1    1       2       0000:0000:0000:0000:0000:ffff:0a02:9017 10.2.144.23     v1      enp193s0f1np1
rocep193s0f1    1       3       0000:0000:0000:0000:0000:ffff:0a02:9017 10.2.144.23     v2      enp193s0f1np1
rocep233s0      1       0       fe80:0000:0000:0000:0690:81ff:fe3a:48c8                 v2      enp233s0
rocep233s0      1       1       fd93:16d3:59b6:0109:0690:81ff:fe3a:48c8                 v2      enp233s0
rocep249s0      1       0       fe80:0000:0000:0000:0690:81ff:fe39:f0c0                 v2      enp249s0
rocep249s0      1       1       fd93:16d3:59b6:010a:0690:81ff:fe39:f0c0                 v2      enp249s0
rocep25s0       1       0       fe80:0000:0000:0000:0690:81ff:fe39:ef10                 v2      enp25s0
rocep25s0       1       1       fd93:16d3:59b6:010f:0690:81ff:fe39:ef10                 v2      enp25s0
rocep9s0        1       0       fe80:0000:0000:0000:0690:81ff:fe39:f978                 v2      enp9s0
rocep9s0        1       1       fd93:16d3:59b6:0110:0690:81ff:fe39:f978                 v2      enp9s0
n_gids_found=24

Before submitting a new issue...

  • Make sure you already searched for relevant issues and read the documentation

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions