Skip to content

Rebase and merge min_bw from downstream#4

Open
liayan wants to merge 200 commits intomasterfrom
ly/merge-min-bw-from-downstream
Open

Rebase and merge min_bw from downstream#4
liayan wants to merge 200 commits intomasterfrom
ly/merge-min-bw-from-downstream

Conversation

@liayan
Copy link
Collaborator

@liayan liayan commented Aug 21, 2025

No description provided.

Huai-En, Tseng and others added 30 commits May 18, 2023 10:57
Issue: When send_rcredit flag is set to 1 then credit_mr, ctrl_buf, ctrl_sge_list and ctrl_wr are allocated.
       However all these entities are deallocated under additional conditions, like work_rdma_cm == ON.
       This is not always true, ib_send_bw can be run without rdma cm and that leads to error message during PD deallocation:
       "Failed to deallocate PD - Device or resource busy"

Fix: If send_rcredit flag was used during entities creation it is also used during deallocation without additional conditions.
Modify --source_ip to --bind_sounce_ip to fix init connection establishment with specific interface
Revert "Perftest: replace rand() with getrandom() during MR buffer initialization"
When using send verb with shared queue in perftest the WQE
length is being set as message size, this is harmful to the
scatter entry caching feature.

This commit align WQE length to MTU so it will enhance the caching
and will result in better performance.

Signed-off-by: Hassan Khadour <hkhadour@nvidia.com>
Signed-off-by: Hassan Khadour <hkhadour@nvidia.com>
Currenty initial negotiation performed via ipv4 which is not suitable
for modern ipv6 only topology.
This patch allow to specify which address family to use, default
behaviour not changed.

New option:  --ipv6-addr

Usage example:
./ib_write_bw -d mlx5_0 --ipv6-addr
./ib_write_bw -d mlx5_0 --ipv6-addr 2a02:6b8:c0e:97f:0:441d:9fbd:3f1e

signed-off-by: Dmitry Monakhov <dmonakhov@gmail.com>
Signed-off-by: Hassan Khadour <hkhadour@nvidia.com>
Fix and optimize some code sections in initial communication
functions.

Signed-off-by: Hassan Khadour <hkhadour@nvidia.com>
Add ipv6 address support for initial communication.
When rdmacm is not used, ctx_close_connection() does a handshake and
then sends "done" over the socket before closing it. This is racy and
can lead to a spurious error:

    HOST A                                  HOST B

    ctx_close_connection()
                                            ctx_close_connection()
      ctx_hand_shake() [ succeeds ]
                                              ctx_hand_shake() [ succeeds ]
                                              write(sockfd, "done")
                                              close(sockfd)

      write(sockfd, "done") <-- fails since HOST B has closed the
                                socket and replies with a TCP RST

Fix this simply by deleting the write(). The ctx_hand_shake() already
ensures the two sides are in sync and can proceed to the close().

Signed-off-by: Roland Dreier <roland@enfabrica.net>
Neuron introduced an API for exporting DMA-buffers for allocated tensor
address. The API introduced on Neuron runtime library 2.13.6. Add
Neuron dmabuf support and an additional flag to specify whether to use
DMA buffers or peer to peer communication.

Signed-off-by: Yonatan Nachum <ynachum@amazon.com>
Neuron and Habana HW accelerator flags doesn't appear in the man page,
add them.

Signed-off-by: Yonatan Nachum <ynachum@amazon.com>
Fix ib_send_bw bidir duration mode case to check if the bandwidth
really crossed limit_bw.

Fixes: 3528004
Signed-off-by: Hassan Khadour <hkhadour@nvidia.com>
Add support for Neuron HW accelerator DMA buffers
Fix race in non-rdmacm ctx_close_connection()
Signed-off-by: Hassan Khadour <hkhadour@nvidia.com>
Signed-off-by: Hassan Khadour <hkhadour@nvidia.com>
For RoCE, the udp sport is randomly selected when flow_label is 0.
This makes the traffic go through different path when trying to
run test.
Add flow_label option to let it use same udp sport to go through
same path when running test.

Signed-off-by: Liu, Changcheng <jerrliu@nvidia.com>
perftest: support set flow_label through env variable FLOW_LABEL
Update cuda_memory_init to error out if CuDeviceGetByPCIBusId fails. Otherwise, it will silently pick up device 0 (ie, value taken from perftest_parameters which is memset to 0 initially).
Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
Error messages should not be printed to stdout.

Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
Error out if CuDeviceGetByPCIBusId fails
avoid unnecessary memory allocation when  "mr_per_qp"  flag is not set
Guofeng Yue and others added 27 commits May 20, 2025 19:13
In perform_warm_up mode, if the length of post_list is 1 and the
message size is less than or equal to 8192, all send_flags in WRs
are 0 and CQEs will not be generated since IBV_SEND_SIGNALED is
not set. As a result, the perform_warm_up process will stuck in
an infinite poll-CQ loop.

Set IBV_SEND_SIGNALED in this case to requiring CQE, and clear the
flag after post_send_method to avoid affecting subsequent tests.

Fixes: 56d025e ("Allow overriding CQ moderation on post list mode (linux-rdma#58)")
Signed-off-by: Guofeng Yue <yueguofeng@h-partners.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Perftest: Do not align SRQ recv length to MTU for hns
Cuda: Use pcie mapping regardless of data direct
To build with MLU DMA-BUF support, use:
./configure --enable-mlu  --with-mlu=</usr/local/neuware>

To run with MLU DMA-BUF enabled, use:
ib_write_bw --use_mlu=<device id> --use_mlu_dmabuf

Signed-off-by: hancheng <hancheng@cambricon.com>
Signed-off-by: hancheng <hancheng@cambricon.com>
Add GitHub actions support.
Add a test to build perftest on top of ubuntu24.04 and cuda12.9 to
verify build process.

Signed-off-by: Shmuel Shaul <sshaul@nvidia.com>
Perftest: Add GitHub actions support
Currently perftest support null-mr over client only (sender).
this commit add support over server side (receiver).

Signed-off-by: Shmuel Shaul <sshaul@nvidia.com>
The device IDs of some of our equipment overlap with Mellanox's,
so we need to add vendor ID and dev types for proper differentiation.

signed-off-by: tianx@yunsilicon.com
Add support for DMA-buffers in Cambricon devices
mlx5 driver enables SCATETR2CQE feature by default – up to 64B payloads.
Meaning, messages that are up-to 64B, can be scattered to CQE on the
responder side.

Peer memory in general, and specifically here with GPUDirect enabled
doesn’t work with this feature, and it must be disabled.

Signed-off-by: Shmuel Shaul <sshaul@nvidia.com>
Enabling the HAVE_MLX5DV and HAVE_OOO_RECV_WRS flags during perftest
compilation leads to an issue when testing non-Mellanox devices.
Specifically, the create_qp function will invoke mlx5dv_query_device,
which is intended for Mellanox devices. This call will cause the test to
terminate prematurely, as third-party devices do not support the mlx5dv
interface
This commit bypass the mlx5dv_query_device if using non-mlnx device.

Signed-off-by: Shmuel Shaul <sshaul@nvidia.com>
Warn if blueflame is not supported, as it can impact latency.
Adding the print to all latency tests.

Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Shmuel Shaul <sshaul@nvidia.com>
Signed-off-by: Shmuel Shaul <sshaul@nvidia.com>
This commit refactors the CUDA integration in Perftest by dynamically
loading the CUDA library (`libcuda.so`) instead of linking it
statically.

Changes include:
- Introduced `cuda_loader.c` to handle dynamic loading of CUDA
functions.
- Modified `cuda_memory.c` to use dynamically loaded function pointers
  instead of direct CUDA API calls.
- Ensured proper cleanup of resources by introducing
`unload_cuda_library()`.
- Find CUDA header path automatically and set related defines if exists.

This change increases flexibility, allowing Perftest to be compiled over
systems with cuda and run on both systems with/without CUDA.

Signed-off-by: Shmuel Shaul <sshaul@nvidia.com>
@liayan liayan requested a review from antgun42 August 22, 2025 15:25
@liayan
Copy link
Collaborator Author

liayan commented Aug 22, 2025

Only last five commit are the change, others from new perftest version.
It passed all ib tests from roce environment.
Will test it on non-roce nodes later.

#define USEC "usec"
/* The format of the results */

#define RESULT_FMT " #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] BW min[MB/sec]"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems redundant

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.