Skip to content

Comments

Refactor and enhance RDMA implementation (verbs and tcp providers)#400

Merged
moleksy merged 6 commits intomainfrom
rdma_verbs
Jul 1, 2025
Merged

Refactor and enhance RDMA implementation (verbs and tcp providers)#400
moleksy merged 6 commits intomainfrom
rdma_verbs

Conversation

@moleksy
Copy link
Collaborator

@moleksy moleksy commented Jun 17, 2025

  • Introduced Verbs RDMA provider
  • User can choose provider and number of endpoints used for transmission
  • Introduced a new test suite for libfabric development, including comprehensive tests for RDMA initialisation and teardown.
  • Added custom fake functions for RDMA address info handling to improve test isolation and reliability.
  • Created a new header file for libfabric development tests to encapsulate test setup and teardown logic.
  • Implemented performance testing for RDMA receiver and transmitter, focusing on TTLB and bandwidth metrics across varying payload and queue sizes.
  • Enhanced the mcm_rdma_args structure to include the provider and the number of endpoints for improved configuration flexibility.
  • Updated existing tests to remove unnecessary calls to fi_getinfo, ensuring tests focus on relevant functionality.
  • Improved memory management in tests to prevent leaks and ensure proper cleanup of allocated resources.

- Introduced a new test suite for libfabric development, including comprehensive tests for RDMA initialization and teardown.
- Added custom fake functions for RDMA address info handling to improve test isolation and reliability.
- Created a new header file for libfabric development tests to encapsulate test setup and teardown logic.
- Implemented performance testing for RDMA receiver and transmitter, focusing on TTLB and bandwidth metrics across varying payload and queue sizes.
- Enhanced the mcm_rdma_args structure to include provider and number of endpoints for better configuration flexibility.
- Updated existing tests to remove unnecessary calls to fi_getinfo, ensuring tests focus on relevant functionality.
- Improved memory management in tests to prevent leaks and ensure proper cleanup of allocated resources.
Copilot AI review requested due to automatic review settings June 17, 2025 14:01
@moleksy moleksy requested a review from soopel as a code owner June 17, 2025 14:01
@moleksy moleksy requested review from Sakoram, ko80 and tszumski June 17, 2025 14:01
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Refactors and enhances the RDMA provider implementation, adds configurable providers and multiple endpoints support, and bolsters test coverage with new physical-RDMA tests.

  • Extend mcm_rdma_args with provider and num_endpoints
  • Refactor Rdma to support multiple endpoints, round-robin CQ handling, and both "verbs" and "tcp" providers
  • Add dedicated physical-RDMA TX/RX test executables and expand unit tests for endpoint and device initialization

Reviewed Changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
sdk/include/mcm_dp.h Add provider and num_endpoints to RDMA arguments
media-proxy/src/mesh/conn_rdma.cc Full rewrite of Rdma to handle multi-EP, providers, ports
media-proxy/src/mesh/conn_rdma_tx.cc Update TX CQ/thread logic for multi-EP and error handling
media-proxy/src/mesh/conn_rdma_rx.cc Update RX CQ/thread logic with reorder buffer and back-off
media-proxy/tests/{rdma_tx,rdma_rx}.cpp New physical-RDMA TX/RX performance tests
media-proxy/tests/libfabric_* Revise and consolidate libfabric initialization tests
media-proxy/tests/conn_rdma_tests.cc Update unit tests for multi-EP behaviors
media-proxy/tests/CMakeLists.txt Add executables for physical-RDMA tests
media-proxy/include/mesh/conn_rdma.h Introduce fields for multi-EP support
media-proxy/include/libfabric_* Extend device/context types for provider, ports, kind
media-proxy/src/libfabric_{dev,ep}.c Update libfabric calls for fi_getinfo v2.0 and verbs, CMA
Comments suppressed due to low confidence (1)

media-proxy/include/mesh/conn_rdma.h:17

  • Header adds a std::vector<ep_ctx_t*> member but does not #include <vector>, leading to compilation errors; please add #include <vector>.
#include <array>

@ko80 ko80 removed the request for review from soopel June 17, 2025 14:14
moleksy added 3 commits June 30, 2025 11:30
- Moved variables `next_tx_idx` and `next_rx_idx` from `Rdma` constructor to tx and rx respectively.
- Updated default value for `rdma_num_eps` to 1 in `Rdma::configure` method.
- Changed endpoint kind definitions from `KIND_RECEIVER` and `KIND_TRANSMITTER` to `FI_KIND_RECEIVER` and `FI_KIND_TRANSMITTER`.
- Introduced cleanup helper function to manage memory for endpoint clones in `Rdma::on_establish`.
- Improved error handling and resource cleanup during RDMA endpoint initialization.
- Added `metrics.h` to define structures for wire headers and statistics messages.
- Enhanced `PerfReceiver` class to track latency and throughput metrics.
- Updated RDMA receiver and transmitter tests to incorporate new metrics and validate performance.
- Implemented a new test structure for measuring latency and bandwidth across varying payload sizes and queue sizes.
…y in connection kind enumeration.

Remove unnecessary double allocation of certain attributes in hints.
Copy link
Collaborator

@ko80 ko80 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

@moleksy moleksy merged commit 81e563b into main Jul 1, 2025
9 checks passed
@moleksy moleksy deleted the rdma_verbs branch July 1, 2025 05:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants