Commit 12ce58c
support concurrency (#1943)
Summary:
Pull Request resolved: #1943
TL;DR:
BEFORE: controlled flow by requiring python caller to obtain a QP ownership and hold for duration of call (.read_from/.write_into)
AFTER: now we can cheaply clone QPs, and just use atomics to generate wr_id, and rely on ibverbs internal locks (ibv_post_send is thread-safe). Complexity introduced by Work completion events which may be returned out of order and only delivered once, so need to store any WC in seperate cache.
### Atomic Counters in rdmaxcel_qp_t for Lock-Free Operations
The rdmaxcel_qp_t wrapper uses atomic counters to enable concurrent, lock-free work request posting:
```
typedef struct rdmaxcel_qp {
struct ibv_qp* ibv_qp;
struct ibv_cq* send_cq;
struct ibv_cq* recv_cq;
// Atomic counters for lock-free concurrent access
_Atomic uint64_t send_wqe_idx; // Next send WQE slot
_Atomic uint64_t send_db_idx; // Last doorbell rung
_Atomic uint64_t recv_wqe_idx; // Next recv WQE slot
_Atomic uint64_t recv_db_idx; // Last recv doorbell
_Atomic uint64_t rts_timestamp; // Ready-to-send timestamp
// Completion caches for efficient polling
completion_cache_t* send_completion_cache;
completion_cache_t* recv_completion_cache;
} rdmaxcel_qp_t;
```
Key Benefits:
Multiple threads can post work requests concurrently using fetch_add on atomic indices
No locks needed for the hot path (posting operations)
Each thread gets a unique WQE slot atomically
Completion polling uses cached results to avoid redundant CQ polls
### Mutex-Protected Queue Pair Creation
While operations are lock-free, QP creation is serialized using Rust Arc<Mutex<HashSet>>:
```
pub struct RdmaManagerActor {
// Track QPs currently being created to prevent duplicate creation
pending_qp_creation: Arc<Mutex<HashSet<(String, ActorId, String)>>>,
// ...
}
```
Creation Flow:
Thread checks if QP exists (lock-free read from HashMap)
If not, acquires mutex and checks pending_qp_creation set
If another thread is creating it, waits without holding lock
Otherwise, inserts key into set, releases lock, and creates QP
After creation, removes key from set
This prevents race conditions where multiple threads try to create the same QP simultaneously while keeping the common path (using existing QPs) lock-free.
### Resource Lifecycle Management
Simplified cleanup via rdmaxcel_qp_destroy:
Previously: Rust manually destroyed ibv_qp and CQs separately (error-prone with concurrent access)
Now: Single C function destroys all resources atomically
Changed register_segments(pd, rdmaxcel_qp_t*) to work with wrapper instead of raw ibv_qp
Reviewed By: casteryh
Differential Revision: D87021168
fbshipit-source-id: c8e5fbca0c2a775801dc37e4e154b24daaddfa2a1 parent b57ad38 commit 12ce58c
File tree
9 files changed
+1339
-544
lines changed- monarch_rdma
- src
- rdmaxcel-sys
- src
9 files changed
+1339
-544
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
14 | 14 | | |
15 | 15 | | |
16 | 16 | | |
| 17 | + | |
17 | 18 | | |
18 | 19 | | |
19 | 20 | | |
| |||
Large diffs are not rendered by default.
Large diffs are not rendered by default.
0 commit comments