-
Notifications
You must be signed in to change notification settings - Fork 495
UCT/DEVICE: optimize doorbell ring #10989
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
WalkthroughGDA send-queue reservation now batches doorbell updates: a new Changes
Sequence Diagram(s)sequenceDiagram
participant EP as Endpoint
participant SQ as SendQueue (wqe_base)
participant Lock as sq_lock
participant DB as Doorbell/DBR
Note over EP,SQ: optimistic reservation -> compute wqe_next
EP->>SQ: read wqe_base_orig
loop atomic advance
EP->>SQ: CAS wqe_base -> wqe_next
end
alt NODELAY & ready == wqe_next
EP->>Lock: lock
Lock->>EP: if sq_db_index != sq_ready_index
EP->>DB: ring_db / update_dbr / ring_db
EP->>Lock: unlock
else batch boundary (wrap) reached
EP->>Lock: lock
Lock->>EP: if sq_db_index != sq_ready_index
EP->>DB: ring_db / update_dbr / ring_db
EP->>Lock: unlock
else
Note right of DB: DBR update deferred (batched)
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
src/uct/ib/mlx5/gdaki/gdaki.c(2 hunks)src/uct/ib/mlx5/gdaki/gdaki.cuh(3 hunks)src/uct/ib/mlx5/gdaki/gdaki_dev.h(0 hunks)test/gtest/ucp/cuda/test_kernels.cu(0 hunks)test/gtest/ucp/cuda/test_kernels.h(0 hunks)test/gtest/ucp/test_ucp_device.cc(0 hunks)
💤 Files with no reviewable changes (4)
- test/gtest/ucp/test_ucp_device.cc
- test/gtest/ucp/cuda/test_kernels.h
- test/gtest/ucp/cuda/test_kernels.cu
- src/uct/ib/mlx5/gdaki/gdaki_dev.h
| uct_rc_gdaki_dev_ep_t *ep, unsigned count) | ||
| { | ||
| uint16_t wqe_cnt; | ||
| uint64_t pi; | ||
|
|
||
| uct_rc_mlx5_gda_read_cqe(ep, &wqe_cnt, NULL); | ||
| pi = uct_rc_mlx5_gda_calc_pi(ep, wqe_cnt); | ||
| return pi + ep->sq_wqe_num + 1 - count; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid overbooking the SQ
uct_rc_mlx5_gda_max_alloc_wqe_base() currently returns pi + sq_wqe_num + 1 - count. When the SQ is already full (e.g. sq_rsvd_index == pi + sq_wqe_num and count == 1), the guard at Line 149 still passes, atomicAdd() hands back wqe_base == pi + sq_wqe_num, and we reuse the oldest in-flight WQE slot. That corrupts outstanding operations. Drop the extra + 1 so the cap stays at exactly sq_wqe_num.
Apply this diff:
- return pi + ep->sq_wqe_num + 1 - count;
+ return pi + ep->sq_wqe_num - count;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| uct_rc_gdaki_dev_ep_t *ep, unsigned count) | |
| { | |
| uint16_t wqe_cnt; | |
| uint64_t pi; | |
| uct_rc_mlx5_gda_read_cqe(ep, &wqe_cnt, NULL); | |
| pi = uct_rc_mlx5_gda_calc_pi(ep, wqe_cnt); | |
| return pi + ep->sq_wqe_num + 1 - count; | |
| } | |
| uct_rc_gdaki_dev_ep_t *ep, unsigned count) | |
| { | |
| uint16_t wqe_cnt; | |
| uint64_t pi; | |
| uct_rc_mlx5_gda_read_cqe(ep, &wqe_cnt, NULL); | |
| pi = uct_rc_mlx5_gda_calc_pi(ep, wqe_cnt); | |
| return pi + ep->sq_wqe_num - count; | |
| } |
🤖 Prompt for AI Agents
In src/uct/ib/mlx5/gdaki/gdaki.cuh around lines 132 to 140, the function
uct_rc_mlx5_gda_max_alloc_wqe_base() adds an extra "+ 1" to its returned value
which allows allocation to return pi + sq_wqe_num when the SQ is full, causing
reuse of the oldest in-flight WQE; remove the extra "+ 1" so the return becomes
pi + ep->sq_wqe_num - count, ensuring the cap is exactly sq_wqe_num and
preventing overbooking of the SQ.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
src/uct/ib/mlx5/gdaki/gdaki.cuh (1)
131-140: Avoid overbooking the SQ.As noted in the previous review,
uct_rc_mlx5_gda_max_alloc_wqe_base()returnspi + sq_wqe_num + 1 - count, which allows allocation to exceed SQ capacity. When the SQ is full, this can cause reuse of in-flight WQE slots and corrupt outstanding operations. The extra+ 1should be removed.Apply this diff:
- return pi + ep->sq_wqe_num + 1 - count; + return pi + ep->sq_wqe_num - count;
🧹 Nitpick comments (2)
src/uct/ib/mlx5/gdaki/gdaki.cuh (2)
243-253: Consider adding exponential backoff to reduce contention.The spinlock implementation uses busy-waiting without backoff, which can waste GPU cycles under contention. While the PR's doorbell optimization reduces lock acquisition frequency, adding a minimal exponential backoff could further improve behavior under unexpected high contention scenarios.
Example with simple backoff:
UCS_F_DEVICE void uct_rc_mlx5_gda_lock(int *lock) { + int backoff = 1; - while (atomicCAS(lock, 0, 1) != 0) - ; + while (atomicCAS(lock, 0, 1) != 0) { + for (int i = 0; i < backoff; i++) { + __nanosleep(1); + } + backoff = min(backoff * 2, 64); + } #ifdef DOCA_GPUNETIO_VERBS_HAS_FENCE_ACQUIRE_RELEASE_PTX asm volatile("fence.acquire.gpu;");
582-585: Consider documenting or removing the no-op progress function.The
uct_rc_mlx5_gda_ep_progressfunction is now a no-op, consistent with removing per-EP progress tracking. If this function is required for API compatibility, consider adding a comment explaining why it's kept. Otherwise, if it's no longer called, it could be removed to reduce code clutter.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/uct/ib/mlx5/gdaki/gdaki.cuh(3 hunks)
🔇 Additional comments (7)
src/uct/ib/mlx5/gdaki/gdaki.cuh (7)
123-128: LGTM! Correct 16-bit wraparound handling.The PI calculation correctly handles 16-bit hardware counter wraparound within the 64-bit software-tracked reserved index. The formula properly reconstructs the full PI by adjusting for counter differences modulo 2^16.
142-187: Reservation logic is sound, but depends on fixing the overbooking issue.The atomic reservation with rollback logic correctly handles concurrent WQE allocation attempts. The rollback loop properly uses CAS to undo over-reservations and re-evaluates availability. However, the correctness of this function depends on fixing the
+ 1overbooking bug inuct_rc_mlx5_gda_max_alloc_wqe_base()(see previous comment).Once the overbooking fix is applied, verify that under heavy concurrent load:
- No WQE slots are double-allocated
- The rollback mechanism successfully prevents starvation
255-258: LGTM! Correct lock release with proper memory ordering.The unlock implementation correctly uses atomic store with release ordering to ensure that modifications made within the critical section are visible to the thread that next acquires the lock.
296-300: LGTM! Simplified flow control check.The flow control function correctly uses a bitwise mask to determine when to signal flow control updates. This returns true when the WQE index is a multiple of the flow control period, which is standard behavior for periodic signaling.
588-613: LGTM! Correct CQE-based completion checking with error handling.The completion check correctly uses the CQE-based PI to determine if an operation has completed. The comparison
pi < comp->wqe_idxproperly detects in-progress operations, and the error handling dumps relevant diagnostics (WQE and CQE contents) for debugging. The 64-bit PI space is large enough that wraparound is not a practical concern.
260-294: Doorbell ring-update-ring pattern is consistent with GDA memory model requirements; however, pattern lacks documentation.The ring-update-ring sequence (ring_db → update_dbr → ring_db) is used consistently in both code paths (lines 278-280 and lines 288-290). The functions employ GPU-specific synchronization:
ring_dbwrites directly to the doorbell register, whileupdate_dbruses a system-scope atomic (cuda::thread_scope_system) to ensure the doorbell record is visible across GPU-system boundaries. This differs from CPU-side MLX5 semantics and appears intentional for GPU Direct Async operations.However, the pattern lacks explanatory comments. The implementation is sound, but the hardware requirement rationale is not documented in the code.
Please confirm whether the ring-update-ring pattern is mandated by MLX5 GDA hardware specifications. If this is a known requirement, consider adding an explanatory comment to clarify the synchronization semantics for future maintainers.
110-121: CQE field layout extraction is correct and matches mlx5_cqe64 structure definition.The mlx5_cqe64 structure layout has wqe_counter as a 16-bit big-endian field, followed by signature (1 byte) and op_own (1 byte). The opcode is extracted from op_own via the upper 4 bits (op_own >> 4). The code correctly reads 32 bits from the wqe_counter address, extracts the lower 16 bits and byte-swaps to get the host-order counter value, and extracts bits 28-31 (the upper nibble of the op_own byte) to obtain the opcode. The byte ordering assumptions align with network byte order conventions used throughout the codebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
src/uct/ib/mlx5/gdaki/gdaki.cuh (2)
96-108: Critical: Incorrect PRMT mask for byte swap (already flagged).This issue was already identified in previous reviews. The mask value
0x1does not perform the required 16-bit byte swap; it should be0x3210.
131-140: Critical: SQ overbooking issue (already flagged).This issue was already identified in previous reviews. The extra
+ 1on Line 139 allows allocation to exceed SQ capacity, corrupting in-flight WQEs.
🧹 Nitpick comments (1)
src/uct/ib/mlx5/gdaki/gdaki.cuh (1)
142-187: Verify: TOCTOU at pre-check and rollback loop efficiency.The pre-check at Line 149 reads
sq_rsvd_indexnon-atomically, creating a TOCTOU gap before theatomicAddat Line 153. While the rollback loop (Lines 176–184) correctly handles over-allocation, multiple threads passing the stale check simultaneously will allatomicAdd, then contend in rollback, each re-reading the CQE (Line 183) under lock. Under high thread counts, this could degrade to spinning.Consider removing the pre-check and relying solely on post-atomicAdd validation, or profile contention with many threads to confirm acceptable performance.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/uct/ib/mlx5/gdaki/gdaki.cuh(3 hunks)
🔇 Additional comments (5)
src/uct/ib/mlx5/gdaki/gdaki.cuh (5)
123-128: LGTM: Correct 16-bit wraparound handling.The arithmetic correctly aligns the 16-bit CQE
wqe_cntwith the 64-bit monotonicsq_rsvd_index, handling counter wraparound via the& 0xffffmask.
243-258: LGTM: Correct spinlock implementation.The lock/unlock functions properly use atomic CAS for acquisition and release ordering for unlock. Memory fences ensure correct ordering across GPU threads.
298-302: LGTM: Correct flow control check.The bitwise mask check correctly implements periodic flow control, returning true when
wqe_idxhas no bits set in thesq_fc_mask.
595-614: Correct completion checking logic (depends onbswap16fix).The PI-based completion check and error handling are correctly implemented. However, accurate operation requires the
bswap16bug (Lines 96–108) to be fixed first, asread_cqedepends on it.
260-296: Double doorbell ring pattern is pre-existing hardware protocol.Git history confirms the ring_db → update_dbr → ring_db sequence pre-dates this optimization and is replicated intentionally across both the no_delay and optimized paths. This is MLX5/DOCA hardware protocol, not duplication.
The lock protection (missing in commit 4680a6b) was correctly added in the most recent fix (01a58e0). The optimization logic is sound: the double-checked lock pattern correctly limits doorbell signaling to only the thread advancing the tail, reducing contention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
src/uct/ib/mlx5/gdaki/gdaki.cuh (1)
131-140: Critical: SQ overbooking due to extra+ 1in max allocation calculation.Line 139 adds an extra
+ 1to the returned value. When the SQ is already full (e.g.,sq_rsvd_index == pi + sq_wqe_numandcount == 1), the guard at Line 149 still passes,atomicAdd()returnswqe_base == pi + sq_wqe_num, and the code reuses the oldest in-flight WQE slot, corrupting outstanding operations.Apply this diff to prevent overbooking:
- return pi + ep->sq_wqe_num + 1 - count; + return pi + ep->sq_wqe_num - count;
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/uct/ib/mlx5/gdaki/gdaki.cuh(3 hunks)src/uct/ib/mlx5/gdaki/gdaki_dev.h(1 hunks)
🔇 Additional comments (7)
src/uct/ib/mlx5/gdaki/gdaki_dev.h (1)
22-22: Struct field addition supports doorbell optimization.The addition of
sq_db_indexis consistent with the PR objective to optimize doorbell ringing by tracking the last WQE index for which the doorbell was rung. This enables the double-check at Line 280 ofgdaki.cuhto avoid redundant doorbell rings.src/uct/ib/mlx5/gdaki/gdaki.cuh (6)
123-128: PI calculation logic is sound but depends on correct CQE reads.The arithmetic for deriving the producer index from the reserved index and WQE counter is correct. However, this function depends on
uct_rc_mlx5_gda_read_cqereturning a validwqe_cnt, which is currently broken due to the incorrect byte-swap mask.
142-187: Reservation logic depends on fixingmax_alloc_wqe_base.The atomicAdd/rollback pattern with atomicCAS (Lines 153-184) is a sound approach for concurrent WQE reservation. However, this function calls
uct_rc_mlx5_gda_max_alloc_wqe_base(Lines 148, 183), which has a critical overbooking bug. Until that is corrected, the reservation logic will allow over-commitment of the SQ.
243-258: Lock/unlock primitives are correctly implemented.The spin-lock at Lines 243-253 uses
atomicCASwith acquire fence semantics, and the unlock at Lines 255-258 uses atomic store with release ordering. This provides proper synchronization for the doorbell optimization.
260-288: Doorbell optimization correctly implements PR objective.The logic successfully implements the batched doorbell ringing strategy:
- Lines 271-274: Update
sq_ready_indexatomically to mark WQEs as ready.- Line 280: Double-check prevents redundant doorbell rings when
sq_db_indexalready matches the latestready_index.- Lines 278-286: Lock-protected doorbell ring ensures only the thread with the latest ready WQE rings the doorbell.
This aligns with the PR objective to reduce doorbell lock contention by allowing only the last thread to post a WQE to ring the doorbell.
582-607: Completion check logic is sound but depends on correct CQE reads.The PI-based completion check (Lines 591-596) and error handling with diagnostic dumps (Lines 598-603) are well-structured. However, this function depends on
uct_rc_mlx5_gda_read_cqeanduct_rc_mlx5_gda_calc_pi, which will return incorrect values until thebswap16mask is fixed.
290-294: I need to gather more information to understand if this logic change is correct and coordinated with other changes in the PR.Let me search the codebase for where this function is called and how it's used:Based on my analysis of the script output and web search results, I can see that the sq_fc_mask is calculated as
(qp_attr.max_tx >> 1) - 1. This creates a mask with all lower bits set. However, without being able to search for whereuct_rc_mlx5_gda_fc()is actually called and how it's used, and without access to the old code showing the previous implementation, I cannot definitively verify whether the logic inversion is correct or represents a bug.The change from
(wqe_idx & ep->sq_fc_mask) == 1to!(wqe_idx & ep->sq_fc_mask)is fundamentally different in semantics and could potentially break flow control if not coordinated with other changes in the mask setup or caller logic.Verify flow control logic change carefully.
The FC condition has been inverted from
(wqe_idx & ep->sq_fc_mask) == 1to!(wqe_idx & ep->sq_fc_mask). Whilesq_fc_maskis set to(qp_attr.max_tx >> 1) - 1, the mask comparison logic now triggers at different WQE indices:
- Old: triggers when masked bits equal 1
- New: triggers when masked bits equal 0
Without visibility into all call sites and the accompanying mask definition changes, manually verify this inverted flow control logic is intentional and correctly coordinates with how CQ updates and completions are requested throughout the GDA code path.
| UCS_F_DEVICE uint16_t uct_rc_mlx5_gda_bswap16(uint16_t x) | ||
| { | ||
| /* Try to reserve optimistically */ | ||
| int32_t prev = atomicAdd(&ep->avail_count, -(int32_t)count); | ||
| if (prev < (int32_t)count) { | ||
| /* Rollback */ | ||
| atomicAdd(&ep->avail_count, count); | ||
| uint32_t ret; | ||
| asm volatile("{\n\t" | ||
| ".reg .b32 mask;\n\t" | ||
| ".reg .b32 ign;\n\t" | ||
| "mov.b32 mask, 0x1;\n\t" | ||
| "prmt.b32 %0, %1, ign, mask;\n\t" | ||
| "}" | ||
| : "=r"(ret) | ||
| : "r"((uint32_t)x)); | ||
| return ret; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: PTX PRMT mask remains incorrect despite past review.
The byte-swap mask is still 0x1 (Line 102), which does not perform byte swapping. A previous review flagged this and marked it as "Addressed in commit 01a58e0", but the code remains unchanged. For 16-bit byte swapping, the selector must be 0x3210 where each 4-bit nibble selects one of the source bytes.
Apply this diff to fix the mask:
- "mov.b32 mask, 0x1;\n\t"
+ "mov.b32 mask, 0x3210;\n\t"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| UCS_F_DEVICE uint16_t uct_rc_mlx5_gda_bswap16(uint16_t x) | |
| { | |
| /* Try to reserve optimistically */ | |
| int32_t prev = atomicAdd(&ep->avail_count, -(int32_t)count); | |
| if (prev < (int32_t)count) { | |
| /* Rollback */ | |
| atomicAdd(&ep->avail_count, count); | |
| uint32_t ret; | |
| asm volatile("{\n\t" | |
| ".reg .b32 mask;\n\t" | |
| ".reg .b32 ign;\n\t" | |
| "mov.b32 mask, 0x1;\n\t" | |
| "prmt.b32 %0, %1, ign, mask;\n\t" | |
| "}" | |
| : "=r"(ret) | |
| : "r"((uint32_t)x)); | |
| return ret; | |
| } | |
| UCS_F_DEVICE uint16_t uct_rc_mlx5_gda_bswap16(uint16_t x) | |
| { | |
| uint32_t ret; | |
| asm volatile("{\n\t" | |
| ".reg .b32 mask;\n\t" | |
| ".reg .b32 ign;\n\t" | |
| "mov.b32 mask, 0x3210;\n\t" | |
| "prmt.b32 %0, %1, ign, mask;\n\t" | |
| "}" | |
| : "=r"(ret) | |
| : "r"((uint32_t)x)); | |
| return ret; | |
| } |
🤖 Prompt for AI Agents
In src/uct/ib/mlx5/gdaki/gdaki.cuh around lines 96-108 the PTX prmt.b32 mask is
incorrectly set to 0x1 which does not perform a 16-bit byte swap; change the
immediate mask to 0x3210 so each 4-bit selector maps the correct source byte for
a 16-bit byteswap, keep the asm template and operand order the same (input as
(uint32_t)x) and ensure the function still returns the lower 16 bits
(cast/truncate the uint32_t ret to uint16_t before returning).
a712778 to
abd798f
Compare
abd798f to
3a8e9ba
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (3)
src/uct/ib/mlx5/gdaki/gdaki.cuh (3)
96-108: Critical: PTX byte-swap mask still incorrect.This issue was previously flagged as critical but remains unresolved. The mask
0x1at Line 102 does not perform 16-bit byte swapping. The correct selector forprmt.b32must be0x3210where each 4-bit nibble maps the source bytes properly.Apply this diff:
- "mov.b32 mask, 0x1;\n\t" + "mov.b32 mask, 0x3210;\n\t"
110-125: Critical: CQE parsing returns corrupted wqe_cnt.Line 119 calls
uct_rc_mlx5_gda_bswap16, which uses the broken PTX mask. Until that mask is corrected, all CQE reads will produce incorrectwqe_cntvalues, breaking completion tracking and producer-index calculations throughout the codebase.
127-135: Critical: SQ overbooking still present.The extra
+ 1at Line 134 allows allocation to returnpi + sq_wqe_numwhen the SQ is full, reusing the oldest in-flight WQE slot and corrupting outstanding operations. This issue was flagged in a previous review but remains unresolved.Apply this diff:
- return pi + ep->sq_wqe_num + 1 - count; + return pi + ep->sq_wqe_num - count;
🧹 Nitpick comments (1)
src/uct/ib/mlx5/gdaki/gdaki_dev.h (1)
22-22: Add inline documentation for the newsq_db_indexfield.The field lacks documentation explaining its purpose. Add a comment above or next to the field (line 22 in
gdaki_dev.h) describing that it tracks the last rung doorbell index to avoid redundant rings.Note: The field is not yet used in the codebase and appears preparatory for future optimization logic. The zero-initialization via
{}ingdaki.c:74is appropriate and requires no changes.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/uct/ib/mlx5/gdaki/gdaki.cuh(1 hunks)src/uct/ib/mlx5/gdaki/gdaki_dev.h(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
- GitHub Check: UCX PR (Static_check Static checks)
- GitHub Check: UCX PR (Codestyle ctags check)
- GitHub Check: UCX PR (Codestyle codespell check)
- GitHub Check: UCX PR (Codestyle format code)
- GitHub Check: UCX PR (Codestyle AUTHORS file update check)
- GitHub Check: UCX PR (Codestyle commit title)
- GitHub Check: UCX release DRP (Prepare CheckRelease)
- GitHub Check: UCX release (Prepare CheckRelease)
- GitHub Check: UCX snapshot (Prepare Check)
🔇 Additional comments (1)
src/uct/ib/mlx5/gdaki/gdaki.cuh (1)
285-289: Inversion of FC logic is confirmed—verify this semantic change was intentional.The flow-control readiness check was indeed inverted in commit ea65cc6. The original logic
(wqe_idx & ep->sq_fc_mask) == 1has been changed to!(wqe_idx & ep->sq_fc_mask), which reverses when flow-control triggers CQ updates.All three call sites (lines 313, 433, 527 in gdaki.cuh) now use the inverted logic to set the
DOCA_GPUNETIO_MLX5_WQE_CTRL_CQ_UPDATEflag. This is a significant semantic change that alters which WQE indices receive completion notifications. Ensure this inversion aligns with the FC behavior intended by the "Collapsed CQ" changes and that all downstream consumers correctly interpret the new return values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (3)
src/uct/ib/mlx5/gdaki/gdaki.cuh (3)
96-108: Fixuct_rc_mlx5_gda_bswap16: wrong PRMT mask and missing truncation.The PTX sequence still uses
mov.b32 mask, 0x1, which does not perform a 16‑bit byte swap, and the function returns the full 32‑bitretinstead of truncating to 16 bits. This keeps CQEwqe_cntdecoding wrong and can corrupt completion tracking.Please switch to the proper 16‑bit byteswap selector and truncate on return, e.g.:
UCS_F_DEVICE uint16_t uct_rc_mlx5_gda_bswap16(uint16_t x) { uint32_t ret; asm volatile("{\n\t" ".reg .b32 mask;\n\t" ".reg .b32 ign;\n\t" - "mov.b32 mask, 0x1;\n\t" + "mov.b32 mask, 0x3210;\n\t" "prmt.b32 %0, %1, ign, mask;\n\t" "}" : "=r"(ret) : "r"((uint32_t)x)); - return ret; + return (uint16_t)ret; }
127-135: Prevent SQ overbooking inuct_rc_mlx5_gda_max_alloc_wqe_base.
uct_rc_mlx5_gda_max_alloc_wqe_base()still returnspi + ep->sq_wqe_num + 1 - count. When the SQ is full (e.g.sq_rsvd_index == pi + sq_wqe_numandcount == 1), this allowsmax_wqe_base == pi + sq_wqe_num, so the precheck inuct_rc_mlx5_gda_reserv_wqe_thread()passes andatomicAdd()can hand outwqe_base == pi + sq_wqe_num, reusing the oldest in‑flight WQE slot and corrupting outstanding operations.Cap at exactly
sq_wqe_numby dropping the extra+ 1:- return pi + ep->sq_wqe_num + 1 - count; + return pi + ep->sq_wqe_num - count;
259-297: Use atomic loads forsq_ready_index(avoid mixed atomic/non‑atomic access) and briefly document the double doorbell.
ep->sq_ready_indexis updated viacuda::atomic_ref(ref.compare_exchange_strong), but later read withREAD_ONCE(ep->sq_ready_index)(Line 286) and a plain load (Line 289). Mixing atomic_ref operations with non‑atomic reads on the same object is undefined by the C++ memory model and can introduce subtle races.You can keep the current algorithm and fix the data‑race by using the same
atomic_reffor all reads, e.g.:- if ((no_delay && READ_ONCE(ep->sq_ready_index) == wqe_next) || + if ((no_delay && ref.load(cuda::std::memory_order_relaxed) == wqe_next) || (!no_delay && ((wqe_base ^ wqe_next) & UCT_RC_GDA_DB_BATCH_SIZE))) { uct_rc_mlx5_gda_lock(&ep->sq_lock); - const uint64_t ready_index = ep->sq_ready_index; + const uint64_t ready_index = + ref.load(cuda::std::memory_order_relaxed); if (ep->sq_db_index != ready_index) { ep->sq_db_index = ready_index; uct_rc_mlx5_gda_ring_db(ep, ready_index); uct_rc_mlx5_gda_update_dbr(ep, ready_index); uct_rc_mlx5_gda_ring_db(ep, ready_index); } uct_rc_mlx5_gda_unlock(&ep->sq_lock); }Additionally, the new comment nicely explains the NODELAY vs batched paths and the double read, but it still does not say why we ring the doorbell twice around
uct_rc_mlx5_gda_update_dbr(). A short note (e.g., “hardware requires DBR update to be bracketed by two doorbells for X/Y reason”) would address the earlier review concern and make this pattern self‑explanatory.
🧹 Nitpick comments (1)
src/uct/ib/mlx5/gdaki/gdaki.cuh (1)
15-19: Clarify semantics/constraints ofUCT_RC_GDA_DB_BATCH_SIZE.
UCT_RC_GDA_DB_BATCH_SIZEis used as a bit mask in((wqe_base ^ wqe_next) & UCT_RC_GDA_DB_BATCH_SIZE)(Line 287), so it must stay a power-of-two and effectively defines a batch boundary bit rather than an arbitrary “size”. Consider either renaming to make the mask-like nature explicit or adding a short comment (e.g., “must be power-of-two; used as WQE-index boundary bit for DB batching”) to prevent accidental changes that would break the batching condition.
What?
Based on PR10959
Ring doorbell only for latest ready wqe index.
Why?
The contention of db lock is very serious for big threads number.
This pr allows gpu threads acquire lock and ring db only when the filled wqe is the last one to post.
How?
Add double check for ready wqe index when trying to ring db.
Summary by CodeRabbit
Bug Fixes
New Features
Chores