Skip to content

Conversation

@jeynmann
Copy link
Contributor

@jeynmann jeynmann commented Nov 5, 2025

What?

Based on PR10959
Ring doorbell only for latest ready wqe index.

case_name msg_size (KB) Version FC Window Latency (us) BW (MB/s) Ops (msg/s)
ucp_put_single_bw 8 pr deafult 7.857 15909.3 2036390
ucp_put_single_bw 8 master deafult 18.944 6598.49 844607
ucp_put_single_lat 8 pr deafult 13.314 9388.54 1201730
ucp_put_single_lat 8 master deafult 40.706 3070.83 393066
ucp_put_multi_bw 8 pr deafult 74.377 1680.63 215121
ucp_put_multi_bw 8 master deafult 74.404 1680.01 215041
ucp_put_multi_lat 8 pr deafult 45.413 2752.49 352319
ucp_put_multi_lat 8 master deafult 44.917 2782.89 356210
ucp_put_partial_bw 8 pr deafult 74.438 1679.24 214943
ucp_put_partial_bw 8 master deafult 74.396 1680.19 215064
ucp_put_partial_lat 8 pr deafult 45.421 2752.03 352260
ucp_put_partial_lat 8 master deafult 45.16 2767.94 354296
ucp_put_single_bw 8 pr inf 8.088 15455 1978250
ucp_put_single_bw 8 master inf 7.954 15714.5 2011460
ucp_put_single_lat 8 pr inf 13.372 9347.83 1196520
ucp_put_single_lat 8 master inf 40.805 3063.38 392113
ucp_put_multi_bw 8 pr inf 78.56 1591.15 203667
ucp_put_multi_bw 8 master inf 80.629 1550.31 198439
ucp_put_multi_lat 8 pr inf 46.886 2666.01 341250
ucp_put_multi_lat 8 master inf 47.851 2612.28 334372
ucp_put_partial_bw 8 pr inf 78.618 1589.97 203516
ucp_put_partial_bw 8 master inf 80.534 1552.14 198674
ucp_put_partial_lat 8 pr inf 45.999 2717.46 347835
ucp_put_partial_lat 8 master inf 47.798 2615.18 334742

Why?

The contention of db lock is very serious for big threads number.
This pr allows gpu threads acquire lock and ring db only when the filled wqe is the last one to post.

How?

Add double check for ready wqe index when trying to ring db.

Summary by CodeRabbit

  • Bug Fixes

    • Conditional send-queue progression and synchronized completion updates to database signaling, reducing races and improving GPU-direct reliability.
    • Adjusted flow-control readiness to more accurately determine send-queue availability.
  • New Features

    • Introduced batching mode for database updates to reduce signaling overhead and improve throughput; supports a no-delay mode for lower latency.
  • Chores

    • Endpoint tracking and locking enhancements to support the new DB synchronization behavior.

@coderabbitai
Copy link

coderabbitai bot commented Nov 5, 2025

Walkthrough

GDA send-queue reservation now batches doorbell updates: a new UCT_RC_GDA_DB_BATCH_SIZE macro and sq_db_index field were added; reservation advances to wqe_next atomically and doorbell/DBR updates are gated by NODELAY or batch-wrap conditions and protected by sq_lock.

Changes

Cohort / File(s) Summary
GDA DB batching & doorbell flow
src/uct/ib/mlx5/gdaki/gdaki.cuh
Added #define UCT_RC_GDA_DB_BATCH_SIZE 128. Reworked uct_rc_mlx5_gda_db to compute/preserve wqe_next/wqe_base_orig, use an atomic advance loop to move wqe_base to wqe_next, and conditionally perform sq DBR/update only when NODELAY and sq_ready_index == wqe_next, or when a WQE batch boundary/wrap occurs. Doorbell update sequence now acquires sq_lock, checks/updates sq_db_index, performs ring_db/update_dbr/ring_db, then releases the lock. Removed unconditional doorbell after the loop. FC/readiness and DB update sites now follow the new batching/NODELAY gating.
GDA device struct
src/uct/ib/mlx5/gdaki/gdaki_dev.h
Added sq_db_index (uint64_t) to uct_rc_gdaki_dev_ep_t for tracking the last doorbell index.

Sequence Diagram(s)

sequenceDiagram
    participant EP as Endpoint
    participant SQ as SendQueue (wqe_base)
    participant Lock as sq_lock
    participant DB as Doorbell/DBR

    Note over EP,SQ: optimistic reservation -> compute wqe_next
    EP->>SQ: read wqe_base_orig
    loop atomic advance
        EP->>SQ: CAS wqe_base -> wqe_next
    end

    alt NODELAY & ready == wqe_next
        EP->>Lock: lock
        Lock->>EP: if sq_db_index != sq_ready_index
        EP->>DB: ring_db / update_dbr / ring_db
        EP->>Lock: unlock
    else batch boundary (wrap) reached
        EP->>Lock: lock
        Lock->>EP: if sq_db_index != sq_ready_index
        EP->>DB: ring_db / update_dbr / ring_db
        EP->>Lock: unlock
    else
        Note right of DB: DBR update deferred (batched)
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Verify atomic compare-exchange correctness around wqe_next / wqe_base_orig.
  • Check sq_lock usage and that lock scope prevents races with concurrent DBR updates.
  • Confirm sq_db_index initialization and all read/update call-sites.
  • Validate NODELAY vs batched boundary conditions against callers and FC readiness semantics.

Suggested reviewers

  • ofirfarjun7

Poem

🐰 I hopped through WQEs and counted the next line,
A flag said “no delay” — ring only on time.
I tuck the DB index beneath a lock so tight,
Batch the bells and wait for the right light,
Nibbling bytes and keeping the SQs polite. 🥕

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'UCT/DEVICE: optimize doorbell ring' accurately summarizes the main change: optimizing doorbell ring behavior to reduce lock contention by only ringing the doorbell for the latest ready WQE index.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4b3747d and 1e9db41.

📒 Files selected for processing (6)
  • src/uct/ib/mlx5/gdaki/gdaki.c (2 hunks)
  • src/uct/ib/mlx5/gdaki/gdaki.cuh (3 hunks)
  • src/uct/ib/mlx5/gdaki/gdaki_dev.h (0 hunks)
  • test/gtest/ucp/cuda/test_kernels.cu (0 hunks)
  • test/gtest/ucp/cuda/test_kernels.h (0 hunks)
  • test/gtest/ucp/test_ucp_device.cc (0 hunks)
💤 Files with no reviewable changes (4)
  • test/gtest/ucp/test_ucp_device.cc
  • test/gtest/ucp/cuda/test_kernels.h
  • test/gtest/ucp/cuda/test_kernels.cu
  • src/uct/ib/mlx5/gdaki/gdaki_dev.h

Comment on lines 132 to 135
uct_rc_gdaki_dev_ep_t *ep, unsigned count)
{
uint16_t wqe_cnt;
uint64_t pi;

uct_rc_mlx5_gda_read_cqe(ep, &wqe_cnt, NULL);
pi = uct_rc_mlx5_gda_calc_pi(ep, wqe_cnt);
return pi + ep->sq_wqe_num + 1 - count;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Avoid overbooking the SQ

uct_rc_mlx5_gda_max_alloc_wqe_base() currently returns pi + sq_wqe_num + 1 - count. When the SQ is already full (e.g. sq_rsvd_index == pi + sq_wqe_num and count == 1), the guard at Line 149 still passes, atomicAdd() hands back wqe_base == pi + sq_wqe_num, and we reuse the oldest in-flight WQE slot. That corrupts outstanding operations. Drop the extra + 1 so the cap stays at exactly sq_wqe_num.

Apply this diff:

-    return pi + ep->sq_wqe_num + 1 - count;
+    return pi + ep->sq_wqe_num - count;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
uct_rc_gdaki_dev_ep_t *ep, unsigned count)
{
uint16_t wqe_cnt;
uint64_t pi;
uct_rc_mlx5_gda_read_cqe(ep, &wqe_cnt, NULL);
pi = uct_rc_mlx5_gda_calc_pi(ep, wqe_cnt);
return pi + ep->sq_wqe_num + 1 - count;
}
uct_rc_gdaki_dev_ep_t *ep, unsigned count)
{
uint16_t wqe_cnt;
uint64_t pi;
uct_rc_mlx5_gda_read_cqe(ep, &wqe_cnt, NULL);
pi = uct_rc_mlx5_gda_calc_pi(ep, wqe_cnt);
return pi + ep->sq_wqe_num - count;
}
🤖 Prompt for AI Agents
In src/uct/ib/mlx5/gdaki/gdaki.cuh around lines 132 to 140, the function
uct_rc_mlx5_gda_max_alloc_wqe_base() adds an extra "+ 1" to its returned value
which allows allocation to return pi + sq_wqe_num when the SQ is full, causing
reuse of the oldest in-flight WQE; remove the extra "+ 1" so the return becomes
pi + ep->sq_wqe_num - count, ensuring the cap is exactly sq_wqe_num and
preventing overbooking of the SQ.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
src/uct/ib/mlx5/gdaki/gdaki.cuh (1)

131-140: Avoid overbooking the SQ.

As noted in the previous review, uct_rc_mlx5_gda_max_alloc_wqe_base() returns pi + sq_wqe_num + 1 - count, which allows allocation to exceed SQ capacity. When the SQ is full, this can cause reuse of in-flight WQE slots and corrupt outstanding operations. The extra + 1 should be removed.

Apply this diff:

-    return pi + ep->sq_wqe_num + 1 - count;
+    return pi + ep->sq_wqe_num - count;
🧹 Nitpick comments (2)
src/uct/ib/mlx5/gdaki/gdaki.cuh (2)

243-253: Consider adding exponential backoff to reduce contention.

The spinlock implementation uses busy-waiting without backoff, which can waste GPU cycles under contention. While the PR's doorbell optimization reduces lock acquisition frequency, adding a minimal exponential backoff could further improve behavior under unexpected high contention scenarios.

Example with simple backoff:

 UCS_F_DEVICE void uct_rc_mlx5_gda_lock(int *lock) {
+    int backoff = 1;
-    while (atomicCAS(lock, 0, 1) != 0)
-        ;
+    while (atomicCAS(lock, 0, 1) != 0) {
+        for (int i = 0; i < backoff; i++) {
+            __nanosleep(1);
+        }
+        backoff = min(backoff * 2, 64);
+    }
 #ifdef DOCA_GPUNETIO_VERBS_HAS_FENCE_ACQUIRE_RELEASE_PTX
     asm volatile("fence.acquire.gpu;");

582-585: Consider documenting or removing the no-op progress function.

The uct_rc_mlx5_gda_ep_progress function is now a no-op, consistent with removing per-EP progress tracking. If this function is required for API compatibility, consider adding a comment explaining why it's kept. Otherwise, if it's no longer called, it could be removed to reduce code clutter.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1e9db41 and 4680a6b.

📒 Files selected for processing (1)
  • src/uct/ib/mlx5/gdaki/gdaki.cuh (3 hunks)
🔇 Additional comments (7)
src/uct/ib/mlx5/gdaki/gdaki.cuh (7)

123-128: LGTM! Correct 16-bit wraparound handling.

The PI calculation correctly handles 16-bit hardware counter wraparound within the 64-bit software-tracked reserved index. The formula properly reconstructs the full PI by adjusting for counter differences modulo 2^16.


142-187: Reservation logic is sound, but depends on fixing the overbooking issue.

The atomic reservation with rollback logic correctly handles concurrent WQE allocation attempts. The rollback loop properly uses CAS to undo over-reservations and re-evaluates availability. However, the correctness of this function depends on fixing the + 1 overbooking bug in uct_rc_mlx5_gda_max_alloc_wqe_base() (see previous comment).

Once the overbooking fix is applied, verify that under heavy concurrent load:

  • No WQE slots are double-allocated
  • The rollback mechanism successfully prevents starvation

255-258: LGTM! Correct lock release with proper memory ordering.

The unlock implementation correctly uses atomic store with release ordering to ensure that modifications made within the critical section are visible to the thread that next acquires the lock.


296-300: LGTM! Simplified flow control check.

The flow control function correctly uses a bitwise mask to determine when to signal flow control updates. This returns true when the WQE index is a multiple of the flow control period, which is standard behavior for periodic signaling.


588-613: LGTM! Correct CQE-based completion checking with error handling.

The completion check correctly uses the CQE-based PI to determine if an operation has completed. The comparison pi < comp->wqe_idx properly detects in-progress operations, and the error handling dumps relevant diagnostics (WQE and CQE contents) for debugging. The 64-bit PI space is large enough that wraparound is not a practical concern.


260-294: Doorbell ring-update-ring pattern is consistent with GDA memory model requirements; however, pattern lacks documentation.

The ring-update-ring sequence (ring_db → update_dbr → ring_db) is used consistently in both code paths (lines 278-280 and lines 288-290). The functions employ GPU-specific synchronization: ring_db writes directly to the doorbell register, while update_dbr uses a system-scope atomic (cuda::thread_scope_system) to ensure the doorbell record is visible across GPU-system boundaries. This differs from CPU-side MLX5 semantics and appears intentional for GPU Direct Async operations.

However, the pattern lacks explanatory comments. The implementation is sound, but the hardware requirement rationale is not documented in the code.

Please confirm whether the ring-update-ring pattern is mandated by MLX5 GDA hardware specifications. If this is a known requirement, consider adding an explanatory comment to clarify the synchronization semantics for future maintainers.


110-121: CQE field layout extraction is correct and matches mlx5_cqe64 structure definition.

The mlx5_cqe64 structure layout has wqe_counter as a 16-bit big-endian field, followed by signature (1 byte) and op_own (1 byte). The opcode is extracted from op_own via the upper 4 bits (op_own >> 4). The code correctly reads 32 bits from the wqe_counter address, extracts the lower 16 bits and byte-swaps to get the host-order counter value, and extracts bits 28-31 (the upper nibble of the op_own byte) to obtain the opcode. The byte ordering assumptions align with network byte order conventions used throughout the codebase.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
src/uct/ib/mlx5/gdaki/gdaki.cuh (2)

96-108: Critical: Incorrect PRMT mask for byte swap (already flagged).

This issue was already identified in previous reviews. The mask value 0x1 does not perform the required 16-bit byte swap; it should be 0x3210.


131-140: Critical: SQ overbooking issue (already flagged).

This issue was already identified in previous reviews. The extra + 1 on Line 139 allows allocation to exceed SQ capacity, corrupting in-flight WQEs.

🧹 Nitpick comments (1)
src/uct/ib/mlx5/gdaki/gdaki.cuh (1)

142-187: Verify: TOCTOU at pre-check and rollback loop efficiency.

The pre-check at Line 149 reads sq_rsvd_index non-atomically, creating a TOCTOU gap before the atomicAdd at Line 153. While the rollback loop (Lines 176–184) correctly handles over-allocation, multiple threads passing the stale check simultaneously will all atomicAdd, then contend in rollback, each re-reading the CQE (Line 183) under lock. Under high thread counts, this could degrade to spinning.

Consider removing the pre-check and relying solely on post-atomicAdd validation, or profile contention with many threads to confirm acceptable performance.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4680a6b and 01a58e0.

📒 Files selected for processing (1)
  • src/uct/ib/mlx5/gdaki/gdaki.cuh (3 hunks)
🔇 Additional comments (5)
src/uct/ib/mlx5/gdaki/gdaki.cuh (5)

123-128: LGTM: Correct 16-bit wraparound handling.

The arithmetic correctly aligns the 16-bit CQE wqe_cnt with the 64-bit monotonic sq_rsvd_index, handling counter wraparound via the & 0xffff mask.


243-258: LGTM: Correct spinlock implementation.

The lock/unlock functions properly use atomic CAS for acquisition and release ordering for unlock. Memory fences ensure correct ordering across GPU threads.


298-302: LGTM: Correct flow control check.

The bitwise mask check correctly implements periodic flow control, returning true when wqe_idx has no bits set in the sq_fc_mask.


595-614: Correct completion checking logic (depends on bswap16 fix).

The PI-based completion check and error handling are correctly implemented. However, accurate operation requires the bswap16 bug (Lines 96–108) to be fixed first, as read_cqe depends on it.


260-296: Double doorbell ring pattern is pre-existing hardware protocol.

Git history confirms the ring_db → update_dbr → ring_db sequence pre-dates this optimization and is replicated intentionally across both the no_delay and optimized paths. This is MLX5/DOCA hardware protocol, not duplication.

The lock protection (missing in commit 4680a6b) was correctly added in the most recent fix (01a58e0). The optimization logic is sound: the double-checked lock pattern correctly limits doorbell signaling to only the thread advancing the tail, reducing contention.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
src/uct/ib/mlx5/gdaki/gdaki.cuh (1)

131-140: Critical: SQ overbooking due to extra + 1 in max allocation calculation.

Line 139 adds an extra + 1 to the returned value. When the SQ is already full (e.g., sq_rsvd_index == pi + sq_wqe_num and count == 1), the guard at Line 149 still passes, atomicAdd() returns wqe_base == pi + sq_wqe_num, and the code reuses the oldest in-flight WQE slot, corrupting outstanding operations.

Apply this diff to prevent overbooking:

-    return pi + ep->sq_wqe_num + 1 - count;
+    return pi + ep->sq_wqe_num - count;
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 01a58e0 and a712778.

📒 Files selected for processing (2)
  • src/uct/ib/mlx5/gdaki/gdaki.cuh (3 hunks)
  • src/uct/ib/mlx5/gdaki/gdaki_dev.h (1 hunks)
🔇 Additional comments (7)
src/uct/ib/mlx5/gdaki/gdaki_dev.h (1)

22-22: Struct field addition supports doorbell optimization.

The addition of sq_db_index is consistent with the PR objective to optimize doorbell ringing by tracking the last WQE index for which the doorbell was rung. This enables the double-check at Line 280 of gdaki.cuh to avoid redundant doorbell rings.

src/uct/ib/mlx5/gdaki/gdaki.cuh (6)

123-128: PI calculation logic is sound but depends on correct CQE reads.

The arithmetic for deriving the producer index from the reserved index and WQE counter is correct. However, this function depends on uct_rc_mlx5_gda_read_cqe returning a valid wqe_cnt, which is currently broken due to the incorrect byte-swap mask.


142-187: Reservation logic depends on fixing max_alloc_wqe_base.

The atomicAdd/rollback pattern with atomicCAS (Lines 153-184) is a sound approach for concurrent WQE reservation. However, this function calls uct_rc_mlx5_gda_max_alloc_wqe_base (Lines 148, 183), which has a critical overbooking bug. Until that is corrected, the reservation logic will allow over-commitment of the SQ.


243-258: Lock/unlock primitives are correctly implemented.

The spin-lock at Lines 243-253 uses atomicCAS with acquire fence semantics, and the unlock at Lines 255-258 uses atomic store with release ordering. This provides proper synchronization for the doorbell optimization.


260-288: Doorbell optimization correctly implements PR objective.

The logic successfully implements the batched doorbell ringing strategy:

  • Lines 271-274: Update sq_ready_index atomically to mark WQEs as ready.
  • Line 280: Double-check prevents redundant doorbell rings when sq_db_index already matches the latest ready_index.
  • Lines 278-286: Lock-protected doorbell ring ensures only the thread with the latest ready WQE rings the doorbell.

This aligns with the PR objective to reduce doorbell lock contention by allowing only the last thread to post a WQE to ring the doorbell.


582-607: Completion check logic is sound but depends on correct CQE reads.

The PI-based completion check (Lines 591-596) and error handling with diagnostic dumps (Lines 598-603) are well-structured. However, this function depends on uct_rc_mlx5_gda_read_cqe and uct_rc_mlx5_gda_calc_pi, which will return incorrect values until the bswap16 mask is fixed.


290-294: I need to gather more information to understand if this logic change is correct and coordinated with other changes in the PR.

Let me search the codebase for where this function is called and how it's used:Based on my analysis of the script output and web search results, I can see that the sq_fc_mask is calculated as (qp_attr.max_tx >> 1) - 1. This creates a mask with all lower bits set. However, without being able to search for where uct_rc_mlx5_gda_fc() is actually called and how it's used, and without access to the old code showing the previous implementation, I cannot definitively verify whether the logic inversion is correct or represents a bug.

The change from (wqe_idx & ep->sq_fc_mask) == 1 to !(wqe_idx & ep->sq_fc_mask) is fundamentally different in semantics and could potentially break flow control if not coordinated with other changes in the mask setup or caller logic.

Verify flow control logic change carefully.

The FC condition has been inverted from (wqe_idx & ep->sq_fc_mask) == 1 to !(wqe_idx & ep->sq_fc_mask). While sq_fc_mask is set to (qp_attr.max_tx >> 1) - 1, the mask comparison logic now triggers at different WQE indices:

  • Old: triggers when masked bits equal 1
  • New: triggers when masked bits equal 0

Without visibility into all call sites and the accompanying mask definition changes, manually verify this inverted flow control logic is intentional and correctly coordinates with how CQ updates and completions are requested throughout the GDA code path.

Comment on lines 96 to 108
UCS_F_DEVICE uint16_t uct_rc_mlx5_gda_bswap16(uint16_t x)
{
/* Try to reserve optimistically */
int32_t prev = atomicAdd(&ep->avail_count, -(int32_t)count);
if (prev < (int32_t)count) {
/* Rollback */
atomicAdd(&ep->avail_count, count);
uint32_t ret;
asm volatile("{\n\t"
".reg .b32 mask;\n\t"
".reg .b32 ign;\n\t"
"mov.b32 mask, 0x1;\n\t"
"prmt.b32 %0, %1, ign, mask;\n\t"
"}"
: "=r"(ret)
: "r"((uint32_t)x));
return ret;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: PTX PRMT mask remains incorrect despite past review.

The byte-swap mask is still 0x1 (Line 102), which does not perform byte swapping. A previous review flagged this and marked it as "Addressed in commit 01a58e0", but the code remains unchanged. For 16-bit byte swapping, the selector must be 0x3210 where each 4-bit nibble selects one of the source bytes.

Apply this diff to fix the mask:

-                 "mov.b32 mask, 0x1;\n\t"
+                 "mov.b32 mask, 0x3210;\n\t"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
UCS_F_DEVICE uint16_t uct_rc_mlx5_gda_bswap16(uint16_t x)
{
/* Try to reserve optimistically */
int32_t prev = atomicAdd(&ep->avail_count, -(int32_t)count);
if (prev < (int32_t)count) {
/* Rollback */
atomicAdd(&ep->avail_count, count);
uint32_t ret;
asm volatile("{\n\t"
".reg .b32 mask;\n\t"
".reg .b32 ign;\n\t"
"mov.b32 mask, 0x1;\n\t"
"prmt.b32 %0, %1, ign, mask;\n\t"
"}"
: "=r"(ret)
: "r"((uint32_t)x));
return ret;
}
UCS_F_DEVICE uint16_t uct_rc_mlx5_gda_bswap16(uint16_t x)
{
uint32_t ret;
asm volatile("{\n\t"
".reg .b32 mask;\n\t"
".reg .b32 ign;\n\t"
"mov.b32 mask, 0x3210;\n\t"
"prmt.b32 %0, %1, ign, mask;\n\t"
"}"
: "=r"(ret)
: "r"((uint32_t)x));
return ret;
}
🤖 Prompt for AI Agents
In src/uct/ib/mlx5/gdaki/gdaki.cuh around lines 96-108 the PTX prmt.b32 mask is
incorrectly set to 0x1 which does not perform a 16-bit byte swap; change the
immediate mask to 0x3210 so each 4-bit selector maps the correct source byte for
a 16-bit byteswap, keep the asm template and operand order the same (input as
(uint32_t)x) and ensure the function still returns the lower 16 bits
(cast/truncate the uint32_t ret to uint16_t before returning).

@jeynmann jeynmann changed the title [WIP][DNM] UCT/DEVICE: optimize doorbell ring UCT/DEVICE: optimize doorbell ring Nov 10, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (3)
src/uct/ib/mlx5/gdaki/gdaki.cuh (3)

96-108: Critical: PTX byte-swap mask still incorrect.

This issue was previously flagged as critical but remains unresolved. The mask 0x1 at Line 102 does not perform 16-bit byte swapping. The correct selector for prmt.b32 must be 0x3210 where each 4-bit nibble maps the source bytes properly.

Apply this diff:

-                 "mov.b32 mask, 0x1;\n\t"
+                 "mov.b32 mask, 0x3210;\n\t"

110-125: Critical: CQE parsing returns corrupted wqe_cnt.

Line 119 calls uct_rc_mlx5_gda_bswap16, which uses the broken PTX mask. Until that mask is corrected, all CQE reads will produce incorrect wqe_cnt values, breaking completion tracking and producer-index calculations throughout the codebase.


127-135: Critical: SQ overbooking still present.

The extra + 1 at Line 134 allows allocation to return pi + sq_wqe_num when the SQ is full, reusing the oldest in-flight WQE slot and corrupting outstanding operations. This issue was flagged in a previous review but remains unresolved.

Apply this diff:

-    return pi + ep->sq_wqe_num + 1 - count;
+    return pi + ep->sq_wqe_num - count;
🧹 Nitpick comments (1)
src/uct/ib/mlx5/gdaki/gdaki_dev.h (1)

22-22: Add inline documentation for the new sq_db_index field.

The field lacks documentation explaining its purpose. Add a comment above or next to the field (line 22 in gdaki_dev.h) describing that it tracks the last rung doorbell index to avoid redundant rings.

Note: The field is not yet used in the codebase and appears preparatory for future optimization logic. The zero-initialization via {} in gdaki.c:74 is appropriate and requires no changes.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a712778 and 3a8e9ba.

📒 Files selected for processing (2)
  • src/uct/ib/mlx5/gdaki/gdaki.cuh (1 hunks)
  • src/uct/ib/mlx5/gdaki/gdaki_dev.h (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
  • GitHub Check: UCX PR (Static_check Static checks)
  • GitHub Check: UCX PR (Codestyle ctags check)
  • GitHub Check: UCX PR (Codestyle codespell check)
  • GitHub Check: UCX PR (Codestyle format code)
  • GitHub Check: UCX PR (Codestyle AUTHORS file update check)
  • GitHub Check: UCX PR (Codestyle commit title)
  • GitHub Check: UCX release DRP (Prepare CheckRelease)
  • GitHub Check: UCX release (Prepare CheckRelease)
  • GitHub Check: UCX snapshot (Prepare Check)
🔇 Additional comments (1)
src/uct/ib/mlx5/gdaki/gdaki.cuh (1)

285-289: Inversion of FC logic is confirmed—verify this semantic change was intentional.

The flow-control readiness check was indeed inverted in commit ea65cc6. The original logic (wqe_idx & ep->sq_fc_mask) == 1 has been changed to !(wqe_idx & ep->sq_fc_mask), which reverses when flow-control triggers CQ updates.

All three call sites (lines 313, 433, 527 in gdaki.cuh) now use the inverted logic to set the DOCA_GPUNETIO_MLX5_WQE_CTRL_CQ_UPDATE flag. This is a significant semantic change that alters which WQE indices receive completion notifications. Ensure this inversion aligns with the FC behavior intended by the "Collapsed CQ" changes and that all downstream consumers correctly interpret the new return values.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
src/uct/ib/mlx5/gdaki/gdaki.cuh (3)

96-108: Fix uct_rc_mlx5_gda_bswap16: wrong PRMT mask and missing truncation.

The PTX sequence still uses mov.b32 mask, 0x1, which does not perform a 16‑bit byte swap, and the function returns the full 32‑bit ret instead of truncating to 16 bits. This keeps CQE wqe_cnt decoding wrong and can corrupt completion tracking.

Please switch to the proper 16‑bit byteswap selector and truncate on return, e.g.:

 UCS_F_DEVICE uint16_t uct_rc_mlx5_gda_bswap16(uint16_t x)
 {
     uint32_t ret;
     asm volatile("{\n\t"
                  ".reg .b32 mask;\n\t"
                  ".reg .b32 ign;\n\t"
-                 "mov.b32 mask, 0x1;\n\t"
+                 "mov.b32 mask, 0x3210;\n\t"
                  "prmt.b32 %0, %1, ign, mask;\n\t"
                  "}"
                  : "=r"(ret)
                  : "r"((uint32_t)x));
-    return ret;
+    return (uint16_t)ret;
 }

127-135: Prevent SQ overbooking in uct_rc_mlx5_gda_max_alloc_wqe_base.

uct_rc_mlx5_gda_max_alloc_wqe_base() still returns pi + ep->sq_wqe_num + 1 - count. When the SQ is full (e.g. sq_rsvd_index == pi + sq_wqe_num and count == 1), this allows max_wqe_base == pi + sq_wqe_num, so the precheck in uct_rc_mlx5_gda_reserv_wqe_thread() passes and atomicAdd() can hand out wqe_base == pi + sq_wqe_num, reusing the oldest in‑flight WQE slot and corrupting outstanding operations.

Cap at exactly sq_wqe_num by dropping the extra + 1:

-    return pi + ep->sq_wqe_num + 1 - count;
+    return pi + ep->sq_wqe_num - count;

259-297: Use atomic loads for sq_ready_index (avoid mixed atomic/non‑atomic access) and briefly document the double doorbell.

ep->sq_ready_index is updated via cuda::atomic_ref (ref.compare_exchange_strong), but later read with READ_ONCE(ep->sq_ready_index) (Line 286) and a plain load (Line 289). Mixing atomic_ref operations with non‑atomic reads on the same object is undefined by the C++ memory model and can introduce subtle races.

You can keep the current algorithm and fix the data‑race by using the same atomic_ref for all reads, e.g.:

-    if ((no_delay && READ_ONCE(ep->sq_ready_index) == wqe_next) ||
+    if ((no_delay && ref.load(cuda::std::memory_order_relaxed) == wqe_next) ||
         (!no_delay && ((wqe_base ^ wqe_next) & UCT_RC_GDA_DB_BATCH_SIZE))) {
         uct_rc_mlx5_gda_lock(&ep->sq_lock);
-        const uint64_t ready_index = ep->sq_ready_index;
+        const uint64_t ready_index =
+                ref.load(cuda::std::memory_order_relaxed);
         if (ep->sq_db_index != ready_index) {
             ep->sq_db_index = ready_index;
             uct_rc_mlx5_gda_ring_db(ep, ready_index);
             uct_rc_mlx5_gda_update_dbr(ep, ready_index);
             uct_rc_mlx5_gda_ring_db(ep, ready_index);
         }
         uct_rc_mlx5_gda_unlock(&ep->sq_lock);
     }

Additionally, the new comment nicely explains the NODELAY vs batched paths and the double read, but it still does not say why we ring the doorbell twice around uct_rc_mlx5_gda_update_dbr(). A short note (e.g., “hardware requires DBR update to be bracketed by two doorbells for X/Y reason”) would address the earlier review concern and make this pattern self‑explanatory.

🧹 Nitpick comments (1)
src/uct/ib/mlx5/gdaki/gdaki.cuh (1)

15-19: Clarify semantics/constraints of UCT_RC_GDA_DB_BATCH_SIZE.

UCT_RC_GDA_DB_BATCH_SIZE is used as a bit mask in ((wqe_base ^ wqe_next) & UCT_RC_GDA_DB_BATCH_SIZE) (Line 287), so it must stay a power-of-two and effectively defines a batch boundary bit rather than an arbitrary “size”. Consider either renaming to make the mask-like nature explicit or adding a short comment (e.g., “must be power-of-two; used as WQE-index boundary bit for DB batching”) to prevent accidental changes that would break the batching condition.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3a8e9ba and c5841d5.

📒 Files selected for processing (1)
  • src/uct/ib/mlx5/gdaki/gdaki.cuh (2 hunks)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants