Skip to content

Conversation

@shasson5
Copy link
Contributor

@shasson5 shasson5 commented Nov 5, 2025

What?

Add support for flush_ep

Why?

Required for cuda_ipc to be used in gtests

Summary by CodeRabbit

  • New Features
    • Added per-endpoint CUDA stream flush API that synchronizes all active streams and supports completion callbacks.
  • Bug Fixes
    • Improved flush progress behavior to avoid treating flush events as regular events, preventing premature completions.
    • Enhanced flush error handling and resource cleanup to reduce leaks and ensure reliable rollback on failure.

@coderabbitai
Copy link

coderabbitai bot commented Nov 5, 2025

Walkthrough

Adds a CUDA-specific endpoint flush API that enqueues per-stream flush events tied to a shared flush descriptor, introduces flush event types and a flush descriptor pool, updates progress to treat flush events specially, and adds rollback and cleanup on allocation/enqueue failures.

Changes

Cohort / File(s) Summary
Flush Type Definitions
src/uct/cuda/base/cuda_iface.h
Adds uct_cuda_flush_desc_t (stream counter + completion) and uct_cuda_flush_stream_desc_t (per-stream event embedding); extends uct_cuda_iface_t with a flush_mpool member and declares uct_cuda_base_ep_flush().
Flush Implementation
src/uct/cuda/base/cuda_iface.c
Implements uct_cuda_base_ep_flush() public API, per-stream callback uct_cuda_base_stream_flushed_cb() that decrements the shared flush counter and completes when zero, helper uct_cuda_base_event_is_flush(), introduces flush_mpool ops and pool lifecycle (init/cleanup), updates progress to skip treating flush events as regular completions, and adds rollback on allocation/enqueue failures.
Interface Handlers
src/uct/cuda/cuda_copy/cuda_copy_iface.c, src/uct/cuda/cuda_ipc/cuda_ipc_iface.c
Replaces ep_flush op bindings from generic uct_base_ep_flush to the new uct_cuda_base_ep_flush for CUDA copy and CUDA IPC ifaces.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant EP as Endpoint
    participant IFACE as CUDA_Base_IFACE
    participant Stream as GPU_Stream
    participant Progress as Progress_Loop

    Caller->>EP: ep_flush(comp)
    activate EP
    EP->>IFACE: uct_cuda_base_ep_flush()
    activate IFACE
    IFACE->>IFACE: alloc flush_desc (stream_counter = N)
    loop per active stream (N)
        IFACE->>Stream: enqueue per-stream flush event (flush_stream_desc)
    end
    EP-->>Caller: return (pending)
    deactivate EP
    deactivate IFACE

    par Stream completions & progress polling
        Stream->>IFACE: stream event triggers callback
        activate IFACE
        IFACE->>IFACE: uct_cuda_base_stream_flushed_cb() — decrement counter
        alt counter == 0
            IFACE->>Caller: complete user completion, free flush_desc
        end
        deactivate IFACE
        Progress->>Progress: poll events (skip flush events)
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Check flush_mpool initialization/destruction and error-path cleanup.
  • Verify atomic/thread-safe handling of stream_counter in callbacks.
  • Confirm progress loop correctly identifies and skips flush events.
  • Review rollback logic for partially enqueued flush_stream_desc items.

Poem

🐇 I nudged each stream to hop in line,

small flags that wait, then tick and shine;
the counter counts each tiny feat,
when last one lands, the flush is sweet —
the rabbit hums, the buffers clear.

Pre-merge checks and finishing touches

✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'UCT/CUDA: Add support for flush_ep' accurately reflects the main change: implementing flush_ep functionality for CUDA endpoints, as evidenced by the new uct_cuda_base_ep_flush API and integration across CUDA interfaces.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot]

This comment was marked as spam.

@openucx openucx deleted a comment from coderabbitai bot Nov 5, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bbcc1d8 and e3d088d.

📒 Files selected for processing (1)
  • src/uct/cuda/base/cuda_iface.c (4 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/uct/cuda/base/cuda_iface.c (2)
src/ucs/debug/memtrack.c (2)
  • ucs_free (368-372)
  • ucs_malloc (328-334)
src/ucs/datastruct/mpool.c (7)
  • ucs_mpool_get (215-218)
  • ucs_mpool_put (220-223)
  • ucs_mpool_chunk_malloc (326-330)
  • ucs_mpool_chunk_free (332-335)
  • ucs_mpool_params_reset (60-73)
  • ucs_mpool_init (81-147)
  • ucs_mpool_cleanup (149-198)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
  • GitHub Check: UCX PR (Static_check Static checks)
  • GitHub Check: UCX PR (Codestyle AUTHORS file update check)
  • GitHub Check: UCX PR (Codestyle ctags check)
  • GitHub Check: UCX PR (Codestyle codespell check)
  • GitHub Check: UCX PR (Codestyle format code)
  • GitHub Check: UCX PR (Codestyle commit title)
  • GitHub Check: UCX release DRP (Prepare CheckRelease)
  • GitHub Check: UCX release (Prepare CheckRelease)
  • GitHub Check: UCX snapshot (Prepare Check)
🔇 Additional comments (7)
src/uct/cuda/base/cuda_iface.c (7)

260-265: LGTM: Clean flush event detection.

The function pointer comparison is an effective way to identify flush events. The dual check (non-NULL comp and matching function) ensures correctness.


275-290: LGTM: Correct flush event handling in progress path.

The updated logic properly distinguishes flush events from regular events:

  • Flush events are extracted immediately (serving as queue markers)
  • Regular events are queried for completion before extraction
  • Flush events skip complete_event invocation (line 285-287) since they represent flush markers rather than actual CUDA operations

The flush-as-marker design is sound: when a flush event reaches the front of the queue, all preceding operations are guaranteed complete.


451-457: LGTM: Appropriate mpool configuration for flush descriptors.

Unlike regular event descriptors that require CUDA event initialization, flush descriptors are simple structures that use queue ordering for synchronization. NULL init/cleanup callbacks are correct here.


465-486: LGTM: Proper flush mpool initialization.

The constructor correctly:

  • Initializes mpool parameters with appropriate element size
  • Checks initialization status and returns errors properly
  • Follows the same pattern as event descriptor mpool initialization elsewhere in the codebase

502-502: LGTM: Proper flush mpool cleanup with leak checking.

The destructor correctly cleans up the flush mpool with leak checking enabled, matching the cleanup pattern for other mpools in the codebase.


209-211: No issues found after verification.

The NULL completion handling is correct and follows expected UCT semantics. When comp == NULL, the function returns UCS_INPROGRESS if there are active streams, indicating work is pending without a callback request—consistent with other async transports like UD. The mechanism is well-documented in the code comments explaining the flush event queue mechanism.


242-257: Rollback logic appears sound — no issues found.

The verification confirms the implementation is correct:

  1. ucs_queue_iter_elem macro correctly uses container_of to retrieve queue descriptors from the iterator over the embedded queue field
  2. Tail element assumption is valid — the rollback retrieves exactly stream_counter elements (matching successful pushes) from each queue; since push appends to tail and retrieval grabs tail, the most-recently-pushed item is correctly identified for rollback
  3. Type cast is safesuper is the first field of uct_cuda_flush_stream_desc_t, so casting uct_cuda_event_desc_t* back to the parent type is valid

The code relies on single-threaded enqueue semantics with no concurrent progress drains during the error path, which appears consistent with the endpoint context.

@openucx openucx deleted a comment from coderabbitai bot Nov 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants