Skip to content

Conversation

@nsarka
Copy link
Member

@nsarka nsarka commented Jan 5, 2026

No description provided.

@nsarka nsarka self-assigned this Jan 5, 2026
@nsarka nsarka force-pushed the nsarka/rs-gemm-cuda branch from a250a75 to d5a36ee Compare January 5, 2026 18:28
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 5, 2026

Greptile Summary

This PR extends the stream parallel type lowering pass to support the kCuda backend for matrix multiplication with reduce-scatter (MM+RS) operations. Previously, only the kNccl backend was supported for this lowering path.

Key Changes

  • Added kCuda backend handling in two critical lowering paths (MM+RS and AG+MM algorithms) in stream_parallel_type.cpp

    • Implemented ShareMemHandles for memory handle sharing between P2P operations
    • Added protocol-specific ordering (P2pProtocol::Get vs P2pProtocol::Put) for send/recv operations
    • Implemented deferred wait semantics: wait_recv executes immediately while wait_send is deferred to loop epilogue
    • Both code paths now have consistent structure and error handling
  • Refactored protocol selection from if-else chains to switch statements for better code clarity and exhaustive enum handling

  • Enhanced test coverage by converting ReduceScatterP2p test to a parameterized test that validates both kCuda and kNccl backends

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The implementation follows established patterns from existing AG+MM path, maintains consistent structure across both lowering paths, properly handles all enum cases with exhaustive switch statements, includes comprehensive test coverage for both backends, and adds appropriate error handling for unsupported backends
  • No files require special attention

Important Files Changed

Filename Overview
csrc/host_ir/pass/stream_parallel_type.cpp Added kCuda backend support for MM+RS stream lowering with proper P2P communication handling, deferred wait semantics, and refactored protocol selection to use switch statements
tests/cpp/test_multidevice_stream_parallel_type.cpp Converted ReduceScatterP2p test to parameterized test that validates both kCuda and kNccl backends

Sequence Diagram

sequenceDiagram
    participant Lowering as Stream Parallel Lowering
    participant Backend as Communicator Backend
    participant P2P as P2P Communication
    participant Wait as Wait Handler
    
    Lowering->>Backend: Check communicator_backend
    
    alt kNccl Backend
        Backend->>P2P: StartCoalescing
        P2P->>P2P: Create RECV
        P2P->>P2P: Create SEND
        P2P->>P2P: EndCoalescing
        P2P->>Wait: Wait(end_coalescing)
    else kCuda Backend
        Backend->>P2P: ShareMemHandles(recv, send)
        alt P2pProtocol::Get
            P2P->>P2P: SEND first
            P2P->>P2P: RECV second
        else P2pProtocol::Put
            P2P->>P2P: RECV first
            P2P->>P2P: SEND second
        end
        P2P->>Wait: Wait(recv) - immediate
        P2P->>Wait: Wait(send) - deferred to epilogue
    else Unsupported
        Backend->>Lowering: NVF_THROW
    end
Loading

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 5, 2026

Greptile's behavior is changing!

From now on, if a review finishes with no comments, we will not post an additional "statistics" comment to confirm that our review found nothing to comment on. However, you can confirm that we reviewed your changes in the status check section.

This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR".

@nsarka nsarka force-pushed the nsarka/rs-gemm-cuda branch from 487bc2b to 162c0c7 Compare January 5, 2026 18:50
@nsarka
Copy link
Member Author

nsarka commented Jan 5, 2026

Thanks for the suggestions @wujingyue

@nsarka nsarka requested a review from wujingyue January 5, 2026 19:05
if_sending_to_self->elseBody().pushBack(send);
break;
}
default:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@nsarka nsarka force-pushed the nsarka/rs-gemm-cuda branch from 557cbc4 to 49d62fd Compare January 5, 2026 20:04
@nsarka nsarka requested a review from wujingyue January 5, 2026 20:04
@nsarka
Copy link
Member Author

nsarka commented Jan 5, 2026

!test

@nsarka nsarka merged commit 1bbd375 into NVIDIA:main Jan 5, 2026
54 of 56 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants