Skip to content

Conversation

Binyang2014
Copy link
Contributor

The PR contains following changes:
Python side:

  • Channel based DSL implementation: decouple channel with chunk.
    • Users create channel explicitly, only need local_rank, remote_rank and channel_type
    • Adjust executor json file, add remote_buffer fields, different op can use different channel and remote buffers combination.
  • Reimplement operation fusion, data dependency check mechanism
  • Add new op such as semaphore, pipeline
  • Clean code and enhance document
    C++ side:
  • Support new execution file json format
  • Support semaphore and pipeline operation
  • code clean, support non-zero copy scenario

caiomcbr and others added 21 commits May 29, 2025 10:21
This PR cover the implementation of the following operations:
- nop
- copy_packet
- transform_to_packet
- reduce_copy
- reduce_copy_packet
- relaxed_signal
- relaxed_wait
- barrier
- flush
- get
- put_packet
- read_put_packet
- put_with_signal
- put_with_signal_and_flush
- reduce_copy_send
- reduce_copy_send_packet
- read_reduce_copy
- read_reduce_copy_send
- group_load_reduce
- group_store

---------

Co-authored-by: Binyang Li <[email protected]>
Fuse Operations

Rank:

- Sync + Sync = Sync

Memory Channel:

- Signal + Signal = Signal (Operations should use different channels)
- Wait + Wait = Wait (Operations should use different channels)
- Relax Signal + Relax Signal = Relax Signal (Operations should use
different channels)
- Relax Wait + Relax Wait = Relax Wait (Operations should use different
channels)
- Get + Get = Get (Operations size should match)
- Put + Put = Put (Operations size should match)
- Put Packet + Put Packet = Put Packet (Operations size should match)
- Reduce + Reduce = Reduce (Operations should have the same
local_src_buff, local_dst_buff and reduce_operation)
- Reduce Packet + Reduce Packet = Reduce Packet (Operations should have
the same local_src_buff, local_dst_buff and reduce_operation)
- Read Reduce + Read Reduce = Read Reduce (Operations should have the
same local_src_buff, local_dst_buff and reduce_operation)

- Reduce + Put = Reduce Send (Operations should match the reduce
dst_buff and the put src_buff)
- Reduce Send + Put = Reduce Send (Operations should match the reduce
dst_buff and the put src_buff)
- Read Reduce + Put = Read Reduce Send (Operations should match the
reduce dst_buff and the put src_buff)
- Read Reduce Send + Put = Read Reduce Send (Operations should match the
reduce dst_buff and the put src_buff)
- Reduce Packet + Put Packet = Reduce Send Packet (Operations should
match the reduce dst_buff and the put src_buff)
- Reduce Send Packet + Put Packet = Reduce Send Packet (Operations
should match the reduce dst_buff and the put src_buff)

Port Channel:

- Signal + Signal = Signal (the operations should use different
channels)
- Wait + Wait = Wait (the operations should use different channels)
- Flush + Flush = Flush
- Put + Put = Put (Operations size should match)
- Put With Signal + Put With Signal = Put With Signal (Operations size
should match)
- Put With Signal and Flush + Put With Signal and Flush = Put With
Signal and Flush (Operations size should match)
The PR ensures that if two or more operations impact the same memory
location, it will insert a mechanism to synchronize the thread block
between theses operations, preventing any errors. For instance:

copy chunk 0 -> chunk 1
put chunk 1 -> chunk 2

Here we can see the copy operation and put operation have chunk 1 in
common for writing and reading, respectively. In this case, I need to
make sure the copy operation ends before starting the put operation.
Therefore, we should have a synchronization mechanism between them to
avoid any conflicts.

In this PR, we will insert synchronization mechanisms between the
operations under the following conditions:

Operation using Write Data Access followed by Operation using Write Data
Access to the same memory location
Operation using Read Data Access followed by Operation using Write Data
Access to the same memory location
Operation using Write Data Access followed by Operation using Read Data
Access to the same memory location

---------

Co-authored-by: Binyang Li <[email protected]>
Park I for channel based DSL in C++ side
Load the new json format and run ops on the device.
Will finish other kernel operations in next PR as well as pipeline
operations
Introduces an asynchronous synchronization mechanism for threads within
the same rank.
Introduce a mechanism that allows create a pipeline for a sequence of
operations, specifying both the number of chunks and the unit size.
Previously, the switch channel operation would automatically create the
scratch buffer if it was referenced and had not yet been created by the
user. We have now removed this behavior; instead, an exception will be
thrown if the user does not properly create the scratch buffer for all
ranks.
- Temporary disable DSL ci pipeline
- Fix executor issue, only add constOffset for remote buffer
Fix rocm build failure issue
@Binyang2014
Copy link
Contributor Author

/azp run

Copy link

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

@Binyang2014 Binyang2014 marked this pull request as ready for review July 25, 2025 17:06
@Binyang2014
Copy link
Contributor Author

/azp run

Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@Binyang2014
Copy link
Contributor Author

/azp run

Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@Binyang2014
Copy link
Contributor Author

/azp run

Copy link

Azure Pipelines successfully started running 3 pipeline(s).

Copy link
Contributor

@chhwang chhwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In progress

Copy link
Contributor

@chhwang chhwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestions for future PRs

  • Since many methods need a tb parameter, let's make its position consistent (e.g., make it the first parameter)
  • I think local_dst_chunk should be mandatory and local_src_chunk should be optional instead. Because we may not have local data to accumulate.
  • How about combining the put*() methods into one, like def put(self, ..., signal: bool = False, flush: bool = False)

@Binyang2014
Copy link
Contributor Author

/azp run

Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@Binyang2014
Copy link
Contributor Author

/azp run

Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@Binyang2014
Copy link
Contributor Author

/azp run

Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@chhwang chhwang merged commit be6a941 into main Aug 9, 2025
17 of 29 checks passed
@chhwang chhwang deleted the feature/dsl branch August 9, 2025 07:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants