-
Notifications
You must be signed in to change notification settings - Fork 65
New DSL implementation #579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Co-authored-by: Binyang Li <[email protected]>
This PR cover the implementation of the following operations: - nop - copy_packet - transform_to_packet - reduce_copy - reduce_copy_packet - relaxed_signal - relaxed_wait - barrier - flush - get - put_packet - read_put_packet - put_with_signal - put_with_signal_and_flush - reduce_copy_send - reduce_copy_send_packet - read_reduce_copy - read_reduce_copy_send - group_load_reduce - group_store --------- Co-authored-by: Binyang Li <[email protected]>
Fuse Operations Rank: - Sync + Sync = Sync Memory Channel: - Signal + Signal = Signal (Operations should use different channels) - Wait + Wait = Wait (Operations should use different channels) - Relax Signal + Relax Signal = Relax Signal (Operations should use different channels) - Relax Wait + Relax Wait = Relax Wait (Operations should use different channels) - Get + Get = Get (Operations size should match) - Put + Put = Put (Operations size should match) - Put Packet + Put Packet = Put Packet (Operations size should match) - Reduce + Reduce = Reduce (Operations should have the same local_src_buff, local_dst_buff and reduce_operation) - Reduce Packet + Reduce Packet = Reduce Packet (Operations should have the same local_src_buff, local_dst_buff and reduce_operation) - Read Reduce + Read Reduce = Read Reduce (Operations should have the same local_src_buff, local_dst_buff and reduce_operation) - Reduce + Put = Reduce Send (Operations should match the reduce dst_buff and the put src_buff) - Reduce Send + Put = Reduce Send (Operations should match the reduce dst_buff and the put src_buff) - Read Reduce + Put = Read Reduce Send (Operations should match the reduce dst_buff and the put src_buff) - Read Reduce Send + Put = Read Reduce Send (Operations should match the reduce dst_buff and the put src_buff) - Reduce Packet + Put Packet = Reduce Send Packet (Operations should match the reduce dst_buff and the put src_buff) - Reduce Send Packet + Put Packet = Reduce Send Packet (Operations should match the reduce dst_buff and the put src_buff) Port Channel: - Signal + Signal = Signal (the operations should use different channels) - Wait + Wait = Wait (the operations should use different channels) - Flush + Flush = Flush - Put + Put = Put (Operations size should match) - Put With Signal + Put With Signal = Put With Signal (Operations size should match) - Put With Signal and Flush + Put With Signal and Flush = Put With Signal and Flush (Operations size should match)
The PR ensures that if two or more operations impact the same memory location, it will insert a mechanism to synchronize the thread block between theses operations, preventing any errors. For instance: copy chunk 0 -> chunk 1 put chunk 1 -> chunk 2 Here we can see the copy operation and put operation have chunk 1 in common for writing and reading, respectively. In this case, I need to make sure the copy operation ends before starting the put operation. Therefore, we should have a synchronization mechanism between them to avoid any conflicts. In this PR, we will insert synchronization mechanisms between the operations under the following conditions: Operation using Write Data Access followed by Operation using Write Data Access to the same memory location Operation using Read Data Access followed by Operation using Write Data Access to the same memory location Operation using Write Data Access followed by Operation using Read Data Access to the same memory location --------- Co-authored-by: Binyang Li <[email protected]>
Park I for channel based DSL in C++ side Load the new json format and run ops on the device. Will finish other kernel operations in next PR as well as pipeline operations
Introduces an asynchronous synchronization mechanism for threads within the same rank.
Introduce a mechanism that allows create a pipeline for a sequence of operations, specifying both the number of chunks and the unit size.
Previously, the switch channel operation would automatically create the scratch buffer if it was referenced and had not yet been created by the user. We have now removed this behavior; instead, an exception will be thrown if the user does not properly create the scratch buffer for all ranks.
- Temporary disable DSL ci pipeline - Fix executor issue, only add constOffset for remote buffer
Co-authored-by: Caio Rocha <[email protected]>
Fix rocm build failure issue
/azp run |
Azure Pipelines could not run because the pipeline triggers exclude this branch/path. |
Co-authored-by: Binyang Li <[email protected]>
/azp run |
Azure Pipelines successfully started running 3 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 3 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 3 pipeline(s). |
chhwang
reviewed
Jul 31, 2025
Co-authored-by: Binyang Li <[email protected]>
Binyang2014
commented
Jul 31, 2025
chhwang
reviewed
Aug 2, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In progress
chhwang
approved these changes
Aug 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestions for future PRs
- Since many methods need a
tb
parameter, let's make its position consistent (e.g., make it the first parameter) - I think
local_dst_chunk
should be mandatory andlocal_src_chunk
should be optional instead. Because we may not have local data to accumulate. - How about combining the
put*()
methods into one, likedef put(self, ..., signal: bool = False, flush: bool = False)
/azp run |
Azure Pipelines successfully started running 3 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 3 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 3 pipeline(s). |
chhwang
approved these changes
Aug 9, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The PR contains following changes:
Python side:
C++ side: