-
Notifications
You must be signed in to change notification settings - Fork 117
Optimize scan KT and support user defined binary operators #2357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR reworks the scan kernel template implementation by transitioning from an SLM-driven approach to a sub-group-driven approach that primarily operates on hardware registers. It also adds support for user-defined binary operators and optimizes the implementation.
Key changes:
- Introduces modular cooperative lookback implementation with bit packing optimizations
- Implements optimized work-group and sub-group level scans with register-based operations
- Adds support for custom binary operators beyond SYCL function objects with known identities
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.
Show a summary per file
File | Description |
---|---|
test/kt/single_pass_scan.cpp |
Adds custom binary operator test (my_bit_xor ) to verify user-defined function object support |
test/kt/CMakeLists.txt |
Expands test coverage by adding uint8_t and uint16_t data types and additional test configurations |
include/oneapi/dpl/experimental/kt/single_pass_scan.h |
Core implementation rewrite - switches to sub-group driven scan with register-based operations and modular lookback |
include/oneapi/dpl/experimental/kt/internal/work_group/work_group_scan.h |
New optimized work-group scan implementation using sub-group level operations |
include/oneapi/dpl/experimental/kt/internal/sub_group/sub_group_scan.h |
New sub-group scan implementation operating on register arrays in strided layout |
include/oneapi/dpl/experimental/kt/internal/cooperative_lookback.h |
Extracted cooperative lookback implementation with specialized bit-packed storage for small types |
Comments suppressed due to low confidence (1)
include/oneapi/dpl/experimental/kt/single_pass_scan.h:314
- There is an extra closing brace that doesn't match any opening brace. This will cause a compilation error.
}
include/oneapi/dpl/experimental/kt/internal/sub_group/sub_group_scan.h
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made it through cooperative_lookback.h. I still need to review the rest.
include/oneapi/dpl/experimental/kt/internal/cooperative_lookback.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/experimental/kt/internal/cooperative_lookback.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/experimental/kt/internal/cooperative_lookback.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/experimental/kt/internal/sub_group/sub_group_scan.h
Outdated
Show resolved
Hide resolved
f9aa36c
to
1486a10
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've gone through the changes and I don't see any glaring issues. I just have a few nitpicks but I think it's in a good shape to merge.
} | ||
|
||
// For initialization routines, we do not need atomicity, so we can write through the | ||
// reference directly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move this comment to line 128 (right before the direct write)? I think it would be clearer what it is referring to. Same for the set_init
function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, done.
_StatusValues __status_vals_partial; | ||
std::size_t __current_num_items; | ||
_TileVals __tile_vals; | ||
std::size_t __num_tiles; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::size_t __num_tiles; | |
_TileIdxT __num_tiles; |
The tile index is 32-bits, so we can probably restrict the size of the number of tiles here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, I had to change a few other types as well.
include/oneapi/dpl/experimental/kt/internal/work_group/work_group_scan.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/experimental/kt/internal/cooperative_lookback.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/experimental/kt/internal/cooperative_lookback.h
Outdated
Show resolved
Hide resolved
Overall this looks good. If we fix the outdated comment mentioning default initialized value type #2357 (comment), and either fix or make an issue for the zeroth tile performance issue #2357 (comment), then I'm ready to approve. The rest are nitpicks which you can apply or ignore. |
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
The overhead of avoiding decoupled lookback and atomic counter is really low. Additionally, we avoid temporary storage allocation and init kernel at runtime. In total, we just compile init kernel + main scan kernel for both single tile and multi tile versions. Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
It does not always look back anymore for the single tile case and lookback is now implemented separate from the core implementation. Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
…_submitter Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
a32b4de
to
1bc2b67
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with green CI. Thanks for the quick responses.
* Implements optimized sub-group and work-group scans for use in our scan KT. * Optimizes cooperative lookback to minimize global atomic memory traffic. * Enables support for user-defined binary operators. --------- Signed-off-by: Matthew Michel <[email protected]>
This PR reworks our scan kernel template by switching from an SLM driven scan where all data is loaded and operated in SLM to a sub-group driven scan where data is primarily stored and operated on in hardware registers. Additionally, this PR adds the building blocks necessary to implement future scan-like kernel templates.
The following changes are made:
internal/cooperative_lookback.h
is introduced to separate the lookback implementation from inclusive scan. We additionally provide an optimized path based on bit packing status and prefix flags for certain types.internal/work_group/work_group_scan.h
andinternal/sub_group/sub_group_scan.h
implement optimized work-group and sub-group level scans, enabling support for user defined operators beyond SYCL function objects with known identities. This implementation and API may be used to propose improvements to SYCL itself as it has been shown to perform significantly better.The initial implementation only supported SYCL function objects with known identities. User defined operators are now supported so long as the type satisfies the following conditions: