Skip to content

Conversation

mmichel11
Copy link
Contributor

@mmichel11 mmichel11 commented Jul 27, 2025

This PR reworks our scan kernel template by switching from an SLM driven scan where all data is loaded and operated in SLM to a sub-group driven scan where data is primarily stored and operated on in hardware registers. Additionally, this PR adds the building blocks necessary to implement future scan-like kernel templates.

The following changes are made:

  • internal/cooperative_lookback.h is introduced to separate the lookback implementation from inclusive scan. We additionally provide an optimized path based on bit packing status and prefix flags for certain types.
  • internal/work_group/work_group_scan.h and internal/sub_group/sub_group_scan.h implement optimized work-group and sub-group level scans, enabling support for user defined operators beyond SYCL function objects with known identities. This implementation and API may be used to propose improvements to SYCL itself as it has been shown to perform significantly better.
  • The reworked implementation also serves as a single work-group scan without any extra kernels compiled. At runtime we determine if a single tile is present and skip temporary storage allocation and the init kernel if so. The existing single work-group path is removed as it does not perform better.

The initial implementation only supported SYCL function objects with known identities. User defined operators are now supported so long as the type satisfies the following conditions:

  • Scan input type must be convertible to output type
  • Type must be copy constructible / assignable.

@mmichel11 mmichel11 added this to the 2022.10.0 milestone Jul 27, 2025
@mmichel11 mmichel11 requested a review from Copilot July 28, 2025 17:03
Copilot

This comment was marked as outdated.

@mmichel11 mmichel11 requested a review from Copilot July 28, 2025 17:35
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR reworks the scan kernel template implementation by transitioning from an SLM-driven approach to a sub-group-driven approach that primarily operates on hardware registers. It also adds support for user-defined binary operators and optimizes the implementation.

Key changes:

  • Introduces modular cooperative lookback implementation with bit packing optimizations
  • Implements optimized work-group and sub-group level scans with register-based operations
  • Adds support for custom binary operators beyond SYCL function objects with known identities

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

Show a summary per file
File Description
test/kt/single_pass_scan.cpp Adds custom binary operator test (my_bit_xor) to verify user-defined function object support
test/kt/CMakeLists.txt Expands test coverage by adding uint8_t and uint16_t data types and additional test configurations
include/oneapi/dpl/experimental/kt/single_pass_scan.h Core implementation rewrite - switches to sub-group driven scan with register-based operations and modular lookback
include/oneapi/dpl/experimental/kt/internal/work_group/work_group_scan.h New optimized work-group scan implementation using sub-group level operations
include/oneapi/dpl/experimental/kt/internal/sub_group/sub_group_scan.h New sub-group scan implementation operating on register arrays in strided layout
include/oneapi/dpl/experimental/kt/internal/cooperative_lookback.h Extracted cooperative lookback implementation with specialized bit-packed storage for small types
Comments suppressed due to low confidence (1)

include/oneapi/dpl/experimental/kt/single_pass_scan.h:314

  • There is an extra closing brace that doesn't match any opening brace. This will cause a compilation error.
    }

@mmichel11 mmichel11 marked this pull request as ready for review July 28, 2025 21:57
@mmichel11 mmichel11 changed the title [Draft] Optimize scan KT and support user defined binary operators Optimize scan KT and support user defined binary operators Jul 28, 2025
Copy link
Member

@adamfidel adamfidel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made it through cooperative_lookback.h. I still need to review the rest.

@mmichel11 mmichel11 force-pushed the dev/mmichel11/kt_scan_optimize branch from f9aa36c to 1486a10 Compare September 2, 2025 13:38
Copy link
Member

@adamfidel adamfidel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gone through the changes and I don't see any glaring issues. I just have a few nitpicks but I think it's in a good shape to merge.

}

// For initialization routines, we do not need atomicity, so we can write through the
// reference directly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this comment to line 128 (right before the direct write)? I think it would be clearer what it is referring to. Same for the set_init function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, done.

_StatusValues __status_vals_partial;
std::size_t __current_num_items;
_TileVals __tile_vals;
std::size_t __num_tiles;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::size_t __num_tiles;
_TileIdxT __num_tiles;

The tile index is 32-bits, so we can probably restrict the size of the number of tiles here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, I had to change a few other types as well.

@danhoeflinger
Copy link
Contributor

Overall this looks good. If we fix the outdated comment mentioning default initialized value type #2357 (comment), and either fix or make an issue for the zeroth tile performance issue #2357 (comment), then I'm ready to approve.

The rest are nitpicks which you can apply or ignore.

Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
The overhead of avoiding decoupled lookback and atomic counter is
really low. Additionally, we avoid temporary storage allocation and init
kernel at runtime.

In total, we just compile init kernel + main scan kernel for both single
tile and multi tile versions.

Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
It does not always look back anymore for the single tile case and lookback
is now implemented separate from the core implementation.

Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
Signed-off-by: Matthew Michel <[email protected]>
@mmichel11 mmichel11 force-pushed the dev/mmichel11/kt_scan_optimize branch from a32b4de to 1bc2b67 Compare September 11, 2025 16:19
Copy link
Contributor

@danhoeflinger danhoeflinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with green CI. Thanks for the quick responses.

@mmichel11 mmichel11 merged commit a7091ba into main Sep 11, 2025
19 checks passed
@mmichel11 mmichel11 deleted the dev/mmichel11/kt_scan_optimize branch September 11, 2025 18:25
timmiesmith pushed a commit that referenced this pull request Sep 11, 2025
* Implements optimized sub-group and work-group scans for use in our scan KT.
* Optimizes cooperative lookback to minimize global atomic memory traffic.
* Enables support for user-defined binary operators.

---------

Signed-off-by: Matthew Michel <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants