-
Notifications
You must be signed in to change notification settings - Fork 117
Partition Set Algorithm Balanced Path #2318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a partitioning phase to the set algorithms, enhancing performance for large input sizes by establishing binary search boundaries that optimize cache usage. Key changes include updating the __gen_set_balanced_path template to accept an additional bounds provider parameter, adding new helper functions (__decode_balanced_path_temp_data, __encode_balanced_path_temp_data) for balanced path processing, and integrating a new partition kernel for the balanced path phase.
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
File | Description |
---|---|
test/general/implementation_details/device_copyable.pass.cpp | Updated static_asserts to include new bounds provider parameter |
include/oneapi/dpl/pstl/hetero/dpcpp/sycl_traits.h | Modified __gen_set_balanced_path specialization to include _BoundsProvider |
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h | Added helper functions & modified balanced path computation to include partitioning support and safeguard against out-of-bound element access |
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h | Updated __parallel_set_reduce_then_scan integration with new bounds provider and partitioning kernel |
Comments suppressed due to low confidence (4)
include/oneapi/dpl/pstl/hetero/dpcpp/sycl_traits.h:458
- The specialization of __gen_set_balanced_path now includes the _BoundsProvider parameter; please verify that all downstream usages are updated accordingly to maintain consistent API behavior.
template <typename _SetOpCount, typename _BoundsProvider, typename _Compare>
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl.h:1086
- [nitpick] When constructing _GenReduceInput with the new _BoundsProvider, it would be helpful to document the role of __diagonal_spacing and __partition_size in determining partition sizes, ensuring that readers understand how these values impact performance.
_BoundsProvider{__diagonal_spacing, __partition_size}, __comp};
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h:698
- The change from returning 0 to clamping __i_elem to __rng1.size() + __rng2.size() - 1 may affect the algorithm's edge-case handling; please confirm that this adjusted behavior correctly reflects the intended semantics.
if (__i_elem >= __rng1.size() + __rng2.size())
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h:821
- [nitpick] Retrieving __tile_size from __gen_input.__get_bounds is critical for partitioning; please ensure that __tile_size is always correctly initialized and consistent for various input sizes to avoid unexpected partition boundaries.
std::size_t __tile_size = __gen_input.__get_bounds.__tile_size;
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made a first pass over the implementation. I like how the patch unifies handling between the partitioned and non-partitioned bounds.
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h
Outdated
Show resolved
Hide resolved
4e4ca84
to
b3c9236
Compare
4b61133
to
42af6b8
Compare
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR looks near ready to me. The conflicts with main need to be addressed.
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h
Outdated
Show resolved
Hide resolved
|
||
// Calculate the max location to search in the second set for future repeats, limiting to the edge of the range | ||
_Index __fwd_search_bound = std::min(__merge_path_rng2 + __fwd_search_count, __rng2.size()); | ||
using _SizeType = decltype(std::get<0>(__in_rng.tuple()).size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No necessary action on your side here, but I see we use something along the lines of decltype(__rng.size()))
frequently in this patch. It looks like we have a oneapi::dpl::__internal::__range_size_t
but only for C++20 or later. Perhaps it would be worth extending this for C++17 with internal ranges.
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce_then_scan.h
Show resolved
Hide resolved
d2e7888
to
45e9998
Compare
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
eac4b52
to
b86cf8a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reapproving
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Add partitioning kernel to set APIs balanced path algorithm.
Adds a partitioning phase which does a sparse pass over the input data to establish binary search boundaries for the main run. This allows memory access pattern to fit within L1 cache for the main kernels when performing the binary searches to establish balanced path intersections.
This improves performance for large sizes of the set algorithms. (When combined with #2317, it provides a nice combination of performance improvements for both large and small sizes of the set algorithms)