Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
110 commits
Select commit Hold shift + click to select a range
3f00f89
first draft (unchecked)
danhoeflinger Dec 19, 2025
aa57a76
access pattern now contiguous within wi
danhoeflinger Dec 29, 2025
74f52cc
remove peer kernel; use subgroup scan if avail
danhoeflinger Dec 29, 2025
f68f634
move lazy storage out of loop
danhoeflinger Dec 29, 2025
a09415f
remove subgroup scan
danhoeflinger Dec 29, 2025
d01e9e5
handle large radix compared to sg size
danhoeflinger Dec 29, 2025
01b1b89
simplify heirarchical scan
danhoeflinger Dec 29, 2025
8dcb9d5
parallelize across subgroups
danhoeflinger Dec 30, 2025
a282dc9
testing out subgroup (assume 32 sgsize)
danhoeflinger Dec 30, 2025
d4a0bb6
limiting overhead
danhoeflinger Dec 30, 2025
c9d12f5
typo
danhoeflinger Dec 30, 2025
c705c5b
undo parallelization across subgroup
danhoeflinger Dec 30, 2025
7e29487
back to sycl builtin
danhoeflinger Dec 30, 2025
64e646b
debug wg / sg size
danhoeflinger Dec 30, 2025
337cf52
increasing wg size
danhoeflinger Dec 30, 2025
e773aa4
fix over/underflow issues
danhoeflinger Dec 30, 2025
a17b9fc
up workgroup sizes
danhoeflinger Dec 30, 2025
d808865
remove debugging
danhoeflinger Dec 30, 2025
22a2e2e
debugging, reorder wg size
danhoeflinger Dec 30, 2025
000d263
tuning, remove debugging
danhoeflinger Jan 6, 2026
1ae8004
restructure count kernel
danhoeflinger Jan 12, 2026
8ffe5a9
killing debug output
danhoeflinger Jan 12, 2026
61a2f41
some tuning
danhoeflinger Jan 12, 2026
64bf05b
aligning wg size
danhoeflinger Jan 12, 2026
72e770e
try other layout
danhoeflinger Jan 12, 2026
8e1eb8a
onewg try better copy out access pattern
danhoeflinger Jan 12, 2026
17f4a77
tuning experiments
danhoeflinger Jan 13, 2026
8cd28ac
tuning tests
danhoeflinger Jan 14, 2026
ebfc351
unroll to try to improve read access pattern
danhoeflinger Jan 15, 2026
aed6ae0
testing alternate strategy reorder
danhoeflinger Jan 15, 2026
3b37ceb
Revert "testing alternate strategy reorder"
danhoeflinger Jan 15, 2026
8fe8fde
reorder kernel with SLM copy of input
danhoeflinger Jan 15, 2026
71a2dfd
Revert "reorder kernel with SLM copy of input"
danhoeflinger Jan 16, 2026
111c6fe
Revert "Revert "testing alternate strategy reorder""
danhoeflinger Jan 16, 2026
7bcbcd9
small steps using registers
danhoeflinger Jan 19, 2026
971aa0d
crank back up tuning
danhoeflinger Jan 19, 2026
8a9c235
cache to SLM, not registers
danhoeflinger Jan 20, 2026
9aa4da6
global offsets to SLM
danhoeflinger Jan 20, 2026
f3e8072
back to registers for values
danhoeflinger Jan 20, 2026
267605e
Revert "back to registers for values"
danhoeflinger Jan 20, 2026
34a4e86
Revert "global offsets to SLM"
danhoeflinger Jan 20, 2026
0b97bd5
Revert "cache to SLM, not registers"
danhoeflinger Jan 20, 2026
097c9d0
Revert "crank back up tuning"
danhoeflinger Jan 20, 2026
f59b444
Revert "small steps using registers"
danhoeflinger Jan 20, 2026
89abbb3
Revert "Revert "Revert "testing alternate strategy reorder"""
danhoeflinger Jan 20, 2026
9d95fe5
test opus solution
danhoeflinger Jan 20, 2026
1dcae81
wider vector
danhoeflinger Jan 20, 2026
896cab0
going back to simpler version
danhoeflinger Jan 20, 2026
1b0635c
Revert "going back to simpler version"
danhoeflinger Jan 20, 2026
36dd487
allow better vectorization
danhoeflinger Jan 20, 2026
c398c94
Revert "allow better vectorization"
danhoeflinger Jan 20, 2026
ca9258b
Revert "Revert "going back to simpler version""
danhoeflinger Jan 20, 2026
29f01e2
attempt better vectorized writes with simple approach
danhoeflinger Jan 20, 2026
4bf5cd0
Revert "attempt better vectorized writes with simple approach"
danhoeflinger Jan 20, 2026
bddb8d2
simplify count kernel
danhoeflinger Jan 21, 2026
eceb262
getting back unrolling for global memory access
danhoeflinger Jan 21, 2026
96d8203
swap memory layout
danhoeflinger Jan 21, 2026
605f088
partial tree reduction
danhoeflinger Jan 21, 2026
13a618c
cutting broadcasts
danhoeflinger Jan 21, 2026
e0ef3b8
simplified tuning (none)
danhoeflinger Jan 21, 2026
8e128d2
fixing keys per work item
danhoeflinger Jan 21, 2026
ca3ce78
adding some tuning for smaller sizes to keep up number of wgs
danhoeflinger Jan 21, 2026
c6442e0
fixing tuning min, remove unnecessary assert
danhoeflinger Jan 21, 2026
0d28440
separate full vs remainder and explicitly tune
danhoeflinger Jan 21, 2026
9215b28
Revert "separate full vs remainder and explicitly tune"
danhoeflinger Jan 22, 2026
e52c4f8
Revert "partial tree reduction"
danhoeflinger Jan 22, 2026
82a6b79
Revert "swap memory layout"
danhoeflinger Jan 22, 2026
cf3909c
Revert "getting back unrolling for global memory access"
danhoeflinger Jan 22, 2026
ff45c89
Revert "simplify count kernel"
danhoeflinger Jan 22, 2026
dfb50a5
separate full blocks and remainder
danhoeflinger Jan 22, 2026
6b98123
remove unnecessary assert
danhoeflinger Jan 22, 2026
fbcc680
adjust tuning
danhoeflinger Jan 22, 2026
4741908
allow more threads to participate in reduction
danhoeflinger Jan 22, 2026
dadc8c9
remove kernel bundle dependency
danhoeflinger Jan 22, 2026
20648bc
slm init with 32 bit at a time
danhoeflinger Jan 22, 2026
2a384f4
better fix for removing kernel compilation
danhoeflinger Jan 22, 2026
2986866
count into registers, not SLM
danhoeflinger Jan 22, 2026
fd0d977
Revert "count into registers, not SLM"
danhoeflinger Jan 22, 2026
ab8f66f
loop reordering to better vectorization
danhoeflinger Jan 22, 2026
168eb3e
tree reduction
danhoeflinger Jan 23, 2026
79da613
updated tuning
danhoeflinger Jan 23, 2026
695d4eb
unifying kernels even / odd
danhoeflinger Jan 26, 2026
b8dd541
improvements for copy optim
danhoeflinger Jan 27, 2026
a422f94
re-checking non unrolled version
danhoeflinger Jan 27, 2026
9788884
going back to previous copy kernel
danhoeflinger Jan 27, 2026
9f688b7
remove const
danhoeflinger Jan 27, 2026
3ca849c
unrolling with contiguous reads
danhoeflinger Jan 27, 2026
ca3c9ec
unroll by 8
danhoeflinger Jan 27, 2026
f84e530
remove extra loop for destruction
danhoeflinger Jan 27, 2026
48ef069
refactor to better abstract out code from iterations
danhoeflinger Jan 27, 2026
2551082
eliminating dead code
danhoeflinger Jan 28, 2026
918e9ac
scan with more threads
danhoeflinger Jan 28, 2026
b8101ca
fix for case where less subgroups than radix states
danhoeflinger Jan 28, 2026
c636df7
Signed-off-by: Dan Hoeflinger <[email protected]>
danhoeflinger Jan 28, 2026
4f463f1
cleanup
danhoeflinger Jan 28, 2026
531b371
make wgsize a runtime param
danhoeflinger Jan 28, 2026
f44c76f
further reduce number of kernels
danhoeflinger Jan 28, 2026
4917cf9
formatting
danhoeflinger Jan 28, 2026
64a1f1c
cleanup
danhoeflinger Jan 28, 2026
07af9ad
remove onewg specific changes
danhoeflinger Jan 28, 2026
2d72869
cleanup
danhoeflinger Jan 29, 2026
47b9859
adding static assertion for overflow protection
danhoeflinger Jan 29, 2026
34cb172
reorder wg size
danhoeflinger Jan 29, 2026
2fd6fd3
removing unused variables
danhoeflinger Jan 29, 2026
5f4f12b
Signed-off-by: Dan Hoeflinger <[email protected]>
danhoeflinger Jan 29, 2026
eb7a0d6
formatting
danhoeflinger Jan 29, 2026
70bdc5c
remove unused types
danhoeflinger Jan 29, 2026
6ab231d
adding comment
danhoeflinger Jan 29, 2026
de151b1
formatting
danhoeflinger Jan 29, 2026
4099d7f
removing redundant SLM storage in reorder (factor of 2 reduction)
danhoeflinger Jan 29, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Loading