Skip to content

feat(cuda): single-pass dispatch plan builder#7197

Open
0ax1 wants to merge 6 commits intodevelopfrom
ad/cuda-next
Open

feat(cuda): single-pass dispatch plan builder#7197
0ax1 wants to merge 6 commits intodevelopfrom
ad/cuda-next

Conversation

@0ax1
Copy link
Copy Markdown
Contributor

@0ax1 0ax1 commented Mar 27, 2026

Refactor the dynamic dispatch plan builder to walk the encoding tree exactly once, discovering unfusable subtrees and computing shared memory requirements in the same pass. The result is a 3-variant enum (Fused, PartiallyFused, Unfusable) that replaces the previous Result<Option<>> API and eliminates the separatefind_unfusable_nodes traversal.

Shared memory is now validated upfront in DispatchPlan::new - before any subtree kernels are executed - so we never pay GPU cost for a plan that will not fit.

The plan stages are split into smem_stages (fully decoded into persistent shared memory) and output_stage (tiled through a scratch region), making the two-phase kernel execution model explicit in the host-side data structures. Shared memory allocation invariants are documented on FusedPlan.

@0ax1 0ax1 added the changelog/feature A new feature label Mar 27, 2026
@0ax1 0ax1 requested review from a10y and robert3005 March 27, 2026 20:17
@0ax1 0ax1 enabled auto-merge (squash) March 27, 2026 20:17
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq bot commented Mar 27, 2026

Merging this PR will degrade performance by 15.51%

❌ 2 regressed benchmarks
✅ 1104 untouched benchmarks
⏩ 1522 skipped benchmarks1

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation bitwise_not_vortex_buffer_mut[1024] 477.2 ns 535.6 ns -10.89%
Simulation bitwise_not_vortex_buffer_mut[128] 317.8 ns 376.1 ns -15.51%

Comparing ad/cuda-next (b4adfcd) with develop (d0ed3fc)

Open in CodSpeed

Footnotes

  1. 1522 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@0ax1 0ax1 force-pushed the ad/cuda-next branch 2 times, most recently from 92fe0d8 to 4246f4e Compare March 27, 2026 20:26
@0ax1 0ax1 requested a review from AdamGS March 27, 2026 20:32
@0ax1
Copy link
Copy Markdown
Contributor Author

0ax1 commented Mar 27, 2026

Right now we're loading all dict values and all runend ends & codes into GPU shared memory. I didn't change this as part of this PR to scope it. But in a follow up I want to change the logic to tile/window through dict and runed with a fixed shared memory budget. In case of dict, this might be based on some LRU cache indirection or similar.

@0ax1 0ax1 force-pushed the ad/cuda-next branch 9 times, most recently from ee300a4 to 3dced6e Compare March 28, 2026 00:11
0ax1 added 3 commits March 28, 2026 19:03
Refactor the dynamic dispatch plan builder to walk the encoding tree
exactly once, discovering unfusable subtrees and computing shared memory
requirements in the same pass. The result is a 3-variant enum (`Fused`,
`PartiallyFused`, `Unfusable`) that replaces the previous
`Result<Option<>>` API and eliminates the separate
`find_unfusable_nodes` traversal.

Shared memory is now validated upfront in `DispatchPlan::new` — before
any subtree kernels are executed — so we never pay GPU cost for a plan
that will not fit.

The plan stages are split into `smem_stages` (fully decoded into
persistent shared memory) and `output_stage` (tiled through a scratch
region), making the two-phase kernel execution model explicit in the
host-side data structures. Shared memory allocation invariants are
documented on `FusedPlan`.

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
0ax1 added 3 commits March 28, 2026 19:56
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/feature A new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant