feat(cuda): single-pass dispatch plan builder by 0ax1 · Pull Request #7197 · vortex-data/vortex

0ax1 · 2026-03-27T20:14:07Z

Refactor the dynamic dispatch plan builder to walk the encoding tree exactly once, discovering unfusable subtrees and computing shared memory requirements in the same pass. The result is a 3-variant enum (Fused, PartiallyFused, Unfusable) that replaces the previous Result<Option<>> API and eliminates the separatefind_unfusable_nodes traversal.

Shared memory is now validated upfront in DispatchPlan::new - before any subtree kernels are executed - so we never pay GPU cost for a plan that will not fit.

The plan stages are split into smem_stages (fully decoded into persistent shared memory) and output_stage (tiled through a scratch region), making the two-phase kernel execution model explicit in the host-side data structures. Shared memory allocation invariants are documented on FusedPlan.

codspeed-hq · 2026-03-27T20:18:03Z

Merging this PR will degrade performance by 15.51%

❌ 2 regressed benchmarks
✅ 1104 untouched benchmarks
⏩ 1522 skipped benchmarks¹

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
❌	Simulation	`bitwise_not_vortex_buffer_mut[1024]`	477.2 ns	535.6 ns	-10.89%
❌	Simulation	`bitwise_not_vortex_buffer_mut[128]`	317.8 ns	376.1 ns	-15.51%

_{Comparing ad/cuda-next (b4adfcd) with develop (d0ed3fc)}

1522 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

0ax1 · 2026-03-27T20:40:21Z

Right now we're loading all dict values and all runend ends & codes into GPU shared memory. I didn't change this as part of this PR to scope it. But in a follow up I want to change the logic to tile/window through dict and runed with a fixed shared memory budget. In case of dict, this might be based on some LRU cache indirection or similar.

Refactor the dynamic dispatch plan builder to walk the encoding tree exactly once, discovering unfusable subtrees and computing shared memory requirements in the same pass. The result is a 3-variant enum (`Fused`, `PartiallyFused`, `Unfusable`) that replaces the previous `Result<Option<>>` API and eliminates the separate `find_unfusable_nodes` traversal. Shared memory is now validated upfront in `DispatchPlan::new` — before any subtree kernels are executed — so we never pay GPU cost for a plan that will not fit. The plan stages are split into `smem_stages` (fully decoded into persistent shared memory) and `output_stage` (tiled through a scratch region), making the two-phase kernel execution model explicit in the host-side data structures. Shared memory allocation invariants are documented on `FusedPlan`. Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>

0ax1 added the changelog/feature A new feature label Mar 27, 2026

0ax1 requested review from a10y and robert3005 March 27, 2026 20:17

0ax1 enabled auto-merge (squash) March 27, 2026 20:17

0ax1 force-pushed the ad/cuda-next branch 2 times, most recently from 92fe0d8 to 4246f4e Compare March 27, 2026 20:26

0ax1 requested a review from AdamGS March 27, 2026 20:32

0ax1 force-pushed the ad/cuda-next branch from 4246f4e to 63c9bd2 Compare March 27, 2026 20:39

0ax1 force-pushed the ad/cuda-next branch 9 times, most recently from ee300a4 to 3dced6e Compare March 28, 2026 00:11

0ax1 added 3 commits March 28, 2026 19:03

chore: minor cleanup

30e87e7

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>

docs: adjust comment

76445a0

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>

0ax1 force-pushed the ad/cuda-next branch from 8cf8c85 to 76445a0 Compare March 28, 2026 19:04

0ax1 added 3 commits March 28, 2026 19:56

well actually

ecbec13

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>

prefer as_ cast

2283116

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>

docs

b4adfcd

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>

0ax1 force-pushed the ad/cuda-next branch from ed6a886 to b4adfcd Compare March 29, 2026 08:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cuda): single-pass dispatch plan builder#7197

feat(cuda): single-pass dispatch plan builder#7197
0ax1 wants to merge 6 commits intodevelopfrom
ad/cuda-next

0ax1 commented Mar 27, 2026 •

edited

Loading

Uh oh!

codspeed-hq bot commented Mar 27, 2026 •

edited

Loading

Uh oh!

0ax1 commented Mar 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

0ax1 commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codspeed-hq bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will degrade performance by 15.51%

Performance Changes

Footnotes

Uh oh!

0ax1 commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

0ax1 commented Mar 27, 2026 •

edited

Loading

codspeed-hq bot commented Mar 27, 2026 •

edited

Loading

0ax1 commented Mar 27, 2026 •

edited

Loading