Add SIMD Shuffle/Reduction Support to PTX Backend by mikepapadim · Pull Request #813 · beehive-lab/TornadoVM

mikepapadim · 2026-03-12T09:15:49Z

Summary

Addresses #812 for PTX

This PR adds PTX backend support for the SIMD-group intrinsics introduced for Metal in PR #796. The three KernelContext methods — simdShuffleDown, simdSum, and simdBroadcastFirst — are now lowered to CUDA's shfl.sync warp-shuffle instructions when targeting the PTX backend.

Mapping

KernelContext API	PTX Instruction
`simdShuffleDown(val, d)`	`shfl.sync.down.b32 dest, val, d, 31, 0xFFFFFFFF;`
`simdBroadcastFirst(val)`	`shfl.sync.idx.b32 dest, val, 0, 31, 0xFFFFFFFF;`
`simdSum(val)`	Butterfly: 5× `shfl.sync.down` + 5× `add.f32` (deltas 16,8,4,2,1)

The shfl.sync instruction requires PTX ISA 6.0+ / SM 3.0+, well within TornadoVM's minimum target (PTX 5.0+ / SM 6.2+).

Changes

New files (3 Graal IR nodes + 1 example)

File	Description
`PTXShuffleDownNode.java`	`FixedWithNextNode` → emits `ShuffleSyncStmt(DOWN, result, data, delta)`
`PTXSimdBroadcastFirstNode.java`	`FixedWithNextNode` → emits `ShuffleSyncStmt(IDX, result, value, 0)`
`PTXSimdSumNode.java`	`FixedWithNextNode` → emits 5-round butterfly (shuffle + add per round)
`SIMDReductionComparison.java`	Benchmark comparing threadgroup, simdSum, and simdShuffleDown reductions

Modified files

File	Change
`PTXLIRStmt.java`	Added `ShuffleSyncStmt` inner class with `Mode` enum (DOWN, IDX, UP, BFLY)
`PTXGraphBuilderPlugins.java`	Added `registerSIMDPlugins()` with 3 `InvocationPlugin`s, called from `registerKernelContextPlugins()`
`TestSIMDGroupReductions.java`	Removed `assertNotBackend(PTX)` from all 5 tests; updated javadoc
`KernelContext.java`	Updated comments/javadoc to document PTX equivalents alongside Metal

Design Notes

Convergent operations: All 3 nodes extend FixedWithNextNode (not FloatingNode) to prevent the Graal scheduler from hoisting them into conditional branches where only some warp lanes execute. This follows the Metal pattern.
Full warp mask: 0xFFFFFFFF assumes full 32-lane warp participation. The clamp value 31 is the maximum lane ID. This matches the current API contract.
b32 type compatibility: PTX .b32 (bit-size) is compatible with .f32 registers per PTX ISA type-compatibility rules.
Plugin-based interception: Following the Metal pattern, InvocationPlugins in PTXGraphBuilderPlugins intercept the KernelContext method calls during bytecode parsing and directly replace them with the new Graal IR nodes.

Test Results

All 5 tests pass on PTX (NVIDIA sm_86):

tornado-test --threadInfo --printKernel -V \
  uk.ac.manchester.tornado.unittests.kernelcontext.reductions.TestSIMDGroupReductions

Running test: testIrregularSizes_NotMultipleOf32 ................ [PASS]
Running test: testIrregularSizes_MultiplesOf32   ................ [PASS]
Running test: testSIMDSum                        ................ [PASS]
Running test: testSIMDShuffleDownReduction       ................ [PASS]
Running test: testSIMDBroadcastFirst             ................ [PASS]

Generated PTX contains the expected instructions:

shfl.sync.down.b32 rfi1, rfi0, 16, 31, 0xFFFFFFFF;
shfl.sync.down.b32 rfi3, rfi2, 8, 31, 0xFFFFFFFF;
shfl.sync.down.b32 rfi5, rfi4, 4, 31, 0xFFFFFFFF;
shfl.sync.down.b32 rfi7, rfi6, 2, 31, 0xFFFFFFFF;
shfl.sync.down.b32 rfi9, rfi8, 1, 31, 0xFFFFFFFF;

shfl.sync.idx.b32 rfi1, rfi0, 0, 31, 0xFFFFFFFF;

Benchmark Results

tornado -m tornado.examples/uk.ac.manchester.tornado.examples.compute.SIMDReductionComparison

Float Sum Reduction — SIMD vs Threadgroup Memory
=================================================
Input size : 25,000,000 floats (100000000.0 MB)
Local size : 32 (one SIMD group per work group)
Groups     : 781,250
Warmup     : 50  Iterations: 200

Sequential sum = 312500012500000

Running benchmarks...

Correctness
-----------
  Threadgroup memory              result=312500004194304  relErr=2.66e-08  ✓
  simd_sum                        result=312500004194304  relErr=2.66e-08  ✓
  simdShuffleDown                 result=312500004194304  relErr=2.66e-08  ✓

Performance (end-to-end dispatch + kernel + readback)
------------------------------------------------------
  Threadgroup memory              avg= 1.668 ms  min= 1.513 ms  max= 2.416 ms
  simd_sum                        avg= 1.258 ms  min= 1.182 ms  max= 2.157 ms
  simdShuffleDown                 avg= 1.228 ms  min= 1.181 ms  max= 2.258 ms

Speedup simd_sum        vs threadgroup: 1.33x
Speedup simdShuffleDown vs threadgroup: 1.36x

Irregular-size correctness (zero-padded to next multiple of 32)
-----------------------------------------------------------------------
  n           paddedSize    threadgroup     simd_sum        simdShuffle
  1           32            1 ✓             1 ✓             1 ✓
  31          32            496 ✓           496 ✓           496 ✓
  33          64            561 ✓           561 ✓           561 ✓
  63          64            2016 ✓          2016 ✓          2016 ✓
  65          96            2145 ✓          2145 ✓          2145 ✓
  100         128           5050 ✓          5050 ✓          5050 ✓
  1000        1024          500500 ✓        500500 ✓        500500 ✓
  1023        1024          523776 ✓        523776 ✓        523776 ✓
  1025        1056          525825 ✓        525825 ✓        525825 ✓
  96          96            4656 ✓          4656 ✓          4656 ✓
  160         160           12880 ✓         12880 ✓         12880 ✓
  65537       65568         2147581953 ✓    2147581953 ✓    2147581953 ✓

The SIMD shuffle paths are ~1.3–1.4× faster than the shared-memory + barrier approach. The speedup comes from:

No shared memory traffic — the threadgroup reduction writes all 32 lane values to shared float[32], then does 5 rounds of barrier-synchronized reads/writes. The SIMD paths stay entirely in registers.
No barriers — the threadgroup path requires 5 barrier.sync calls; the SIMD paths have none.

Post release minor fixes for mvn deploy and readme budges

Add TornadoVM developer skill (build, test, debug, Java 21+ idioms) for Claude

Implement PTX equivalents of the Metal SIMD-group intrinsics using CUDA's shfl.sync warp-shuffle instructions (PTX ISA 6.0+, SM 3.0+). New Graal IR nodes: - PTXShuffleDownNode: shfl.sync.down.b32 for simdShuffleDown(float, int) - PTXSimdBroadcastFirstNode: shfl.sync.idx.b32 lane 0 for simdBroadcastFirst(float) - PTXSimdSumNode: butterfly reduction (5x shfl.sync.down + 5x add.f32) for simdSum(float) New LIR statement: - PTXLIRStmt.ShuffleSyncStmt with Mode enum (DOWN, IDX, UP, BFLY) Plugin registration: - registerSIMDPlugins() in PTXGraphBuilderPlugins intercepts KernelContext SIMD methods and replaces them with the new IR nodes during parsing. Tests and examples: - Enable PTX in TestSIMDGroupReductions (all 5 tests pass) - Add SIMDReductionComparison benchmark example Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

This PR extends TornadoVM’s PTX backend to lower KernelContext SIMD-group intrinsics (simdShuffleDown, simdSum, simdBroadcastFirst) to PTX shfl.sync.* warp shuffle instructions, aligning PTX behavior with the SIMD-group intrinsics introduced for Metal.

Changes:

Add PTX Graal IR nodes + PTX LIR emission support for shfl.sync-based shuffle/reduction operations.
Add a new unit test suite and an example benchmark exercising the SIMD intrinsics.
Update KernelContext documentation/comments and refresh some repo metadata (README badges, deploy workflow).

Reviewed changes

Copilot reviewed 10 out of 12 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`tornadovm-skill.skill`	Adds a new artifact (currently appears to be a zipped payload).
`tornado-unittests/.../TestSIMDGroupReductions.java`	New unit tests for `simdSum`, `simdShuffleDown`, `simdBroadcastFirst`.
`tornado-examples/.../SIMDReductionComparison.java`	New benchmark comparing reduction strategies including SIMD shuffles.
`tornado-drivers/ptx/.../PTXShuffleDownNode.java`	New Graal node lowering `simdShuffleDown` to `shfl.sync.down`.
`tornado-drivers/ptx/.../PTXSimdBroadcastFirstNode.java`	New Graal node lowering `simdBroadcastFirst` to `shfl.sync.idx`.
`tornado-drivers/ptx/.../PTXSimdSumNode.java`	New Graal node implementing `simdSum` via shuffle-based reduction.
`tornado-drivers/ptx/.../PTXLIRStmt.java`	Adds `ShuffleSyncStmt` LIR to emit `shfl.sync.*` instructions.
`tornado-drivers/ptx/.../PTXGraphBuilderPlugins.java`	Registers PTX invocation plugins to replace `KernelContext` calls with the new nodes.
`tornado-assembly/src/bin/tornado-test`	Adds the new test class to the “test the world” list.
`tornado-api/.../KernelContext.java`	Adds CPU stubs + documentation for the SIMD intrinsics.
`README.md`	Updates CI badges.
`.github/workflows/deploy-maven-central-jdk21.yml`	Adjusts deploy workflow naming/tag patterns and checkout ref handling.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

...ivers/ptx/src/main/java/uk/ac/manchester/tornado/drivers/ptx/graal/nodes/PTXSimdSumNode.java

…lanes and add comprehensive tests for simdSum and simdShuffleDown functionality

mikepapadim and others added 8 commits February 25, 2026 11:02

Update README badges for JDK 21 and JDK 25 workflows

9810959

Fix mvn deploy for jdk21

2c664a0

Fix mvn deploy for jdk21 on previous workflow

3505351

Merge pull request #807 from beehive-lab/misc/post-release

b0d9edc

Post release minor fixes for mvn deploy and readme budges

Add tornadovm skill file

e6fa4fe

Merge pull request #808 from beehive-lab/feat/skills

2ce769f

Add TornadoVM developer skill (build, test, debug, Java 21+ idioms) for Claude

Add test for SIMD group reductions in tornado-test

70415bf

mikepapadim requested review from Copilot, kotselidis, mairooni, orionpapadakis and stratika and removed request for Copilot and mairooni March 12, 2026 09:15

mikepapadim self-assigned this Mar 12, 2026

mikepapadim added the PTX label Mar 12, 2026

Copilot started reviewing on behalf of mikepapadim March 12, 2026 09:16 View session

Copilot AI reviewed Mar 12, 2026

View reviewed changes

...ivers/ptx/src/main/java/uk/ac/manchester/tornado/drivers/ptx/graal/nodes/PTXSimdSumNode.java Show resolved Hide resolved

mikepapadim mentioned this pull request Mar 12, 2026

Add Subgroup/Warp Shuffle Support to OpenCL and PTX Backends #812

Open

Enhance PTX SIMD sum implementation to broadcast final result to all …

c96ca32

…lanes and add comprehensive tests for simdSum and simdShuffleDown functionality

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SIMD Shuffle/Reduction Support to PTX Backend #813

Add SIMD Shuffle/Reduction Support to PTX Backend #813
mikepapadim wants to merge 9 commits intodevelopfrom
feature/ptx-simd-shuffle

mikepapadim commented Mar 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mikepapadim commented Mar 12, 2026

Summary

Mapping

Changes

New files (3 Graal IR nodes + 1 example)

Modified files

Design Notes

Test Results

Benchmark Results

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants