Skip to content

Add SIMD Shuffle/Reduction Support to PTX Backend #813

Open
mikepapadim wants to merge 9 commits intodevelopfrom
feature/ptx-simd-shuffle
Open

Add SIMD Shuffle/Reduction Support to PTX Backend #813
mikepapadim wants to merge 9 commits intodevelopfrom
feature/ptx-simd-shuffle

Conversation

@mikepapadim
Copy link
Member

Summary

Addresses #812 for PTX

This PR adds PTX backend support for the SIMD-group intrinsics introduced for Metal in PR #796. The three KernelContext methods — simdShuffleDown, simdSum, and simdBroadcastFirst — are now lowered to CUDA's shfl.sync warp-shuffle instructions when targeting the PTX backend.

Mapping

KernelContext API PTX Instruction
simdShuffleDown(val, d) shfl.sync.down.b32 dest, val, d, 31, 0xFFFFFFFF;
simdBroadcastFirst(val) shfl.sync.idx.b32 dest, val, 0, 31, 0xFFFFFFFF;
simdSum(val) Butterfly: 5× shfl.sync.down + 5× add.f32 (deltas 16,8,4,2,1)

The shfl.sync instruction requires PTX ISA 6.0+ / SM 3.0+, well within TornadoVM's minimum target (PTX 5.0+ / SM 6.2+).

Changes

New files (3 Graal IR nodes + 1 example)

File Description
PTXShuffleDownNode.java FixedWithNextNode → emits ShuffleSyncStmt(DOWN, result, data, delta)
PTXSimdBroadcastFirstNode.java FixedWithNextNode → emits ShuffleSyncStmt(IDX, result, value, 0)
PTXSimdSumNode.java FixedWithNextNode → emits 5-round butterfly (shuffle + add per round)
SIMDReductionComparison.java Benchmark comparing threadgroup, simdSum, and simdShuffleDown reductions

Modified files

File Change
PTXLIRStmt.java Added ShuffleSyncStmt inner class with Mode enum (DOWN, IDX, UP, BFLY)
PTXGraphBuilderPlugins.java Added registerSIMDPlugins() with 3 InvocationPlugins, called from registerKernelContextPlugins()
TestSIMDGroupReductions.java Removed assertNotBackend(PTX) from all 5 tests; updated javadoc
KernelContext.java Updated comments/javadoc to document PTX equivalents alongside Metal

Design Notes

  • Convergent operations: All 3 nodes extend FixedWithNextNode (not FloatingNode) to prevent the Graal scheduler from hoisting them into conditional branches where only some warp lanes execute. This follows the Metal pattern.
  • Full warp mask: 0xFFFFFFFF assumes full 32-lane warp participation. The clamp value 31 is the maximum lane ID. This matches the current API contract.
  • b32 type compatibility: PTX .b32 (bit-size) is compatible with .f32 registers per PTX ISA type-compatibility rules.
  • Plugin-based interception: Following the Metal pattern, InvocationPlugins in PTXGraphBuilderPlugins intercept the KernelContext method calls during bytecode parsing and directly replace them with the new Graal IR nodes.

Test Results

All 5 tests pass on PTX (NVIDIA sm_86):

tornado-test --threadInfo --printKernel -V \
  uk.ac.manchester.tornado.unittests.kernelcontext.reductions.TestSIMDGroupReductions
Running test: testIrregularSizes_NotMultipleOf32 ................ [PASS]
Running test: testIrregularSizes_MultiplesOf32   ................ [PASS]
Running test: testSIMDSum                        ................ [PASS]
Running test: testSIMDShuffleDownReduction       ................ [PASS]
Running test: testSIMDBroadcastFirst             ................ [PASS]

Generated PTX contains the expected instructions:

shfl.sync.down.b32 rfi1, rfi0, 16, 31, 0xFFFFFFFF;
shfl.sync.down.b32 rfi3, rfi2, 8, 31, 0xFFFFFFFF;
shfl.sync.down.b32 rfi5, rfi4, 4, 31, 0xFFFFFFFF;
shfl.sync.down.b32 rfi7, rfi6, 2, 31, 0xFFFFFFFF;
shfl.sync.down.b32 rfi9, rfi8, 1, 31, 0xFFFFFFFF;
shfl.sync.idx.b32 rfi1, rfi0, 0, 31, 0xFFFFFFFF;

Benchmark Results

tornado -m tornado.examples/uk.ac.manchester.tornado.examples.compute.SIMDReductionComparison
Float Sum Reduction — SIMD vs Threadgroup Memory
=================================================
Input size : 25,000,000 floats (100000000.0 MB)
Local size : 32 (one SIMD group per work group)
Groups     : 781,250
Warmup     : 50  Iterations: 200

Sequential sum = 312500012500000

Running benchmarks...

Correctness
-----------
  Threadgroup memory              result=312500004194304  relErr=2.66e-08  ✓
  simd_sum                        result=312500004194304  relErr=2.66e-08  ✓
  simdShuffleDown                 result=312500004194304  relErr=2.66e-08  ✓

Performance (end-to-end dispatch + kernel + readback)
------------------------------------------------------
  Threadgroup memory              avg= 1.668 ms  min= 1.513 ms  max= 2.416 ms
  simd_sum                        avg= 1.258 ms  min= 1.182 ms  max= 2.157 ms
  simdShuffleDown                 avg= 1.228 ms  min= 1.181 ms  max= 2.258 ms

Speedup simd_sum        vs threadgroup: 1.33x
Speedup simdShuffleDown vs threadgroup: 1.36x

Irregular-size correctness (zero-padded to next multiple of 32)
-----------------------------------------------------------------------
  n           paddedSize    threadgroup     simd_sum        simdShuffle
  1           32            1 ✓             1 ✓             1 ✓
  31          32            496 ✓           496 ✓           496 ✓
  33          64            561 ✓           561 ✓           561 ✓
  63          64            2016 ✓          2016 ✓          2016 ✓
  65          96            2145 ✓          2145 ✓          2145 ✓
  100         128           5050 ✓          5050 ✓          5050 ✓
  1000        1024          500500 ✓        500500 ✓        500500 ✓
  1023        1024          523776 ✓        523776 ✓        523776 ✓
  1025        1056          525825 ✓        525825 ✓        525825 ✓
  96          96            4656 ✓          4656 ✓          4656 ✓
  160         160           12880 ✓         12880 ✓         12880 ✓
  65537       65568         2147581953 ✓    2147581953 ✓    2147581953 ✓

The SIMD shuffle paths are ~1.3–1.4× faster than the shared-memory + barrier approach. The speedup comes from:

  1. No shared memory traffic — the threadgroup reduction writes all 32 lane values to shared float[32], then does 5 rounds of barrier-synchronized reads/writes. The SIMD paths stay entirely in registers.
  2. No barriers — the threadgroup path requires 5 barrier.sync calls; the SIMD paths have none.

mikepapadim and others added 8 commits February 25, 2026 11:02
Post release minor fixes for mvn deploy and readme budges
Add TornadoVM developer skill (build, test, debug, Java 21+ idioms) for Claude
Implement PTX equivalents of the Metal SIMD-group intrinsics using
CUDA's shfl.sync warp-shuffle instructions (PTX ISA 6.0+, SM 3.0+).

New Graal IR nodes:
- PTXShuffleDownNode: shfl.sync.down.b32 for simdShuffleDown(float, int)
- PTXSimdBroadcastFirstNode: shfl.sync.idx.b32 lane 0 for simdBroadcastFirst(float)
- PTXSimdSumNode: butterfly reduction (5x shfl.sync.down + 5x add.f32) for simdSum(float)

New LIR statement:
- PTXLIRStmt.ShuffleSyncStmt with Mode enum (DOWN, IDX, UP, BFLY)

Plugin registration:
- registerSIMDPlugins() in PTXGraphBuilderPlugins intercepts KernelContext
  SIMD methods and replaces them with the new IR nodes during parsing.

Tests and examples:
- Enable PTX in TestSIMDGroupReductions (all 5 tests pass)
- Add SIMDReductionComparison benchmark example

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends TornadoVM’s PTX backend to lower KernelContext SIMD-group intrinsics (simdShuffleDown, simdSum, simdBroadcastFirst) to PTX shfl.sync.* warp shuffle instructions, aligning PTX behavior with the SIMD-group intrinsics introduced for Metal.

Changes:

  • Add PTX Graal IR nodes + PTX LIR emission support for shfl.sync-based shuffle/reduction operations.
  • Add a new unit test suite and an example benchmark exercising the SIMD intrinsics.
  • Update KernelContext documentation/comments and refresh some repo metadata (README badges, deploy workflow).

Reviewed changes

Copilot reviewed 10 out of 12 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tornadovm-skill.skill Adds a new artifact (currently appears to be a zipped payload).
tornado-unittests/.../TestSIMDGroupReductions.java New unit tests for simdSum, simdShuffleDown, simdBroadcastFirst.
tornado-examples/.../SIMDReductionComparison.java New benchmark comparing reduction strategies including SIMD shuffles.
tornado-drivers/ptx/.../PTXShuffleDownNode.java New Graal node lowering simdShuffleDown to shfl.sync.down.
tornado-drivers/ptx/.../PTXSimdBroadcastFirstNode.java New Graal node lowering simdBroadcastFirst to shfl.sync.idx.
tornado-drivers/ptx/.../PTXSimdSumNode.java New Graal node implementing simdSum via shuffle-based reduction.
tornado-drivers/ptx/.../PTXLIRStmt.java Adds ShuffleSyncStmt LIR to emit shfl.sync.* instructions.
tornado-drivers/ptx/.../PTXGraphBuilderPlugins.java Registers PTX invocation plugins to replace KernelContext calls with the new nodes.
tornado-assembly/src/bin/tornado-test Adds the new test class to the “test the world” list.
tornado-api/.../KernelContext.java Adds CPU stubs + documentation for the SIMD intrinsics.
README.md Updates CI badges.
.github/workflows/deploy-maven-central-jdk21.yml Adjusts deploy workflow naming/tag patterns and checkout ref handling.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

…lanes and add comprehensive tests for simdSum and simdShuffleDown functionality
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants