Add SIMD Shuffle/Reduction Support to PTX Backend #813
Open
mikepapadim wants to merge 9 commits intodevelopfrom
Open
Add SIMD Shuffle/Reduction Support to PTX Backend #813mikepapadim wants to merge 9 commits intodevelopfrom
mikepapadim wants to merge 9 commits intodevelopfrom
Conversation
Post release minor fixes for mvn deploy and readme budges
Add TornadoVM developer skill (build, test, debug, Java 21+ idioms) for Claude
Implement PTX equivalents of the Metal SIMD-group intrinsics using CUDA's shfl.sync warp-shuffle instructions (PTX ISA 6.0+, SM 3.0+). New Graal IR nodes: - PTXShuffleDownNode: shfl.sync.down.b32 for simdShuffleDown(float, int) - PTXSimdBroadcastFirstNode: shfl.sync.idx.b32 lane 0 for simdBroadcastFirst(float) - PTXSimdSumNode: butterfly reduction (5x shfl.sync.down + 5x add.f32) for simdSum(float) New LIR statement: - PTXLIRStmt.ShuffleSyncStmt with Mode enum (DOWN, IDX, UP, BFLY) Plugin registration: - registerSIMDPlugins() in PTXGraphBuilderPlugins intercepts KernelContext SIMD methods and replaces them with the new IR nodes during parsing. Tests and examples: - Enable PTX in TestSIMDGroupReductions (all 5 tests pass) - Add SIMDReductionComparison benchmark example Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR extends TornadoVM’s PTX backend to lower KernelContext SIMD-group intrinsics (simdShuffleDown, simdSum, simdBroadcastFirst) to PTX shfl.sync.* warp shuffle instructions, aligning PTX behavior with the SIMD-group intrinsics introduced for Metal.
Changes:
- Add PTX Graal IR nodes + PTX LIR emission support for
shfl.sync-based shuffle/reduction operations. - Add a new unit test suite and an example benchmark exercising the SIMD intrinsics.
- Update
KernelContextdocumentation/comments and refresh some repo metadata (README badges, deploy workflow).
Reviewed changes
Copilot reviewed 10 out of 12 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
tornadovm-skill.skill |
Adds a new artifact (currently appears to be a zipped payload). |
tornado-unittests/.../TestSIMDGroupReductions.java |
New unit tests for simdSum, simdShuffleDown, simdBroadcastFirst. |
tornado-examples/.../SIMDReductionComparison.java |
New benchmark comparing reduction strategies including SIMD shuffles. |
tornado-drivers/ptx/.../PTXShuffleDownNode.java |
New Graal node lowering simdShuffleDown to shfl.sync.down. |
tornado-drivers/ptx/.../PTXSimdBroadcastFirstNode.java |
New Graal node lowering simdBroadcastFirst to shfl.sync.idx. |
tornado-drivers/ptx/.../PTXSimdSumNode.java |
New Graal node implementing simdSum via shuffle-based reduction. |
tornado-drivers/ptx/.../PTXLIRStmt.java |
Adds ShuffleSyncStmt LIR to emit shfl.sync.* instructions. |
tornado-drivers/ptx/.../PTXGraphBuilderPlugins.java |
Registers PTX invocation plugins to replace KernelContext calls with the new nodes. |
tornado-assembly/src/bin/tornado-test |
Adds the new test class to the “test the world” list. |
tornado-api/.../KernelContext.java |
Adds CPU stubs + documentation for the SIMD intrinsics. |
README.md |
Updates CI badges. |
.github/workflows/deploy-maven-central-jdk21.yml |
Adjusts deploy workflow naming/tag patterns and checkout ref handling. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
...ivers/ptx/src/main/java/uk/ac/manchester/tornado/drivers/ptx/graal/nodes/PTXSimdSumNode.java
Show resolved
Hide resolved
…lanes and add comprehensive tests for simdSum and simdShuffleDown functionality
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Addresses #812 for PTX
This PR adds PTX backend support for the SIMD-group intrinsics introduced for Metal in PR #796. The three
KernelContextmethods —simdShuffleDown,simdSum, andsimdBroadcastFirst— are now lowered to CUDA'sshfl.syncwarp-shuffle instructions when targeting the PTX backend.Mapping
simdShuffleDown(val, d)shfl.sync.down.b32 dest, val, d, 31, 0xFFFFFFFF;simdBroadcastFirst(val)shfl.sync.idx.b32 dest, val, 0, 31, 0xFFFFFFFF;simdSum(val)shfl.sync.down+ 5×add.f32(deltas 16,8,4,2,1)The
shfl.syncinstruction requires PTX ISA 6.0+ / SM 3.0+, well within TornadoVM's minimum target (PTX 5.0+ / SM 6.2+).Changes
New files (3 Graal IR nodes + 1 example)
PTXShuffleDownNode.javaFixedWithNextNode→ emitsShuffleSyncStmt(DOWN, result, data, delta)PTXSimdBroadcastFirstNode.javaFixedWithNextNode→ emitsShuffleSyncStmt(IDX, result, value, 0)PTXSimdSumNode.javaFixedWithNextNode→ emits 5-round butterfly (shuffle + add per round)SIMDReductionComparison.javaModified files
PTXLIRStmt.javaShuffleSyncStmtinner class withModeenum (DOWN, IDX, UP, BFLY)PTXGraphBuilderPlugins.javaregisterSIMDPlugins()with 3InvocationPlugins, called fromregisterKernelContextPlugins()TestSIMDGroupReductions.javaassertNotBackend(PTX)from all 5 tests; updated javadocKernelContext.javaDesign Notes
FixedWithNextNode(notFloatingNode) to prevent the Graal scheduler from hoisting them into conditional branches where only some warp lanes execute. This follows the Metal pattern.0xFFFFFFFFassumes full 32-lane warp participation. The clamp value31is the maximum lane ID. This matches the current API contract..b32(bit-size) is compatible with.f32registers per PTX ISA type-compatibility rules.InvocationPlugins inPTXGraphBuilderPluginsintercept theKernelContextmethod calls during bytecode parsing and directly replace them with the new Graal IR nodes.Test Results
All 5 tests pass on PTX (NVIDIA sm_86):
Generated PTX contains the expected instructions:
Benchmark Results
The SIMD shuffle paths are ~1.3–1.4× faster than the shared-memory + barrier approach. The speedup comes from:
float[32], then does 5 rounds of barrier-synchronized reads/writes. The SIMD paths stay entirely in registers.barrier.synccalls; the SIMD paths have none.