Skip to content

Comments

Support loadAligned on CUDA backend.#10098

Open
csyonghe wants to merge 9 commits intoshader-slang:masterfrom
csyonghe:loadaligned-cuda
Open

Support loadAligned on CUDA backend.#10098
csyonghe wants to merge 9 commits intoshader-slang:masterfrom
csyonghe:loadaligned-cuda

Conversation

@csyonghe
Copy link
Collaborator

This change adds proper code generation for loadAligned calls when emitting cuda code.

This is implemented by extending the existing lowerImmutableBufferLoadForCUDA to lowerImmutableOrAlignedBufferLoadForCUDA.

In the pass, when we see a load(ptr:T*, aligned(16)), we will produce a struct T_aligned16 { T value; } type that wraps a T, with a [Alignment(16)] decoration on the wrapper struct type. Then we rewrite the load to load(bit_cast<T_aligned16*>(ptr)).value. The cuda backend is extended to recognize the Alignment decoration and emit it as a __align(16)__ attribute in the resulting cuda code.

Copilot AI review requested due to automatic review settings February 20, 2026 01:09
@csyonghe csyonghe requested a review from a team as a code owner February 20, 2026 01:09
@csyonghe csyonghe requested review from bmillsNV and removed request for a team February 20, 2026 01:09
@coderabbitai
Copy link

coderabbitai bot commented Feb 20, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Aligned-load support was added across IR, core API, CUDA lowering, and emitter: alignment decorations and builder helpers, core loadAligned signatures now accept pointer-with-access types, CUDA lowering wraps/unwraps types for aligned loads, and CUDA emitter emits align(N) when present.

Changes

Cohort / File(s) Summary
Core API
source/slang/core.meta.slang
Changed __load_aligned and loadAligned signatures to accept Ptr<T, access, AddressSpace.Device> with an Access template parameter instead of raw T*.
IR: decorations & stable names
source/slang/slang-ir-insts.lua, source/slang/slang-ir-insts-stable-names.lua
Added AlignmentDecoration (integer operand) to IR decorations and a stable-name entry for Decoration.AlignmentDecoration.
IR: APIs
source/slang/slang-ir-insts.h
Added IRLoad::getPtrOperand(), IRBuilder::getPtrType(..., oldPtrType) overload, and IRBuilder::addAlignmentDecoration(...) helper (duplicate insertion for availability).
CUDA lowering
source/slang/slang-ir-cuda-immutable-load.h, source/slang/slang-ir-cuda-immutable-load.cpp, source/slang/slang-emit.cpp
Renamed lowering to lowerImmutableOrAlignedBufferLoadForCUDA; added aligned-wrapper key/cache, getOrCreateAlignedWrapper, pointer bitcast to wrapper, load-through-wrapper lowering, and unwrap/extract-and-rewire behavior.
CUDA emitter
source/slang/slang-emit-cuda.h, source/slang/slang-emit-cuda.cpp
Declared/implemented emitPostKeywordTypeAttributesImpl(IRInst*) which emits __align__(N) when an AlignmentDecoration is present.
Tests
tests/spirv/aligned-load-store.slang
Added uniform ImmutablePtr<C> data3; and a loadAligned<16>(data3)/storeAligned<16>(...) sequence; updated expectations for additional PTX load/store paths.

Sequence Diagram(s)

sequenceDiagram
    participant User as User Code
    participant CoreAPI as Core API
    participant IRBuilder as IR Builder
    participant LowerPass as CUDA Lowering Pass
    participant CUDAEmitter as CUDA Emitter

    User->>CoreAPI: call loadAligned<16>(ptr)
    CoreAPI->>IRBuilder: emit IRLoad + AlignmentDecoration
    IRBuilder->>LowerPass: provide IR with alignment metadata
    LowerPass->>LowerPass: getOrCreate aligned-wrapper type
    LowerPass->>LowerPass: bitcast ptr -> wrapped ptr and load wrapped struct
    LowerPass->>LowerPass: extract field, replace uses (unwrap)
    LowerPass->>CUDAEmitter: emit lowered IR
    CUDAEmitter->>CUDAEmitter: detect AlignmentDecoration and emit __align__(N)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰
I tuck a field in a cozy wrap,
hop, bitcast, then gently unwrap,
a tiny decoration points the way,
CUDA lines up bytes to play,
hooray — aligned hops for the day!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 9.52% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Support loadAligned on CUDA backend' accurately and concisely describes the main objective of the pull request, which is to add code generation for loadAligned calls in CUDA.
Description check ✅ Passed The description is directly related to the changeset, providing a clear technical explanation of how loadAligned support is implemented in the CUDA backend through wrapper structs and alignment decorations.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for generating CUDA code with proper alignment attributes for loadAligned calls. The implementation extends the existing immutable buffer load lowering pass to handle aligned loads by creating wrapper struct types with alignment decorations.

Changes:

  • Extends lowerImmutableBufferLoadForCUDA to lowerImmutableOrAlignedBufferLoadForCUDA to handle both immutable and aligned buffer loads for CUDA targets
  • Adds AlignmentDecoration to the IR instruction system to represent alignment requirements on struct types
  • Implements wrapper type creation that adds __align__ attributes in generated CUDA code
  • Updates loadAligned signature to accept Ptr<T, access, AddressSpace.Device> for better type flexibility

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/spirv/aligned-load-store.slang Adds PTX target test expectations for aligned loads on CUDA backend
source/slang/slang-ir-insts.lua Defines new AlignmentDecoration IR instruction
source/slang/slang-ir-insts.h Adds getPtrOperand helper method to IRLoad and getPtrType overload to IRBuilder
source/slang/slang-ir-insts-stable-names.lua Assigns stable ID (728) to AlignmentDecoration
source/slang/slang-ir-cuda-immutable-load.h Renames function to reflect expanded functionality
source/slang/slang-ir-cuda-immutable-load.cpp Implements aligned wrapper type creation and load rewriting logic
source/slang/slang-emit.cpp Updates pass invocation to use renamed function
source/slang/slang-emit-cuda.h Declares emitPostKeywordTypeAttributesImpl override
source/slang/slang-emit-cuda.cpp Implements align attribute emission for types with AlignmentDecoration; includes minor formatting cleanup
source/slang/core.meta.slang Updates __load_aligned and loadAligned signatures to support Ptr with access qualifiers

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@source/slang/slang-emit-cuda.cpp`:
- Around line 1275-1277: The three-line block assigning rowCount, colCount, and
matrixUse from coopMatType has formatting drift; reformat this block to match
the project's clang-format style (run the repo's formatting script or
clang-format) so spacing/indentation and casting align with surrounding code.
Locate the block that uses coopMatType and IRIntLit with getRowCount(),
getColumnCount(), and getMatrixUse() and re-run the formatter so the lines
assigning uint32_t rowCount, uint32_t colCount, and uint32_t matrixUse conform
to the project's style.

In `@source/slang/slang-ir-cuda-immutable-load.cpp`:
- Around line 349-404: The FieldExtract is currently emitted before the load
because builder.setInsertBefore(load) is used; when needUnwrap is true we must
emit the extract after the load has been produced to avoid use-before-def.
Change the insertion point right before calling emitFieldExtract (e.g., call
builder.setInsertAfter(loadedValue) or builder.setInsertAfter(load) once
loadedValue is available) so the builder emits the extract after the load, then
perform the replaceOperand loop and/or inst replacement/remove as before
(symbols: builder, load, needUnwrap, loadedValue, emitFieldExtract, inst,
replaceUsesWith, removeAndDeallocate).

@slangbot
Copy link
Contributor

slangbot commented Feb 20, 2026

⚠️ IR Instruction Files Changed

This PR modifies IR instruction definition files. Please review if you need to update the following constants in source/slang/slang-ir.h:

  • k_minSupportedModuleVersion: Should be incremented if you're removing instructions or making breaking changes
  • k_maxSupportedModuleVersion: Should be incremented when adding new instructions

These version numbers help ensure compatibility between different versions of compiled modules.

@csyonghe
Copy link
Collaborator Author

/format

@slangbot
Copy link
Contributor

🌈 Formatted, please merge the changes from this PR

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@source/slang/slang-ir-cuda-immutable-load.cpp`:
- Around line 349-405: Cache the root address of the load's pointer before you
insert the alignment bitcast so the immutable-location check runs against the
original root; specifically, compute auto rootAddr = getRootAddr(load->getPtr())
before calling getOrCreateAlignedWrapper/ builder.replaceOperand, then use
isPointerToImmutableLocation(rootAddr) (and, if needed, pass rootAddr into the
immutable-load path) instead of calling getRootAddr(load->getPtr()) after the
bitcast so emitImmutableLoad/ isPointerToImmutableLocation see the unmodified
root.

@csyonghe csyonghe added the pr: non-breaking PRs without breaking changes label Feb 20, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@source/slang/slang-ir-cuda-immutable-load.cpp`:
- Around line 82-85: The call to getNaturalSizeAndAlignment currently ignores
its Result return value causing naturalSizeAlignment to be potentially
uninitialized; modify the code around getNaturalSizeAndAlignment (and the local
IRSizeAndAlignment naturalSizeAlignment) to check the Result and handle failure:
if the call fails, do not use naturalSizeAlignment—either propagate the failure
up (return an error/Result from this function) or return a safe fallback (e.g.,
the original innerType) depending on the surrounding API contract, and adjust
callers if you choose to propagate the error.

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@source/slang/slang-ir-cuda-immutable-load.cpp`:
- Around line 382-405: The code in the needUnwrap branch replaces all recorded
uses with the extracted field value but never removes the original load
instruction (inst), leaving an orphaned IR node; after you call
builder.replaceOperand for the saved uses (inside the needUnwrap block where you
set loadedValue = builder.emitFieldExtract(...)), check whether loadedValue !=
inst and if so call inst->replaceUsesWith(loadedValue) or simply
inst->removeAndDeallocate() as appropriate to fully remove the original load;
update the needUnwrap branch to mirror the cleanup behavior of the else-if that
handles loadedValue != inst so inst is deallocated when the unwrap path produced
a new value.

@csyonghe
Copy link
Collaborator Author

/format

@slangbot
Copy link
Contributor

🌈 Formatted, please merge the changes from this PR

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@source/slang/slang-ir-cuda-immutable-load.cpp`:
- Around line 352-407: The unwrap branch currently replaces uses by iterating
uses and calling builder.replaceOperand but never removes the original inst,
leaving an orphaned IR node; update the needUnwrap branch after extracting the
field so that if the immutable lowering actually produced a different value
(i.e., loadedValue != inst) you call inst->replaceUsesWith(loadedValue) (or
ensure all uses are replaced) and then inst->removeAndDeallocate(); in short,
after the builder.emitFieldExtract and the loop that replaces uses, check
loadedValue != inst and call inst->replaceUsesWith(loadedValue) if needed and
then inst->removeAndDeallocate() so the original instruction is cleaned up
(referencing symbols: needUnwrap, emitImmutableLoad, loadedValue, inst,
builder.emitFieldExtract, builder.replaceOperand, replaceUsesWith,
removeAndDeallocate).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr: non-breaking PRs without breaking changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants