Support loadAligned on CUDA backend.#10098
Support loadAligned on CUDA backend.#10098csyonghe wants to merge 9 commits intoshader-slang:masterfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAligned-load support was added across IR, core API, CUDA lowering, and emitter: alignment decorations and builder helpers, core loadAligned signatures now accept pointer-with-access types, CUDA lowering wraps/unwraps types for aligned loads, and CUDA emitter emits align(N) when present. Changes
Sequence Diagram(s)sequenceDiagram
participant User as User Code
participant CoreAPI as Core API
participant IRBuilder as IR Builder
participant LowerPass as CUDA Lowering Pass
participant CUDAEmitter as CUDA Emitter
User->>CoreAPI: call loadAligned<16>(ptr)
CoreAPI->>IRBuilder: emit IRLoad + AlignmentDecoration
IRBuilder->>LowerPass: provide IR with alignment metadata
LowerPass->>LowerPass: getOrCreate aligned-wrapper type
LowerPass->>LowerPass: bitcast ptr -> wrapped ptr and load wrapped struct
LowerPass->>LowerPass: extract field, replace uses (unwrap)
LowerPass->>CUDAEmitter: emit lowered IR
CUDAEmitter->>CUDAEmitter: detect AlignmentDecoration and emit __align__(N)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Pull request overview
This PR adds support for generating CUDA code with proper alignment attributes for loadAligned calls. The implementation extends the existing immutable buffer load lowering pass to handle aligned loads by creating wrapper struct types with alignment decorations.
Changes:
- Extends
lowerImmutableBufferLoadForCUDAtolowerImmutableOrAlignedBufferLoadForCUDAto handle both immutable and aligned buffer loads for CUDA targets - Adds
AlignmentDecorationto the IR instruction system to represent alignment requirements on struct types - Implements wrapper type creation that adds
__align__attributes in generated CUDA code - Updates
loadAlignedsignature to acceptPtr<T, access, AddressSpace.Device>for better type flexibility
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/spirv/aligned-load-store.slang | Adds PTX target test expectations for aligned loads on CUDA backend |
| source/slang/slang-ir-insts.lua | Defines new AlignmentDecoration IR instruction |
| source/slang/slang-ir-insts.h | Adds getPtrOperand helper method to IRLoad and getPtrType overload to IRBuilder |
| source/slang/slang-ir-insts-stable-names.lua | Assigns stable ID (728) to AlignmentDecoration |
| source/slang/slang-ir-cuda-immutable-load.h | Renames function to reflect expanded functionality |
| source/slang/slang-ir-cuda-immutable-load.cpp | Implements aligned wrapper type creation and load rewriting logic |
| source/slang/slang-emit.cpp | Updates pass invocation to use renamed function |
| source/slang/slang-emit-cuda.h | Declares emitPostKeywordTypeAttributesImpl override |
| source/slang/slang-emit-cuda.cpp | Implements align attribute emission for types with AlignmentDecoration; includes minor formatting cleanup |
| source/slang/core.meta.slang | Updates __load_aligned and loadAligned signatures to support Ptr with access qualifiers |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@source/slang/slang-emit-cuda.cpp`:
- Around line 1275-1277: The three-line block assigning rowCount, colCount, and
matrixUse from coopMatType has formatting drift; reformat this block to match
the project's clang-format style (run the repo's formatting script or
clang-format) so spacing/indentation and casting align with surrounding code.
Locate the block that uses coopMatType and IRIntLit with getRowCount(),
getColumnCount(), and getMatrixUse() and re-run the formatter so the lines
assigning uint32_t rowCount, uint32_t colCount, and uint32_t matrixUse conform
to the project's style.
In `@source/slang/slang-ir-cuda-immutable-load.cpp`:
- Around line 349-404: The FieldExtract is currently emitted before the load
because builder.setInsertBefore(load) is used; when needUnwrap is true we must
emit the extract after the load has been produced to avoid use-before-def.
Change the insertion point right before calling emitFieldExtract (e.g., call
builder.setInsertAfter(loadedValue) or builder.setInsertAfter(load) once
loadedValue is available) so the builder emits the extract after the load, then
perform the replaceOperand loop and/or inst replacement/remove as before
(symbols: builder, load, needUnwrap, loadedValue, emitFieldExtract, inst,
replaceUsesWith, removeAndDeallocate).
|
This PR modifies IR instruction definition files. Please review if you need to update the following constants in
These version numbers help ensure compatibility between different versions of compiled modules. |
|
/format |
|
🌈 Formatted, please merge the changes from this PR |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@source/slang/slang-ir-cuda-immutable-load.cpp`:
- Around line 349-405: Cache the root address of the load's pointer before you
insert the alignment bitcast so the immutable-location check runs against the
original root; specifically, compute auto rootAddr = getRootAddr(load->getPtr())
before calling getOrCreateAlignedWrapper/ builder.replaceOperand, then use
isPointerToImmutableLocation(rootAddr) (and, if needed, pass rootAddr into the
immutable-load path) instead of calling getRootAddr(load->getPtr()) after the
bitcast so emitImmutableLoad/ isPointerToImmutableLocation see the unmodified
root.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@source/slang/slang-ir-cuda-immutable-load.cpp`:
- Around line 82-85: The call to getNaturalSizeAndAlignment currently ignores
its Result return value causing naturalSizeAlignment to be potentially
uninitialized; modify the code around getNaturalSizeAndAlignment (and the local
IRSizeAndAlignment naturalSizeAlignment) to check the Result and handle failure:
if the call fails, do not use naturalSizeAlignment—either propagate the failure
up (return an error/Result from this function) or return a safe fallback (e.g.,
the original innerType) depending on the surrounding API contract, and adjust
callers if you choose to propagate the error.
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@source/slang/slang-ir-cuda-immutable-load.cpp`:
- Around line 382-405: The code in the needUnwrap branch replaces all recorded
uses with the extracted field value but never removes the original load
instruction (inst), leaving an orphaned IR node; after you call
builder.replaceOperand for the saved uses (inside the needUnwrap block where you
set loadedValue = builder.emitFieldExtract(...)), check whether loadedValue !=
inst and if so call inst->replaceUsesWith(loadedValue) or simply
inst->removeAndDeallocate() as appropriate to fully remove the original load;
update the needUnwrap branch to mirror the cleanup behavior of the else-if that
handles loadedValue != inst so inst is deallocated when the unwrap path produced
a new value.
|
/format |
|
🌈 Formatted, please merge the changes from this PR |
Format code for PR shader-slang#10098
There was a problem hiding this comment.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@source/slang/slang-ir-cuda-immutable-load.cpp`:
- Around line 352-407: The unwrap branch currently replaces uses by iterating
uses and calling builder.replaceOperand but never removes the original inst,
leaving an orphaned IR node; update the needUnwrap branch after extracting the
field so that if the immutable lowering actually produced a different value
(i.e., loadedValue != inst) you call inst->replaceUsesWith(loadedValue) (or
ensure all uses are replaced) and then inst->removeAndDeallocate(); in short,
after the builder.emitFieldExtract and the loop that replaces uses, check
loadedValue != inst and call inst->replaceUsesWith(loadedValue) if needed and
then inst->removeAndDeallocate() so the original instruction is cleaned up
(referencing symbols: needUnwrap, emitImmutableLoad, loadedValue, inst,
builder.emitFieldExtract, builder.replaceOperand, replaceUsesWith,
removeAndDeallocate).
This change adds proper code generation for
loadAlignedcalls when emitting cuda code.This is implemented by extending the existing
lowerImmutableBufferLoadForCUDAtolowerImmutableOrAlignedBufferLoadForCUDA.In the pass, when we see a
load(ptr:T*, aligned(16)), we will produce astruct T_aligned16 { T value; }type that wraps aT, with a[Alignment(16)]decoration on the wrapper struct type. Then we rewrite the load toload(bit_cast<T_aligned16*>(ptr)).value. The cuda backend is extended to recognize the Alignment decoration and emit it as a__align(16)__attribute in the resulting cuda code.