remove all the usage of pointer to shared memory by kaizhangNV · Pull Request #10093 · shader-slang/slang

kaizhangNV · 2026-02-19T22:00:39Z

Because variable pointer is quite unstable on the driver side and it could lead unexpected behavior during run-time, we will remove the usage of pointer to shared memory in neural.slang.

coderabbitai · 2026-02-19T22:01:04Z

📝 Walkthrough

Walkthrough

Refactors group-shared memory from pointer-based SPtr to an offset-based SharedMemoryPool model; propagates a ShMemPool type through MMAHelper and neural tests. Adds/updates HLSL metadata exposing CoopMat/CoopVec declarations and many cooperative-matrix/vector intrinsics. Test suites updated to use TestPool offsets and TestShMemSize.

Changes

Cohort / File(s)	Summary
HLSL metadata (Cooperative matrices) `source/slang/hlsl.meta.slang`	Added extensive `CoopMat`/`CoopVec` declarations, constructors, conversions, transpose/reduce/map ops, packing/unpacking helpers, descriptor-handle interfaces, atomics, synchronization and many target-specific intrinsics and SPIR‑V assembly hooks.
MMA core refactor (shared memory → offsets + pool) `source/standard-modules/neural/accelerate-vector-coopmat.slang`	Extended `MMAHelper` to accept `ShMemPool`; replaced `SPtr<uint4>` shared-memory parameters with `uint` offsets; all internal load/store now index `ShMemPool.data[offset + ...]`; propagated offsets through load/store, accumulation, and MMA call sites.
Shared memory pool interface `source/standard-modules/neural/shared-memory-pool.slang`	Reworked `ISharedMemoryPool` to expose `associatedtype MemSize`; removed `getPointer()` and moved `groupshared uint4 data[...]` into a public extension keyed by pool type; adjusted `ISharedMemorySize.Bytes` visibility.
Neural tests — adopt pooled shared memory `tests/neural/bias-sum-reduce.slang`, `tests/neural/outerproduct-accumulate-test.slang`, `tests/neural/tiled-mma-load-test.slang`, `tests/neural/mma-helper-test-*.slang`, `tests/neural/...`	Added `TestShMemSize` types and `TestPool` aliases; introduced `OffsetA`/`OffsetB` constants; removed per-test `groupshared` arrays and `SPtr` usage; updated `serialRead`, verification helpers, and all MMA instantiations to accept and use pool offsets and `TestPool`.
Test variants (single/multi/transpose/unaligned) `tests/neural/mma-helper-test-single-warp.slang`, `tests/neural/mma-helper-test-multi-warps.slang`, `tests/neural/mma-helper-test-transpose-.slang`, `tests/neural/tiled-mma-load-test-.slang`	Applied pool/offset changes uniformly across many permutations: changed function signatures (SPtr → uint offset), replaced direct shared-memory indexing with `TestPool.data[sharedMemOffset + ...]`, and updated MMA template instantiations to include `TestPool`.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I hopped through tiles and swapped each pointer for a mark,

pooled the shared carrots, hid them in the dark,
CoopMat hums, vectors skip in rows of light,
offsets trace the rhythm of the warp‑wide night,
a rabbit claps — the kernels dance until the dawn.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'remove all the usage of pointer to shared memory' accurately reflects the main objective and primary changes throughout the changeset—replacing pointer-based shared memory access with offset-based pooling.
Description check	✅ Passed	The description clearly relates to the changeset by explaining the rationale (driver instability with variable pointers) for removing pointer-to-shared-memory usage, which aligns with the extensive refactoring visible in all modified files.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

This PR refactors the neural module's shared memory management from pointer-based to offset-based access, enabling CooperativeMatrix operations to work correctly on Vulkan/SPIR-V targets where pointer manipulation in shared memory is restricted by the logical addressing model.

Changes:

Removed pointer-based shared memory access (SPtr, __getAddress) and replaced with offset-based indexing
Updated SharedMemoryPool to use extension pattern providing static groupshared data array
Added CUDA support to CoopMat Load/Store operations for groupshared arrays with different element types
Updated all MMAHelper functions to accept ShMemPool template parameter and use uint offsets instead of pointers

Reviewed changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated no comments.

File	Description
source/standard-modules/neural/shared-memory-pool.slang	Refactored SharedMemoryPool from pointer-returning to extension-based pattern with static groupshared data array
source/standard-modules/neural/accelerate-vector-coopmat.slang	Updated all MMAHelper functions to use offset-based access; changed signatures from SPtr to uint offsets; added ShMemPool template parameter
source/slang/hlsl.meta.slang	Added CUDA implementation to CoopMat Load/Store for groupshared arrays with different element types; removed "Vulkan only" comments; changed require attribute from cooperative_matrix_spirv to cooperative_matrix
tests/neural/*.slang (17 files)	Updated all test files to use SharedMemoryPool pattern with TestShMemSize struct, TestPool typealias, and offset constants instead of groupshared arrays and pointers

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

tests/neural/mma-helper-test-single-warp-arbitrary-size.slang (1)

103-134: Consider extracting serialRead to a common utility module.

This debugging helper is duplicated across multiple test files with nearly identical implementations. Extracting it to common.slang would reduce maintenance burden.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/neural/mma-helper-test-single-warp-arbitrary-size.slang` around lines
103 - 134, The serialRead debugging helper is duplicated; extract the function
serialRead(uint tid, uint sharedMemOffset, uint rowOrColumnCount) into a shared
module (e.g., common.slang) and replace the copies in test files with an
import/include of that module; ensure the function signature and any referenced
symbols (TestPool, GroupMemoryBarrierWithGroupSync, printf, uint4, half,
bit_cast) remain unchanged, remove the duplicate definitions from each test
(including mma-helper-test-single-warp-arbitrary-size.slang), and update
build/compile includes so all tests reference the new common utility.

tests/neural/mma-helper-test-transpose-single-warp.slang (1)

121-152: Hardcoded loop iteration count may not match data type.

The inner loop iterates 8 times regardless of sizeof(T), which appears to assume float data (32 bytes per row). For half data (16 bytes per row), this would read beyond the intended data. Consider making this dynamic similar to other serialRead implementations.

♻️ Suggested fix to compute iterations based on type size

 void serialRead(uint tid, uint sharedMemOffset, uint rowOrColumnCount)
 {
     GroupMemoryBarrierWithGroupSync();

     if (tid > 0)
         return;

     int linearIndex = 0;
     for (int id = 0; id < rowOrColumnCount; id++)
     {
         printf("tid: %d\n", id);
-        for (int i = 0; i < 8; i++)
+        int numVectors = (sizeof(DType) == 2) ? 2 : 8; // 2 uint4 for half (16 bytes), 8 for float (32 bytes * 2)
+        for (int i = 0; i < numVectors; i++)
         {
             uint4 values = TestPool.data[sharedMemOffset + id * 8 + i];

#!/bin/bash
# Check how other serialRead implementations handle the iteration count
rg -n "for.*int i = 0.*i <" tests/neural/mma-helper-test-*.slang -A2 -B2 | head -60

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/neural/mma-helper-test-transpose-single-warp.slang` around lines 121 -
152, serialRead currently hardcodes the inner loop to iterate 8 times (for i <
8), which assumes 32-byte rows (e.g., float data) and will overread for 16-byte
rows (half data); update serialRead to compute the per-row uint4 count
dynamically based on the element byte size instead of using 8. Locate the
serialRead function and replace the fixed "for (int i = 0; i < 8; i++)" logic
(the loop using i and accessing TestPool.data[sharedMemOffset + id * 8 + i])
with a computed elementsPerRow derived from bytesPerRow / sizeof(uint4) (where
bytesPerRow depends on the element type, e.g., sizeof(half) vs sizeof(float) or
a runtime/type-specific constant), and use that elementsPerRow in the loop so
both half and float layouts read the correct amount. Ensure any index arithmetic
(sharedMemOffset + id * 8 + i) uses the new elementsPerRow variable.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@source/slang/hlsl.meta.slang`:
- Around line 27403-27407: The CUDA intrinsics incorrectly dereference a
non‑existent m_data field on the raw __ref groupshared U[V] parameter (e.g. in
the case cuda blocks for the Store and Load overloads handling __ref groupshared
U[V] data); update the intrinsic operand that currently uses "($1)->m_data + $2"
(and similar) to pass the raw pointer instead (e.g. "($1) + $2" or "$1" as
appropriate for the intrinsic) so the WMMA "__intrinsic_asm" calls receive a raw
pointer rather than accessing a nonexistent m_data field; apply the same fix to
all occurrences referenced in the comment (Store overloads at the ranges around
lines 27350–27352, 27372–27374, 27405–27407 and Load overloads around
27597–27599, 27653–27655).

---

Duplicate comments:
In `@source/slang/hlsl.meta.slang`:
- Around line 27651-27655: The CUDA Load path incorrectly assumes a field named
m_data like the Store path; update the two intrinsics in the case cuda branch
(the __intrinsic_asm calls for Slang_CUDA_WMMA::RowMajor and ::ColMajor) to use
the same safe buffer accessor used in the Store-path fix (i.e., replace
"($0)->m_data" with the canonical accessor/helper used previously such as the
getter or underlying buffer pointer expression from the Store-path), and apply
the change to both RowMajor and ColMajor lines so the code no longer assumes a
concrete m_data field.

---

Nitpick comments:
In `@tests/neural/mma-helper-test-single-warp-arbitrary-size.slang`:
- Around line 103-134: The serialRead debugging helper is duplicated; extract
the function serialRead(uint tid, uint sharedMemOffset, uint rowOrColumnCount)
into a shared module (e.g., common.slang) and replace the copies in test files
with an import/include of that module; ensure the function signature and any
referenced symbols (TestPool, GroupMemoryBarrierWithGroupSync, printf, uint4,
half, bit_cast) remain unchanged, remove the duplicate definitions from each
test (including mma-helper-test-single-warp-arbitrary-size.slang), and update
build/compile includes so all tests reference the new common utility.

In `@tests/neural/mma-helper-test-transpose-single-warp.slang`:
- Around line 121-152: serialRead currently hardcodes the inner loop to iterate
8 times (for i < 8), which assumes 32-byte rows (e.g., float data) and will
overread for 16-byte rows (half data); update serialRead to compute the per-row
uint4 count dynamically based on the element byte size instead of using 8.
Locate the serialRead function and replace the fixed "for (int i = 0; i < 8;
i++)" logic (the loop using i and accessing TestPool.data[sharedMemOffset + id *
8 + i]) with a computed elementsPerRow derived from bytesPerRow / sizeof(uint4)
(where bytesPerRow depends on the element type, e.g., sizeof(half) vs
sizeof(float) or a runtime/type-specific constant), and use that elementsPerRow
in the loop so both half and float layouts read the correct amount. Ensure any
index arithmetic (sharedMemOffset + id * 8 + i) uses the new elementsPerRow
variable.

coderabbitai · 2026-02-20T16:57:09Z

source/slang/hlsl.meta.slang

+        case cuda:
+            if (matrixLayout == CoopMatMatrixLayout.RowMajor)
+                __intrinsic_asm "$0.Store<Slang_CUDA_WMMA::RowMajor>(($1)->m_data + $2, $3)";
+            else
+                __intrinsic_asm "$0.Store<Slang_CUDA_WMMA::ColMajor>(($1)->m_data + $2, $3)";


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # First, verify the file and line numbers exist echo "=== Checking file and line context ===" sed -n '27403,27407p' source/slang/hlsl.meta.slang echo -e "\n=== Checking surrounding context for Store method ===" sed -n '27378,27410p' source/slang/hlsl.meta.slang echo -e "\n=== Searching for Slang_CUDA_WMMA and WMMA usage ===" rg -n "Slang_CUDA_WMMA" -g'*.slang' -g'*.cpp' -g'*.h' | head -30 echo -e "\n=== Searching for m_data field usage in groupshared context ===" rg -n "m_data" -g'*.slang' -g'*.cpp' | head -40 echo -e "\n=== Searching for CooperativeMatrix or CoopMat ===" rg -n "CooperativeMatrix|CoopMat" -g'*.slang' | head -50

Repository: shader-slang/slang

Length of output: 15187

🏁 Script executed:

#!/bin/bash # Get the full Store method signatures to understand parameter types echo "=== Store overload at lines 27269 (raw array) ===" sed -n '27255,27275p' source/slang/hlsl.meta.slang echo -e "\n=== Store overload at lines 27350 (wrapped type) ===" sed -n '27330,27360p' source/slang/hlsl.meta.slang echo -e "\n=== Store overload at lines 27405 (flagged) ===" sed -n '27378,27410p' source/slang/hlsl.meta.slang echo -e "\n=== Store overload at lines 27428 (another variant) ===" sed -n '27410,27440p' source/slang/hlsl.meta.slang echo -e "\n=== Check if there's a wrapper array type definition ===" rg -n "struct.*Array|class.*Array" -A5 source/slang/hlsl.meta.slang | head -50

Repository: shader-slang/slang

Length of output: 7303

🏁 Script executed:

#!/bin/bash # Check CUDA codegen for groupshared array handling and m_data wrapping echo "=== Search CUDA codegen for groupshared/workgroup handling ===" rg -n "groupshared|workgroup|AddressSpace::GroupShared" source/slang/slang-emit-cuda.cpp | head -30 echo -e "\n=== Search for array wrapping or m_data field generation in CUDA ===" rg -n "m_data|wrapper.*array|Array.*wrapper" source/slang/slang-emit-cuda.cpp | head -30 echo -e "\n=== Check how Ptr types are lowered in CUDA ===" rg -n "Ptr.*GroupShared|AddressSpace.*GroupShared" source/slang/slang-emit-cuda.cpp -A3 | head -50 echo -e "\n=== Check WMMA intrinsic signatures and expected parameter types ===" sed -n '6595,6650p' prelude/slang-cuda-prelude.h

Repository: shader-slang/slang

Length of output: 1365

🏁 Script executed:

#!/bin/bash # Get the WMMA Store/Load function signatures echo "=== Search for Store function in WMMA prelude ===" rg -n "\.Store\(|Store<" prelude/slang-cuda-prelude.h -A5 | head -60 echo -e "\n=== Search for Load function in WMMA prelude ===" rg -n "\.Load\(|Load<" prelude/slang-cuda-prelude.h -A5 | head -60 echo -e "\n=== Check if there's a wrapper struct for groupshared arrays ===" rg -n "struct.*GroupShared|class.*GroupShared" source/slang/ -g'*.slang' -g'*.cpp' -g'*.h' | head -20 echo -e "\n=== Check the Load overload context at line 27597 for comparison ===" sed -n '27570,27610p' source/slang/hlsl.meta.slang

Repository: shader-slang/slang

Length of output: 3494

CUDA path incorrectly assumes m_data field on raw groupshared arrays.

The parameter __ref groupshared U[V] data at line 27405–27407 is a raw array reference, but the CUDA intrinsic strings dereference ($1)->m_data, which does not exist. The WMMA intrinsics expect a raw pointer, not a dereferenced field. This pattern appears systematically in multiple Store and Load overloads (lines 27350–27352, 27372–27374, 27405–27407 for Store; lines 27597–27599, 27653–27655 for Load). The SPIRV backend correctly treats these as raw arrays; the CUDA paths must align accordingly or the intrinsic strings should be corrected to pass the raw array pointer directly.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@source/slang/hlsl.meta.slang` around lines 27403 - 27407, The CUDA intrinsics incorrectly dereference a non‑existent m_data field on the raw __ref groupshared U[V] parameter (e.g. in the case cuda blocks for the Store and Load overloads handling __ref groupshared U[V] data); update the intrinsic operand that currently uses "($1)->m_data + $2" (and similar) to pass the raw pointer instead (e.g. "($1) + $2" or "$1" as appropriate for the intrinsic) so the WMMA "__intrinsic_asm" calls receive a raw pointer rather than accessing a nonexistent m_data field; apply the same fix to all occurrences referenced in the comment (Store overloads at the ranges around lines 27350–27352, 27372–27374, 27405–27407 and Load overloads around 27597–27599, 27653–27655).

csyonghe · 2026-02-20T20:50:50Z

source/standard-modules/neural/shared-memory-pool.slang

+    typealias MemSize = ShMemSize;
+}
+
+VISIBILITY_LEVEL extension<T : ISharedMemoryPool> T


why is this in an extension?

I want to access the shared memory array, but I can't define static groupshared inside an interface.

I'm not aware of other proper way to do this.

kaizhangNV requested a review from a team as a code owner February 19, 2026 22:00

kaizhangNV requested review from bmillsNV and Copilot and removed request for a team February 19, 2026 22:00

Copilot started reviewing on behalf of kaizhangNV February 19, 2026 22:01 View session

kaizhangNV added the pr: non-breaking PRs without breaking changes label Feb 19, 2026

kaizhangNV changed the title ~~remove all the usage of pointer on Vulkan~~ remove all the usage of pointer to shared memory Feb 19, 2026

Copilot AI reviewed Feb 19, 2026

View reviewed changes

kaizhangNV added 3 commits February 20, 2026 08:38

remove all the usage of pointer on Vulkan

17f33de

update ISharedMemoryPool

3c34bfc

add attribute of derivative member

525c03e

kaizhangNV force-pushed the avoid_ptr branch from 8e3b32f to 525c03e Compare February 20, 2026 16:50

coderabbitai bot reviewed Feb 20, 2026

View reviewed changes

csyonghe reviewed Feb 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

remove all the usage of pointer to shared memory#10093

remove all the usage of pointer to shared memory#10093
kaizhangNV wants to merge 3 commits intoshader-slang:masterfrom
kaizhangNV:avoid_ptr

kaizhangNV commented Feb 19, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Feb 19, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

Copilot AI left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 20, 2026

Uh oh!

csyonghe Feb 20, 2026

Uh oh!

kaizhangNV Feb 20, 2026

Uh oh!

kaizhangNV Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

kaizhangNV commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

csyonghe Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

kaizhangNV Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

kaizhangNV Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kaizhangNV commented Feb 19, 2026 •

edited

Loading

coderabbitai bot commented Feb 19, 2026 •

edited

Loading