Skip to content

Comments

remove all the usage of pointer to shared memory#10093

Open
kaizhangNV wants to merge 3 commits intoshader-slang:masterfrom
kaizhangNV:avoid_ptr
Open

remove all the usage of pointer to shared memory#10093
kaizhangNV wants to merge 3 commits intoshader-slang:masterfrom
kaizhangNV:avoid_ptr

Conversation

@kaizhangNV
Copy link
Contributor

@kaizhangNV kaizhangNV commented Feb 19, 2026

Because variable pointer is quite unstable on the driver side and it could lead unexpected behavior during run-time, we will remove the usage of pointer to shared memory in neural.slang.

@kaizhangNV kaizhangNV requested a review from a team as a code owner February 19, 2026 22:00
@kaizhangNV kaizhangNV requested review from bmillsNV and Copilot and removed request for a team February 19, 2026 22:00
@coderabbitai
Copy link

coderabbitai bot commented Feb 19, 2026

📝 Walkthrough

Walkthrough

Refactors group-shared memory from pointer-based SPtr to an offset-based SharedMemoryPool model; propagates a ShMemPool type through MMAHelper and neural tests. Adds/updates HLSL metadata exposing CoopMat/CoopVec declarations and many cooperative-matrix/vector intrinsics. Test suites updated to use TestPool offsets and TestShMemSize.

Changes

Cohort / File(s) Summary
HLSL metadata (Cooperative matrices)
source/slang/hlsl.meta.slang
Added extensive CoopMat/CoopVec declarations, constructors, conversions, transpose/reduce/map ops, packing/unpacking helpers, descriptor-handle interfaces, atomics, synchronization and many target-specific intrinsics and SPIR‑V assembly hooks.
MMA core refactor (shared memory → offsets + pool)
source/standard-modules/neural/accelerate-vector-coopmat.slang
Extended MMAHelper to accept ShMemPool; replaced SPtr<uint4> shared-memory parameters with uint offsets; all internal load/store now index ShMemPool.data[offset + ...]; propagated offsets through load/store, accumulation, and MMA call sites.
Shared memory pool interface
source/standard-modules/neural/shared-memory-pool.slang
Reworked ISharedMemoryPool to expose associatedtype MemSize; removed getPointer() and moved groupshared uint4 data[...] into a public extension keyed by pool type; adjusted ISharedMemorySize.Bytes visibility.
Neural tests — adopt pooled shared memory
tests/neural/bias-sum-reduce.slang, tests/neural/outerproduct-accumulate-test*.slang, tests/neural/tiled-mma-load-test*.slang, tests/neural/mma-helper-test-*.slang, tests/neural/...
Added TestShMemSize types and TestPool aliases; introduced OffsetA/OffsetB constants; removed per-test groupshared arrays and SPtr usage; updated serialRead, verification helpers, and all MMA instantiations to accept and use pool offsets and TestPool.
Test variants (single/multi/transpose/unaligned)
tests/neural/mma-helper-test-single-warp*.slang, tests/neural/mma-helper-test-multi-warps*.slang, tests/neural/mma-helper-test-transpose-*.slang, tests/neural/tiled-mma-load-test-*.slang
Applied pool/offset changes uniformly across many permutations: changed function signatures (SPtr → uint offset), replaced direct shared-memory indexing with TestPool.data[sharedMemOffset + ...], and updated MMA template instantiations to include TestPool.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I hopped through tiles and swapped each pointer for a mark,

pooled the shared carrots, hid them in the dark,
CoopMat hums, vectors skip in rows of light,
offsets trace the rhythm of the warp‑wide night,
a rabbit claps — the kernels dance until the dawn.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'remove all the usage of pointer to shared memory' accurately reflects the main objective and primary changes throughout the changeset—replacing pointer-based shared memory access with offset-based pooling.
Description check ✅ Passed The description clearly relates to the changeset by explaining the rationale (driver instability with variable pointers) for removing pointer-to-shared-memory usage, which aligns with the extensive refactoring visible in all modified files.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Comment @coderabbitai help to get the list of available commands and usage tips.

@kaizhangNV kaizhangNV added the pr: non-breaking PRs without breaking changes label Feb 19, 2026
@kaizhangNV kaizhangNV changed the title remove all the usage of pointer on Vulkan remove all the usage of pointer to shared memory Feb 19, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the neural module's shared memory management from pointer-based to offset-based access, enabling CooperativeMatrix operations to work correctly on Vulkan/SPIR-V targets where pointer manipulation in shared memory is restricted by the logical addressing model.

Changes:

  • Removed pointer-based shared memory access (SPtr, __getAddress) and replaced with offset-based indexing
  • Updated SharedMemoryPool to use extension pattern providing static groupshared data array
  • Added CUDA support to CoopMat Load/Store operations for groupshared arrays with different element types
  • Updated all MMAHelper functions to accept ShMemPool template parameter and use uint offsets instead of pointers

Reviewed changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated no comments.

File Description
source/standard-modules/neural/shared-memory-pool.slang Refactored SharedMemoryPool from pointer-returning to extension-based pattern with static groupshared data array
source/standard-modules/neural/accelerate-vector-coopmat.slang Updated all MMAHelper functions to use offset-based access; changed signatures from SPtr to uint offsets; added ShMemPool template parameter
source/slang/hlsl.meta.slang Added CUDA implementation to CoopMat Load/Store for groupshared arrays with different element types; removed "Vulkan only" comments; changed require attribute from cooperative_matrix_spirv to cooperative_matrix
tests/neural/*.slang (17 files) Updated all test files to use SharedMemoryPool pattern with TestShMemSize struct, TestPool typealias, and offset constants instead of groupshared arrays and pointers

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
tests/neural/mma-helper-test-single-warp-arbitrary-size.slang (1)

103-134: Consider extracting serialRead to a common utility module.

This debugging helper is duplicated across multiple test files with nearly identical implementations. Extracting it to common.slang would reduce maintenance burden.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/neural/mma-helper-test-single-warp-arbitrary-size.slang` around lines
103 - 134, The serialRead debugging helper is duplicated; extract the function
serialRead(uint tid, uint sharedMemOffset, uint rowOrColumnCount) into a shared
module (e.g., common.slang) and replace the copies in test files with an
import/include of that module; ensure the function signature and any referenced
symbols (TestPool, GroupMemoryBarrierWithGroupSync, printf, uint4, half,
bit_cast) remain unchanged, remove the duplicate definitions from each test
(including mma-helper-test-single-warp-arbitrary-size.slang), and update
build/compile includes so all tests reference the new common utility.
tests/neural/mma-helper-test-transpose-single-warp.slang (1)

121-152: Hardcoded loop iteration count may not match data type.

The inner loop iterates 8 times regardless of sizeof(T), which appears to assume float data (32 bytes per row). For half data (16 bytes per row), this would read beyond the intended data. Consider making this dynamic similar to other serialRead implementations.

♻️ Suggested fix to compute iterations based on type size
 void serialRead(uint tid, uint sharedMemOffset, uint rowOrColumnCount)
 {
     GroupMemoryBarrierWithGroupSync();

     if (tid > 0)
         return;

     int linearIndex = 0;
     for (int id = 0; id < rowOrColumnCount; id++)
     {
         printf("tid: %d\n", id);
-        for (int i = 0; i < 8; i++)
+        int numVectors = (sizeof(DType) == 2) ? 2 : 8; // 2 uint4 for half (16 bytes), 8 for float (32 bytes * 2)
+        for (int i = 0; i < numVectors; i++)
         {
             uint4 values = TestPool.data[sharedMemOffset + id * 8 + i];
#!/bin/bash
# Check how other serialRead implementations handle the iteration count
rg -n "for.*int i = 0.*i <" tests/neural/mma-helper-test-*.slang -A2 -B2 | head -60
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/neural/mma-helper-test-transpose-single-warp.slang` around lines 121 -
152, serialRead currently hardcodes the inner loop to iterate 8 times (for i <
8), which assumes 32-byte rows (e.g., float data) and will overread for 16-byte
rows (half data); update serialRead to compute the per-row uint4 count
dynamically based on the element byte size instead of using 8. Locate the
serialRead function and replace the fixed "for (int i = 0; i < 8; i++)" logic
(the loop using i and accessing TestPool.data[sharedMemOffset + id * 8 + i])
with a computed elementsPerRow derived from bytesPerRow / sizeof(uint4) (where
bytesPerRow depends on the element type, e.g., sizeof(half) vs sizeof(float) or
a runtime/type-specific constant), and use that elementsPerRow in the loop so
both half and float layouts read the correct amount. Ensure any index arithmetic
(sharedMemOffset + id * 8 + i) uses the new elementsPerRow variable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@source/slang/hlsl.meta.slang`:
- Around line 27403-27407: The CUDA intrinsics incorrectly dereference a
non‑existent m_data field on the raw __ref groupshared U[V] parameter (e.g. in
the case cuda blocks for the Store and Load overloads handling __ref groupshared
U[V] data); update the intrinsic operand that currently uses "($1)->m_data + $2"
(and similar) to pass the raw pointer instead (e.g. "($1) + $2" or "$1" as
appropriate for the intrinsic) so the WMMA "__intrinsic_asm" calls receive a raw
pointer rather than accessing a nonexistent m_data field; apply the same fix to
all occurrences referenced in the comment (Store overloads at the ranges around
lines 27350–27352, 27372–27374, 27405–27407 and Load overloads around
27597–27599, 27653–27655).

---

Duplicate comments:
In `@source/slang/hlsl.meta.slang`:
- Around line 27651-27655: The CUDA Load path incorrectly assumes a field named
m_data like the Store path; update the two intrinsics in the case cuda branch
(the __intrinsic_asm calls for Slang_CUDA_WMMA::RowMajor and ::ColMajor) to use
the same safe buffer accessor used in the Store-path fix (i.e., replace
"($0)->m_data" with the canonical accessor/helper used previously such as the
getter or underlying buffer pointer expression from the Store-path), and apply
the change to both RowMajor and ColMajor lines so the code no longer assumes a
concrete m_data field.

---

Nitpick comments:
In `@tests/neural/mma-helper-test-single-warp-arbitrary-size.slang`:
- Around line 103-134: The serialRead debugging helper is duplicated; extract
the function serialRead(uint tid, uint sharedMemOffset, uint rowOrColumnCount)
into a shared module (e.g., common.slang) and replace the copies in test files
with an import/include of that module; ensure the function signature and any
referenced symbols (TestPool, GroupMemoryBarrierWithGroupSync, printf, uint4,
half, bit_cast) remain unchanged, remove the duplicate definitions from each
test (including mma-helper-test-single-warp-arbitrary-size.slang), and update
build/compile includes so all tests reference the new common utility.

In `@tests/neural/mma-helper-test-transpose-single-warp.slang`:
- Around line 121-152: serialRead currently hardcodes the inner loop to iterate
8 times (for i < 8), which assumes 32-byte rows (e.g., float data) and will
overread for 16-byte rows (half data); update serialRead to compute the per-row
uint4 count dynamically based on the element byte size instead of using 8.
Locate the serialRead function and replace the fixed "for (int i = 0; i < 8;
i++)" logic (the loop using i and accessing TestPool.data[sharedMemOffset + id *
8 + i]) with a computed elementsPerRow derived from bytesPerRow / sizeof(uint4)
(where bytesPerRow depends on the element type, e.g., sizeof(half) vs
sizeof(float) or a runtime/type-specific constant), and use that elementsPerRow
in the loop so both half and float layouts read the correct amount. Ensure any
index arithmetic (sharedMemOffset + id * 8 + i) uses the new elementsPerRow
variable.

Comment on lines +27403 to +27407
case cuda:
if (matrixLayout == CoopMatMatrixLayout.RowMajor)
__intrinsic_asm "$0.Store<Slang_CUDA_WMMA::RowMajor>(($1)->m_data + $2, $3)";
else
__intrinsic_asm "$0.Store<Slang_CUDA_WMMA::ColMajor>(($1)->m_data + $2, $3)";
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# First, verify the file and line numbers exist
echo "=== Checking file and line context ==="
sed -n '27403,27407p' source/slang/hlsl.meta.slang

echo -e "\n=== Checking surrounding context for Store method ==="
sed -n '27378,27410p' source/slang/hlsl.meta.slang

echo -e "\n=== Searching for Slang_CUDA_WMMA and WMMA usage ==="
rg -n "Slang_CUDA_WMMA" -g'*.slang' -g'*.cpp' -g'*.h' | head -30

echo -e "\n=== Searching for m_data field usage in groupshared context ==="
rg -n "m_data" -g'*.slang' -g'*.cpp' | head -40

echo -e "\n=== Searching for CooperativeMatrix or CoopMat ==="
rg -n "CooperativeMatrix|CoopMat" -g'*.slang' | head -50

Repository: shader-slang/slang

Length of output: 15187


🏁 Script executed:

#!/bin/bash
# Get the full Store method signatures to understand parameter types
echo "=== Store overload at lines 27269 (raw array) ==="
sed -n '27255,27275p' source/slang/hlsl.meta.slang

echo -e "\n=== Store overload at lines 27350 (wrapped type) ==="
sed -n '27330,27360p' source/slang/hlsl.meta.slang

echo -e "\n=== Store overload at lines 27405 (flagged) ==="
sed -n '27378,27410p' source/slang/hlsl.meta.slang

echo -e "\n=== Store overload at lines 27428 (another variant) ==="
sed -n '27410,27440p' source/slang/hlsl.meta.slang

echo -e "\n=== Check if there's a wrapper array type definition ==="
rg -n "struct.*Array|class.*Array" -A5 source/slang/hlsl.meta.slang | head -50

Repository: shader-slang/slang

Length of output: 7303


🏁 Script executed:

#!/bin/bash
# Check CUDA codegen for groupshared array handling and m_data wrapping
echo "=== Search CUDA codegen for groupshared/workgroup handling ==="
rg -n "groupshared|workgroup|AddressSpace::GroupShared" source/slang/slang-emit-cuda.cpp | head -30

echo -e "\n=== Search for array wrapping or m_data field generation in CUDA ==="
rg -n "m_data|wrapper.*array|Array.*wrapper" source/slang/slang-emit-cuda.cpp | head -30

echo -e "\n=== Check how Ptr types are lowered in CUDA ==="
rg -n "Ptr.*GroupShared|AddressSpace.*GroupShared" source/slang/slang-emit-cuda.cpp -A3 | head -50

echo -e "\n=== Check WMMA intrinsic signatures and expected parameter types ==="
sed -n '6595,6650p' prelude/slang-cuda-prelude.h

Repository: shader-slang/slang

Length of output: 1365


🏁 Script executed:

#!/bin/bash
# Get the WMMA Store/Load function signatures
echo "=== Search for Store function in WMMA prelude ==="
rg -n "\.Store\(|Store<" prelude/slang-cuda-prelude.h -A5 | head -60

echo -e "\n=== Search for Load function in WMMA prelude ==="
rg -n "\.Load\(|Load<" prelude/slang-cuda-prelude.h -A5 | head -60

echo -e "\n=== Check if there's a wrapper struct for groupshared arrays ==="
rg -n "struct.*GroupShared|class.*GroupShared" source/slang/ -g'*.slang' -g'*.cpp' -g'*.h' | head -20

echo -e "\n=== Check the Load overload context at line 27597 for comparison ==="
sed -n '27570,27610p' source/slang/hlsl.meta.slang

Repository: shader-slang/slang

Length of output: 3494


CUDA path incorrectly assumes m_data field on raw groupshared arrays.

The parameter __ref groupshared U[V] data at line 27405–27407 is a raw array reference, but the CUDA intrinsic strings dereference ($1)->m_data, which does not exist. The WMMA intrinsics expect a raw pointer, not a dereferenced field. This pattern appears systematically in multiple Store and Load overloads (lines 27350–27352, 27372–27374, 27405–27407 for Store; lines 27597–27599, 27653–27655 for Load). The SPIRV backend correctly treats these as raw arrays; the CUDA paths must align accordingly or the intrinsic strings should be corrected to pass the raw array pointer directly.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@source/slang/hlsl.meta.slang` around lines 27403 - 27407, The CUDA intrinsics
incorrectly dereference a non‑existent m_data field on the raw __ref groupshared
U[V] parameter (e.g. in the case cuda blocks for the Store and Load overloads
handling __ref groupshared U[V] data); update the intrinsic operand that
currently uses "($1)->m_data + $2" (and similar) to pass the raw pointer instead
(e.g. "($1) + $2" or "$1" as appropriate for the intrinsic) so the WMMA
"__intrinsic_asm" calls receive a raw pointer rather than accessing a
nonexistent m_data field; apply the same fix to all occurrences referenced in
the comment (Store overloads at the ranges around lines 27350–27352,
27372–27374, 27405–27407 and Load overloads around 27597–27599, 27653–27655).

typealias MemSize = ShMemSize;
}

VISIBILITY_LEVEL extension<T : ISharedMemoryPool> T
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this in an extension?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to access the shared memory array, but I can't define static groupshared inside an interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not aware of other proper way to do this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr: non-breaking PRs without breaking changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants