remove all the usage of pointer to shared memory#10093
remove all the usage of pointer to shared memory#10093kaizhangNV wants to merge 3 commits intoshader-slang:masterfrom
Conversation
📝 WalkthroughWalkthroughRefactors group-shared memory from pointer-based SPtr to an offset-based SharedMemoryPool model; propagates a ShMemPool type through MMAHelper and neural tests. Adds/updates HLSL metadata exposing CoopMat/CoopVec declarations and many cooperative-matrix/vector intrinsics. Test suites updated to use TestPool offsets and TestShMemSize. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Comment |
There was a problem hiding this comment.
Pull request overview
This PR refactors the neural module's shared memory management from pointer-based to offset-based access, enabling CooperativeMatrix operations to work correctly on Vulkan/SPIR-V targets where pointer manipulation in shared memory is restricted by the logical addressing model.
Changes:
- Removed pointer-based shared memory access (SPtr, __getAddress) and replaced with offset-based indexing
- Updated SharedMemoryPool to use extension pattern providing static groupshared data array
- Added CUDA support to CoopMat Load/Store operations for groupshared arrays with different element types
- Updated all MMAHelper functions to accept ShMemPool template parameter and use uint offsets instead of pointers
Reviewed changes
Copilot reviewed 22 out of 22 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| source/standard-modules/neural/shared-memory-pool.slang | Refactored SharedMemoryPool from pointer-returning to extension-based pattern with static groupshared data array |
| source/standard-modules/neural/accelerate-vector-coopmat.slang | Updated all MMAHelper functions to use offset-based access; changed signatures from SPtr to uint offsets; added ShMemPool template parameter |
| source/slang/hlsl.meta.slang | Added CUDA implementation to CoopMat Load/Store for groupshared arrays with different element types; removed "Vulkan only" comments; changed require attribute from cooperative_matrix_spirv to cooperative_matrix |
| tests/neural/*.slang (17 files) | Updated all test files to use SharedMemoryPool pattern with TestShMemSize struct, TestPool typealias, and offset constants instead of groupshared arrays and pointers |
8e3b32f to
525c03e
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
tests/neural/mma-helper-test-single-warp-arbitrary-size.slang (1)
103-134: Consider extractingserialReadto a common utility module.This debugging helper is duplicated across multiple test files with nearly identical implementations. Extracting it to
common.slangwould reduce maintenance burden.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/neural/mma-helper-test-single-warp-arbitrary-size.slang` around lines 103 - 134, The serialRead debugging helper is duplicated; extract the function serialRead(uint tid, uint sharedMemOffset, uint rowOrColumnCount) into a shared module (e.g., common.slang) and replace the copies in test files with an import/include of that module; ensure the function signature and any referenced symbols (TestPool, GroupMemoryBarrierWithGroupSync, printf, uint4, half, bit_cast) remain unchanged, remove the duplicate definitions from each test (including mma-helper-test-single-warp-arbitrary-size.slang), and update build/compile includes so all tests reference the new common utility.tests/neural/mma-helper-test-transpose-single-warp.slang (1)
121-152: Hardcoded loop iteration count may not match data type.The inner loop iterates 8 times regardless of
sizeof(T), which appears to assumefloatdata (32 bytes per row). Forhalfdata (16 bytes per row), this would read beyond the intended data. Consider making this dynamic similar to otherserialReadimplementations.♻️ Suggested fix to compute iterations based on type size
void serialRead(uint tid, uint sharedMemOffset, uint rowOrColumnCount) { GroupMemoryBarrierWithGroupSync(); if (tid > 0) return; int linearIndex = 0; for (int id = 0; id < rowOrColumnCount; id++) { printf("tid: %d\n", id); - for (int i = 0; i < 8; i++) + int numVectors = (sizeof(DType) == 2) ? 2 : 8; // 2 uint4 for half (16 bytes), 8 for float (32 bytes * 2) + for (int i = 0; i < numVectors; i++) { uint4 values = TestPool.data[sharedMemOffset + id * 8 + i];#!/bin/bash # Check how other serialRead implementations handle the iteration count rg -n "for.*int i = 0.*i <" tests/neural/mma-helper-test-*.slang -A2 -B2 | head -60🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/neural/mma-helper-test-transpose-single-warp.slang` around lines 121 - 152, serialRead currently hardcodes the inner loop to iterate 8 times (for i < 8), which assumes 32-byte rows (e.g., float data) and will overread for 16-byte rows (half data); update serialRead to compute the per-row uint4 count dynamically based on the element byte size instead of using 8. Locate the serialRead function and replace the fixed "for (int i = 0; i < 8; i++)" logic (the loop using i and accessing TestPool.data[sharedMemOffset + id * 8 + i]) with a computed elementsPerRow derived from bytesPerRow / sizeof(uint4) (where bytesPerRow depends on the element type, e.g., sizeof(half) vs sizeof(float) or a runtime/type-specific constant), and use that elementsPerRow in the loop so both half and float layouts read the correct amount. Ensure any index arithmetic (sharedMemOffset + id * 8 + i) uses the new elementsPerRow variable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@source/slang/hlsl.meta.slang`:
- Around line 27403-27407: The CUDA intrinsics incorrectly dereference a
non‑existent m_data field on the raw __ref groupshared U[V] parameter (e.g. in
the case cuda blocks for the Store and Load overloads handling __ref groupshared
U[V] data); update the intrinsic operand that currently uses "($1)->m_data + $2"
(and similar) to pass the raw pointer instead (e.g. "($1) + $2" or "$1" as
appropriate for the intrinsic) so the WMMA "__intrinsic_asm" calls receive a raw
pointer rather than accessing a nonexistent m_data field; apply the same fix to
all occurrences referenced in the comment (Store overloads at the ranges around
lines 27350–27352, 27372–27374, 27405–27407 and Load overloads around
27597–27599, 27653–27655).
---
Duplicate comments:
In `@source/slang/hlsl.meta.slang`:
- Around line 27651-27655: The CUDA Load path incorrectly assumes a field named
m_data like the Store path; update the two intrinsics in the case cuda branch
(the __intrinsic_asm calls for Slang_CUDA_WMMA::RowMajor and ::ColMajor) to use
the same safe buffer accessor used in the Store-path fix (i.e., replace
"($0)->m_data" with the canonical accessor/helper used previously such as the
getter or underlying buffer pointer expression from the Store-path), and apply
the change to both RowMajor and ColMajor lines so the code no longer assumes a
concrete m_data field.
---
Nitpick comments:
In `@tests/neural/mma-helper-test-single-warp-arbitrary-size.slang`:
- Around line 103-134: The serialRead debugging helper is duplicated; extract
the function serialRead(uint tid, uint sharedMemOffset, uint rowOrColumnCount)
into a shared module (e.g., common.slang) and replace the copies in test files
with an import/include of that module; ensure the function signature and any
referenced symbols (TestPool, GroupMemoryBarrierWithGroupSync, printf, uint4,
half, bit_cast) remain unchanged, remove the duplicate definitions from each
test (including mma-helper-test-single-warp-arbitrary-size.slang), and update
build/compile includes so all tests reference the new common utility.
In `@tests/neural/mma-helper-test-transpose-single-warp.slang`:
- Around line 121-152: serialRead currently hardcodes the inner loop to iterate
8 times (for i < 8), which assumes 32-byte rows (e.g., float data) and will
overread for 16-byte rows (half data); update serialRead to compute the per-row
uint4 count dynamically based on the element byte size instead of using 8.
Locate the serialRead function and replace the fixed "for (int i = 0; i < 8;
i++)" logic (the loop using i and accessing TestPool.data[sharedMemOffset + id *
8 + i]) with a computed elementsPerRow derived from bytesPerRow / sizeof(uint4)
(where bytesPerRow depends on the element type, e.g., sizeof(half) vs
sizeof(float) or a runtime/type-specific constant), and use that elementsPerRow
in the loop so both half and float layouts read the correct amount. Ensure any
index arithmetic (sharedMemOffset + id * 8 + i) uses the new elementsPerRow
variable.
| case cuda: | ||
| if (matrixLayout == CoopMatMatrixLayout.RowMajor) | ||
| __intrinsic_asm "$0.Store<Slang_CUDA_WMMA::RowMajor>(($1)->m_data + $2, $3)"; | ||
| else | ||
| __intrinsic_asm "$0.Store<Slang_CUDA_WMMA::ColMajor>(($1)->m_data + $2, $3)"; |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# First, verify the file and line numbers exist
echo "=== Checking file and line context ==="
sed -n '27403,27407p' source/slang/hlsl.meta.slang
echo -e "\n=== Checking surrounding context for Store method ==="
sed -n '27378,27410p' source/slang/hlsl.meta.slang
echo -e "\n=== Searching for Slang_CUDA_WMMA and WMMA usage ==="
rg -n "Slang_CUDA_WMMA" -g'*.slang' -g'*.cpp' -g'*.h' | head -30
echo -e "\n=== Searching for m_data field usage in groupshared context ==="
rg -n "m_data" -g'*.slang' -g'*.cpp' | head -40
echo -e "\n=== Searching for CooperativeMatrix or CoopMat ==="
rg -n "CooperativeMatrix|CoopMat" -g'*.slang' | head -50Repository: shader-slang/slang
Length of output: 15187
🏁 Script executed:
#!/bin/bash
# Get the full Store method signatures to understand parameter types
echo "=== Store overload at lines 27269 (raw array) ==="
sed -n '27255,27275p' source/slang/hlsl.meta.slang
echo -e "\n=== Store overload at lines 27350 (wrapped type) ==="
sed -n '27330,27360p' source/slang/hlsl.meta.slang
echo -e "\n=== Store overload at lines 27405 (flagged) ==="
sed -n '27378,27410p' source/slang/hlsl.meta.slang
echo -e "\n=== Store overload at lines 27428 (another variant) ==="
sed -n '27410,27440p' source/slang/hlsl.meta.slang
echo -e "\n=== Check if there's a wrapper array type definition ==="
rg -n "struct.*Array|class.*Array" -A5 source/slang/hlsl.meta.slang | head -50Repository: shader-slang/slang
Length of output: 7303
🏁 Script executed:
#!/bin/bash
# Check CUDA codegen for groupshared array handling and m_data wrapping
echo "=== Search CUDA codegen for groupshared/workgroup handling ==="
rg -n "groupshared|workgroup|AddressSpace::GroupShared" source/slang/slang-emit-cuda.cpp | head -30
echo -e "\n=== Search for array wrapping or m_data field generation in CUDA ==="
rg -n "m_data|wrapper.*array|Array.*wrapper" source/slang/slang-emit-cuda.cpp | head -30
echo -e "\n=== Check how Ptr types are lowered in CUDA ==="
rg -n "Ptr.*GroupShared|AddressSpace.*GroupShared" source/slang/slang-emit-cuda.cpp -A3 | head -50
echo -e "\n=== Check WMMA intrinsic signatures and expected parameter types ==="
sed -n '6595,6650p' prelude/slang-cuda-prelude.hRepository: shader-slang/slang
Length of output: 1365
🏁 Script executed:
#!/bin/bash
# Get the WMMA Store/Load function signatures
echo "=== Search for Store function in WMMA prelude ==="
rg -n "\.Store\(|Store<" prelude/slang-cuda-prelude.h -A5 | head -60
echo -e "\n=== Search for Load function in WMMA prelude ==="
rg -n "\.Load\(|Load<" prelude/slang-cuda-prelude.h -A5 | head -60
echo -e "\n=== Check if there's a wrapper struct for groupshared arrays ==="
rg -n "struct.*GroupShared|class.*GroupShared" source/slang/ -g'*.slang' -g'*.cpp' -g'*.h' | head -20
echo -e "\n=== Check the Load overload context at line 27597 for comparison ==="
sed -n '27570,27610p' source/slang/hlsl.meta.slangRepository: shader-slang/slang
Length of output: 3494
CUDA path incorrectly assumes m_data field on raw groupshared arrays.
The parameter __ref groupshared U[V] data at line 27405–27407 is a raw array reference, but the CUDA intrinsic strings dereference ($1)->m_data, which does not exist. The WMMA intrinsics expect a raw pointer, not a dereferenced field. This pattern appears systematically in multiple Store and Load overloads (lines 27350–27352, 27372–27374, 27405–27407 for Store; lines 27597–27599, 27653–27655 for Load). The SPIRV backend correctly treats these as raw arrays; the CUDA paths must align accordingly or the intrinsic strings should be corrected to pass the raw array pointer directly.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@source/slang/hlsl.meta.slang` around lines 27403 - 27407, The CUDA intrinsics
incorrectly dereference a non‑existent m_data field on the raw __ref groupshared
U[V] parameter (e.g. in the case cuda blocks for the Store and Load overloads
handling __ref groupshared U[V] data); update the intrinsic operand that
currently uses "($1)->m_data + $2" (and similar) to pass the raw pointer instead
(e.g. "($1) + $2" or "$1" as appropriate for the intrinsic) so the WMMA
"__intrinsic_asm" calls receive a raw pointer rather than accessing a
nonexistent m_data field; apply the same fix to all occurrences referenced in
the comment (Store overloads at the ranges around lines 27350–27352,
27372–27374, 27405–27407 and Load overloads around 27597–27599, 27653–27655).
| typealias MemSize = ShMemSize; | ||
| } | ||
|
|
||
| VISIBILITY_LEVEL extension<T : ISharedMemoryPool> T |
There was a problem hiding this comment.
why is this in an extension?
There was a problem hiding this comment.
I want to access the shared memory array, but I can't define static groupshared inside an interface.
There was a problem hiding this comment.
I'm not aware of other proper way to do this.
Because variable pointer is quite unstable on the driver side and it could lead unexpected behavior during run-time, we will remove the usage of pointer to shared memory in neural.slang.