-
Notifications
You must be signed in to change notification settings - Fork 56
Description
This is a general issue I wanted to file to encourage open discussion about future improvements to groupshared LDS/shmem usage and allocation.
Problem Discussion
This is unlikely exhaustive, but here are the problems that are likely to crop up for users of groupshared memory (I will use groupshared, LDS, and shmem interchangeably).
- LDS allocation usually scales linearly with the number of waves in a thread group, however, because
WaveGetLaneCountis not a compile-time constant, the wave size cannot readily be used in agroupshareddeclaration. The current workaround for this is to either make (bad) assumptions about the wave size, or produce multiple specializations of each shader and dispatching the correct one with a matchingWaveSizeat runtime. Neither option is ideal, with the former resulting in brittle hardware-specific code, and the latter resulting in build and runtime complexity. - Currently,
groupshareddata must be declared in the global declaration context. This impedes code composability -- for example, if we wanted to include a header to use a function defined in that header, we may hurt occupancy by inadvertently dragging alonggroupshareddeclarations. This is a real footgun in larger (and sometimes smaller) codebases, and it is difficult to detect when it occurs (or at least, it takes some work to understand why occupancy is lower than expected). - Along the same vein of code composability, LDS memory usage must interact directly with the
groupsharedvariable as declared, due to the lack of a user-accessiblereffunction parameter qualifier. A function that operates over some input data and exports output data cannot always rely on "copy-in" and "copy-out" semantics, because theinandoutsemantics do not permit usage of the variousInterlocked*functions (which internally are modeled usingref-qualified parameters)
Possible Solutions
If possible, I think there are a few things that would immediately improve quality-of-life for compute shader authors. These suggestions are written from the perspective of an ISV (the shader writer), and it's understood that other solutions may end up being preferable due to practicality, performance, ease-of-implementation, or all of the above from the perspective of a hardware vendor or DXC compiler implementer.
- Permit the use of
WaveGetLaneCountin the declaration type forgroupsharedstorage. - Permit local variables to be declared as
groupshared. - Add a
refkeyword that would permit use ofInterlocked*intrinsics for ref-qualified parameters
The first item would allow developers to conceptually treat WaveGetLaneCount as a constexpr function, whose value is realized only when a PSO is actually created at runtime. This has implications beyond LDS allocation, but would be a very useful tool in the toolbox for other use cases.
For the second item, because functions are still fully unrolled currently, the total storage needed per-thread-group for a given compute shader should still be statically known, although DXIL may require modifications to properly alias types allocated from the virtual shmem pool. The idea here is that a static analysis pass would determine the amount of LDS memory needed in the "middle swell" of the program, accounting for all possible branches taken where groupshared variables are declared.
The counterargument to the second item is that statically knowing how much LDS is used precludes future HLSL code in a world where function calling is possible. At this point, one option would be to permit functions to allocate LDS (similar to alloca) using the same semantics as locally declared groupshared variables. The driver would need to be able to suspend thread groups if LDS isn't available, or possibly demote allocated LDS to slower vram (possibly from a fixed size pool of reserved memory).
The last item addresses the ability to perform operations on memory in LDS, regardless of where or how that LDS memory was allocated.
All that said, my main goal is to encourage discussion, and not attempt to be overly prescriptive about the solutions. I think starting from a well-defined problem statement is likely step one.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status