Skip to content

groupshared improvement discussion #83

@jeremyong

Description

@jeremyong

This is a general issue I wanted to file to encourage open discussion about future improvements to groupshared LDS/shmem usage and allocation.

Problem Discussion

This is unlikely exhaustive, but here are the problems that are likely to crop up for users of groupshared memory (I will use groupshared, LDS, and shmem interchangeably).

  • LDS allocation usually scales linearly with the number of waves in a thread group, however, because WaveGetLaneCount is not a compile-time constant, the wave size cannot readily be used in a groupshared declaration. The current workaround for this is to either make (bad) assumptions about the wave size, or produce multiple specializations of each shader and dispatching the correct one with a matching WaveSize at runtime. Neither option is ideal, with the former resulting in brittle hardware-specific code, and the latter resulting in build and runtime complexity.
  • Currently, groupshared data must be declared in the global declaration context. This impedes code composability -- for example, if we wanted to include a header to use a function defined in that header, we may hurt occupancy by inadvertently dragging along groupshared declarations. This is a real footgun in larger (and sometimes smaller) codebases, and it is difficult to detect when it occurs (or at least, it takes some work to understand why occupancy is lower than expected).
  • Along the same vein of code composability, LDS memory usage must interact directly with the groupshared variable as declared, due to the lack of a user-accessible ref function parameter qualifier. A function that operates over some input data and exports output data cannot always rely on "copy-in" and "copy-out" semantics, because the in and out semantics do not permit usage of the various Interlocked* functions (which internally are modeled using ref-qualified parameters)

Possible Solutions

If possible, I think there are a few things that would immediately improve quality-of-life for compute shader authors. These suggestions are written from the perspective of an ISV (the shader writer), and it's understood that other solutions may end up being preferable due to practicality, performance, ease-of-implementation, or all of the above from the perspective of a hardware vendor or DXC compiler implementer.

  1. Permit the use of WaveGetLaneCount in the declaration type for groupshared storage.
  2. Permit local variables to be declared as groupshared.
  3. Add a ref keyword that would permit use of Interlocked* intrinsics for ref-qualified parameters

The first item would allow developers to conceptually treat WaveGetLaneCount as a constexpr function, whose value is realized only when a PSO is actually created at runtime. This has implications beyond LDS allocation, but would be a very useful tool in the toolbox for other use cases.

For the second item, because functions are still fully unrolled currently, the total storage needed per-thread-group for a given compute shader should still be statically known, although DXIL may require modifications to properly alias types allocated from the virtual shmem pool. The idea here is that a static analysis pass would determine the amount of LDS memory needed in the "middle swell" of the program, accounting for all possible branches taken where groupshared variables are declared.

The counterargument to the second item is that statically knowing how much LDS is used precludes future HLSL code in a world where function calling is possible. At this point, one option would be to permit functions to allocate LDS (similar to alloca) using the same semantics as locally declared groupshared variables. The driver would need to be able to suspend thread groups if LDS isn't available, or possibly demote allocated LDS to slower vram (possibly from a fixed size pool of reserved memory).

The last item addresses the ability to perform operations on memory in LDS, regardless of where or how that LDS memory was allocated.

All that said, my main goal is to encourage discussion, and not attempt to be overly prescriptive about the solutions. I think starting from a well-defined problem statement is likely step one.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Triaged

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions