Skip to content

Commit cb0c994

Browse files
Choke on the Docs
1 parent 1bab309 commit cb0c994

File tree

1 file changed

+76
-8
lines changed

1 file changed

+76
-8
lines changed

include/nbl/video/utilities/CScanner.h

Lines changed: 76 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,8 @@ Utility class to help you perform the equivalent of `std::inclusive_scan` and `s
2323
The basic building block is a Blelloch-Scan, the `nbl_glsl_workgroup{Add/Mul/And/Xor/Or/Min/Max}{Exclusive/Inclusive}`:
2424
https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda
2525
https://classes.engineering.wustl.edu/cse231/core/index.php/Scan
26+
Also referred to as an "Upsweep-Downsweep Scan" due to the fact it computes Reductions hierarchically until there's only one block left,
27+
then does Prefix Sums and propagates the results into more blocks until we're back at 1 element of a block for 1 element of the input.
2628
2729
The workgroup scan is itself probably built out of Hillis-Steele subgroup scans, we use `KHR_shader_subgroup_arithmetic` whenever available,
2830
but fall back to our own "software" emulation of subgroup arithmetic using Hillis-Steele and some scratch shared memory.
@@ -35,7 +37,7 @@ The scheduling relies on two principles:
3537
- Virtual and Persistent Workgroups
3638
- Atomic Counters as Sempahores
3739
38-
# Virtual Workgroups TODO: Move this Paragraph somewhere else.
40+
## Virtual Workgroups TODO: Move this Paragraph somewhere else.
3941
Generally speaking, launching a new workgroup has non-trivial overhead.
4042
4143
Also most IHVs, especially AMD have silly limits on the ranges of dispatches (like 64k workgroups), which also apply to 1D dispatches.
@@ -70,7 +72,7 @@ atomicMax(nextWorkgroup,gl_GlobalInvocationID.x+1);
7072
```
7173
has the potential to deadlock and TDR your GPU.
7274
73-
However if you use a global counter of dispatched workgroups in an SSBO and `atomicAdd` to assign the `virtualWorkgroupIndex`
75+
However if you use such an atomic to assign the `virtualWorkgroupIndex` in lieu of spinning
7476
```glsl
7577
uint virtualWorkgroupIndex;
7678
for ((virtualWorkgroupIndex=atomicAdd(nextWorkgroup,1u))<virtualWorkgroupCount)
@@ -80,7 +82,66 @@ for ((virtualWorkgroupIndex=atomicAdd(nextWorkgroup,1u))<virtualWorkgroupCount)
8082
```
8183
the ordering of starting work is now enforced (still wont guarantee the order of completion).
8284
83-
# Atomic Counters as Semaphores
85+
## Atomic Counters as Semaphores
86+
To scan arbitrarily large arrays, we already use Virtual Workgroups.
87+
88+
For improved cache coherence and more bandwidth on the higher level reduction and scan blocks,
89+
its best to use a temporary scratch buffer roughly of size `O(2 log_{WorkgroupSize}(n))`.
90+
91+
We can however turn the BrainWorm(TM) up to 11, and do the whole scan in a single dispatch.
92+
93+
First, we assign a Linear Index using the trick outlined in Virtual Workgroups section, to every scan block (workgroup)
94+
such that if executed serially, **the lower index block would have finished before any higher index block.**
95+
https://developer.nvidia.com/sites/all/modules/custom/gpugems/books/GPUGems3/elementLinks/39fig06.jpg
96+
It would be also useful to keep some table that would let us map the Linear Index to the scan level.
97+
98+
Then we use a little bit more scratch for some atomic counters.
99+
100+
A naive scheduler would have one atomic counter per upsweep-downsweep level, which would be incremented AFTER the workgroup
101+
is finished with the scan and writes its outputs, this would tell us how many workgroups have completed at the level so far.
102+
103+
Then we could figure out the scan level given workgroup Linear Index, then spinwait until the atomic counter mapped to the
104+
previous scan level tells us all workgroups have completed.
105+
106+
HOWEVER, this naive approach is really no better than separate dispatches with pipeline barriers in the middle.
107+
108+
Subsequently we turn up the BrainWorm(TM) to 12, and notice that we don't really need to wait on an entire previous level,
109+
just the workgroups that will produce the data that the current one will process.
110+
111+
So there's one atomic per workgroup above the second level while sweeping up (top block included as it waits for reduction results),
112+
this atomic is incremented by immediately lower level workgroups which provide the inputs to the current workgroup. The current
113+
workgroup will have to spinwait on its atomic until it reaches WORKGROUP_SIZE (with the exception of the last workgroup in the level,
114+
where the value might be different when the number of workgroups in previous level is not divisible by WORKGROUP_SIZE).
115+
116+
In the downsweep phase (waiting for scan results), multiple lower level workgroups spin on the same atomic until it reaches 1, since
117+
a single input is needed by multiple outputs.
118+
119+
## So what's an Indirect Scan?
120+
121+
It is when you don't know the count of the elements to scan, because lets say another GPU dispatch produces the list to scan and its
122+
variable length, for example culling systems.
123+
124+
Naturally because of this, you won't know:
125+
- the number of workgroups to dispatch, so DispatchIndirect is needed
126+
- the number of upsweep/downsweep levels
127+
- the number of workgroups in each level
128+
- the size and offsets of the auxillary output data array for each level
129+
- the size and offsets of the atomics for each level
130+
131+
## Further Work
132+
133+
We could reduce auxillary memory size some more, by noting that only two levels need to access the same intermediate result and
134+
only workgroups from 3 immediately consecutive levels can ever work simultaneously due to our scheduler.
135+
136+
Right now we allocate and don't alias the auxillary memory used for storage of the intermediate workgroup results.
137+
138+
# I hear you say Nabla is too complex...
139+
140+
If you think that AAA Engines have similar and less complicated utilities, you're gravely mistaken, the AMD GPUs in the
141+
Playstation and Xbox have hardware workgroup ordered dispatch and a `mbcnt` instruction which allows you do to single dispatch
142+
prefix sum with subgroup sized workgroups at peak Bandwidth efficiency in about 6 lines of HLSL.
143+
144+
Console devs get to bring a gun to a knife fight...
84145
**/
85146
class NBL_API CScanner final : public core::IReferenceCounted
86147
{
@@ -115,8 +176,8 @@ class NBL_API CScanner final : public core::IReferenceCounted
115176
EO_COUNT = _NBL_GLSL_SCAN_OP_COUNT_
116177
};
117178

118-
//
119-
struct Parameters : nbl_glsl_scan_Parameters_t
179+
// This struct is only for managing where to store intermediate results of the scans
180+
struct Parameters : nbl_glsl_scan_Parameters_t // this struct and its methods are also available in GLSL so you can launch indirect dispatches
120181
{
121182
static inline constexpr uint32_t MaxScanLevels = NBL_BUILTIN_MAX_SCAN_LEVELS;
122183

@@ -125,6 +186,7 @@ class NBL_API CScanner final : public core::IReferenceCounted
125186
std::fill_n(lastElement,MaxScanLevels/2+1,0u);
126187
std::fill_n(temporaryStorageOffset,MaxScanLevels/2,0u);
127188
}
189+
// build the constant tables for each level given the number of elements to scan and workgroupSize
128190
Parameters(const uint32_t _elementCount, const uint32_t workgroupSize) : Parameters()
129191
{
130192
assert(_elementCount!=0u && "Input element count can't be 0!");
@@ -137,7 +199,7 @@ class NBL_API CScanner final : public core::IReferenceCounted
137199

138200
std::exclusive_scan(temporaryStorageOffset,temporaryStorageOffset+sizeof(temporaryStorageOffset)/sizeof(uint32_t),temporaryStorageOffset,0u);
139201
}
140-
202+
// given already computed tables of lastElement indices per level, number of levels, and storage offsets, tell us total auxillary buffer size needed
141203
inline uint32_t getScratchSize(uint32_t ssboAlignment=256u)
142204
{
143205
uint32_t uint_count = 1u; // workgroup enumerator
@@ -146,13 +208,17 @@ class NBL_API CScanner final : public core::IReferenceCounted
146208
return core::roundUp<uint32_t>(uint_count*sizeof(uint32_t),ssboAlignment);
147209
}
148210
};
149-
struct SchedulerParameters : nbl_glsl_scan_DefaultSchedulerParameters_t
211+
// the default scheduler we provide works as described above in the big documentation block
212+
struct SchedulerParameters : nbl_glsl_scan_DefaultSchedulerParameters_t // this struct and its methods are also available in GLSL so you can launch indirect dispatches
150213
{
151214
SchedulerParameters()
152215
{
153216
std::fill_n(finishedFlagOffset,Parameters::MaxScanLevels-1,0u);
154217
std::fill_n(cumulativeWorkgroupCount,Parameters::MaxScanLevels,0u);
155218
}
219+
// given the number of elements and workgroup size, figure out how many atomics we need
220+
// also account for the fact that we will want to use the same scratch buffer both for the
221+
// scheduler's atomics and the aux data storage
156222
SchedulerParameters(Parameters& outScanParams, const uint32_t _elementCount, const uint32_t workgroupSize) : SchedulerParameters()
157223
{
158224
outScanParams = Parameters(_elementCount,workgroupSize);
@@ -175,6 +241,7 @@ class NBL_API CScanner final : public core::IReferenceCounted
175241
std::inclusive_scan(cumulativeWorkgroupCount,cumulativeWorkgroupCount+Parameters::MaxScanLevels,cumulativeWorkgroupCount);
176242
}
177243
};
244+
// push constants of the default direct scan pipeline provide both aux memory offset params and scheduling params
178245
struct DefaultPushConstants
179246
{
180247
Parameters scanParams;
@@ -185,9 +252,10 @@ class NBL_API CScanner final : public core::IReferenceCounted
185252
DispatchInfo() : wg_count(0u)
186253
{
187254
}
255+
// in case we scan very few elements, you don't want to launch workgroups that wont do anything
188256
DispatchInfo(const IPhysicalDevice::SLimits& limits, const uint32_t elementCount, const uint32_t workgroupSize)
189257
{
190-
constexpr auto workgroupSpinningProtection = 4u;
258+
constexpr auto workgroupSpinningProtection = 4u; // to prevent first workgroup starving/idling on level 1 after finishing level 0 early
191259
wg_count = limits.computeOptimalPersistentWorkgroupDispatchSize(elementCount,workgroupSize,workgroupSpinningProtection);
192260
}
193261

0 commit comments

Comments
 (0)