@@ -23,6 +23,8 @@ Utility class to help you perform the equivalent of `std::inclusive_scan` and `s
23
23
The basic building block is a Blelloch-Scan, the `nbl_glsl_workgroup{Add/Mul/And/Xor/Or/Min/Max}{Exclusive/Inclusive}`:
24
24
https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda
25
25
https://classes.engineering.wustl.edu/cse231/core/index.php/Scan
26
+ Also referred to as an "Upsweep-Downsweep Scan" due to the fact it computes Reductions hierarchically until there's only one block left,
27
+ then does Prefix Sums and propagates the results into more blocks until we're back at 1 element of a block for 1 element of the input.
26
28
27
29
The workgroup scan is itself probably built out of Hillis-Steele subgroup scans, we use `KHR_shader_subgroup_arithmetic` whenever available,
28
30
but fall back to our own "software" emulation of subgroup arithmetic using Hillis-Steele and some scratch shared memory.
@@ -35,7 +37,7 @@ The scheduling relies on two principles:
35
37
- Virtual and Persistent Workgroups
36
38
- Atomic Counters as Sempahores
37
39
38
- # Virtual Workgroups TODO: Move this Paragraph somewhere else.
40
+ ## Virtual Workgroups TODO: Move this Paragraph somewhere else.
39
41
Generally speaking, launching a new workgroup has non-trivial overhead.
40
42
41
43
Also most IHVs, especially AMD have silly limits on the ranges of dispatches (like 64k workgroups), which also apply to 1D dispatches.
@@ -70,7 +72,7 @@ atomicMax(nextWorkgroup,gl_GlobalInvocationID.x+1);
70
72
```
71
73
has the potential to deadlock and TDR your GPU.
72
74
73
- However if you use a global counter of dispatched workgroups in an SSBO and `atomicAdd` to assign the `virtualWorkgroupIndex`
75
+ However if you use such an atomic to assign the `virtualWorkgroupIndex` in lieu of spinning
74
76
```glsl
75
77
uint virtualWorkgroupIndex;
76
78
for ((virtualWorkgroupIndex=atomicAdd(nextWorkgroup,1u))<virtualWorkgroupCount)
@@ -80,7 +82,66 @@ for ((virtualWorkgroupIndex=atomicAdd(nextWorkgroup,1u))<virtualWorkgroupCount)
80
82
```
81
83
the ordering of starting work is now enforced (still wont guarantee the order of completion).
82
84
83
- # Atomic Counters as Semaphores
85
+ ## Atomic Counters as Semaphores
86
+ To scan arbitrarily large arrays, we already use Virtual Workgroups.
87
+
88
+ For improved cache coherence and more bandwidth on the higher level reduction and scan blocks,
89
+ its best to use a temporary scratch buffer roughly of size `O(2 log_{WorkgroupSize}(n))`.
90
+
91
+ We can however turn the BrainWorm(TM) up to 11, and do the whole scan in a single dispatch.
92
+
93
+ First, we assign a Linear Index using the trick outlined in Virtual Workgroups section, to every scan block (workgroup)
94
+ such that if executed serially, **the lower index block would have finished before any higher index block.**
95
+ https://developer.nvidia.com/sites/all/modules/custom/gpugems/books/GPUGems3/elementLinks/39fig06.jpg
96
+ It would be also useful to keep some table that would let us map the Linear Index to the scan level.
97
+
98
+ Then we use a little bit more scratch for some atomic counters.
99
+
100
+ A naive scheduler would have one atomic counter per upsweep-downsweep level, which would be incremented AFTER the workgroup
101
+ is finished with the scan and writes its outputs, this would tell us how many workgroups have completed at the level so far.
102
+
103
+ Then we could figure out the scan level given workgroup Linear Index, then spinwait until the atomic counter mapped to the
104
+ previous scan level tells us all workgroups have completed.
105
+
106
+ HOWEVER, this naive approach is really no better than separate dispatches with pipeline barriers in the middle.
107
+
108
+ Subsequently we turn up the BrainWorm(TM) to 12, and notice that we don't really need to wait on an entire previous level,
109
+ just the workgroups that will produce the data that the current one will process.
110
+
111
+ So there's one atomic per workgroup above the second level while sweeping up (top block included as it waits for reduction results),
112
+ this atomic is incremented by immediately lower level workgroups which provide the inputs to the current workgroup. The current
113
+ workgroup will have to spinwait on its atomic until it reaches WORKGROUP_SIZE (with the exception of the last workgroup in the level,
114
+ where the value might be different when the number of workgroups in previous level is not divisible by WORKGROUP_SIZE).
115
+
116
+ In the downsweep phase (waiting for scan results), multiple lower level workgroups spin on the same atomic until it reaches 1, since
117
+ a single input is needed by multiple outputs.
118
+
119
+ ## So what's an Indirect Scan?
120
+
121
+ It is when you don't know the count of the elements to scan, because lets say another GPU dispatch produces the list to scan and its
122
+ variable length, for example culling systems.
123
+
124
+ Naturally because of this, you won't know:
125
+ - the number of workgroups to dispatch, so DispatchIndirect is needed
126
+ - the number of upsweep/downsweep levels
127
+ - the number of workgroups in each level
128
+ - the size and offsets of the auxillary output data array for each level
129
+ - the size and offsets of the atomics for each level
130
+
131
+ ## Further Work
132
+
133
+ We could reduce auxillary memory size some more, by noting that only two levels need to access the same intermediate result and
134
+ only workgroups from 3 immediately consecutive levels can ever work simultaneously due to our scheduler.
135
+
136
+ Right now we allocate and don't alias the auxillary memory used for storage of the intermediate workgroup results.
137
+
138
+ # I hear you say Nabla is too complex...
139
+
140
+ If you think that AAA Engines have similar and less complicated utilities, you're gravely mistaken, the AMD GPUs in the
141
+ Playstation and Xbox have hardware workgroup ordered dispatch and a `mbcnt` instruction which allows you do to single dispatch
142
+ prefix sum with subgroup sized workgroups at peak Bandwidth efficiency in about 6 lines of HLSL.
143
+
144
+ Console devs get to bring a gun to a knife fight...
84
145
**/
85
146
class NBL_API CScanner final : public core::IReferenceCounted
86
147
{
@@ -115,8 +176,8 @@ class NBL_API CScanner final : public core::IReferenceCounted
115
176
EO_COUNT = _NBL_GLSL_SCAN_OP_COUNT_
116
177
};
117
178
118
- //
119
- struct Parameters : nbl_glsl_scan_Parameters_t
179
+ // This struct is only for managing where to store intermediate results of the scans
180
+ struct Parameters : nbl_glsl_scan_Parameters_t // this struct and its methods are also available in GLSL so you can launch indirect dispatches
120
181
{
121
182
static inline constexpr uint32_t MaxScanLevels = NBL_BUILTIN_MAX_SCAN_LEVELS;
122
183
@@ -125,6 +186,7 @@ class NBL_API CScanner final : public core::IReferenceCounted
125
186
std::fill_n (lastElement,MaxScanLevels/2 +1 ,0u );
126
187
std::fill_n (temporaryStorageOffset,MaxScanLevels/2 ,0u );
127
188
}
189
+ // build the constant tables for each level given the number of elements to scan and workgroupSize
128
190
Parameters (const uint32_t _elementCount, const uint32_t workgroupSize) : Parameters()
129
191
{
130
192
assert (_elementCount!=0u && " Input element count can't be 0!" );
@@ -137,7 +199,7 @@ class NBL_API CScanner final : public core::IReferenceCounted
137
199
138
200
std::exclusive_scan (temporaryStorageOffset,temporaryStorageOffset+sizeof (temporaryStorageOffset)/sizeof (uint32_t ),temporaryStorageOffset,0u );
139
201
}
140
-
202
+ // given already computed tables of lastElement indices per level, number of levels, and storage offsets, tell us total auxillary buffer size needed
141
203
inline uint32_t getScratchSize (uint32_t ssboAlignment=256u )
142
204
{
143
205
uint32_t uint_count = 1u ; // workgroup enumerator
@@ -146,13 +208,17 @@ class NBL_API CScanner final : public core::IReferenceCounted
146
208
return core::roundUp<uint32_t >(uint_count*sizeof (uint32_t ),ssboAlignment);
147
209
}
148
210
};
149
- struct SchedulerParameters : nbl_glsl_scan_DefaultSchedulerParameters_t
211
+ // the default scheduler we provide works as described above in the big documentation block
212
+ struct SchedulerParameters : nbl_glsl_scan_DefaultSchedulerParameters_t // this struct and its methods are also available in GLSL so you can launch indirect dispatches
150
213
{
151
214
SchedulerParameters ()
152
215
{
153
216
std::fill_n (finishedFlagOffset,Parameters::MaxScanLevels-1 ,0u );
154
217
std::fill_n (cumulativeWorkgroupCount,Parameters::MaxScanLevels,0u );
155
218
}
219
+ // given the number of elements and workgroup size, figure out how many atomics we need
220
+ // also account for the fact that we will want to use the same scratch buffer both for the
221
+ // scheduler's atomics and the aux data storage
156
222
SchedulerParameters (Parameters& outScanParams, const uint32_t _elementCount, const uint32_t workgroupSize) : SchedulerParameters()
157
223
{
158
224
outScanParams = Parameters (_elementCount,workgroupSize);
@@ -175,6 +241,7 @@ class NBL_API CScanner final : public core::IReferenceCounted
175
241
std::inclusive_scan (cumulativeWorkgroupCount,cumulativeWorkgroupCount+Parameters::MaxScanLevels,cumulativeWorkgroupCount);
176
242
}
177
243
};
244
+ // push constants of the default direct scan pipeline provide both aux memory offset params and scheduling params
178
245
struct DefaultPushConstants
179
246
{
180
247
Parameters scanParams;
@@ -185,9 +252,10 @@ class NBL_API CScanner final : public core::IReferenceCounted
185
252
DispatchInfo () : wg_count(0u )
186
253
{
187
254
}
255
+ // in case we scan very few elements, you don't want to launch workgroups that wont do anything
188
256
DispatchInfo (const IPhysicalDevice::SLimits& limits, const uint32_t elementCount, const uint32_t workgroupSize)
189
257
{
190
- constexpr auto workgroupSpinningProtection = 4u ;
258
+ constexpr auto workgroupSpinningProtection = 4u ; // to prevent first workgroup starving/idling on level 1 after finishing level 0 early
191
259
wg_count = limits.computeOptimalPersistentWorkgroupDispatchSize (elementCount,workgroupSize,workgroupSpinningProtection);
192
260
}
193
261
0 commit comments