Skip to content

Commit 7a3a3ae

Browse files
authored
Merge pull request #828 from ska-sa/NGC-1264-f-dither-tidy
Use dithering in the F-engine
2 parents b5bc125 + 8f65d7b commit 7a3a3ae

File tree

17 files changed

+468
-288
lines changed

17 files changed

+468
-288
lines changed

doc/engines.rst

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -236,6 +236,38 @@ The above concepts are illustrated in the following figure:
236236
Common features
237237
---------------
238238

239+
.. _dithering:
240+
241+
Dithering
242+
^^^^^^^^^
243+
To improve linearity, a random value in the interval (-0.5, 0.5) is added to
244+
each component (real and imaginary) before quantisation, in both the F-engine
245+
and in beamforming (it is not needed for correlation because that takes place
246+
entirely in integer arithmetic with no loss of precision). These values are
247+
generated using `curand`_, with its underlying XORWOW generator. It is
248+
designed for parallel use, with each work-item having the same seed but a
249+
different `sequence` parameter to :cpp:func:`!curand_init`. This minimises
250+
correlation between sequences generated by different threads. The sequence
251+
numbers are also chosen to be distinct between the different engines, to avoid
252+
correlation between channels.
253+
254+
Floating-point rounding issues make it tricky to get a perfectly zero-mean
255+
distribution. While it is probably inconsequential, simply using
256+
``curand_uniform(state) - 0.5f`` will not give zero mean. We solve this by
257+
mapping the :math:`2^{32}` possible return values of :cpp:func:`!curand` to
258+
the range :math:`(-2^{31}, 2^{31})` with zero represented twice, before
259+
scaling to convert to a real value in :math:`(-0.5, 0.5)`. While this is
260+
still a deviation from uniformity, it does give a symmetric distribution.
261+
262+
The :c:struct:`curandStateXORWOW_t` struct defined by curand is unnecessarily large
263+
for our purposes, because it retains state needed to generate Gaussian
264+
distributions (Box-Muller transform). To reduce global memory traffic, we use
265+
a different type we define (:c:struct:`randState_t`) to hold random states in
266+
global memory, together with helpers that save and restore this smaller state
267+
from a private :c:struct:`curandStateXORWOW_t` used within a kernel.
268+
269+
.. _curand: https://docs.nvidia.com/cuda/curand/index.html
270+
239271
.. _engines-shutdown-procedure:
240272

241273
Shutdown procedures

doc/fgpu.design.rst

Lines changed: 43 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -72,15 +72,15 @@ The polyphase filter bank starts with a finite impulse response (FIR) filter,
7272
with some number of *taps* (e.g., 16), and a *step* size which is twice the
7373
number of output channels. This can be thought of as organising the samples as
7474
a 2D array, with *step* columns, and then applying a FIR down each column.
75-
Since the columns are independent, we map each column to a separate workitem,
75+
Since the columns are independent, we map each column to a separate work-item,
7676
which keeps a sliding window of samples in its registers. GPUs generally don't
7777
allow indirect indexing of registers, so loop unrolling (by the number of
7878
taps) is used to ensure that the indices are known at compile time.
7979

8080
This might not give enough parallelism, particularly for small channel counts,
81-
so in fact each column in split into sections and a separate workitem is used
81+
so in fact each column in split into sections and a separate work-item is used
8282
for each section. There is a trade-off here as samples at the boundaries
83-
between sections need to be loaded by both workitems, leading to overheads.
83+
between sections need to be loaded by both work-items, leading to overheads.
8484

8585
Registers are used to hold both the sliding window and the weights, which
8686
leads to significant register pressure. This reduces occupancy and leads to
@@ -219,7 +219,7 @@ We can also re-use some common expressions by computing :math:`X_{N-k}` at the s
219219
This raises the question: Why compute both :math:`X_{k}` and :math:`X_{N-k}`? After all,
220220
parameter :math:`k` should range the full channel range initially stated (parameter :math:`N`). The answer:
221221
compute efficiency. It is costly to compute :math:`U_k` and :math:`V_k` so if we can use them to
222-
compute two elements of :math:`X`` (:math:`X_{k}` and :math:`X_{N-k}`) at once it is better than producing
222+
compute two elements of :math:`X` (:math:`X_{k}` and :math:`X_{N-k}`) at once it is better than producing
223223
only one element of :math:`X`.
224224

225225
Why is doing all this work more efficient that letting cuFFT handle the
@@ -386,10 +386,6 @@ operations are all straightforward. While C++ doesn't have a convert with
386386
saturation function, we can access the CUDA functionality through inline PTX
387387
assembly (OpenCL C has an equivalent function).
388388

389-
Fine delays and the twiddle factor for the Cooley-Tukey transformation are
390-
computed using the ``sincospi`` function, which saves both a multiplication by
391-
:math:`\pi` and a range reduction.
392-
393389
The gains, fine delays and phases need to be made available to the kernel. We
394390
found that transferring them through the usual CUDA copy mechanism leads to
395391
sub-optimal scheduling, because these (small) transfers could end up queued
@@ -398,6 +394,45 @@ to allow the CPU to write directly to the GPU buffers. The buffers are
398394
replicated per output item, so that it is possible for the CPU to be updating
399395
the values for one output item while the GPU is computing on another.
400396

397+
Fast sin/cos
398+
~~~~~~~~~~~~
399+
CUDA GPUs have hardware units dedicated to computing transcendental functions.
400+
They are significantly faster than software computation, but have accuracy
401+
limitations. The larger the absolute value of the argument, the worse the
402+
accuracy is. For angles in the interval :math:`[-\pi, \pi]`, the maximum
403+
absolute error in computing :math:`e^{jx}` is 4.21e-07. That's roughly 5×
404+
worse than using the more accurate function, but far smaller than the errors
405+
introduced by quantisation. Over larger ranges, the maximum error increases
406+
roughly linearly with the magnitude. The script
407+
:file:`scratch/sincos_accuracy` can be used to measure this.
408+
409+
It's therefore important to check the range of the angles we're using before
410+
blindly using the faster function. There are several places where we compute
411+
phase rotations:
412+
413+
1. In implementing the real-to-complex transform, we compute
414+
:math:`e^{\frac{-\pi j}{N}\cdot k}`, where
415+
:math:`0 \le k \le \frac{N}{2}`. The angle is thus in the range
416+
:math:`[-\frac{\pi}{2}, 0]`.
417+
418+
2. In unzipping the FFT, we compute the twiddle factor
419+
:math:`e^{\frac{-2\pi j}{mn}\cdot rs}`, where :math:`0 \le r < n` and
420+
:math:`0 \le s \le \frac{m}{2}`. The angle is thus in the range
421+
:math:`(-\pi, 0]`.
422+
423+
3. We also do an order-:math:`n` FFT, but since we only consider small fixed
424+
values of :math:`n`, we hard-code the roots of unity rather than computing
425+
them at runtime.
426+
427+
4. Fine delays and phase rotation are combined to produce a per-channel phase
428+
rotation. For wideband, the fine delay is up to half a sample, which
429+
translates to a maximum rotation of :math:`\frac{\pi}{4}`. For narrowband
430+
the calculation is more complex, but it again becomes a maximum rotation
431+
of :math:`\frac{\pi}{4}`. The fixed phase rotation is limited to
432+
:math:`[-\pi, \pi]`, so the total angle is in
433+
:math:`[-\frac{5\pi}{4}, \frac{5\pi}{4}]`, for which the fast sincos
434+
function has a maximum absolute error of 6.67e-07.
435+
401436
Coarse delays
402437
^^^^^^^^^^^^^
403438
One of the more challenging aspects of the processing design was the handling

doc/xbgpu.design.b.rst

Lines changed: 3 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -29,12 +29,12 @@ to channel :math:`c`, beam :math:`b`, antenna :math:`a` is
2929
where :math:`w_{ab}` and :math:`d_{ab}` are the weight and delay values passed
3030
to the kernel.
3131

32-
Each workgroup of the kernel handles multiple spectra and all beams and
32+
Each work-group of the kernel handles multiple spectra and all beams and
3333
antennas, but only a single channel. Conceptually, the kernel first computes
3434
:math:`W_{abc}` for all antennas and beams and stores it to local memory, then
3535
applies it to all antennas and beams. Each input sample is loaded once before
3636
it is used for all beams. An accumulator is maintained for each beam. Since
37-
each coefficient is used many times (the number depends on the work group
37+
each coefficient is used many times (the number depends on the work-group
3838
size, which is a tuning parameter, but 64-256 is reasonable) after it is
3939
computed, the cost for computing coefficients is amortised.
4040

@@ -49,7 +49,7 @@ sizes have two advantages:
4949
barriers.
5050

5151
2. If the batch size is small, the number of coefficients to compute is also
52-
small, and there is not enough work to keep all the work items busy, making
52+
small, and there is not enough work to keep all the work-items busy, making
5353
the coefficient computation less efficient.
5454

5555
Higher beam counts
@@ -66,36 +66,6 @@ This does mean that the inputs are loaded multiple times, but caches help
6666
significantly here, and the kernel tends to be more compute-bound in this
6767
domain.
6868

69-
.. _dithering:
70-
71-
Dithering
72-
^^^^^^^^^
73-
To improve linearity, a random value in the interval (-0.5, 0.5) is added to
74-
each component (real and imaginary) before quantisation. These values are
75-
generated using `curand`_, with its underlying XORWOW generator. It is
76-
designed for parallel use, with each thread having the same seed but a
77-
different `sequence` parameter to :cpp:func:`!curand_init`. This minimises
78-
correlation between sequences generated by different threads. The sequence
79-
numbers are also chosen to be distinct between the different engines, to avoid
80-
correlation between channels.
81-
82-
Floating-point rounding issues make it tricky to get a perfectly zero-mean
83-
distribution. While it is probably inconsequential, simply using
84-
``curand_uniform(state) - 0.5f`` will not give zero mean. We solve this by
85-
mapping the :math:`2^{32}` possible return values of :cpp:func:`!curand` to
86-
the range :math:`(-2^{31}, 2^{31})` with zero represented twice, before
87-
scaling to convert to a real value in :math:`(-0.5, 0.5)`. While this is
88-
still a deviation from uniformity, it does give a symmetric distribution.
89-
90-
The :c:struct:`curandStateXORWOW_t` struct defined by curand is unnecessarily large
91-
for our purposes, because it retains state needed to generate Gaussian
92-
distributions (Box-Muller transform). To reduce global memory traffic, we use
93-
a different type we define (:c:struct:`randState_t`) to hold random states in
94-
global memory, together with helpers that save and restore this smaller state
95-
from a private :c:struct:`curandStateXORWOW_t` used within a kernel.
96-
97-
.. _curand: https://docs.nvidia.com/cuda/curand/index.html
98-
9969
Data flow
10070
---------
10171
The host side of the beamforming is simpler than for correlation because

qualification/baseline_correlation_products/test_consistency.py

Lines changed: 0 additions & 77 deletions
This file was deleted.

qualification/tied_array_channelised_voltage/test_delay.py

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -45,8 +45,11 @@ async def test_delay_small(
4545
-------------------
4646
Verification by means of test. Set a delay on one input and form a beam
4747
from it with a compensating delay. Use a different input with no delay
48-
to form a reference beam. Check that the results are consistent to within 2
49-
ULP.
48+
to form a reference beam. Check that the results are consistent to within 3
49+
ULP. This tolerance allows for 1 ULP of F-engine dithering for each input,
50+
and 1 ULP for B-engine dithering (the reference beam experiences no
51+
dithering because the output simply equals the input and hence no re-quantisation
52+
occurs).
5053
5154
This test is only valid for delays of less than half a sample. For larger
5255
delays, the F-engine delay is done partially in the time domain, while the
@@ -115,7 +118,7 @@ async def test_delay_small(
115118
data = data.astype(np.int16)
116119
max_error = np.max(np.abs(data[delay_beam] - data[ref_beam]))
117120
with check:
118-
assert max_error <= 2
121+
assert max_error <= 3
119122
pdf_report.detail(f"Maximum difference is {max_error} ULP")
120123

121124

scratch/fgpu/benchmarks/compute_bench.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -88,6 +88,8 @@ def main(): # noqa: C901
8888
samples=args.recv_chunk_samples + extra_samples,
8989
spectra=out_spectra,
9090
spectra_per_heap=spectra_per_heap,
91+
seed=123,
92+
sequence_first=456,
9193
)
9294
fn.ensure_all_bound()
9395

scratch/fgpu/sincos_accuracy.py

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
#!/usr/bin/env python3
2+
3+
################################################################################
4+
# Copyright (c) 2024, National Research Foundation (SARAO)
5+
#
6+
# Licensed under the BSD 3-Clause License (the "License"); you may not use
7+
# this file except in compliance with the License. You may obtain a copy
8+
# of the License at
9+
#
10+
# https://opensource.org/licenses/BSD-3-Clause
11+
#
12+
# Unless required by applicable law or agreed to in writing, software
13+
# distributed under the License is distributed on an "AS IS" BASIS,
14+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15+
# See the License for the specific language governing permissions and
16+
# limitations under the License.
17+
################################################################################
18+
19+
"""Measure the accuracy of CUDA's sincos implementations."""
20+
21+
import argparse
22+
23+
import numpy as np
24+
import pycuda.autoinit
25+
import pycuda.driver
26+
from pycuda.compiler import SourceModule
27+
28+
29+
def main():
30+
parser = argparse.ArgumentParser()
31+
parser.add_argument("--max", type=float, default=1.0, help="Maximum value to test, in units of pi")
32+
parser.add_argument("--func", choices=["__sincosf", "sincosf"], default="__sincosf")
33+
args = parser.parse_args()
34+
35+
source = SourceModule(
36+
f"""
37+
#include <math.h>
38+
39+
__global__ void sincos_kernel(float2 *out, const float *in)
40+
{{
41+
int idx = blockIdx.x * blockDim.x + threadIdx.x;
42+
{args.func}(in[idx], &out[idx].x, &out[idx].y);
43+
}}
44+
"""
45+
)
46+
kernel = source.get_function("sincos_kernel")
47+
48+
info = np.finfo(np.float32)
49+
block = 128
50+
n = 2**info.nmant
51+
sc = np.zeros((n, 2), np.float32)
52+
53+
max_test = np.pi * args.max
54+
# Iterate through the exponent portion of the float32. For each value,
55+
# we populate angle with all possible mantissa bits. We exclude the
56+
# largest value since that is used to encode infinity and NaNs.
57+
max_sin_err = 0.0
58+
max_cos_err = 0.0
59+
max_tot_err = 0.0
60+
for raw_exp in range(0, 2**info.nexp - 1):
61+
angle = np.arange(raw_exp << info.nmant, (raw_exp + 1) << info.nmant, dtype=np.uint32).view(np.float32)
62+
if angle[0] > max_test:
63+
break
64+
cut = np.searchsorted(angle, max_test, side="right")
65+
66+
kernel(pycuda.driver.Out(sc), pycuda.driver.In(angle), block=(block, 1, 1), grid=(n // block, 1, 1))
67+
# Clip to max_test if needed
68+
sin_err = np.abs(sc[:cut, 0].astype(np.float64) - np.sin(angle[:cut].astype(np.float64)))
69+
cos_err = np.abs(sc[:cut, 1].astype(np.float64) - np.cos(angle[:cut].astype(np.float64)))
70+
tot_err = np.hypot(sin_err, cos_err)
71+
max_sin_err = max(max_sin_err, np.max(sin_err))
72+
max_cos_err = max(max_cos_err, np.max(cos_err))
73+
max_tot_err = max(max_tot_err, np.max(tot_err))
74+
75+
print(f"Max sin err: {max_sin_err} (2**{np.log2(max_sin_err)})")
76+
print(f"Max cos err: {max_cos_err} (2**{np.log2(max_cos_err)})")
77+
print(f"Max tot err: {max_tot_err} (2**{np.log2(max_tot_err)})")
78+
79+
80+
if __name__ == "__main__":
81+
main()

0 commit comments

Comments
 (0)