ska-sa
diff --git a/‎doc/engines.rst‎
Lines changed: 32 additions & 0 deletions b/‎doc/engines.rst‎
Lines changed: 32 additions & 0 deletions
diff --git a/‎doc/fgpu.design.rst‎
Lines changed: 43 additions & 8 deletions b/‎doc/fgpu.design.rst‎
Lines changed: 43 additions & 8 deletions
diff --git a/‎doc/xbgpu.design.b.rst‎
Lines changed: 3 additions & 33 deletions b/‎doc/xbgpu.design.b.rst‎
Lines changed: 3 additions & 33 deletions
diff --git a/‎qualification/baseline_correlation_products/test_consistency.py‎
Lines changed: 0 additions & 77 deletions b/‎qualification/baseline_correlation_products/test_consistency.py‎
Lines changed: 0 additions & 77 deletions
diff --git a/‎qualification/tied_array_channelised_voltage/test_delay.py‎
Lines changed: 6 additions & 3 deletions b/‎qualification/tied_array_channelised_voltage/test_delay.py‎
Lines changed: 6 additions & 3 deletions
diff --git a/‎scratch/fgpu/benchmarks/compute_bench.py‎
Lines changed: 2 additions & 0 deletions b/‎scratch/fgpu/benchmarks/compute_bench.py‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎scratch/fgpu/sincos_accuracy.py‎
Lines changed: 81 additions & 0 deletions b/‎scratch/fgpu/sincos_accuracy.py‎
Lines changed: 81 additions & 0 deletions
@@ -236,6 +236,38 @@ The above concepts are illustrated in the following figure:
 Common features
 ---------------
 
+.. _dithering:
+
+Dithering
+^^^^^^^^^
+To improve linearity, a random value in the interval (-0.5, 0.5) is added to
+each component (real and imaginary) before quantisation, in both the F-engine
+and in beamforming (it is not needed for correlation because that takes place
+entirely in integer arithmetic with no loss of precision). These values are
+generated using `curand`_, with its underlying XORWOW generator. It is
+designed for parallel use, with each work-item having the same seed but a
+different `sequence` parameter to :cpp:func:`!curand_init`. This minimises
+correlation between sequences generated by different threads. The sequence
+numbers are also chosen to be distinct between the different engines, to avoid
+correlation between channels.
+
+Floating-point rounding issues make it tricky to get a perfectly zero-mean
+distribution. While it is probably inconsequential, simply using
+``curand_uniform(state) - 0.5f`` will not give zero mean. We solve this by
+mapping the :math:`2^{32}` possible return values of :cpp:func:`!curand` to
+the range :math:`(-2^{31}, 2^{31})` with zero represented twice, before
+scaling to convert to a real value in :math:`(-0.5, 0.5)`. While this is
+still a deviation from uniformity, it does give a symmetric distribution.
+
+The :c:struct:`curandStateXORWOW_t` struct defined by curand is unnecessarily large
+for our purposes, because it retains state needed to generate Gaussian
+distributions (Box-Muller transform). To reduce global memory traffic, we use
+a different type we define (:c:struct:`randState_t`) to hold random states in
+global memory, together with helpers that save and restore this smaller state
+from a private :c:struct:`curandStateXORWOW_t` used within a kernel.
+
+.. _curand: https://docs.nvidia.com/cuda/curand/index.html
+
 .. _engines-shutdown-procedure:
 
 Shutdown procedures
 
@@ -72,15 +72,15 @@ The polyphase filter bank starts with a finite impulse response (FIR) filter,
 with some number of *taps* (e.g., 16), and a *step* size which is twice the
 number of output channels. This can be thought of as organising the samples as
 a 2D array, with *step* columns, and then applying a FIR down each column.
-Since the columns are independent, we map each column to a separate workitem,
+Since the columns are independent, we map each column to a separate work-item,
 which keeps a sliding window of samples in its registers. GPUs generally don't
 allow indirect indexing of registers, so loop unrolling (by the number of
 taps) is used to ensure that the indices are known at compile time.
 
 This might not give enough parallelism, particularly for small channel counts,
-so in fact each column in split into sections and a separate workitem is used
+so in fact each column in split into sections and a separate work-item is used
 for each section. There is a trade-off here as samples at the boundaries
-between sections need to be loaded by both workitems, leading to overheads.
+between sections need to be loaded by both work-items, leading to overheads.
 
 Registers are used to hold both the sliding window and the weights, which
 leads to significant register pressure. This reduces occupancy and leads to
@@ -219,7 +219,7 @@ We can also re-use some common expressions by computing :math:`X_{N-k}` at the s
 This raises the question: Why compute both :math:`X_{k}` and :math:`X_{N-k}`? After all,
 parameter :math:`k` should range the full channel range initially stated (parameter :math:`N`). The answer:
 compute efficiency. It is costly to compute :math:`U_k` and :math:`V_k` so if we can use them to
-compute two elements of :math:`X`` (:math:`X_{k}` and :math:`X_{N-k}`) at once it is better than producing
+compute two elements of :math:`X` (:math:`X_{k}` and :math:`X_{N-k}`) at once it is better than producing
 only one element of :math:`X`.
 
 Why is doing all this work more efficient that letting cuFFT handle the
@@ -386,10 +386,6 @@ operations are all straightforward. While C++ doesn't have a convert with
 saturation function, we can access the CUDA functionality through inline PTX
 assembly (OpenCL C has an equivalent function).
 
-Fine delays and the twiddle factor for the Cooley-Tukey transformation are
-computed using the ``sincospi`` function, which saves both a multiplication by
-:math:`\pi` and a range reduction.
-
 The gains, fine delays and phases need to be made available to the kernel. We
 found that transferring them through the usual CUDA copy mechanism leads to
 sub-optimal scheduling, because these (small) transfers could end up queued
@@ -398,6 +394,45 @@ to allow the CPU to write directly to the GPU buffers. The buffers are
 replicated per output item, so that it is possible for the CPU to be updating
 the values for one output item while the GPU is computing on another.
 
+Fast sin/cos
+~~~~~~~~~~~~
+CUDA GPUs have hardware units dedicated to computing transcendental functions.
+They are significantly faster than software computation, but have accuracy
+limitations. The larger the absolute value of the argument, the worse the
+accuracy is. For angles in the interval :math:`[-\pi, \pi]`, the maximum
+absolute error in computing :math:`e^{jx}` is 4.21e-07. That's roughly 5×
+worse than using the more accurate function, but far smaller than the errors
+introduced by quantisation. Over larger ranges, the maximum error increases
+roughly linearly with the magnitude. The script
+:file:`scratch/sincos_accuracy` can be used to measure this.
+
+It's therefore important to check the range of the angles we're using before
+blindly using the faster function. There are several places where we compute
+phase rotations:
+
+ 1. In implementing the real-to-complex transform, we compute
+    :math:`e^{\frac{-\pi j}{N}\cdot k}`, where
+    :math:`0 \le k \le \frac{N}{2}`. The angle is thus in the range
+    :math:`[-\frac{\pi}{2}, 0]`.
+
+ 2. In unzipping the FFT, we compute the twiddle factor
+    :math:`e^{\frac{-2\pi j}{mn}\cdot rs}`, where :math:`0 \le r < n` and
+    :math:`0 \le s \le \frac{m}{2}`. The angle is thus in the range
+    :math:`(-\pi, 0]`.
+
+ 3. We also do an order-:math:`n` FFT, but since we only consider small fixed
+    values of :math:`n`, we hard-code the roots of unity rather than computing
+    them at runtime.
+
+ 4. Fine delays and phase rotation are combined to produce a per-channel phase
+    rotation. For wideband, the fine delay is up to half a sample, which
+    translates to a maximum rotation of :math:`\frac{\pi}{4}`. For narrowband
+    the calculation is more complex, but it again becomes a maximum rotation
+    of :math:`\frac{\pi}{4}`. The fixed phase rotation is limited to
+    :math:`[-\pi, \pi]`, so the total angle is in
+    :math:`[-\frac{5\pi}{4}, \frac{5\pi}{4}]`, for which the fast sincos
+    function has a maximum absolute error of 6.67e-07.
+
 Coarse delays
 ^^^^^^^^^^^^^
 One of the more challenging aspects of the processing design was the handling
 
@@ -29,12 +29,12 @@ to channel :math:`c`, beam :math:`b`, antenna :math:`a` is
 where :math:`w_{ab}` and :math:`d_{ab}` are the weight and delay values passed
 to the kernel.
 
-Each workgroup of the kernel handles multiple spectra and all beams and
+Each work-group of the kernel handles multiple spectra and all beams and
 antennas, but only a single channel. Conceptually, the kernel first computes
 :math:`W_{abc}` for all antennas and beams and stores it to local memory, then
 applies it to all antennas and beams. Each input sample is loaded once before
 it is used for all beams. An accumulator is maintained for each beam. Since
-each coefficient is used many times (the number depends on the work group
+each coefficient is used many times (the number depends on the work-group
 size, which is a tuning parameter, but 64-256 is reasonable) after it is
 computed, the cost for computing coefficients is amortised.
 
@@ -49,7 +49,7 @@ sizes have two advantages:
    barriers.
 
 2. If the batch size is small, the number of coefficients to compute is also
-   small, and there is not enough work to keep all the work items busy, making
+   small, and there is not enough work to keep all the work-items busy, making
    the coefficient computation less efficient.
 
 Higher beam counts
@@ -66,36 +66,6 @@ This does mean that the inputs are loaded multiple times, but caches help
 significantly here, and the kernel tends to be more compute-bound in this
 domain.
 
-.. _dithering:
-
-Dithering
-^^^^^^^^^
-To improve linearity, a random value in the interval (-0.5, 0.5) is added to
-each component (real and imaginary) before quantisation. These values are
-generated using `curand`_, with its underlying XORWOW generator. It is
-designed for parallel use, with each thread having the same seed but a
-different `sequence` parameter to :cpp:func:`!curand_init`. This minimises
-correlation between sequences generated by different threads. The sequence
-numbers are also chosen to be distinct between the different engines, to avoid
-correlation between channels.
-
-Floating-point rounding issues make it tricky to get a perfectly zero-mean
-distribution. While it is probably inconsequential, simply using
-``curand_uniform(state) - 0.5f`` will not give zero mean. We solve this by
-mapping the :math:`2^{32}` possible return values of :cpp:func:`!curand` to
-the range :math:`(-2^{31}, 2^{31})` with zero represented twice, before
-scaling to convert to a real value in :math:`(-0.5, 0.5)`. While this is
-still a deviation from uniformity, it does give a symmetric distribution.
-
-The :c:struct:`curandStateXORWOW_t` struct defined by curand is unnecessarily large
-for our purposes, because it retains state needed to generate Gaussian
-distributions (Box-Muller transform). To reduce global memory traffic, we use
-a different type we define (:c:struct:`randState_t`) to hold random states in
-global memory, together with helpers that save and restore this smaller state
-from a private :c:struct:`curandStateXORWOW_t` used within a kernel.
-
-.. _curand: https://docs.nvidia.com/cuda/curand/index.html
-
 Data flow
 ---------
 The host side of the beamforming is simpler than for correlation because
 
@@ -45,8 +45,11 @@ async def test_delay_small(
     -------------------
     Verification by means of test. Set a delay on one input and form a beam
     from it with a compensating delay. Use a different input with no delay
-    to form a reference beam. Check that the results are consistent to within 2
-    ULP.
+    to form a reference beam. Check that the results are consistent to within 3
+    ULP. This tolerance allows for 1 ULP of F-engine dithering for each input,
+    and 1 ULP for B-engine dithering (the reference beam experiences no
+    dithering because the output simply equals the input and hence no re-quantisation
+    occurs).
 
     This test is only valid for delays of less than half a sample. For larger
     delays, the F-engine delay is done partially in the time domain, while the
@@ -115,7 +118,7 @@ async def test_delay_small(
         data = data.astype(np.int16)
         max_error = np.max(np.abs(data[delay_beam] - data[ref_beam]))
         with check:
-            assert max_error <= 2
+            assert max_error <= 3
         pdf_report.detail(f"Maximum difference is {max_error} ULP")
 
 
 
@@ -88,6 +88,8 @@ def main():  # noqa: C901
             samples=args.recv_chunk_samples + extra_samples,
             spectra=out_spectra,
             spectra_per_heap=spectra_per_heap,
+            seed=123,
+            sequence_first=456,
         )
         fn.ensure_all_bound()
 
 
@@ -0,0 +1,81 @@
+#!/usr/bin/env python3
+
+################################################################################
+# Copyright (c) 2024, National Research Foundation (SARAO)
+#
+# Licensed under the BSD 3-Clause License (the "License"); you may not use
+# this file except in compliance with the License. You may obtain a copy
+# of the License at
+#
+#   https://opensource.org/licenses/BSD-3-Clause
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+################################################################################
+
+"""Measure the accuracy of CUDA's sincos implementations."""
+
+import argparse
+
+import numpy as np
+import pycuda.autoinit
+import pycuda.driver
+from pycuda.compiler import SourceModule
+
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--max", type=float, default=1.0, help="Maximum value to test, in units of pi")
+    parser.add_argument("--func", choices=["__sincosf", "sincosf"], default="__sincosf")
+    args = parser.parse_args()
+
+    source = SourceModule(
+        f"""
+    #include <math.h>
+
+    __global__ void sincos_kernel(float2 *out, const float *in)
+    {{
+        int idx = blockIdx.x * blockDim.x + threadIdx.x;
+        {args.func}(in[idx], &out[idx].x, &out[idx].y);
+    }}
+    """
+    )
+    kernel = source.get_function("sincos_kernel")
+
+    info = np.finfo(np.float32)
+    block = 128
+    n = 2**info.nmant
+    sc = np.zeros((n, 2), np.float32)
+
+    max_test = np.pi * args.max
+    # Iterate through the exponent portion of the float32. For each value,
+    # we populate angle with all possible mantissa bits. We exclude the
+    # largest value since that is used to encode infinity and NaNs.
+    max_sin_err = 0.0
+    max_cos_err = 0.0
+    max_tot_err = 0.0
+    for raw_exp in range(0, 2**info.nexp - 1):
+        angle = np.arange(raw_exp << info.nmant, (raw_exp + 1) << info.nmant, dtype=np.uint32).view(np.float32)
+        if angle[0] > max_test:
+            break
+        cut = np.searchsorted(angle, max_test, side="right")
+
+        kernel(pycuda.driver.Out(sc), pycuda.driver.In(angle), block=(block, 1, 1), grid=(n // block, 1, 1))
+        # Clip to max_test if needed
+        sin_err = np.abs(sc[:cut, 0].astype(np.float64) - np.sin(angle[:cut].astype(np.float64)))
+        cos_err = np.abs(sc[:cut, 1].astype(np.float64) - np.cos(angle[:cut].astype(np.float64)))
+        tot_err = np.hypot(sin_err, cos_err)
+        max_sin_err = max(max_sin_err, np.max(sin_err))
+        max_cos_err = max(max_cos_err, np.max(cos_err))
+        max_tot_err = max(max_tot_err, np.max(tot_err))
+
+    print(f"Max sin err: {max_sin_err}  (2**{np.log2(max_sin_err)})")
+    print(f"Max cos err: {max_cos_err}  (2**{np.log2(max_cos_err)})")
+    print(f"Max tot err: {max_tot_err}  (2**{np.log2(max_tot_err)})")
+
+
+if __name__ == "__main__":
+    main()
Original file line number	Diff line number	Diff line change
`@@ -88,6 +88,8 @@ def main(): # noqa: C901`
`88`	`88`	`samples=args.recv_chunk_samples + extra_samples,`
`89`	`89`	`spectra=out_spectra,`
`90`	`90`	`spectra_per_heap=spectra_per_heap,`
	`91`	`+ seed=123,`
	`92`	`+ sequence_first=456,`
`91`	`93`	`)`
`92`	`94`	`fn.ensure_all_bound()`
`93`	`95`