@@ -72,15 +72,15 @@ The polyphase filter bank starts with a finite impulse response (FIR) filter,
7272with some number of *taps * (e.g., 16), and a *step * size which is twice the
7373number of output channels. This can be thought of as organising the samples as
7474a 2D array, with *step * columns, and then applying a FIR down each column.
75- Since the columns are independent, we map each column to a separate workitem ,
75+ Since the columns are independent, we map each column to a separate work-item ,
7676which keeps a sliding window of samples in its registers. GPUs generally don't
7777allow indirect indexing of registers, so loop unrolling (by the number of
7878taps) is used to ensure that the indices are known at compile time.
7979
8080This might not give enough parallelism, particularly for small channel counts,
81- so in fact each column in split into sections and a separate workitem is used
81+ so in fact each column in split into sections and a separate work-item is used
8282for each section. There is a trade-off here as samples at the boundaries
83- between sections need to be loaded by both workitems , leading to overheads.
83+ between sections need to be loaded by both work-items , leading to overheads.
8484
8585Registers are used to hold both the sliding window and the weights, which
8686leads to significant register pressure. This reduces occupancy and leads to
@@ -219,7 +219,7 @@ We can also re-use some common expressions by computing :math:`X_{N-k}` at the s
219219 This raises the question: Why compute both :math: `X_{k}` and :math: `X_{N-k}`? After all,
220220parameter :math: `k` should range the full channel range initially stated (parameter :math: `N`). The answer:
221221compute efficiency. It is costly to compute :math: `U_k` and :math: `V_k` so if we can use them to
222- compute two elements of :math: `X`` (:math: `X_{k}` and :math: `X_{N-k}`) at once it is better than producing
222+ compute two elements of :math: `X` (:math: `X_{k}` and :math: `X_{N-k}`) at once it is better than producing
223223only one element of :math: `X`.
224224
225225Why is doing all this work more efficient that letting cuFFT handle the
@@ -386,10 +386,6 @@ operations are all straightforward. While C++ doesn't have a convert with
386386saturation function, we can access the CUDA functionality through inline PTX
387387assembly (OpenCL C has an equivalent function).
388388
389- Fine delays and the twiddle factor for the Cooley-Tukey transformation are
390- computed using the ``sincospi `` function, which saves both a multiplication by
391- :math: `\pi ` and a range reduction.
392-
393389The gains, fine delays and phases need to be made available to the kernel. We
394390found that transferring them through the usual CUDA copy mechanism leads to
395391sub-optimal scheduling, because these (small) transfers could end up queued
@@ -398,6 +394,45 @@ to allow the CPU to write directly to the GPU buffers. The buffers are
398394replicated per output item, so that it is possible for the CPU to be updating
399395the values for one output item while the GPU is computing on another.
400396
397+ Fast sin/cos
398+ ~~~~~~~~~~~~
399+ CUDA GPUs have hardware units dedicated to computing transcendental functions.
400+ They are significantly faster than software computation, but have accuracy
401+ limitations. The larger the absolute value of the argument, the worse the
402+ accuracy is. For angles in the interval :math: `[-\pi , \pi ]`, the maximum
403+ absolute error in computing :math: `e^{jx}` is 4.21e-07. That's roughly 5×
404+ worse than using the more accurate function, but far smaller than the errors
405+ introduced by quantisation. Over larger ranges, the maximum error increases
406+ roughly linearly with the magnitude. The script
407+ :file: `scratch/sincos_accuracy ` can be used to measure this.
408+
409+ It's therefore important to check the range of the angles we're using before
410+ blindly using the faster function. There are several places where we compute
411+ phase rotations:
412+
413+ 1. In implementing the real-to-complex transform, we compute
414+ :math: `e^{\frac {-\pi j}{N}\cdot k}`, where
415+ :math: `0 \le k \le \frac {N}{2 }`. The angle is thus in the range
416+ :math: `[-\frac {\pi }{2 }, 0 ]`.
417+
418+ 2. In unzipping the FFT, we compute the twiddle factor
419+ :math: `e^{\frac {-2 \pi j}{mn}\cdot rs}`, where :math: `0 \le r < n` and
420+ :math: `0 \le s \le \frac {m}{2 }`. The angle is thus in the range
421+ :math: `(-\pi , 0 ]`.
422+
423+ 3. We also do an order-:math: `n` FFT, but since we only consider small fixed
424+ values of :math: `n`, we hard-code the roots of unity rather than computing
425+ them at runtime.
426+
427+ 4. Fine delays and phase rotation are combined to produce a per-channel phase
428+ rotation. For wideband, the fine delay is up to half a sample, which
429+ translates to a maximum rotation of :math: `\frac {\pi }{4 }`. For narrowband
430+ the calculation is more complex, but it again becomes a maximum rotation
431+ of :math: `\frac {\pi }{4 }`. The fixed phase rotation is limited to
432+ :math: `[-\pi , \pi ]`, so the total angle is in
433+ :math: `[-\frac {5 \pi }{4 }, \frac {5 \pi }{4 }]`, for which the fast sincos
434+ function has a maximum absolute error of 6.67e-07.
435+
401436Coarse delays
402437^^^^^^^^^^^^^
403438One of the more challenging aspects of the processing design was the handling
0 commit comments