Replies: 5 comments
-
|
Currently building a workbench with IIRS (SVF and biquad) alongside improved Goertzel (flat-top windowed Q-tuned) filters. After I know what our filters can do, it will be time to tune them into a bank. I'm not sure how it will shake out, but if we can't get pitch, time, and amplitude resolution, then a combination of filters will be used to produce several output textures and some composites so that drawing can be very fast-attack and lock in on pitch and amplitude for any tones that exist for sufficient duration. Kind of excited to find out how these filters do. I know Goertzel has very fundamental limits. Already saw some stability limits for biquads at low frequency where the SVFs have no problems. Will publish my new workbench binary sooner rather than later. |
Beta Was this translation helpful? Give feedback.
-
|
See the dsp and workbench crates for updates. Currently the Goertzel filters with Dolph-Chebyshev windows are the best, most engineering friendly. The limitation of the Dolph-Chebyshev is that it has a flat noise floor, as in not decaying at all. 100dB of cymbals will show up as 20dB bass tones to an 80dB floor. We need various supplemental filters, including downsampling polyphase, to cut far noises without distortion before feeding to DFTs. Another potential problem is that gain correction by iso226 is non-linear. It might be difficult to provide cross-referencing from fast-attack filters if all fast-attack filters will have -10dB of iso226 change from band start to band stop. ISO226 is what it is because bass tones tend to be stronger. That means lower tones leaking into higher bins that weren't low cut soon enough. |
Beta Was this translation helpful? Give feedback.
-
|
Conclusion for now is to combine:
The IIRs are cheap and parallel. The FIRs are cheap and parallel. The DFTs are cheap with decimation and still parallel. This is a good GPU programming combination. This week we will get this workbenched into a pre-baked filter bank that outputs to one image texture per window size. Using ~3 windows enables a mixture of time, pitch, and amplitude precision. Windows with less side lobe suppression can give us better pitch precision at the expense of lost dynamic range. They must look for peaks since using something like a 20dB Dolph-Chebyshev window gives a narrow main lobe but only 20dB suppression. Adjusting the threshold can be done by watching the RMS over the bandpass. If there is a lot of RMS in the wide band (IIR), we have to raise the threshold on the narrow band (DFT). With Dolph Chebychev, our far-sidelobe suppression is only as good as the IIR. On the high-frequency-end the decimation and FIR will will cut much of it. On the low frequency side, the IIR is more important. Without it, a big bass thud would light up the whole spectrogram. It's only about 10dB per octave, so even a 1st order IIR will keep it in check. 2nd order will preserve some dynamic range. All in all, the tradeoffs are very reminiscent of CAP theorem, but with five variables:
IIRs trade a little bit of delay for better dynamic range. The DFT window length and choice of Dolph-Chebychev window tune most of the rest. |
Beta Was this translation helpful? Give feedback.
-
|
Just finished up the Parks McClellen Remez adoption for the FIR lowpass. Most of the tools necessary are ready for slamming into the GPU. The biggest places for advancement in the future are likely re-construction techniques, watching nearby filter bins and wide band IIRs to decide where to set dynamic thresholds or to dampen neighbors where we can calculate a deterministic mask (allowing pure signal in neighbors through). |
Beta Was this translation helpful? Give feedback.
-
|
Realized how Polyphase works and found a better way to implement it in the GPU. Not going to Rust prototype this since the existing FIRs have served their purpose by letting me evaluate window coefficients. Also just time to do GPU things 🤠 |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Converting the amplitude to frequency domain is upstream of everything, procedural or learned. The program can only be as good as its spectrum analyzer.
Capability
How we can "see" sound in our program. At every step, we are up against the uncertainty principle, time versus pitch resolution and precision tradeoffs. The ideal filter bin responds instantly and evenly to every pitch from its center until halfway to the next bin. These bins do not exist and must be approximated. Tools we can use to get closer to the information horizon:
How these are used is very coupled with the implementation.
Perceptual
Perception is both an audio and visual problem.
Low frequencies have larger RMS before we perceive the same loudness. So far, the iso226 "correction" is actually just completely dropping out the bass tones. Either our math is wrong or we should not even be attempting to use iso226.
Questions ✋
Curves
We only cover a limited range with linear precision with most visual output devices. The audio log is mostly undone by the visual log; We can easily see low-brightness pixels changing and expect to see something whenever we observe any pitch, no matter how faint. Linearly drawing dB on screen mostly "just works". While we might want to compress noise near the auditory threshold by using careful ramp design, once we get above noise, sounds we can perceive may have some log base mapped onto the linear display and our eyes and ears will agree.
Questions ✋
Computation
Providing good data at a high rate and low latency is coupled with the time versus pitch resolution & precision tradeoffs.
Keeping Visual Ahead of Audio
We want to finish our audio processing in roughly low tens of microseconds and keep our frame times quite low to aid in real-time monitoring, visually rendering slightly before audio physically leaves the output device. The mind accepts audio delay much more than visual, so we take steps to avoid visual delay at all costs, including techniques like waiting until late in the latch window to stream audio and submit GPU commands.
VK_KHR_present_waitand others (this is not yet available on MoltenVK?). We can also useVK_EXT_swapchain_maintenance1with a fence.node.latencyproperty when creating streams to reduce the input all the way down to 128 samples, much lower than any frame time a human can perceive.Variable Versus Fixed Render Rate
Free-sync and fast refreshing screens are more available. Alternative frontends such as lasers and LEDs can also cycle quite a bit faster than 60Hz and will benefit from VRR compatible implementations. This makes the audio rate variable. It requires variable rates of advance in visuals.
Computation Performance
The current CQT style so far has independent bins, which makes rolling sum and parallel calculation both very easy to implement. This design is the basis.
Precision
Remember, the ideal filter bin responds instantly and evenly to every pitch from its center until halfway to the next bin. This is impossible. What is possible is to measure with uncertainty and represent it to the program.
Decimation
Low pitch bins have much more information than they need. We want to skip input samples and do the transformation on a down-sampled input.
The first pass at decimation revealed strong leakage where the "folding" resulted in amplified noise at lower frequencies, so strong that the noise at the coupled bins was larger than signal observed by the bins measuring true pitch. We really must high-pass before decimation.
Widening High Pitch Bins
If a bin sums a window that is very long, it will destructively remove even very nearby frequencies. This makes our high-pitch bins too precise. The window length for 60fps is 800 samples. This is much longer than the wavelength of our highest-pitch bins, and correspondingly, they end up seeing almost nothing except the exact bin frequency, leaving large gaps between bins where no energy will be measured. There are two solutions:
The second option is probably better. The technique we should choose is to slide a shorter window with an appropriately tolerant window function over sub-frame lengths in order to "blur" the bin's pitch response to match the frequency range it needs to measure energy within. This is basically the STFT. We have to calculate this sum multiple times per audio frame and this is easier if we design for multiple clocks for VRR etc.
Reconciling All Trade-Offs
The idea of using multiple filter banks is ultimately the best solution because we can draw the fast signal, expecting slow signal to show up soon. We can then use the corroboration with the slow-bin log to derive slightly better pitch and time resolution, albeit with a delay.
Unlike subatomic particles, audio signals do not change when we measure them, and we can measure pitch and time precisely, just a little bit later, the same as how a human likely represents uncertainty in the initial moments of perceiving any sound.
Beta Was this translation helpful? Give feedback.
All reactions