Perfecting the Spectrograph #1

psionic-k · 2026-02-05T19:44:29Z

psionic-k
Feb 5, 2026
Maintainer

Converting the amplitude to frequency domain is upstream of everything, procedural or learned. The program can only be as good as its spectrum analyzer.

Capability

How we can "see" sound in our program. At every step, we are up against the uncertainty principle, time versus pitch resolution and precision tradeoffs. The ideal filter bin responds instantly and evenly to every pitch from its center until halfway to the next bin. These bins do not exist and must be approximated. Tools we can use to get closer to the information horizon:

DFT bins
High-pass & low-pass filters
Window functions

How these are used is very coupled with the implementation.

Perceptual

Perception is both an audio and visual problem.

Music uses octaves that are log2 in pitch
Visual perception is log sensitive
Audio perception is also log sensitive
Audio iso-loudness curves in tone are not flat SPLs

Low frequencies have larger RMS before we perceive the same loudness. So far, the iso226 "correction" is actually just completely dropping out the bass tones. Either our math is wrong or we should not even be attempting to use iso226.

Questions ✋

What correction to SPLs or magnitudes are the right tool for converting raw numerical output to perceptually corrected values?
Should we apply Steven's Power Law?
Do we need a gain curve per bucket?
If ISO226 is the right tool, how should its math be applied? (At this time we find the dB relative to 1kHz and apply it to other buckets by adding after RMS to SPL conversion)

Curves

We only cover a limited range with linear precision with most visual output devices. The audio log is mostly undone by the visual log; We can easily see low-brightness pixels changing and expect to see something whenever we observe any pitch, no matter how faint. Linearly drawing dB on screen mostly "just works". While we might want to compress noise near the auditory threshold by using careful ramp design, once we get above noise, sounds we can perceive may have some log base mapped onto the linear display and our eyes and ears will agree.

Questions ✋

What log bases are actually most perceptually useful?
What curve designs maximize tradeoffs between promoting noise into visual versus allowing

Computation

Providing good data at a high rate and low latency is coupled with the time versus pitch resolution & precision tradeoffs.

Keeping Visual Ahead of Audio

We want to finish our audio processing in roughly low tens of microseconds and keep our frame times quite low to aid in real-time monitoring, visually rendering slightly before audio physically leaves the output device. The mind accepts audio delay much more than visual, so we take steps to avoid visual delay at all costs, including techniques like waiting until late in the latch window to stream audio and submit GPU commands.

VK_EXT_present_timing extension was very recently added to Vulkan. It builds on VK_KHR_present_wait and others (this is not yet available on MoltenVK?). We can also use VK_EXT_swapchain_maintenance1 with a fence.
Pipewire may support adding latency to the output Nodes. The API hasn't been found yet.
To get access to the upstream data as early as possible, we request small buffers, using the Pipewire node.latency property when creating streams to reduce the input all the way down to 128 samples, much lower than any frame time a human can perceive.

Variable Versus Fixed Render Rate

Free-sync and fast refreshing screens are more available. Alternative frontends such as lasers and LEDs can also cycle quite a bit faster than 60Hz and will benefit from VRR compatible implementations. This makes the audio rate variable. It requires variable rates of advance in visuals.

We have to decide the submit and presentation timing
Interpolation is more flexible on the consumer side. It is better to use output buffers formats compatible with hardware interpolation in image samplers.
Audio processing clock should run independently of frontends and provide explicit timing for multiple frontends to consume on their own clocks.
Independent clocks means jitter, missed deadlines, and aliasing, all things that are easier to solve with interpolation.

Computation Performance

Phasor math (avoiding transcendental ops that are slow in hardware)
Independent center bin frequencies (no FFT-style term re-use across bins, but calculation that is trivially parallel over shader cores)
Sliding sums (roll-on roll-off with sliding sum error dampening ⛴️ )
Decimation, filtering and removing excess data

The current CQT style so far has independent bins, which makes rolling sum and parallel calculation both very easy to implement. This design is the basis.

Precision

Remember, the ideal filter bin responds instantly and evenly to every pitch from its center until halfway to the next bin. This is impossible. What is possible is to measure with uncertainty and represent it to the program.

Decimation

Low pitch bins have much more information than they need. We want to skip input samples and do the transformation on a down-sampled input.

The first pass at decimation revealed strong leakage where the "folding" resulted in amplified noise at lower frequencies, so strong that the noise at the coupled bins was larger than signal observed by the bins measuring true pitch. We really must high-pass before decimation.

Widening High Pitch Bins

If a bin sums a window that is very long, it will destructively remove even very nearby frequencies. This makes our high-pitch bins too precise. The window length for 60fps is 800 samples. This is much longer than the wavelength of our highest-pitch bins, and correspondingly, they end up seeing almost nothing except the exact bin frequency, leaving large gaps between bins where no energy will be measured. There are two solutions:

more bins
more tolerant bins

The second option is probably better. The technique we should choose is to slide a shorter window with an appropriately tolerant window function over sub-frame lengths in order to "blur" the bin's pitch response to match the frequency range it needs to measure energy within. This is basically the STFT. We have to calculate this sum multiple times per audio frame and this is easier if we design for multiple clocks for VRR etc.

Reconciling All Trade-Offs

Different filter banks with unique bins:
- Short-time style "wide" bins with fastest response time but poorest pitch resolution
- Longer bins that produce only precise signals, albeit with more delay
- Use more bins if sufficiently wide main lobes cannot be achieved
- Use higher frequency of measurement for highest pitch bins to catch chirps that would average lower and be curved down
Bandpass bank for decimation without high-pitch amplification
- Bandpassing before more precise filter banks should greatly cut noise even without decimation
Output buffer log (window of old outputs)
- Explicit timing data per log entry
- Fast & slow bins use independent buffer logs
Independent scheduling of fast and slow filter banks, allowing fast bank to repeat output if audio jitter leads to under-run
Better perceptual agreement between how we perceive audio and how visual we render is perceived
- This mainly requires gain curves and / or simple integrators

The idea of using multiple filter banks is ultimately the best solution because we can draw the fast signal, expecting slow signal to show up soon. We can then use the corroboration with the slow-bin log to derive slightly better pitch and time resolution, albeit with a delay.

Unlike subatomic particles, audio signals do not change when we measure them, and we can measure pitch and time precisely, just a little bit later, the same as how a human likely represents uncertainty in the initial moments of perceiving any sound.

psionic-k · 2026-02-09T19:05:14Z

psionic-k
Feb 9, 2026
Maintainer Author

Currently building a workbench with IIRS (SVF and biquad) alongside improved Goertzel (flat-top windowed Q-tuned) filters. After I know what our filters can do, it will be time to tune them into a bank.

I'm not sure how it will shake out, but if we can't get pitch, time, and amplitude resolution, then a combination of filters will be used to produce several output textures and some composites so that drawing can be very fast-attack and lock in on pitch and amplitude for any tones that exist for sufficient duration.

Kind of excited to find out how these filters do. I know Goertzel has very fundamental limits. Already saw some stability limits for biquads at low frequency where the SVFs have no problems. Will publish my new workbench binary sooner rather than later.

0 replies

psionic-k · 2026-02-17T02:18:39Z

psionic-k
Feb 17, 2026
Maintainer Author

See the dsp and workbench crates for updates. Currently the Goertzel filters with Dolph-Chebyshev windows are the best, most engineering friendly.

The limitation of the Dolph-Chebyshev is that it has a flat noise floor, as in not decaying at all. 100dB of cymbals will show up as 20dB bass tones to an 80dB floor. We need various supplemental filters, including downsampling polyphase, to cut far noises without distortion before feeding to DFTs.

Another potential problem is that gain correction by iso226 is non-linear. It might be difficult to provide cross-referencing from fast-attack filters if all fast-attack filters will have -10dB of iso226 change from band start to band stop. ISO226 is what it is because bass tones tend to be stronger. That means lower tones leaking into higher bins that weren't low cut soon enough.

0 replies

psionic-k · 2026-03-01T17:04:20Z

psionic-k
Mar 1, 2026
Maintainer Author

Conclusion for now is to combine:

bin-centered IIR bandpass, low Q, mainly to suppress the farthest side lobes and create enough high-cut to decimate
Decimation-length FIR to cut maybe 10-20dB more side lobe noise while down-sampling
DFT bins with multiple windows on a single Geortzel stream

The IIRs are cheap and parallel. The FIRs are cheap and parallel. The DFTs are cheap with decimation and still parallel. This is a good GPU programming combination.

This week we will get this workbenched into a pre-baked filter bank that outputs to one image texture per window size. Using ~3 windows enables a mixture of time, pitch, and amplitude precision.

Windows with less side lobe suppression can give us better pitch precision at the expense of lost dynamic range. They must look for peaks since using something like a 20dB Dolph-Chebyshev window gives a narrow main lobe but only 20dB suppression. Adjusting the threshold can be done by watching the RMS over the bandpass. If there is a lot of RMS in the wide band (IIR), we have to raise the threshold on the narrow band (DFT).

With Dolph Chebychev, our far-sidelobe suppression is only as good as the IIR. On the high-frequency-end the decimation and FIR will will cut much of it. On the low frequency side, the IIR is more important. Without it, a big bass thud would light up the whole spectrogram. It's only about 10dB per octave, so even a 1st order IIR will keep it in check. 2nd order will preserve some dynamic range.

All in all, the tradeoffs are very reminiscent of CAP theorem, but with five variables:

pitch
time
delay
amplitude
dynamic range

IIRs trade a little bit of delay for better dynamic range. The DFT window length and choice of Dolph-Chebychev window tune most of the rest.

0 replies

psionic-k · 2026-03-06T13:46:33Z

psionic-k
Mar 6, 2026
Maintainer Author

Just finished up the Parks McClellen Remez adoption for the FIR lowpass. Most of the tools necessary are ready for slamming into the GPU.

The biggest places for advancement in the future are likely re-construction techniques, watching nearby filter bins and wide band IIRs to decide where to set dynamic thresholds or to dampen neighbors where we can calculate a deterministic mask (allowing pure signal in neighbors through).

0 replies

psionic-k · 2026-03-06T16:36:00Z

psionic-k
Mar 6, 2026
Maintainer Author

Realized how Polyphase works and found a better way to implement it in the GPU. Not going to Rust prototype this since the existing FIRs have served their purpose by letting me evaluate window coefficients. Also just time to do GPU things 🤠

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Perfecting the Spectrograph #1

Uh oh!

{{title}}

Uh oh!

Replies: 5 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Perfecting the Spectrograph #1

Uh oh!

psionic-k Feb 5, 2026 Maintainer

Capability

Perceptual

Curves

Computation

Keeping Visual Ahead of Audio

Variable Versus Fixed Render Rate

Computation Performance

Precision

Decimation

Widening High Pitch Bins

Reconciling All Trade-Offs

Replies: 5 comments

Uh oh!

psionic-k Feb 9, 2026 Maintainer Author

Uh oh!

psionic-k Feb 17, 2026 Maintainer Author

Uh oh!

Uh oh!

psionic-k Mar 1, 2026 Maintainer Author

Uh oh!

psionic-k Mar 6, 2026 Maintainer Author

Uh oh!

psionic-k Mar 6, 2026 Maintainer Author

psionic-k
Feb 5, 2026
Maintainer

psionic-k
Feb 9, 2026
Maintainer Author

psionic-k
Feb 17, 2026
Maintainer Author

psionic-k
Mar 1, 2026
Maintainer Author

psionic-k
Mar 6, 2026
Maintainer Author

psionic-k
Mar 6, 2026
Maintainer Author