[WIP] Improve .outputSeek() performance
#25
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When starting sample playback, the
.outputSeek()method computes a bunch of output up-front, so that the first call to.process()produces output for the beginning of the sample.Previously, this was done by
.seek()ing into the input, producing.outputLatency()samples using normal processing. The pre-roll output is "reflected back" (reversed and phase-inverted before adding back in, to avoid the click you'd get from truncation).Here's a diagram of the analysis (top) and synthesis (bottom) windows for the first 10 output blocks, including those computed during
.outputSeek():The input for
t < 0is effectively zeroes, and the output fort < 0is reflected back. This is with 4x overlap, so there are two output windows which need to be computed up-front before the "actual" output is ready. There are also two analysis windows for each output block, because this is performing a time-stretch.Improvement 1: skip one analysis
The first change in this branch/PR is the internal flag
assumePreviousBlockZero. This is actually true after a.reset(), but we pretend it's true directly after.outputSeek(). This means we avoid one input-block analysis when time-stretching:Affect on the sound
The first output block has no "previous input" to use for a phase-vocoder prediction. This isn't a problem, since that phase-vocoder prediction is only really needed to remain phase-aligned to a previous block.
If anything, I would expect this first block to actually be clearer on initial transients, but that needs to be backed up by thorough listening tests.
Improvement 2: different initial window shape
This change changes the window shape (and block centre) at the start of
.outputSeek(). The previous window shapes are then restored for the next block:Since the input is padded with zeros, and the output gets folded back, this first block's window doesn't need to extend very far before
t=0. However, since the analysis/synthesis "window offsets" (marked with a dot in these diagrams) are the reference time for the analysis/synthesis, we also perform an additional phase-shift to adjust for this changing when we restore the original window shapes/offsets.Limits to the performance improvement
If
splitComputationis turned off, then Stretch periodically does a big chunk of work as it computes the next output block. In an environment with large buffer sizes (or enough buffering to handle the uneven CPU use), this isn't a problem - here's the computation time if we compute chunks of 2048 samples at a time:However, for 512 samples we can see that most blocks barely do any work:
Split-computation
Split-computation mode spreads this work out more evenly (without any threading stuff), at the expense of some extra output latency:
This difference is even more dramatic for smaller buffer sizes. Here's the difference if we use 100-sample buffers:
However! On all of these plots I've added the time of the
.outputSeek()call, scaled to CPU% as if it's a single 512/2048/... buffer. (The width shows the amount of pre-roll it actually has to compute.) Since this is up-front work to get the Stretch instance ready to produce actual output, split-computation doesn't help.Future alternatives
I think I've taken this particular approach as far as it can reasonably go.
The only option I can see for getting the CPU cost of
.outputSeek()all the way down (to match thesmoothedComputationcase) is to use a much cheaper method to generate that initial output. We would then need to re-analyse that output so we can phase-match it from the next typical Stretch processing-block.The most general approach would be to allow the user to generate some initial output themselves, and then tell Stretch to continue based on that output.