Skip to content

Conversation

@cwoffenden
Copy link
Contributor

@cwoffenden cwoffenden commented Oct 16, 2024

Short version: this improves the copy back from the audio worklet's heap to JS by 7-12x depending on the browser.

Since we pass in the stack for the worklet from the caller's heap, its address doesn't change. And since the render quantum size doesn't change after the audio worklet creation, the stack positions for the audio buffers do not change either. This optimisation adds one-time subarray views and replaces the float-by-float copy with a simple set() per channel (per output).

To prove this doesn't break anything, tests of the audio worklet API to compare before and after have already landed in #23394, merged here; they can be run using:

test/runner interactive.test_audio_worklet_stereo_io
test/runner interactive.test_audio_worklet_2x_stereo_io
test/runner interactive.test_audio_worklet_mono_io
test/runner interactive.test_audio_worklet_2x_hard_pan_io 

A benchmark of the extracted copy before and after is here:

https://wip.numfum.com/cw/2024-10-29/index.html

This PR fulfils the garbage-free requirement as well as having the performance boost. Multiple output buffers are created with addresses at the first entries in the AW's stack, and the number of ins and outs can change dynamically (using or not the predefined buffers). There are tests for buffer overflows as well as sanity.

(Edited to take the content from the conversation below and turn this from a proposal to a summary.)

@cwoffenden cwoffenden changed the title [AUDIO_WORKET] Optimise the copy back from wasm's heap to JS [AUDIO_WORKLET] Optimise the copy back from wasm's heap to JS Oct 16, 2024
@cwoffenden cwoffenden force-pushed the cw-audio-tweaks-3 branch 2 times, most recently from 38203f0 to 24a90e4 Compare October 17, 2024 13:01
@juj
Copy link
Collaborator

juj commented Oct 17, 2024

Nice idea! It looks like this PR carries the rename from the other PR. Rebase/merge should hide it?

@cwoffenden
Copy link
Contributor Author

Nice idea!

A shower thought, as all good ideas are!

It looks like this PR carries the rename from the other PR. Rebase/merge should hide it?

Yes, this one was rebased off the earlier one so should have the same commit IDs.

@cwoffenden cwoffenden marked this pull request as draft October 18, 2024 06:35
@cwoffenden
Copy link
Contributor Author

cwoffenden commented Oct 21, 2024

EDIT: every test I tried suggests no, the stack is only used once we hit Wasm in the callback.

@juj or someone who knows more about the internals of this than me: on entering AudioWorkletProcessor.process() nothing else will have used the stack beforehand? I've verified stack top is always the same when we enter the AudioWorkletProcessor's contructor and process() and it's always the same as the the address (plus length) passed to emscripten_start_wasm_audio_worklet_thread_async().

To rephrase: unless the audio worklet explicitly uses the stack functions from JS, nothing external from Wasm will before callbackFunction() is called?

@cwoffenden cwoffenden force-pushed the cw-audio-tweaks-3 branch 3 times, most recently from dcdcea1 to 5b65dcf Compare October 26, 2024 10:06
@cwoffenden
Copy link
Contributor Author

Some notes: lots of experiments with the stack allocations, minimum sizes, various flags (STACK_OVERFLOW_CHECK and ASSERTIONS) and we have some very happy code! View allocations are all in the constructor, with sanity checks performed on the address.

Next is to benchmark it.

@cwoffenden
Copy link
Contributor Author

cwoffenden commented Oct 29, 2024

Benchmarks of the main part of the audio copy done:

https://wip.numfum.com/cw/2024-10-29/index.html

Testing on my M2 Mac Studio in Chrome and Safari this PR is around 15x faster on the float copy, e.g. the original being 0.625µs per process() and the new code 0.044µs. On my aging iPhone 11 it's 1.182µs before and 0.199µs after.

Firefox on Mac has the largest difference: 5.090µs before and 0.123µs after, a 41x speed-up. Updating Firefox fixes this and it's now 0.916µs before and 0.118µs after.

@juj if we're in agreement that the simplified standalone JavaScript test code is doing the right thing (it's short and a copy and paste), I can gather numbers from regular hardware (we have a wall of Chromebooks at work).

Next I'll need to create tests to show that this still works with various input and output configs.

EDIT: a 7-12x speed-up seems typical on x64 Windows or Linux.

@cwoffenden
Copy link
Contributor Author

cwoffenden commented Oct 30, 2024

Note to me: look at #22808

@cwoffenden cwoffenden force-pushed the cw-audio-tweaks-3 branch 2 times, most recently from 069f7a4 to ae0e8bf Compare October 31, 2024 22:49
@cwoffenden cwoffenden force-pushed the cw-audio-tweaks-3 branch 6 times, most recently from f6153e9 to cccece4 Compare November 8, 2024 06:06
@cwoffenden cwoffenden force-pushed the cw-audio-tweaks-3 branch 2 times, most recently from b3dc2ef to ac37140 Compare November 14, 2024 19:14
@cwoffenden
Copy link
Contributor Author

I can split out the tests from the code changes

That would be awesome, thank you.

Tests moved to #23394 (since they don't require these changes).

sbc100 pushed a commit that referenced this pull request Jan 15, 2025
These are the audio worklet tests from #22753 extracted to a standalone
PR. The tests are:

- Multiple stereo inputs mixing in the processor to a single stereo
output
- Multiple stereo inputs copying in the processor to multiple stereo
outputs
- Multiple mono inputs mixing in the processor to a single mono output
- Multiple mono inputs copying in the processor to L+R stereo outputs

The tests use different stack sizes (from 2kB to 6kB depending on the
requirement).

The audio tracks were composed by Tim Wright especially for Emscripten
and released under a CC0 license.
@cwoffenden
Copy link
Contributor Author

cwoffenden commented Jan 16, 2025

Since this has had a lot of comments and changes since its first commit in October I've edited the initial comment to contain the findings and results, and merged the tests which landed in main. Besides the comment/error message in library_webaudio.js all of the changes are now in a single file.

@cwoffenden cwoffenden requested a review from sbc100 January 16, 2025 08:58
@cwoffenden cwoffenden closed this Jan 21, 2025
@cwoffenden cwoffenden deleted the cw-audio-tweaks-3 branch January 21, 2025 10:21
@cwoffenden
Copy link
Contributor Author

Odd, GH closed this when I renamed the branch to merge into a fork (it's residing in cw-audio-optimised-copy).

If there's any interest in this I can re-open another PR.

juj pushed a commit that referenced this pull request Aug 30, 2025
A reworking of #22753, which "improves the copy back from the audio
worklet's heap to JS by 7-12x depending on the browser." From the
previous description:

Since we pass in the stack for the worklet from the caller's heap, its
address doesn't change. And since the render quantum size doesn't change
after the audio worklet creation, the stack positions for the audio
buffers do not change either. This optimisation adds one-time subarray
views and replaces the float-by-float copy with a simple `set()` per
channel (per output).

The existing interactive tests (written for the original PR) can be run
for comparison:
```
test/runner interactive.test_audio_worklet_stereo_io
test/runner interactive.test_audio_worklet_2x_stereo_io
test/runner interactive.test_audio_worklet_mono_io
test/runner interactive.test_audio_worklet_2x_hard_pan_io
test/runner interactive.test_audio_worklet_params_mixing
test/runner interactive.test_audio_worklet_memory_growth
test/runner interactive.test_audio_worklet_hard_pans
```
These test various input/output arrangements as well as parameters
(parameters are interesting because, depending on the browser, the sizes
change as the params move from static to varying).

The original benchmark of the extracted copy is still valid:

https://wip.numfum.com/cw/2024-10-29/index.html

This is tested with 32- and 64-bit wasm (which required a reordering of
how structs and data were stored to avoid alignment issues).

Some explanations:
- Fixed-position output buffer views are created once in
the`WasmAudioWorkletProcessor` constructor
- Stack allocations for the `process()` call are split into aligned
struct data (see the comments) and audio/param data
- The struct writes are simplified by this splitting of data
- `ASSERTIONS` are used to ensure everything fits and correctly aligns
- The tests account for size changes in the params, which can vary from
a single float to 128 floats (a single float nicely showing up any
8-byte alignment issues for wasm64)

~~Future improvements: the output views are sequential, so instead of of
being individual views covering each channel the views could cover one
to however-many-views needed, with a single `set()` being enough for all
outputs.~~
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants