Transforming the Audioencoder to an windowed autoregressive model, for realtime use. #788

gslaller · 2022-12-31T14:08:10Z

gslaller
Dec 31, 2022

Making the model suitable for realtime inference over webrtc.

I have two approaches to my problem:

Take the dumb “Window” approach where both the AudioEncoder and the TextDecoder rerun on every new frame
An actual realtime approach(my desired approach), where the old computed context is reused and not recomputed on every frame. But I am facing difficulty here:
1. The AudioEncoder is using absolute positional embedding with a fixed context window. Making the reuse of the computed token not viable. Are there any masking hacks to transform the AudioEncoder to an autoregressive kind of model with a sliding window context, without retraining?
2. And the TextDecoder is using a learned positional embedding with fixed parameters 🙃

Any feedback is welcome.