-
Notifications
You must be signed in to change notification settings - Fork 159
Description
I'm trying to run Fast FullSubNet in a real-time audio streaming context.
I've successfully trained a model that seems to work reasonably well in a non-streaming context: https://github.com/fronx/FullSubNet/releases/tag/fast118
However, the latency of running it in that way is too high. I've tried turning down the hop length, but it just leads to choppy, unintelligible noise. So I looked around and apparently the structure of the code needs to be changed quite a bit for that to work?
I'm happy to execute the change and contribute it to this repo, but I might need a little bit more guidance so I don't go off track. I know how to program, but I'm still fairly new to audio ML.
Gathered instructions
For reference, to have everything in one place, here are instructions I gathered from older issues:
there are two things you need to do are changing the torch.nn.LSTM to torch.nn.LSTMCell and adding a for-loop.
as you can see, for performance purposes, cumulative norm that I released is written in a compact style, i.e., in advance computing the statistical mean value of all frames for an utterance. You should separate this function using a frame-wise style. The point basically is to ensure that normalizing the current frame using the statistical mean value of previous all frames.
You may use a for-loop like here:
hx, cx = load(hidden_state) rnn = nn.LSTMCell(dims) output = [] for samples in (all_samples, step=hop_len): frame = fft(samples) frame = cum_norm(frame) hx, cx = rnn(frame, (hx, cx)) output.append(ifft(hx)) overlapped_add(output)You could check out here for the difference between the
LSTMCellandLSTM.
Questions
- It looks like you suggest changing the model input from a magnitude spectrogram (
[B, 1, F, T]) to an array of samples. Is that necessary? Wouldn't that require completely retraining the model from scratch? - The pseudocode above doesn't mention MEL scaling, which is necessary for Fast FullSubNet. I assume that should also be applied per frame?
- Does the change from
LSTMto looping overLSTMCellrequire retraining?
Thanks in advance for any hints you can provide. Would be nice if we could get this repo into usable shape for streaming inference in a way that's shareable with the world. π€©