Real-time streaming Fast FullSubNet (LSTMCell)

I'm trying to run Fast FullSubNet in a real-time audio streaming context.

I've successfully trained a model that seems to work reasonably well in a non-streaming context: https://github.com/fronx/FullSubNet/releases/tag/fast118

However, the latency of running it in that way is too high. I've tried turning down the hop length, but it just leads to choppy, unintelligible noise. So I looked around and apparently the structure of the code needs to be changed quite a bit for that to work?

I'm happy to execute the change and contribute it to this repo, but I might need a little bit more guidance so I don't go off track. I know how to program, but I'm still fairly new to audio ML.

### Gathered instructions
For reference, to have everything in one place, here are instructions I gathered from older issues:

> there are two things you need to do are changing the torch.nn.LSTM to torch.nn.LSTMCell and adding a for-loop.

> as you can see, for performance purposes, [cumulative norm](https://github.com/haoxiangsnr/FullSubNet/blob/main/audio_zen/model/base_model.py#L203) that I released is written in a compact style, i.e., in advance computing the statistical mean value of all frames for an utterance. You should separate this function using a frame-wise style. The point basically is to ensure that normalizing the current frame using the statistical mean value of previous all frames.

> You may use a for-loop like here:
> 
> ```python
> hx, cx = load(hidden_state)
> rnn = nn.LSTMCell(dims)
> 
> output = []
> for samples in (all_samples, step=hop_len):
>     frame = fft(samples)
>     frame = cum_norm(frame)
>     hx, cx = rnn(frame, (hx, cx))
>     output.append(ifft(hx))
> 
> overlapped_add(output)
> ```
> 
> You could check out [here](https://stackoverflow.com/questions/57048120/pytorch-lstm-vs-lstmcell) for the difference between the `LSTMCell` and `LSTM`.

### Questions
1. It looks like you suggest changing the model input from a magnitude spectrogram (`[B, 1, F, T]`) to an array of samples. Is that necessary? Wouldn't that require completely retraining the model from scratch?
2. The pseudocode above doesn't mention MEL scaling, which is necessary for Fast FullSubNet. I assume that should also be applied per frame?
3. Does the change from `LSTM` to looping over `LSTMCell` require retraining?

Thanks in advance for any hints you can provide. Would be nice if we could get this repo into usable shape for streaming inference in a way that's shareable with the world. 🤩

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Real-time streaming Fast FullSubNet (LSTMCell) #67

Gathered instructions

Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Real-time streaming Fast FullSubNet (LSTMCell) #67

Description

Gathered instructions

Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions