❓Help understanding inference input length for 16khz audio #751
-
❓ Hi there.First, thanks for the great work done here. I really appreciate it. I have a question about the shape of the input for the latest ONNX model in the repo I'm trying to understand why 640 is the limit. I believe the window size for 16khz is documented to be 512 and the context is always expected to be 64. Thanks |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
|
Hi! The input chunk goes through the following steps:
The STFT parameters are a hop size of 128 and a window size of 256. Therefore, a base chunk of 640 samples in length is split into exactly 4 timestamps. The VAD encoder has a downsampling factor of 4, effectively compressing these 4 timestamps into 1 single state. If window_size_samples is greater than 640, the encoder will produce 2 or more states, which causes an error during the decoding step. This mismatch in the expected number of states is the likely reason for the shape error you observed (e.g., {1,1,1,128,2}). Ultimately, it is not recommended to use values other than 512 (for 16 kHz) or 256 (for 8 kHz), quality may suffer. |
Beta Was this translation helpful? Give feedback.
Hi! The input chunk goes through the following steps:
The STFT parameters are a hop size of 128 and a window size of 256. Therefore, a base chunk of 640 samples in length is split into exactly 4 timestamps.
The VAD encoder has a downsampling factor of 4, effectively compressing these 4 timestamps into 1 single state.
If window_size_samples is greater than 640, the encoder will produce 2 or more states, which causes an error during the decoding step. This mismatch in the expected number of states is the likely reason for the shape error you observed (e.g., {1,1,1,128,2}).
Ultimately, it is not recommende…