❓Help understanding inference input length for 16khz audio #751

bignacio · 2026-01-29T13:38:55Z

bignacio
Jan 29, 2026

❓ Hi there.

First, thanks for the great work done here. I really appreciate it.

I have a question about the shape of the input for the latest ONNX model in the repo
Using the C++ example as starting point, I noticed window_size_samples seems to work with values up to 639, before I start seeing errors like this

[E:onnxruntime:, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running LSTM node. Name:'If_0_then_branch__Inline_0__/decoder/rnn/LSTM' Status Message: Input X must have 3 dimensions only. Actual:{1,1,1,128,2}

I'm trying to understand why 640 is the limit. I believe the window size for 16khz is documented to be 512 and the context is always expected to be 64.
The shape in the error message is also confusing me, I can't really why would {1,1,1,128,2} come from for window size values >= 640.
I tried netron.app for an explanation without much success.

Thanks

Answered by adamnsandle

Jan 29, 2026

Hi! The input chunk goes through the following steps:

Right padding (64 samples).
Short-Time Fourier Transform (STFT).
VAD encoder
Decoder

The STFT parameters are a hop size of 128 and a window size of 256. Therefore, a base chunk of 640 samples in length is split into exactly 4 timestamps.

The VAD encoder has a downsampling factor of 4, effectively compressing these 4 timestamps into 1 single state.

If window_size_samples is greater than 640, the encoder will produce 2 or more states, which causes an error during the decoding step. This mismatch in the expected number of states is the likely reason for the shape error you observed (e.g., {1,1,1,128,2}).

Ultimately, it is not recommende…

View full answer

adamnsandle · 2026-01-29T14:42:47Z

adamnsandle
Jan 29, 2026
Collaborator

Hi! The input chunk goes through the following steps:

Right padding (64 samples).
Short-Time Fourier Transform (STFT).
VAD encoder
Decoder

The STFT parameters are a hop size of 128 and a window size of 256. Therefore, a base chunk of 640 samples in length is split into exactly 4 timestamps.

The VAD encoder has a downsampling factor of 4, effectively compressing these 4 timestamps into 1 single state.

If window_size_samples is greater than 640, the encoder will produce 2 or more states, which causes an error during the decoding step. This mismatch in the expected number of states is the likely reason for the shape error you observed (e.g., {1,1,1,128,2}).

Ultimately, it is not recommended to use values other than 512 (for 16 kHz) or 256 (for 8 kHz), quality may suffer.

1 reply

bignacio Jan 29, 2026
Author

Ah, I see it now! Thanks so much for the quick and detailed explanation.
All the best!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

❓Help understanding inference input length for 16khz audio #751

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

❓Help understanding inference input length for 16khz audio #751

Uh oh!

bignacio Jan 29, 2026

❓ Hi there.

Replies: 1 comment · 1 reply

Uh oh!

adamnsandle Jan 29, 2026 Collaborator

Uh oh!

bignacio Jan 29, 2026 Author

bignacio
Jan 29, 2026

Replies: 1 comment 1 reply

adamnsandle
Jan 29, 2026
Collaborator

bignacio Jan 29, 2026
Author