Skip to content
Discussion options

You must be logged in to vote

Hi! The input chunk goes through the following steps:

  • Right padding (64 samples).
  • Short-Time Fourier Transform (STFT).
  • VAD encoder
  • Decoder

The STFT parameters are a hop size of 128 and a window size of 256. Therefore, a base chunk of 640 samples in length is split into exactly 4 timestamps.

The VAD encoder has a downsampling factor of 4, effectively compressing these 4 timestamps into 1 single state.

If window_size_samples is greater than 640, the encoder will produce 2 or more states, which causes an error during the decoding step. This mismatch in the expected number of states is the likely reason for the shape error you observed (e.g., {1,1,1,128,2}).

Ultimately, it is not recommende…

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@bignacio
Comment options

Answer selected by snakers4
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
help wanted Extra attention is needed
2 participants
Converted from issue

This discussion was converted from issue #750 on January 29, 2026 14:53.