-
|
Hi! I am using Silero VAD to segment my audio. Some of the detected speech segments are very short because the spoken utterances themselves are short. However, for my downstream processing, I would like every final speech segment to be at least 3 seconds long. Initially, I thought that the parameter min_speech_duration_ms controlled the minimum length of each output segment. However, after reviewing the code and documentation, I realized that this parameter simply discards speech segments shorter than the specified duration. But I do not want to remove short speech segments. Instead, I would like to keep all detected speech but ensure that the final segments are no shorter than 3 seconds, possibly by merging adjacent segments when necessary. My question is: is there any built-in way in Silero VAD to enforce a minimum output segment length (e.g., 3 seconds) without discarding short segments? Or is post-processing (manually merging adjacent segments) the recommended approach in this case? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
One approach would be to try increasing this.
Probably yes. |
Beta Was this translation helpful? Give feedback.
One approach would be to try increasing this.
If you domain has quite long utterances separated by long silences, it can achieve your goal.
Probably yes.
You see, if there is a small silence between some speech, and we merge it - we kind of lose information, hence we do not do it.
If we enforce minimal speech length and there is no proper speech of such length, we will be either deleting information or introducing bias.