Fix VAD tie-break to prefer latest max-silence candidate#1421
Fix VAD tie-break to prefer latest max-silence candidate#1421keloz04 wants to merge 1 commit intoSYSTRAN:masterfrom
Conversation
I fail to see how this is a problem. A problem for what? |
Thanks for the question — this is a chunk-boundary quality issue when
This matters most in Repro (synthetic probs, same VAD options):
Result:
Both keep the same primary criterion (maximum silence duration); this PR only changes tie-breaking among equal-duration candidates. |
|
Why are you responding with AI-generated answer? Was this "problem" found and the PR created by AI? As no actual explanation was provided, the question still remains: How this is a problem? |
|
Thanks for the feedback. English isn’t my first language, so I used an LLM for help. Why I considered this a practical issue (based on my understanding of the batched pipeline): in batched transcription In addition, I noticed a potential issue in # faster_whisper/vad.py
if (
not use_max_poss_sil_at_max_speech
and sil_dur_now > min_silence_samples_at_max_speech
):
prev_end = temp_endWhen So, it seems to me that the logic might be worth adjusting to update If I misunderstood and this does not cause a practical issue here, I’m happy to close the PR. |
Did YOU notice it, or was it noticed by AI? |
Thanks for the follow-up. To be transparent, I first noticed this while reviewing the code with the help of an LLM. Here’s the context. I’m building a small GUI pipeline around faster-whisper to improve transcription quality for my own use. For this project, I’m adding a pre-transcription audio preprocessing stage to improve quality, using BS-RoFormer models fine-tuned for my use case along with a few useful preprocessing steps from the audio separation community. Before updating, I reviewed the upstream changes carefully. While using faster-whisper, I’ve been modifying several defaults and parameters to improve transcription quality for my audio, and through that process I’ve been gradually learning how the library behaves. For example, because the existing temperature fallback logic can reduce reproducibility (results can vary between runs and quality can shift each time I transcribe), I tried setting temperature to 0 to make outputs more consistent. However, once temperature fallback no longer triggers, I ran into cases where the model didn’t escape hallucinated/low-quality outputs. That’s what pushed me to enable VAD and start tuning VAD-related parameters more carefully. While doing that, I learned that even relatively small differences in how silence is handled (on the order of ~1 second) can noticeably change the transcription result. So I wanted to understand what was actually happening in the pipeline: how VAD decides boundaries, when it splits, and what each parameter means in terms of control flow. I used an LLM to help me follow the code path and understand how the internal pipeline flows, in particular the VAD-related logic. I then validated my understanding through repeated experiments on my own audio to see how parameter changes affect the output. Through that iterative process, I became much more aware that VAD behavior and its parameters can have a meaningful impact on transcription quality for my use case. That’s why, before applying this VAD-related update, I reviewed the relevant code path carefully to understand the details and decided to cautiously propose this PR. I’m still learning this codebase, so I might be wrong — if my understanding is off, I’ll happily revise or close the PR. |
Can you swear to that? It looks to me like you're overstating your role. I think it's you who's helping the LLM by bringing its reviews here. |
Before pulling in the VAD-related commit, I reviewed the diff and the relevant code path to understand what the change does in my pipeline. I used an LLM as a helper during that review, and the initial observation came up during that review via the LLM. If any of my earlier comments came across as “I noticed this first on my own without any LLM help,” I apologize — that wasn’t what I meant, and that wording was a translation/phrasing mistake. That said, I didn’t mean to overstate my role. I worked through the logic myself, thought through the likely impact, and decided what to propose in this PR. If my wording suggested otherwise, that wasn’t my intention. I’m just trying to give an honest description of how I reviewed the code and validated the behavior. |
|
Why did you close it? We illiterate plebs were hoping for more of AI's divine hallucinations. |
I closed it because it looks like earliest-on-tie may be the intended behavior here. Thanks for taking a look. For my use case, I’ll keep a local tweak for now. |
Are you sure? How about "it produces additional chunks, that directly increases the number of encoder/decoder calls" problem? And what about other problems with "possible_ends" when "use_max_poss_sil_at_max_speech" is off?
I think your AI is stuck in a loop. :) |
From what I’ve been able to confirm, in some situations an unintended and unnecessary trailing silence can be included after a speech chunk, and I thought that extra silence could affect transcription quality. That said, I’m not claiming my interpretation is definitely correct. Given your reaction and your experience contributing across multiple repos, it seems possible this is intended behavior or that I’m missing some context, so I decided to close the PR. |
Is this AI's hallucination again? |
|
=== === |
Because of your reaction and your experience contributing across multiple repos, it seems possible that this is intended behavior or that I’m missing some context, so I’m trying to be careful with how strongly I state my conclusion. |
How "trailing silence can be included after a speech chunk" can be intended behavior if the whole point of VAD is to remove silence... If you have confirmed such issue then present your case. On the original PR: it actually has the opposite effect, it creates the "problem" that it claims to solve. So, it's AI's hallucination. |
Here’s a minimal repro showing trailing silence can be included at the end of a speech chunk in the current VAD logic (no PR changes required). Repro (run from repo root): import numpy as np
import faster_whisper.vad as vad
# 16kHz, window=512 => each frame is 512 samples (~32ms)
# 20 frames speech then 20 frames silence:
probs = [0.9]*20 + [0.1]*20
# Stub the VAD model so get_speech_timestamps() consumes our synthetic probs
vad.get_vad_model = lambda: (lambda audio: np.array(probs, dtype=np.float32))
audio = np.zeros(512 * len(probs), dtype=np.float32)
silence_start = 20 * 512 # first silent frame start
out_true = vad.get_speech_timestamps(
audio,
sampling_rate=16000,
max_speech_duration_s=1.0,
min_silence_duration_ms=2000,
speech_pad_ms=0,
use_max_poss_sil_at_max_speech=True,
)
out_false = vad.get_speech_timestamps(
audio,
sampling_rate=16000,
max_speech_duration_s=1.0,
min_silence_duration_ms=2000,
speech_pad_ms=0,
use_max_poss_sil_at_max_speech=False,
)
print("use_max_poss...=True :", out_true)
print("use_max_poss...=False:", out_false)
print("silence_start:", silence_start)
print("trailing_silence_samples (True):", out_true[0]["end"] - silence_start)
print("trailing_silence_seconds (True):", (out_true[0]["end"] - silence_start) / 16000.0)Output on my run:
Looking at the results, you can see that when |
So, you [your LLM] claims that bug is in I would say that the issue is opposite to "unnecessary trailing silence".
When you fully rely on an LLM, you're missing logical thinking. 😉 |
min_silence_duration_ms=2000 doesn’t mean “<2s is not silence”. In the code, silence/non-speech frames are identified by (speech_prob < neg_threshold) while triggered. In this repro the tail is already below neg_threshold (non-speech), it’s just shorter than 2s, so the silence-based |
|
So? |
|
Is the prompt too short for your LLM to respond? Is it still confused by layman's semantics? |
So: per your earlier “present your case” request — the repro shows a concrete path where a segment end can land inside non-speech when max_speech_duration_s triggers during an ongoing short silence. min_silence_duration_ms=2000 doesn’t mean “<2s is not silence”. It only means “don’t terminate because of silence unless it lasts >=2s”. In the repro, temp_end is set at 10240 (silence start), but when max_speech_duration_s fires before possible_ends is populated, the fallback end becomes cur_sample (15872), i.e. 15872-10240=5632 samples ≈ 0.352s of non-speech included at the end. If we’re already in non-speech at that moment, I’d argue the more consistent boundary is temp_end (silence start) rather than cur_sample. A minimal fix would be: when max duration fires while temp_end is set, end at temp_end. |
|
I think you need to increase your temperature because you stuck in a hallucination loop. |
You wrote earlier: “How can trailing silence be included after a speech chunk if the whole point of VAD is to remove silence?” So — do you understand the code path now? Either this is intended behavior, or it’s a bug and should be fixed. Which is it? |
|
Yes, I [not AI] wrote that. But somehow you [LLM] forgot that it wrote that "trailing silence" is "unnecessary".
No, every your [LLM] post looks like sheer lunacy for me. |
You’re right that my wording “unnecessary” was too strong / imprecise. Let me restate it in concrete terms. Fact: When max_speech_duration_s triggers while we’re already in frames classified as non-speech (speech_prob < neg_threshold), the current fallback can set end = cur_sample, so the returned segment may include some non-speech at the tail (even with speech_pad_ms=0). Question: Is that tail inclusion intended as part of the design (e.g., to avoid cutting too early when short pauses occur), or should the fallback prefer temp_end (silence start) when temp_end is already set? If the intended behavior is to keep trailing non-speech in that specific max-duration path, I’m fine with that—then it’s not a bug. If not, a minimal change would be to clamp to temp_end when temp_end is set at the moment max-duration fires. If you disagree, could you point to the exact rationale in the code/expected behavior? I’m happy to adjust my understanding. |
|
I think you [LLM] missed the point when it was busy retranslating semantics. My advice: Don't waste your time with LLM and don't try to waste the time of others. Why are YOU even needed here? I can myself copy/paste my questions to AI to get similar hallucinated responses. Anyone can paste vad.py and ask an AI to hallucinate various "problems". Do you think you are some kind of "ambassador" that we need to mediate between AI and humans? |
I am not asking anyone to trust an LLM. I am simply pointing to the reproducible test case. The only technical question remaining is: If your stance is that "cur_sample is the intended behavior," then I will drop this discussion. If my way of presenting the case or explaining the repro felt like it touched your pride, I apologize—that was never my intention. It’s likely a nuance lost in translation. It is past 2:30 AM here, so I am heading to bed. I will check back for any further technical comments from you when I am available tomorrow. P.S. |
|
That behavior needs to be investigated, maybe next week I'll look at it. Posting AI-generated responses is not OK. Using a translator is OK.
IMO, currently it's not a proper app for such task. |
Summary
When
use_max_poss_sil_at_max_speech=True,get_speech_timestampscurrently picks the first candidate when multiple silence candidates have the same duration.Problem
max(possible_ends, key=lambda x: x[1])breaks ties by first occurrence, which biases the split point toward older silence segments instead of the most recent one near the max-speech boundary.Fix
Use recency as a tie-breaker:
key=lambda x: x[1]key=lambda x: (x[1], x[0])This preserves the longest-silence behavior and, on ties, chooses the latest candidate.
Tests
Added
tests/test_vad.py:test_get_speech_timestamps_prefers_latest_silence_on_tieRan:
pytest -q tests/test_vad.py(Passed)