Can we have this as part of the pipe to address further hallucination issues #2521

saman-rahbar · 2025-02-10T19:01:19Z

saman-rahbar
Feb 10, 2025

Just an example of token-level confidence filtering and a secondary verification loop using Whisper to addresss above seems to work fine with the current arch

import torch
import whisper

def transcribe_with_confidence_filter(
model_name: str,
audio_file_path: str,
confidence_threshold: float = 0.2,
secondary_pass: bool = True
):

model = whisper.load_model(model_name)

result = model.transcribe(audio_file_path, verbose=False, beam_size=5)

tokens = result["segments"]
filtered_tokens = []

for segment in tokens:
    segment_text = segment["text"]
    segment_logprobs = segment.get("avg_logprob", None)  # or "tokens" -> "logprob" if needed
    
    if segment_logprobs is not None:
        confidence_score = torch.exp(torch.tensor(segment_logprobs)).item()
    else:
        confidence_score = 1.0  # assume confident if we don't have logprobs

    if confidence_score >= confidence_threshold:
        filtered_tokens.append(segment_text)
    else:
        if secondary_pass:
            # Attempt a second pass with a more constrained decoding
            second_pass_result = model.transcribe(
                audio_file_path,
                verbose=False,
                temperature=0.0,  # more "greedy"
                best_of=1,
                beam_size=1,      # narrower beam
                initial_prompt=segment_text  # provide context if useful
            )
            filtered_tokens.append(second_pass_result["text"])
        else:
            filtered_tokens.append("[LOW_CONFIDENCE]")

final_text = " ".join(filtered_tokens)
return final_text

if name == "main":
# Example usage
audio_path = "sample_audio.wav"
transcription = transcribe_with_confidence_filter(
model_name="medium",
audio_file_path=audio_path,
confidence_threshold=0.25,
secondary_pass=True
)
print("Final transcription:", transcription)

Advantages

Reduced Hallucinations
By filtering out or re-checking segments below a certain confidence, we can significantly reduce the likelihood of random, hallucinatory text.
Minimal Architectural Changes
This proposal does not require retraining Whisper; it relies on existing capabilities (log probabilities and temperature/beam size settings).
Users can easily adjust the confidence_threshold, decoding parameters, and other heuristics based on their specific domain or application.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Can we have this as part of the pipe to address further hallucination issues #2521

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Can we have this as part of the pipe to address further hallucination issues #2521

Uh oh!

saman-rahbar Feb 10, 2025

Replies: 0 comments

saman-rahbar
Feb 10, 2025