Intelligibility prediction? #211
Replies: 2 comments
-
I've modified the code slightly, it seems to work but honestly I feel like I might be doing something stupid with the logits (since I barely understand what's going on): import argparse
import os
import warnings
from typing import List, Optional, Tuple, Union, TYPE_CHECKING
import numpy as np
import torch
import tqdm
from whisper.audio import SAMPLE_RATE, N_FRAMES, HOP_LENGTH, pad_or_trim, log_mel_spectrogram, load_audio
from whisper.decoding import DecodingOptions, DecodingResult, DecodingTask
from whisper.tokenizer import LANGUAGES, TO_LANGUAGE_CODE, get_tokenizer
from whisper.utils import exact_div, format_timestamp, optional_int, optional_float, str2bool, write_txt, write_vtt, write_srt
def convert_text_to_tokens(text, model, language, task='transcribe'):
"""
Given some text, encode it to tokens by the model
Parameters
----------
text: str of words
"""
tokenizer = get_tokenizer(model.is_multilingual, language=language, task=task)
return tokenizer.encode(text)
def get_probability_of_correctness(audio_path, text, model, language, task):
"""
Given some text, and an audio segment, return
the probability that the text is correct
Parameters
----------
audio: np.ndarray
The audio segment
text: str
The text to be transcribed
"""
SAMPLE_RATE = 16000
N_FFT = 400
N_MELS = 80
HOP_LENGTH = 160
CHUNK_LENGTH = 30
N_SAMPLES = CHUNK_LENGTH * SAMPLE_RATE # 480000: number of samples in a chunk
N_FRAMES = exact_div(N_SAMPLES, HOP_LENGTH) # 3000: number of frames in a mel spectrogram input
tokens = convert_text_to_tokens(text, model, language, task=task)
audio = load_audio(audio_path, sr=SAMPLE_RATE)
mel = log_mel_spectrogram(audio)
dtype = torch.float32
segment = pad_or_trim(mel, N_FRAMES).to(model.device).to(dtype)
mel = segment
print(mel.shape)
#'''
single = mel.ndim == 2
if single:
mel = mel.unsqueeze(0)
# skip encoder forward pass if already-encoded audio features were given
if mel.shape[-2:] != (model.dims.n_audio_ctx, model.dims.n_audio_state):
mel = model.encoder(mel)
#'''
tokens = torch.tensor([tokens])
logits = model.logits(tokens.to(model.device), mel)
sum_logprobs = logits.softmax(dim=-1)
avg_logprobs = [lp / (len(t) + 1) for t, lp in zip(tokens, sum_logprobs)]
return logits, sum_logprobs, avg_logprobs And then: logits, sum_logprobs, avg_logprobs = get_probability_of_correctness('/content/test.wav', 'Unintelligible', model, 'en', 'transcribe')
first_half = sum(avg_logprobs[0][0])/len(avg_logprobs[0][0])
second_half = sum(avg_logprobs[0][1])/len(avg_logprobs[0][1])
final_score = sum([first_half, second_half])/2
final_score.item() What I'm trying to do is get a total logprob metric for "How probable is it that this text is unintelligible" by passing it the text "Unintelligible", tokenizing, then getting logprobs for it. However I find I'm getting logits that are of size 10K (in 2 arrays inside a tensor). It's unclear to me from your code how you guys are getting a single number for the |
Beta Was this translation helpful? Give feedback.
-
Someone demonstrated a confidence coloring output in the .srt thread I believe. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I noticed when trying to transcribe a very low quality audio file that the model started repeating itself saying "Inaudible". This gave me the idea that perhaps this model could be used to predict intelligibility (as per: https://arxiv.org/abs/2204.04288 - they found that ASR uncertainty metrics correlated strongly with human ratings of intelligibility)
My idea is to create a function that:
I'm not sure if this will work because I barely understand the model. But my guess is if a word like "Inaudible" passed the logprob threshold test of whether or not this transcription is a failure, then my hunch that the logprobs measure a kind of "what's the probability that the text will be this given this audio" might be right.
In that case I can just write something like "Unintelligible" and give it a piece of audio and have it measure the probability that that is unintelligible.
I tried to make something with the help of Codex but still getting errors:
(I've placed this inside
transcribe.py
)Is giving me this error:
So I probably just need to tokenize in the right way. Would love any help whatsoever!
Beta Was this translation helpful? Give feedback.
All reactions