Intelligibility prediction? #211

youssefabdelm · 2022-09-30T20:59:54Z

youssefabdelm
Sep 30, 2022

I noticed when trying to transcribe a very low quality audio file that the model started repeating itself saying "Inaudible". This gave me the idea that perhaps this model could be used to predict intelligibility (as per: https://arxiv.org/abs/2204.04288 - they found that ASR uncertainty metrics correlated strongly with human ratings of intelligibility)

My idea is to create a function that:

Takes in audio and text
Returns a probability that the text matches the audio (basically the logprobs you guys use)

I'm not sure if this will work because I barely understand the model. But my guess is if a word like "Inaudible" passed the logprob threshold test of whether or not this transcription is a failure, then my hunch that the logprobs measure a kind of "what's the probability that the text will be this given this audio" might be right.

In that case I can just write something like "Unintelligible" and give it a piece of audio and have it measure the probability that that is unintelligible.

I tried to make something with the help of Codex but still getting errors:

(I've placed this inside transcribe.py)

from .audio import load_audio

def convert_text_to_tokens(text, model, language, task='transcribe'):
    """
    Given some text, encode it to tokens by the model
    
    Parameters
    ----------
    text: str of words
    """
    tokenizer = get_tokenizer(model.is_multilingual, language=language, task=task)
    return tokenizer.encode(text)



def get_probability_of_correctness(audio_path, text, model, language, task):
    """
    Given some text, and an audio segment, return
    the probability that the text is correct
    
    Parameters
    ----------
    audio: np.ndarray
        The audio segment
    text: str
        The text to be transcribed
    """
    SAMPLE_RATE = 16000
    N_FFT = 400
    N_MELS = 80
    HOP_LENGTH = 160
    CHUNK_LENGTH = 30
    N_SAMPLES = CHUNK_LENGTH * SAMPLE_RATE  # 480000: number of samples in a chunk
    N_FRAMES = exact_div(N_SAMPLES, HOP_LENGTH)  # 3000: number of frames in a mel spectrogram input


    tokens = convert_text_to_tokens(text, model, language, task=task)
    audio = load_audio(audio_path, sr=SAMPLE_RATE)
    mel = log_mel_spectrogram(audio)
    
    dtype = torch.float32
    
    segment = pad_or_trim(mel, N_FRAMES).to(model.device).to(dtype)
    tokens = torch.tensor(tokens)
    logits = model.logits(tokens.to(model.device), segment)
    return logits.softmax(dim=-1)

from whisper.transcribe import convert_text_to_tokens, get_probability_of_correctness
get_probability_of_correctness('/content/test.wav', 'Unintelligible', model, 'en', 'transcribe')

Is giving me this error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-3-f539f9eae14e>](https://localhost:8080/#) in <module>
      1 from whisper.transcribe import convert_text_to_tokens, get_probability_of_correctness
----> 2 get_probability_of_correctness('/content/test.wav', 'Unintelligible', model, 'en', 'transcribe')

7 frames
[/usr/local/lib/python3.7/dist-packages/whisper/transcribe.py](https://localhost:8080/#) in get_probability_of_correctness(audio_path, text, model, language, task)
    352     segment = pad_or_trim(mel, N_FRAMES).to(model.device).to(dtype)
    353     tokens = torch.tensor(tokens)
--> 354     logits = model.logits(tokens.to(model.device), segment)
    355     return logits.softmax(dim=-1)
    356 

[/usr/local/lib/python3.7/dist-packages/whisper/model.py](https://localhost:8080/#) in logits(self, tokens, audio_features)
    218 
    219     def logits(self, tokens: torch.Tensor, audio_features: torch.Tensor):
--> 220         return self.decoder.forward(tokens, audio_features)
    221 
    222     def forward(self, mel: torch.Tensor, tokens: torch.Tensor) -> Dict[str, torch.Tensor]:

[/usr/local/lib/python3.7/dist-packages/whisper/model.py](https://localhost:8080/#) in forward(self, x, xa, kv_cache)
    187 
    188         for block in self.blocks:
--> 189             x = block(x, xa, mask=self.mask, kv_cache=kv_cache)
    190 
    191         x = self.ln(x)

[/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *input, **kwargs)
   1128         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1129                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130             return forward_call(*input, **kwargs)
   1131         # Do not call functions when jit is used
   1132         full_backward_hooks, non_full_backward_hooks = [], []

[/usr/local/lib/python3.7/dist-packages/whisper/model.py](https://localhost:8080/#) in forward(self, x, xa, mask, kv_cache)
    122         kv_cache: Optional[dict] = None,
    123     ):
--> 124         x = x + self.attn(self.attn_ln(x), mask=mask, kv_cache=kv_cache)
    125         if self.cross_attn:
    126             x = x + self.cross_attn(self.cross_attn_ln(x), xa, kv_cache=kv_cache)

[/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *input, **kwargs)
   1128         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1129                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130             return forward_call(*input, **kwargs)
   1131         # Do not call functions when jit is used
   1132         full_backward_hooks, non_full_backward_hooks = [], []

[/usr/local/lib/python3.7/dist-packages/whisper/model.py](https://localhost:8080/#) in forward(self, x, xa, mask, kv_cache)
     83             v = kv_cache.get(self.value, self.value(xa))
     84 
---> 85         wv = self.qkv_attention(q, k, v, mask)
     86         return self.out(wv)
     87 

[/usr/local/lib/python3.7/dist-packages/whisper/model.py](https://localhost:8080/#) in qkv_attention(self, q, k, v, mask)
     87 
     88     def qkv_attention(self, q: Tensor, k: Tensor, v: Tensor, mask: Optional[Tensor] = None):
---> 89         n_batch, n_ctx, n_state = q.shape
     90         scale = (n_state // self.n_head) ** -0.25
     91         q = q.view(*q.shape[:2], self.n_head, -1).permute(0, 2, 1, 3) * scale

ValueError: not enough values to unpack (expected 3, got 2)

So I probably just need to tokenize in the right way. Would love any help whatsoever!

youssefabdelm · 2022-10-01T10:51:39Z

youssefabdelm
Oct 1, 2022
Author

I've modified the code slightly, it seems to work but honestly I feel like I might be doing something stupid with the logits (since I barely understand what's going on):

import argparse
import os
import warnings
from typing import List, Optional, Tuple, Union, TYPE_CHECKING

import numpy as np
import torch
import tqdm

from whisper.audio import SAMPLE_RATE, N_FRAMES, HOP_LENGTH, pad_or_trim, log_mel_spectrogram, load_audio
from whisper.decoding import DecodingOptions, DecodingResult, DecodingTask
from whisper.tokenizer import LANGUAGES, TO_LANGUAGE_CODE, get_tokenizer
from whisper.utils import exact_div, format_timestamp, optional_int, optional_float, str2bool, write_txt, write_vtt, write_srt


def convert_text_to_tokens(text, model, language, task='transcribe'):
    """
    Given some text, encode it to tokens by the model
    
    Parameters
    ----------
    text: str of words
    """
    tokenizer = get_tokenizer(model.is_multilingual, language=language, task=task)
    return tokenizer.encode(text)



def get_probability_of_correctness(audio_path, text, model, language, task):
    """
    Given some text, and an audio segment, return
    the probability that the text is correct
    
    Parameters
    ----------
    audio: np.ndarray
        The audio segment
    text: str
        The text to be transcribed
    """

    SAMPLE_RATE = 16000
    N_FFT = 400
    N_MELS = 80
    HOP_LENGTH = 160
    CHUNK_LENGTH = 30
    N_SAMPLES = CHUNK_LENGTH * SAMPLE_RATE  # 480000: number of samples in a chunk
    N_FRAMES = exact_div(N_SAMPLES, HOP_LENGTH)  # 3000: number of frames in a mel spectrogram input


    tokens = convert_text_to_tokens(text, model, language, task=task)
    audio = load_audio(audio_path, sr=SAMPLE_RATE)
    mel = log_mel_spectrogram(audio)
    
    
    
    dtype = torch.float32
    
    segment = pad_or_trim(mel, N_FRAMES).to(model.device).to(dtype)
    mel = segment
    print(mel.shape)
    #'''
    single = mel.ndim == 2
    if single:
        mel = mel.unsqueeze(0)

    # skip encoder forward pass if already-encoded audio features were given
    if mel.shape[-2:] != (model.dims.n_audio_ctx, model.dims.n_audio_state):
        mel = model.encoder(mel)
    #'''

    
    

    tokens = torch.tensor([tokens])

    logits = model.logits(tokens.to(model.device), mel)
    sum_logprobs = logits.softmax(dim=-1)

    
    avg_logprobs = [lp / (len(t) + 1) for t, lp in zip(tokens, sum_logprobs)]
    return logits, sum_logprobs, avg_logprobs

And then:

logits, sum_logprobs, avg_logprobs = get_probability_of_correctness('/content/test.wav', 'Unintelligible', model, 'en', 'transcribe')
first_half = sum(avg_logprobs[0][0])/len(avg_logprobs[0][0])
second_half = sum(avg_logprobs[0][1])/len(avg_logprobs[0][1])
final_score = sum([first_half, second_half])/2
final_score.item()

What I'm trying to do is get a total logprob metric for "How probable is it that this text is unintelligible" by passing it the text "Unintelligible", tokenizing, then getting logprobs for it. However I find I'm getting logits that are of size 10K (in 2 arrays inside a tensor).

It's unclear to me from your code how you guys are getting a single number for the avg_logprob.

0 replies

turnkit · 2022-11-01T08:04:54Z

turnkit
Nov 1, 2022

Someone demonstrated a confidence coloring output in the .srt thread I believe.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Intelligibility prediction? #211

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Intelligibility prediction? #211

Uh oh!

Uh oh!

youssefabdelm Sep 30, 2022

Replies: 2 comments

Uh oh!

Uh oh!

youssefabdelm Oct 1, 2022 Author

Uh oh!

turnkit Nov 1, 2022

youssefabdelm
Sep 30, 2022

youssefabdelm
Oct 1, 2022
Author

turnkit
Nov 1, 2022