Restricting possible range of language predicted. #2247

ngcheeyuan · 2024-06-26T13:23:48Z

ngcheeyuan
Jun 26, 2024

Hi as per title. Can we force the language prediction to be X if it does not predict a language within a given list?

phineas-pta · 2024-06-30T14:10:14Z

phineas-pta
Jun 30, 2024

with lang option

1 reply

ngcheeyuan Jul 2, 2024
Author

Sorry what do you mean? I can't seem to find it within the various classes.

ryanheise · 2024-07-02T01:30:04Z

ryanheise
Jul 2, 2024

There is no command-line option for this, but if you're writing Python code it should be possible.

Take a look at this function in decoding.py:

@torch.no_grad()
def detect_language(
    model: "Whisper", mel: Tensor, tokenizer: Tokenizer = None
) -> Tuple[Tensor, List[dict]]:
    """
    Detect the spoken language in the audio, and return them as list of strings, along with the ids
    of the most probable language tokens and the probability distribution over all language tokens.
    This is performed outside the main decode loop in order to not interfere with kv-caching.

    Returns
    -------
    language_tokens : Tensor, shape = (n_audio,)
        ids of the most probable language tokens, which appears after the startoftranscript token.
    language_probs : List[Dict[str, float]], length = n_audio
        list of dictionaries containing the probability distribution over all languages.
    """
    if tokenizer is None:
        tokenizer = get_tokenizer(
            model.is_multilingual, num_languages=model.num_languages
        )
    if (
        tokenizer.language is None
        or tokenizer.language_token not in tokenizer.sot_sequence
    ):
        raise ValueError(
            "This model doesn't have language tokens so it can't perform lang id"
        )

    single = mel.ndim == 2
    if single:
        mel = mel.unsqueeze(0)

    # skip encoder forward pass if already-encoded audio features were given
    if mel.shape[-2:] != (model.dims.n_audio_ctx, model.dims.n_audio_state):
        mel = model.encoder(mel)

    # forward pass using a single token, startoftranscript
    n_audio = mel.shape[0]
    x = torch.tensor([[tokenizer.sot]] * n_audio).to(mel.device)  # [n_audio, 1]
    logits = model.logits(x, mel)[:, 0]

    # collect detected languages; suppress all non-language tokens
    mask = torch.ones(logits.shape[-1], dtype=torch.bool)
    mask[list(tokenizer.all_language_tokens)] = False
    logits[:, mask] = -np.inf
    language_tokens = logits.argmax(dim=-1)
    language_token_probs = logits.softmax(dim=-1).cpu()
    language_probs = [
        {
            c: language_token_probs[i, j].item()
            for j, c in zip(tokenizer.all_language_tokens, tokenizer.all_language_codes)
        }
        for i in range(n_audio)
    ]

    if single:
        language_tokens = language_tokens[0]
        language_probs = language_probs[0]

    return language_tokens, language_probs

If you call this function within your program and take the returned language_probs, that will give you the probabilities of all languages, allowing you to pull out the subset of languages that you want to consider.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Restricting possible range of language predicted. #2247

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Restricting possible range of language predicted. #2247

Uh oh!

ngcheeyuan Jun 26, 2024

Replies: 2 comments · 1 reply

Uh oh!

phineas-pta Jun 30, 2024

Uh oh!

ngcheeyuan Jul 2, 2024 Author

Uh oh!

ryanheise Jul 2, 2024

ngcheeyuan
Jun 26, 2024

Replies: 2 comments 1 reply

phineas-pta
Jun 30, 2024

ngcheeyuan Jul 2, 2024
Author

ryanheise
Jul 2, 2024