Is there a way to work with a limited vocabulary? #843

theabhinavdas · 2023-01-14T19:59:33Z

theabhinavdas
Jan 14, 2023

EDIT: Improving this question.

After further research, I found that what I'm looking for is called ASR using limited vocabulary. I'm wondering if that is possible with Whisper and how I may go about doing that.

DeepSpeech has something like this: https://github.com/mozilla/DeepSpeech/blob/master/data/lm/generate_lm.py

Answered by jongwook

Jan 17, 2023

It's not trivial, but it would be simpler if every word in your list can be represented in a single token, like:

from whisper.tokenizer import get_tokenizer

tokenizer = get_tokenizer(multilingual=False)

tokenizer.encode(" cat")  # [3797]
tokenizer.encode(" dog")  # [3290]
tokenizer.encode(" mouse")  # [10211]

At this point it become just a multi-class classification. Otherwise, during decoding, you could add another instance of LogitFilter which would block all token sequences except the ones you allow:

whisper/whisper/decoding.py

Lines 367 to 380 in 0f39c89

     class LogitFilter:  
   def apply(self, logits: Tensor, tokens: Tensor) -> None:  
   """Apply any filtering or maskin…

View full answer

jongwook · 2023-01-17T08:00:48Z

jongwook
Jan 17, 2023
Maintainer

It's not trivial, but it would be simpler if every word in your list can be represented in a single token, like:

from whisper.tokenizer import get_tokenizer

tokenizer = get_tokenizer(multilingual=False)

tokenizer.encode(" cat")  # [3797]
tokenizer.encode(" dog")  # [3290]
tokenizer.encode(" mouse")  # [10211]

At this point it become just a multi-class classification. Otherwise, during decoding, you could add another instance of LogitFilter which would block all token sequences except the ones you allow:

whisper/whisper/decoding.py

Lines 367 to 380 in 0f39c89

    
           class LogitFilter: 
        
               def apply(self, logits: Tensor, tokens: Tensor) -> None: 
        
                   """Apply any filtering or masking to logits in-place 
        
                   Parameters 
        
                   ---------- 
        
                   logits : Tensor, shape = (n_batch, vocab_size) 
        
                       per-token logits of the probability distribution at the current step 
        
                   tokens : Tensor, shape = (n_batch, current_sequence_length) 
        
                       all tokens in the context so far, including the prefix and sot_sequence tokens 
        
                   """ 
        
                   raise NotImplementedError

5 replies

pranabenator Jan 17, 2023

@jongwook What about homophones?

tilsen Sep 27, 2023

@jongwook Can you elaborate on how to accomplish the limited vocabulary with the tokenizer approach? I can't figure out what to do with the tokenizer after the commands you have above.

Thanks

jongwook Sep 27, 2023
Maintainer

Make whisper transcribe numbers in the actual spoken words #1041 (comment)

This is an example that blocks all tokens representing numbers to make Whisper transcribe numbers into words rather than digits. It'd be a more extreme version of the above where you'd block all tokens except the limited set of words that you want to allow.

tilsen Sep 27, 2023

Thanks for pointing me to this example. I wonder if you have any insight into why this doesn't seem to work with my "yes","no" tokens:

tokenizer = get_tokenizer(multilingual=False)

targets = ["yes","no"]

keep_tokens=list()
for i in range(0,len(targets)):
    keep_tokens.append(tokenizer.encode(" "+targets[i]))


block_tokens=list()
for i in range(tokenizer.eot):
    if (i not in keep_tokens):
        block_tokens.append(i)

model = whisper.load_model("medium",device="cuda")

result = model.transcribe(files[0],suppress_tokens=[-1] + block_tokens, word_timestamps=True)
print(result)

result = model.transcribe(files[0], word_timestamps=True)
print(result)

The first result (with only the "yes","no" tokens) doesn't transcribe a "yes":

{'text': '', 'segments': [{'id': 0, 'seek': 0, 'start': 0.0, 'end': 2.0, 'text': '', 'tokens': [], 'temperature': 1.0, 'avg_logprob': -1.7079352140426636, 'compression_ratio': 0.42857142857142855, 'no_speech_prob': 0.1031215488910675, 'words': []}], 'language': 'en'}

But without suppression the "yes" is correctly transcribed:

{'text': ' Yes.', 'segments': [{'id': 0, 'seek': 0, 'start': 0.0, 'end': 1.54, 'text': ' Yes.', 'tokens': [50364, 1079, 13, 50464], 'temperature': 0.0, 'avg_logprob': -0.8751304626464844, 'compression_ratio': 0.3333333333333333, 'no_speech_prob': 0.1031215488910675, 'words': [{'word': ' Yes.', 'start': 0.0, 'end': 1.54, 'probability': 0.7959637641906738}]}], 'language': 'en'}

LeoArtaza Oct 6, 2023

@tilsen I was just trying this out and what I did to solve it is to not only include "yes", but also "Yes" "YES" " Yes.", etc.

pranabenator · 2023-01-17T17:12:35Z

pranabenator
Jan 17, 2023

IMO, a solid way to attack the problem would be through (1) recognition, (2) distance calculation, both morphologically (e.g., vanilla Levenshtein) and phonetically (e.g., double metaphone), and attach weights to each such extracted "feature" by batch (supervised) learning, and finally, (3) use a suitable RL technique, (e.g., Q) to learn weights corresponding to idiosyncratic hyperarticulations of (a group of) individuals.

Further, using tokens in this use case may not be applicable dynamically that limits its applicability.

0 replies

SegfaultCreator · 2024-08-30T08:37:42Z

SegfaultCreator
Aug 30, 2024

Just in case somebody is struggling to implement the solution suggested by @jongwook.
This was my approach to create the SuppressToken-List, which works as desired

def createSuppressTokenList(words_to_remain):
    tokenizer = whisper.tokenizer.get_tokenizer(multilingual=False, language = "en", task = "transcribe")
    keep_tokens=list()
    for i in range(0,len(words_to_remain)):
        token_for_word = tokenizer.encode(" " + words_to_remain[i])
        keep_tokens.append(token_for_word)
    
    flattened_keeps = [item for sublist in keep_tokens for item in sublist]  # Create single list with items instead of list of lists
    
    # Iterate over ALL tokens and block everyone except relevant
    block_tokens=list()
    for i in range(tokenizer.eot):
        if (i not in flattened_keeps):
            block_tokens.append(i)
            
    return block_tokens

Usage

detection_targets = ["foo", "bar"]
block_tokens = createSuppressTokenList(detection_targets)
result = model.transcribe(files[0],suppress_tokens=[-1] + block_tokens)

0 replies

	class LogitFilter:
	def apply(self, logits: Tensor, tokens: Tensor) -> None:
	"""Apply any filtering or maskin…

Is there a way to work with a limited vocabulary? #843

Uh oh!

Uh oh!

theabhinavdas Jan 14, 2023

Replies: 3 comments · 5 replies

Uh oh!

jongwook Jan 17, 2023 Maintainer

Uh oh!

Uh oh!

pranabenator Jan 17, 2023

Uh oh!

tilsen Sep 27, 2023

Uh oh!

jongwook Sep 27, 2023 Maintainer

Uh oh!

Uh oh!

tilsen Sep 27, 2023

Uh oh!

LeoArtaza Oct 6, 2023

Uh oh!

Uh oh!

pranabenator Jan 17, 2023

Uh oh!

Uh oh!

SegfaultCreator Aug 30, 2024

theabhinavdas
Jan 14, 2023

Replies: 3 comments 5 replies

jongwook
Jan 17, 2023
Maintainer

jongwook Sep 27, 2023
Maintainer

pranabenator
Jan 17, 2023

SegfaultCreator
Aug 30, 2024