What special tokens to use for transcription, and why is <|startoflm|> so highly probable? #2578

belson17 · 2025-04-19T00:56:10Z

belson17
Apr 19, 2025

Hi all, I'm trying to determine the probability of a sequence of "target tokens" for a given audio recording of speech. To do so, I'm passing in this sequence of tokens and the encoded Mel spectrogram to the decoder. My problem is I'm unsure what special tokens to prepend to the sequence to indicate to the model to transcribe in English. In particular, for the "turbo" and "large-v3" models, I see the token "<|startoflm|>" has a very high probability output from the decoder, so it seems the model expects this token, but I don't know how to use it. I also don't see this token predicted with high probability when using other models, e.g., "base".

What prefix special tokens should I use for the turbo and large models? Below are snippets of code to show what I'm doing. If there is documentation, please point me there. Thanks!

model = whisper.load_model("turbo", device="cuda")

prefix_token_ids = [50258, 50259, 50359, 50363]
prefix_tokens = ['<|startoftranscript|>', '<|en|>', '<|transcribe|>', '<|notimestamps|>']
prefix_tensor = torch.tensor(prefix_token_ids).unsqueeze(0)

full_sequence_token_ids = torch.cat([prefix_tensor, target_token_ids], dim=1)
decoder_input_token_ids = full_sequence_token_ids[:, :-1]
target_token_ids_to_eval = full_sequence_token_ids[:, 1:]

logits = model.decoder(decoder_input_token_ids, self.model.encoder(mel))
log_probs = F.log_softmax(logits, dim=-1)

for i in range(len(logits[0]):
    pred_top = logits[0, i].topk(5).indices
    pred_top_probs = np.exp(log_probs[0, i, pred_top].detach().cpu().numpy())
    print(f"Time step {i}: top tokens = {[tokenizer.decode([t]) for t in pred_top]}, probs={pred_top_probs}")
    print(f"Target token: {self.tokenizer.decode([target_token_ids_to_eval[0, i]])}")

Outputs:

Time step 0: top tokens = ['<|en|>', '<|notimestamps|>', '<|la|>', '<|de|>', '<|fr|>'], probs=[9.9240476e-01 4.9215057e-03 4.9104204e-04 2.9613255e-04 2.5190943e-04]
Target token: <|en|>
Time step 1: top tokens = ['<|startoflm|>', '<|transcribe|>', '', '<|nospeech|>', '<|endoftext|>'], probs=[9.9999833e-01 1.6491989e-06 2.2530656e-08 1.3922099e-08 4.4615808e-10]
Target token: <|transcribe|>
Time step 2: top tokens = ['', '', '', '', ''], probs=[5.7725841e-01 4.1457579e-01 5.2851834e-04 5.1052548e-04 4.3363243e-04]
Target token: <|notimestamps|>
Time step 3: top tokens = ['<|endoftext|>', ',', ' and', ' -', ' ...'], probs=[1.0000000e+00 3.1554283e-11 2.2344248e-11 1.2954151e-11 1.1298531e-11]
Target token:  They
Time step 4: top tokens = [' regain', ' reg', ' began', ' re', ' begin'], probs=[0.6244507  0.309043   0.00685263 0.00403515 0.00258522]
Target token:  regain

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What special tokens to use for transcription, and why is <|startoflm|> so highly probable? #2578

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

What special tokens to use for transcription, and why is <|startoflm|> so highly probable? #2578

Uh oh!

belson17 Apr 19, 2025

Replies: 0 comments

belson17
Apr 19, 2025