Get (relative) timestamps for whisper.decode (no transcribe call) #1141

SinanAkkoyun · 2023-03-23T03:25:39Z

SinanAkkoyun
Mar 23, 2023

Hello!
I need to get timestamps with the following code:

audio = whisper.load_audio(input_audio)
    audio = whisper.pad_or_trim(audio)
    mel = whisper.log_mel_spectrogram(audio).to(model.device)
    options = whisper.DecodingOptions(prompt=prompt, max_initial_timestamp=None, without_timestamps=False)
    result = whisper.decode(model, mel, options)

How can I extract the timestamps there without calling "model.transcribe()"?
I tried to look at the timing.py but still don't know how...

Thank you very much!

SinanAkkoyun · 2023-03-24T18:11:42Z

SinanAkkoyun
Mar 24, 2023
Author

I did it myself!

here is my totally bad code, but it works:

def transcribe_test(input_audio, prompt=""):
    prepend_punctuations: str = "\"'“¿([{-",
    append_punctuations: str = "\"'.。,，!！?？:：”)]}、",
    global tokenizer
    audio = whisper.load_audio(input_audio)
    audio = whisper.pad_or_trim(audio)
    mel = whisper.log_mel_spectrogram(audio).to(model.device)
    options = whisper.DecodingOptions(prompt=prompt, max_initial_timestamp=None, without_timestamps=False)
    result = whisper.decode(model, mel, options)

    
    if tokenizer is None:
      tokenizer = get_tokenizer(multilingual=model.is_multilingual, language='en', task=options.task)

    text_tokens = [tokenizer.decode([t]) for t in result.tokens]
    colored_text = get_colored_text(text_tokens, result.token_probs, tokenizer, prompt)

    starttime=time.time()
    segments = [{"seek": 0, "start": 0, "end": len(audio) / SAMPLE_RATE, "tokens": result.tokens}]
    add_word_timestamps(
        segments=segments,
        model=model,
        tokenizer=tokenizer,
        mel=mel,
        num_frames=mel.shape[-1],
        prepend_punctuations=prepend_punctuations,
        append_punctuations=append_punctuations,
    )
    word_timestamps = segments[0]["words"]
    print(f"time: {time.time() - starttime}")

    return result.text, colored_text, text_tokens, result.token_probs, word_timestamps

1 reply

aybakana Jul 8, 2023

I guess fp16 must be set to False in options otherwise I get this error -> RuntimeError: "slow_conv2d_cpu" not implemented for 'Half'

I also got get_colored_text is not defined error. I guess you defined it by yourself.

I also needed to add the following before defining the function;

from whisper.tokenizer import get_tokenizer
from whisper.timing import add_word_timestamps
tokenizer = None
SAMPLE_RATE = 16000

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Get (relative) timestamps for whisper.decode (no transcribe call) #1141

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Get (relative) timestamps for whisper.decode (no transcribe call) #1141

Uh oh!

SinanAkkoyun Mar 23, 2023

Replies: 1 comment · 1 reply

Uh oh!

SinanAkkoyun Mar 24, 2023 Author

Uh oh!

Uh oh!

aybakana Jul 8, 2023

SinanAkkoyun
Mar 23, 2023

Replies: 1 comment 1 reply

SinanAkkoyun
Mar 24, 2023
Author