Getting chunk level output with start and end timestamps with Whisper #2275

meg261995 · 2024-07-24T13:07:33Z

meg261995
Jul 24, 2024

I am using the Whisper3 model to transcribe several audio files. However, the output I am getting is in the form of a tensor. I would like to obtain text chunks with corresponding start and end timestamps instead. Can someone please assist me in achieving this desired output using the available method? I get the desired output if I make use of pipeline with "AutoModelForSpeechSeq2Seq" class instead of "WhisperForConditionalGeneration" like below.

from transformers import WhisperForConditionalGeneration, AutoProcessor

processor = AutoProcessor.from_pretrained("openai/whisper-large-v3")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v3")

audio, sr = librosa.load(audio,sr=16000) 

inputs = processor(audio, return_tensors="pt", truncation=False, padding="longest", return_attention_mask=True, sampling_rate=sr)

result = model.generate(**inputs)
decoded = processor.batch_decode(result, skip_special_tokens=True)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Getting chunk level output with start and end timestamps with Whisper #2275

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Getting chunk level output with start and end timestamps with Whisper #2275

Uh oh!

Uh oh!

meg261995 Jul 24, 2024

Replies: 0 comments

meg261995
Jul 24, 2024