You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
import whisper
model = whisper.load_model("turbo")
# load audio and pad/trim it to fit 30 seconds
audio = whisper.load_audio("audio.mp3")
audio = whisper.pad_or_trim(audio)
# make log-Mel spectrogram and move to the same device as the model
mel = whisper.log_mel_spectrogram(audio, n_mels=model.dims.n_mels).to(model.device)
# detect the spoken language
_, probs = model.detect_language(mel)
print(f"Detected language: {max(probs, key=probs.get)}")
I am using an example to detect the language of an audio file. If the audio begins with a silence longer than 30 seconds, the performance of the detection will be poor, and the confidence level will be low. I wonder if adding an offset option to whisper.pad_or_trim would be helpful. This way, we could skip the silent portion if the detection confidence is low.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
I am using an example to detect the language of an audio file. If the audio begins with a silence longer than 30 seconds, the performance of the detection will be poor, and the confidence level will be low. I wonder if adding an
offset
option towhisper.pad_or_trim
would be helpful. This way, we could skip the silent portion if the detection confidence is low.Beta Was this translation helpful? Give feedback.
All reactions