how to directly transcribe an in memory audio file ?? #908
-
using it on colab current_size = 'small' audio = AudioSegment.from_file(filee, format="mp3") result = model.transcribe(data) 2 frames /usr/local/lib/python3.8/dist-packages/whisper/transcribe.py in transcribe(model, audio, verbose, temperature, compression_ratio_threshold, logprob_threshold, no_speech_threshold, condition_on_previous_text, initial_prompt, **decode_options) /usr/local/lib/python3.8/dist-packages/whisper/audio.py in log_mel_spectrogram(audio, n_mels) /usr/local/lib/python3.8/dist-packages/torch/functional.py in stft(input, n_fft, hop_length, win_length, window, center, pad_mode, normalized, onesided, return_complex) RuntimeError: "reflection_pad1d" not implemented for 'Short' it throw this error at me |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 4 replies
-
It appears that Line 49 in 5c1a8c1 |
Beta Was this translation helpful? Give feedback.
-
I was having the same issue! I ended up making a modified version of whisper's def load_audio(file_bytes: bytes, sr: int = 16_000) -> np.ndarray:
"""
Use file's bytes and transform to mono waveform, resampling as necessary
Parameters
----------
file: bytes
The bytes of the audio file
sr: int
The sample rate to resample the audio if necessary
Returns
-------
A NumPy array containing the audio waveform, in float32 dtype.
"""
try:
# This launches a subprocess to decode audio while down-mixing and resampling as necessary.
# Requires the ffmpeg CLI and `ffmpeg-python` package to be installed.
out, _ = (
ffmpeg.input('pipe:', threads=0)
.output("pipe:", format="s16le", acodec="pcm_s16le", ac=1, ar=sr)
.run_async(pipe_stdin=True, pipe_stdout=True)
).communicate(input=file_bytes)
except ffmpeg.Error as e:
raise RuntimeError(f"Failed to load audio: {e.stderr.decode()}") from e
return np.frombuffer(out, np.int16).flatten().astype(np.float32) / 32768.0
with open('input.mp3', 'rb') as f:
file_bytes = f.read()
audio = load_audio(file_bytes)
result = model.transcribe(audio) |
Beta Was this translation helpful? Give feedback.
It appears that
audio
is in int16 dtype, whereas Whisper expects float32 or float16. You may try converting it to a float32 array and dividing it by 32768, similar to what's done inaudio.py
:whisper/whisper/audio.py
Line 49 in 5c1a8c1