how to directly transcribe an in memory audio file ?? #908

peternasser99 · 2023-01-29T05:14:59Z

peternasser99
Jan 29, 2023

using it on colab

current_size = 'small'
model = whisper.load_model(current_size, download_root='/content/')
filee = "/content/drive/MyDrive/audio/ON.mp3"

audio = AudioSegment.from_file(filee, format="mp3")
audio.export("example.wav", format="wav")
audio = AudioSegment.from_file("example.wav", format="wav")
data = numpy.array(audio.get_array_of_samples())

result = model.transcribe(data)
print(result["text"])

2 frames

/usr/local/lib/python3.8/dist-packages/whisper/transcribe.py in transcribe(model, audio, verbose, temperature, compression_ratio_threshold, logprob_threshold, no_speech_threshold, condition_on_previous_text, initial_prompt, **decode_options)
83 decode_options["fp16"] = False
84
---> 85 mel = log_mel_spectrogram(audio)
86
87 if decode_options.get("language", None) is None:

/usr/local/lib/python3.8/dist-packages/whisper/audio.py in log_mel_spectrogram(audio, n_mels)
113
114 window = torch.hann_window(N_FFT).to(audio.device)
--> 115 stft = torch.stft(audio, N_FFT, HOP_LENGTH, window=window, return_complex=True)
116 magnitudes = stft[..., :-1].abs() ** 2
117

/usr/local/lib/python3.8/dist-packages/torch/functional.py in stft(input, n_fft, hop_length, win_length, window, center, pad_mode, normalized, onesided, return_complex)
628 extended_shape = [1] * (3 - signal_dim) + list(input.size())
629 pad = int(n_fft // 2)
--> 630 input = F.pad(input.view(extended_shape), [pad, pad], pad_mode)
631 input = input.view(input.shape[-signal_dim:])
632 return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined]

RuntimeError: "reflection_pad1d" not implemented for 'Short'

it throw this error at me

Answered by jongwook

Jan 29, 2023

It appears that audio is in int16 dtype, whereas Whisper expects float32 or float16. You may try converting it to a float32 array and dividing it by 32768, similar to what's done in audio.py:

whisper/whisper/audio.py

Line 49 in 5c1a8c1

return np.frombuffer(out, np.int16).flatten().astype(np.float32) / 32768.0

View full answer

jongwook · 2023-01-29T07:09:37Z

jongwook
Jan 29, 2023
Maintainer

It appears that audio is in int16 dtype, whereas Whisper expects float32 or float16. You may try converting it to a float32 array and dividing it by 32768, similar to what's done in audio.py:

whisper/whisper/audio.py

Line 49 in 5c1a8c1

return np.frombuffer(out, np.int16).flatten().astype(np.float32) / 32768.0

3 replies

peternasser99 Feb 5, 2023
Author

truly thank you 😊😊
the pipeline of processing is very ambiguous form me to follow it works 👍👍👍
if i want to batch transcribe then should i add the bathes here n the array?

jongwook Feb 5, 2023
Maintainer

This repo doesn't currently support batch transcription, so each audio track needs to be transcribed sequentially. There are some third-party implementations for batching in #662 and #684, where you can expect a decent speedup.

peternasser99 Feb 6, 2023
Author

again thank you for your help

CarlosGTrejo · 2023-03-26T02:53:34Z

CarlosGTrejo
Mar 26, 2023

I was having the same issue! I ended up making a modified version of whisper's load_audio() function that accepts file bytes.

def load_audio(file_bytes: bytes, sr: int = 16_000) -> np.ndarray:
    """
    Use file's bytes and transform to mono waveform, resampling as necessary
    Parameters
    ----------
    file: bytes
        The bytes of the audio file
    sr: int
        The sample rate to resample the audio if necessary
    Returns
    -------
    A NumPy array containing the audio waveform, in float32 dtype.
    """
    try:
        # This launches a subprocess to decode audio while down-mixing and resampling as necessary.
        # Requires the ffmpeg CLI and `ffmpeg-python` package to be installed.
        out, _ = (
            ffmpeg.input('pipe:', threads=0)
            .output("pipe:", format="s16le", acodec="pcm_s16le", ac=1, ar=sr)
            .run_async(pipe_stdin=True, pipe_stdout=True)
        ).communicate(input=file_bytes)

    except ffmpeg.Error as e:
        raise RuntimeError(f"Failed to load audio: {e.stderr.decode()}") from e

    return np.frombuffer(out, np.int16).flatten().astype(np.float32) / 32768.0


with open('input.mp3', 'rb') as f:
    file_bytes = f.read()

audio = load_audio(file_bytes)
result = model.transcribe(audio)

1 reply

flobeier Mar 14, 2025

I had the same issue while developing a small web app and chose a slightly different approach. Thanks for your inspiration!

import numpy as np
import uvicorn
import whisper
from fastapi import FastAPI, UploadFile
from subprocess import CalledProcessError, run
from typing import BinaryIO


def load_audio(file: BinaryIO, sr: int = 16_000):
    """
    Open an audio file and read as mono waveform, resampling as necessary

    Parameters
    ----------
    file: BinaryIO
        The audio file to open

    sr: int
        The sample rate to resample the audio if necessary

    Returns
    -------
    A NumPy array containing the audio waveform, in float32 dtype.
    """
    # This launches a subprocess to decode audio while down-mixing
    # and resampling as necessary.  Requires the ffmpeg CLI in PATH.
    # fmt: off
    cmd = [
        "ffmpeg",
        "-nostdin",
        "-threads", "0",
        "-i", "-",
        "-f", "s16le",
        "-ac", "1",
        "-acodec", "pcm_s16le",
        "-ar", str(sr),
        "-"
    ]
    # fmt: on
    try:
        out = run(cmd, capture_output=True, check=True, stdin=file).stdout
    except CalledProcessError as e:
        raise RuntimeError(f"Failed to load audio: {e.stderr.decode()}") from e

    return np.frombuffer(out, np.int16).flatten().astype(np.float32) / 32768.0


app = FastAPI()


@app.post("/file/")
def transcribe_file(file: UploadFile):
    model = whisper.load_model("turbo")
    audio = load_audio(file.file)
    result = model.transcribe(audio)
    return result


if __name__ == "__main__":
    uvicorn.run("main:app", reload=True)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

how to directly transcribe an in memory audio file ?? #908

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

how to directly transcribe an in memory audio file ?? #908

Uh oh!

peternasser99 Jan 29, 2023

Replies: 2 comments · 4 replies

Uh oh!

jongwook Jan 29, 2023 Maintainer

Uh oh!

peternasser99 Feb 5, 2023 Author

Uh oh!

jongwook Feb 5, 2023 Maintainer

Uh oh!

peternasser99 Feb 6, 2023 Author

Uh oh!

CarlosGTrejo Mar 26, 2023

Uh oh!

flobeier Mar 14, 2025

peternasser99
Jan 29, 2023

Replies: 2 comments 4 replies

jongwook
Jan 29, 2023
Maintainer

peternasser99 Feb 5, 2023
Author

jongwook Feb 5, 2023
Maintainer

peternasser99 Feb 6, 2023
Author

CarlosGTrejo
Mar 26, 2023