How to send audio to Whisper in a numpy array ? #450

elpidiovaldez · 2022-11-02T05:45:40Z

elpidiovaldez
Nov 2, 2022

I want to send speech to Whisper as a numpy array. The documentation says this is possible, but I do not get correct transcriptions. Probably there is something I don't understand about the required format. I am sending PCM audio in 5 second chunks as a 32 bit float numpy array. This should be padded to 30 seconds and passed to 'transcribe'. Any help would be appreciated. Here is my test code:

import sys
import pyaudio
import numpy as np
from . import load_model
from .transcribe import transcribe
from .utils import write_txt

import whisper

RATE = 16000
CHUNK_SIZE = 16000*5
FORMAT = pyaudio.paInt16
FORMATOUT = pyaudio.paInt16

def test():
    model = whisper.load_model("base")
    
    p = pyaudio.PyAudio()
    
    streamIn = p.open(
        format=FORMAT, channels=1, rate=RATE,
        input=True, output=True,
        frames_per_buffer=CHUNK_SIZE
    )
    while True:
        data = streamIn.read(CHUNK_SIZE)
        audio = np.frombuffer(data, np.int16).astype(np.float32)

        # load audio and pad/trim it to fit 30 seconds
        #audio = whisper.load_audio("/home/paul/temp.wav")
        audio = whisper.pad_or_trim(audio)

        # make log-Mel spectrogram and move to the same device as the model
        mel = whisper.log_mel_spectrogram(audio).to(model.device)

        # detect the spoken language
        _, probs = model.detect_language(mel)
        print(f"Detected language: {max(probs, key=probs.get)}")

        # decode the audio
        options = whisper.DecodingOptions(fp16=False)
        result = whisper.decode(model, mel, options)

        # print the recognized text
        print(result.text)

Answered by jianfch

Nov 2, 2022

Looks like you're missing a / 32768.0 and make sure audio has only 1 dimension.

whisper/whisper/audio.py

Line 49 in 9f70a35

return np.frombuffer(out, np.int16).flatten().astype(np.float32) / 32768.0

Note that the results are expected to be disjointed and possibly have missing words because of the 5 second chunking.

View full answer

jianfch · 2022-11-02T17:21:02Z

jianfch
Nov 2, 2022

Looks like you're missing a / 32768.0 and make sure audio has only 1 dimension.

whisper/whisper/audio.py

Line 49 in 9f70a35

return np.frombuffer(out, np.int16).flatten().astype(np.float32) / 32768.0

Note that the results are expected to be disjointed and possibly have missing words because of the 5 second chunking.

4 replies

Ca-ressemble-a-du-fake Nov 3, 2022

Can you elaborate where this 2^15 comes from ?

Ca-ressemble-a-du-fake Nov 3, 2022

Sorry I did not see it was in code! And explanation is here

elpidiovaldez Nov 4, 2022
Author

Thanks for the answer - you have save me a lot of time. In fact it was scaling that I was missing. In my case 'flatten' was not necessary, but I have seen it used in other examples, and it does no harm. The working code for my test function is:
audio = np.frombuffer(data, np.int16).astype(np.float32)*(1/32768.0)

MarkoMilos Aug 22, 2023

In what case will audio have 2 dimensions so that flatten() is required? As far as I understand if the audio has 2 channels it will still be a 1-dimensional array with interleaved channel values... right?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to send audio to Whisper in a numpy array ? #450

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to send audio to Whisper in a numpy array ? #450

Uh oh!

elpidiovaldez Nov 2, 2022

Replies: 1 comment · 4 replies

Uh oh!

Uh oh!

jianfch Nov 2, 2022

Uh oh!

Uh oh!

Ca-ressemble-a-du-fake Nov 3, 2022

Uh oh!

Uh oh!

Ca-ressemble-a-du-fake Nov 3, 2022

Uh oh!

elpidiovaldez Nov 4, 2022 Author

Uh oh!

MarkoMilos Aug 22, 2023

elpidiovaldez
Nov 2, 2022

Replies: 1 comment 4 replies

jianfch
Nov 2, 2022

elpidiovaldez Nov 4, 2022
Author