VAD #96

vanheerdena · 2022-09-24T12:02:18Z

vanheerdena
Sep 24, 2022

I'm looking to use Whisper for voice activity detection (VAD) only. Anyone able to point me in the right direction as to how I detect presence or absence of speech in an audio clip using this model?

Answered by jongwook

Sep 26, 2022

In the ["segment"] field of the dictionary returned by the function transcribe(), each item will have segment-level details, and there is no_speech_prob that contains the probability of the token <|nospeech|>. This combined with the log probability threshold and the compression ratio threshold performs a crude VAD in transcribe(), but you might find a better result by combining with a separate VAD tool that's more accurate.

View full answer

IpsumDominum · 2022-09-26T08:22:15Z

IpsumDominum
Sep 26, 2022

I don't think in the code base the VAD is built in. One way (What I'd do) to do it is just to see if the output is an empty string. For this purpose you can turn off all the beam search params to make it greedy search to speed things up.

2 replies

creatorrr Sep 26, 2022

@IpsumDominum Would it be advisable to use a separate VAD to split audio before feeding it to the whisper models? It is useful to cheaply segment audio based on activity before running inference in real time scenarios

IpsumDominum Sep 26, 2022

I think it depends. if your external VAD model is better then the whisper |nospeech| detection then go for it.

jongwook · 2022-09-26T08:47:48Z

jongwook
Sep 26, 2022
Maintainer

In the ["segment"] field of the dictionary returned by the function transcribe(), each item will have segment-level details, and there is no_speech_prob that contains the probability of the token <|nospeech|>. This combined with the log probability threshold and the compression ratio threshold performs a crude VAD in transcribe(), but you might find a better result by combining with a separate VAD tool that's more accurate.

2 replies

madroidmaq Mar 14, 2023

Let me supplement the test data results from my end for better understanding by others.

Curl:

curl https://api.openai.com/v1/audio/translations \
    -X POST \
    -H 'Authorization: Bearer TOKEN' \
    -H 'Content-Type: multipart/form-data' \
    -F [email protected] \
    -F model=whisper-1 \
    -F response_format=verbose_json

Response:

{
    "task": "translate",
    "language": "english",
    "duration": 6.3,
    "segments": [
        {
            "id": 0,
            "seek": 0,
            "start": 0.0,
            "end": 3.2800000000000002,
            "text": " Starting in 3, 2, 1.",
            "tokens": [
                16217,
                294,
                805,
                11,
                568,
                11,
                502,
                13
            ],
            "temperature": 0.0,
            "avg_logprob": -0.5764781550357216,
            "compression_ratio": 0.925,
            "no_speech_prob": 0.025908704847097397,
            "transient": false
        },
        {
            "id": 1,
            "seek": 328,
            "start": 3.28,
            "end": 29.28,
            "text": " 3 plus 7 plus 1.",
            "tokens": [
                805,
                1804,
                1614,
                1804,
                502,
                13
            ],
            "temperature": 0.0,
            "avg_logprob": -0.4856548309326172,
            "compression_ratio": 0.8421052631578947,
            "no_speech_prob": 0.0017739988397806883,
            "transient": false
        }
    ],
    "text": "Starting in 3, 2, 1. 3 plus 7 plus 1."
}

LukasNel Dec 24, 2023

How would you use the logprob with the no speech prob? Would you subtract the two, subtract the log of no speech from it or take the ratio

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

VAD #96

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

VAD #96

Uh oh!

vanheerdena Sep 24, 2022

Replies: 2 comments · 4 replies

Uh oh!

IpsumDominum Sep 26, 2022

Uh oh!

creatorrr Sep 26, 2022

Uh oh!

IpsumDominum Sep 26, 2022

Uh oh!

jongwook Sep 26, 2022 Maintainer

Uh oh!

madroidmaq Mar 14, 2023

Uh oh!

LukasNel Dec 24, 2023

vanheerdena
Sep 24, 2022

Replies: 2 comments 4 replies

IpsumDominum
Sep 26, 2022

jongwook
Sep 26, 2022
Maintainer