Skip to content

Server crashes or returns part of previous result when using VAD and no voice activity detected in the sample #3595

@j4-c4

Description

@j4-c4

Two variants, crash and returning previous result.

Server command:

.\whisper-server.exe --model ggml-large-v3-turbo.bin --vad --vad-model .\ggml-silero-v6.2.0.bin --language auto

Request:

PS F:\whisper\audio> Invoke-RestMethod -Uri 'http://127.0.0.1:8080/inference' -Method Post -ContentType 'multipart/form-data' -Form @{
>>   file = Get-Item '.\no_speech.wav'
>>   response_format = 'verbose_json'
>> }
Invoke-RestMethod: Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host..

This will crash once it hits whisper_vad_segments_from_probs: Final speech segments after filtering: 0. Full log below:

Log
PS E:\Libraries\Downloads\whisper\whisper-cublas-12.4.0-bin-x64\Release> .\whisper-server.exe --model .\ggml-large-v3-turbo.bin --vad --vad-model .\ggml-silero-v6.2.0.bin --language auto
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 980, compute capability 5.2, VMM: yes
whisper_init_from_file_with_params_no_state: loading model from '.\ggml-large-v3-turbo.bin'
whisper_init_with_params_no_state: use gpu    = 1
whisper_init_with_params_no_state: flash attn = 1
whisper_init_with_params_no_state: gpu_device = 0
whisper_init_with_params_no_state: dtw        = 0
whisper_init_with_params_no_state: devices    = 2
whisper_init_with_params_no_state: backends   = 2
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51866
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 128
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 5 (large v3)
whisper_model_load: adding 1609 extra tokens
whisper_model_load: n_langs       = 100
whisper_model_load:        CUDA0 total size =  1623.92 MB
whisper_model_load: model size    = 1623.92 MB
whisper_backend_init_gpu: using CUDA0 backend
whisper_init_state: kv self size  =   10.49 MB
whisper_init_state: kv cross size =   31.46 MB
whisper_init_state: kv pad  size  =    7.86 MB
whisper_init_state: compute buffer (conv)   =   37.69 MB
whisper_init_state: compute buffer (encode) =   55.35 MB
whisper_init_state: compute buffer (cross)  =    9.27 MB
whisper_init_state: compute buffer (decode) =  100.04 MB

whisper server listening at http://127.0.0.1:8080

Received request: no_speech.wav
Successfully loaded no_speech.wav

system_info: n_threads = 4 / 12 | WHISPER : COREML = 0 | OPENVINO = 0 | CUDA : ARCHS = 520 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | OPENMP = 1 | REPACK = 1 |

operator (): processing 'no_speech.wav' (488789 samples, 30.5 sec), 4 threads, 1 processors, lang = auto, task = transcribe, timestamps = 1 ...

Running whisper.cpp inference on no_speech.wav
whisper_full: VAD is enabled, processing speech segments only
whisper_vad: VAD is enabled, processing speech segments only
whisper_vad_init_from_file_with_params: loading VAD model from '.\ggml-silero-v6.2.0.bin'
whisper_vad_init_with_params: model type: silero-16k
whisper_vad_init_with_params: model version: 6.2.0
whisper_vad_init_with_params: n_encoder_layers = 4
whisper_vad_init_with_params: encoder_in_channels[0] = 129
whisper_vad_init_with_params: encoder_in_channels[1] = 128
whisper_vad_init_with_params: encoder_in_channels[2] = 64
whisper_vad_init_with_params: encoder_in_channels[3] = 64
whisper_vad_init_with_params: encoder_out_channels[0] = 128
whisper_vad_init_with_params: encoder_out_channels[1] = 64
whisper_vad_init_with_params: encoder_out_channels[2] = 64
whisper_vad_init_with_params: encoder_out_channels[3] = 128
whisper_vad_init_with_params: lstm_input_size = 128
whisper_vad_init_with_params: lstm_hidden_size = 128
whisper_vad_init_with_params: final_conv_in = 128
whisper_vad_init_with_params: final_conv_out = 1
whisper_vad_init_with_params:          CPU total size =     0.88 MB
whisper_vad_init_with_params: model size    =    0.88 MB
whisper_backend_init_gpu: no GPU found
whisper_vad_init_context: compute buffer (VAD)   =    1.60 MB
whisper_vad_segments_from_samples: detecting speech timestamps in 488789 samples
whisper_vad_detect_speech: detecting speech in 488789 samples
whisper_vad_detect_speech: n_chunks: 955
whisper_vad_detect_speech: props size: 955
whisper_vad_detect_speech: chunk_len: 341 < n_window: 512
whisper_vad_detect_speech: vad time = 164.73 ms processing 488789 samples
whisper_vad_segments_from_probs: detecting speech timestamps using 955 probabilities
whisper_vad_segments_from_probs: Final speech segments after filtering: 0
<crash>

If you pass it audio with speech first and then audio without speech, it will not crash but will return the old result.

Requests:

PS F:\whisper\audio> Invoke-RestMethod -Uri 'http://127.0.0.1:8080/inference' -Method Post -ContentType 'multipart/form-data' -Form @{
>>   file = Get-Item '.\has_speech.wav'
>>   response_format = 'verbose_json'
>> }

task                          : transcribe
language                      : korean
duration                      : 179.989318847656
text                          :  <snip>

segments                      : <snip>
detected_language             : korean
detected_language_probability : 0.992537379264832
language_probabilities        : @{en=0.00533512607216835; ko=0.992537379264832}

PS F:\whisper\audio> Invoke-RestMethod -Uri 'http://127.0.0.1:8080/inference' -Method Post -ContentType 'multipart/form-data' -Form @{
>>   file = Get-Item '.\no_speech.wav'
>>   response_format = 'verbose_json'
>> }

task                          : transcribe
language                      : korean
duration                      : 30.5493125915527
text                          :
segments                      : {}
detected_language             : korean
detected_language_probability : 0.992537379264832
language_probabilities        : @{en=0.00533512607216835; ko=0.992537379264832}

This might be expected because I haven't passed --no-context... so I tried that.

Server command:

.\whisper-server.exe --model .\ggml-large-v3-turbo.bin --vad --vad-model .\ggml-silero-v6.2.0.bin --language auto --no-context

Requests+responses:

PS F:\whisper\audio> Invoke-RestMethod -Uri 'http://127.0.0.1:8080/inference' -Method Post -ContentType 'multipart/form-data' -Form @{
>>   file = Get-Item '.\has_speech.wav'
>>   response_format = 'verbose_json'
>> }

task                          : transcribe
language                      : korean
duration                      : 179.989318847656
text                          :  <snip>

segments                      : <snip>
detected_language             : korean
detected_language_probability : 0.992537379264832
language_probabilities        : @{en=0.00533512607216835; ko=0.992537379264832}

PS F:\whisper\audio> Invoke-RestMethod -Uri 'http://127.0.0.1:8080/inference' -Method Post -ContentType 'multipart/form-data' -Form @{
>>   file = Get-Item '.\no_speech.wav'
>>   response_format = 'verbose_json'
>> }

task                          : transcribe
language                      : korean
duration                      : 30.5493125915527
text                          :
segments                      : {}
detected_language             : korean
detected_language_probability : 0.992537379264832
language_probabilities        : @{en=0.00533512607216835; ko=0.992537379264832}

Still does it. Seems to indicate (to me) that it is actually keeping the context?

Long log probably isn't useful in this case, but here it is anyway:

Log
PS E:\Libraries\Downloads\whisper\whisper-cublas-12.4.0-bin-x64\Release> .\whisper-server.exe --model .\ggml-large-v3-turbo.bin --vad --vad-model .\ggml-silero-v6.2.0.bin --language auto --no-context
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 980, compute capability 5.2, VMM: yes
whisper_init_from_file_with_params_no_state: loading model from '.\ggml-large-v3-turbo.bin'
whisper_init_with_params_no_state: use gpu    = 1
whisper_init_with_params_no_state: flash attn = 1
whisper_init_with_params_no_state: gpu_device = 0
whisper_init_with_params_no_state: dtw        = 0
whisper_init_with_params_no_state: devices    = 2
whisper_init_with_params_no_state: backends   = 2
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51866
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 128
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 5 (large v3)
whisper_model_load: adding 1609 extra tokens
whisper_model_load: n_langs       = 100
whisper_model_load:        CUDA0 total size =  1623.92 MB
whisper_model_load: model size    = 1623.92 MB
whisper_backend_init_gpu: using CUDA0 backend
whisper_init_state: kv self size  =   10.49 MB
whisper_init_state: kv cross size =   31.46 MB
whisper_init_state: kv pad  size  =    7.86 MB
whisper_init_state: compute buffer (conv)   =   37.69 MB
whisper_init_state: compute buffer (encode) =   55.35 MB
whisper_init_state: compute buffer (cross)  =    9.27 MB
whisper_init_state: compute buffer (decode) =  100.04 MB

whisper server listening at http://127.0.0.1:8080

Received request: has_speech.wav
Successfully loaded has_speech.wav

system_info: n_threads = 4 / 12 | WHISPER : COREML = 0 | OPENVINO = 0 | CUDA : ARCHS = 520 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | OPENMP = 1 | REPACK = 1 |

operator (): processing 'has_speech.wav' (2879829 samples, 180.0 sec), 4 threads, 1 processors, lang = auto, task = transcribe, timestamps = 1 ...

Running whisper.cpp inference on has_speech.wav
whisper_full: VAD is enabled, processing speech segments only
whisper_vad: VAD is enabled, processing speech segments only
whisper_vad_init_from_file_with_params: loading VAD model from '.\ggml-silero-v6.2.0.bin'
whisper_vad_init_with_params: model type: silero-16k
whisper_vad_init_with_params: model version: 6.2.0
whisper_vad_init_with_params: n_encoder_layers = 4
whisper_vad_init_with_params: encoder_in_channels[0] = 129
whisper_vad_init_with_params: encoder_in_channels[1] = 128
whisper_vad_init_with_params: encoder_in_channels[2] = 64
whisper_vad_init_with_params: encoder_in_channels[3] = 64
whisper_vad_init_with_params: encoder_out_channels[0] = 128
whisper_vad_init_with_params: encoder_out_channels[1] = 64
whisper_vad_init_with_params: encoder_out_channels[2] = 64
whisper_vad_init_with_params: encoder_out_channels[3] = 128
whisper_vad_init_with_params: lstm_input_size = 128
whisper_vad_init_with_params: lstm_hidden_size = 128
whisper_vad_init_with_params: final_conv_in = 128
whisper_vad_init_with_params: final_conv_out = 1
whisper_vad_init_with_params:          CPU total size =     0.88 MB
whisper_vad_init_with_params: model size    =    0.88 MB
whisper_backend_init_gpu: no GPU found
whisper_vad_init_context: compute buffer (VAD)   =    1.60 MB
whisper_vad_segments_from_samples: detecting speech timestamps in 2879829 samples
whisper_vad_detect_speech: detecting speech in 2879829 samples
whisper_vad_detect_speech: n_chunks: 5625
whisper_vad_detect_speech: props size: 5625
whisper_vad_detect_speech: chunk_len: 341 < n_window: 512
whisper_vad_detect_speech: vad time = 840.26 ms processing 2879829 samples
whisper_vad_segments_from_probs: detecting speech timestamps using 5625 probabilities
whisper_vad_segments_from_probs: Merged 2 adjacent segments, now have 81 segments
whisper_vad_segments_from_probs: Final speech segments after filtering: 81
whisper_vad_segments_from_probs: VAD segment 0: start = 0.19, end = 2.30 (duration: 2.11)
whisper_vad_segments_from_probs: VAD segment 1: start = 2.79, end = 4.51 (duration: 1.72)
whisper_vad_segments_from_probs: VAD segment 2: start = 4.93, end = 6.43 (duration: 1.50)
whisper_vad_segments_from_probs: VAD segment 3: start = 6.98, end = 8.19 (duration: 1.21)
whisper_vad_segments_from_probs: VAD segment 4: start = 8.51, end = 9.95 (duration: 1.44)
whisper_vad_segments_from_probs: VAD segment 5: start = 10.40, end = 12.99 (duration: 2.59)
whisper_vad_segments_from_probs: VAD segment 6: start = 13.89, end = 15.39 (duration: 1.50)
whisper_vad_segments_from_probs: VAD segment 7: start = 15.78, end = 16.83 (duration: 1.05)
whisper_vad_segments_from_probs: VAD segment 8: start = 17.25, end = 19.13 (duration: 1.88)
whisper_vad_segments_from_probs: VAD segment 9: start = 19.59, end = 21.47 (duration: 1.88)
whisper_vad_segments_from_probs: VAD segment 10: start = 22.05, end = 23.13 (duration: 1.08)
whisper_vad_segments_from_probs: VAD segment 11: start = 23.39, end = 24.38 (duration: 0.99)
whisper_vad_segments_from_probs: VAD segment 12: start = 25.47, end = 27.97 (duration: 2.50)
whisper_vad_segments_from_probs: VAD segment 13: start = 28.16, end = 32.93 (duration: 4.77)
whisper_vad_segments_from_probs: VAD segment 14: start = 33.63, end = 34.88 (duration: 1.25)
whisper_vad_segments_from_probs: VAD segment 15: start = 35.39, end = 36.73 (duration: 1.34)
whisper_vad_segments_from_probs: VAD segment 16: start = 36.99, end = 39.13 (duration: 2.14)
whisper_vad_segments_from_probs: VAD segment 17: start = 39.30, end = 41.79 (duration: 2.49)
whisper_vad_segments_from_probs: VAD segment 18: start = 42.24, end = 44.45 (duration: 2.21)
whisper_vad_segments_from_probs: VAD segment 19: start = 45.31, end = 46.49 (duration: 1.18)
whisper_vad_segments_from_probs: VAD segment 20: start = 47.75, end = 48.06 (duration: 0.31)
whisper_vad_segments_from_probs: VAD segment 21: start = 48.67, end = 49.82 (duration: 1.15)
whisper_vad_segments_from_probs: VAD segment 22: start = 50.72, end = 53.25 (duration: 2.53)
whisper_vad_segments_from_probs: VAD segment 23: start = 54.11, end = 56.16 (duration: 2.05)
whisper_vad_segments_from_probs: VAD segment 24: start = 56.39, end = 56.93 (duration: 0.54)
whisper_vad_segments_from_probs: VAD segment 25: start = 57.19, end = 57.89 (duration: 0.70)
whisper_vad_segments_from_probs: VAD segment 26: start = 58.15, end = 58.78 (duration: 0.63)
whisper_vad_segments_from_probs: VAD segment 27: start = 59.04, end = 59.52 (duration: 0.48)
whisper_vad_segments_from_probs: VAD segment 28: start = 60.61, end = 61.37 (duration: 0.76)
whisper_vad_segments_from_probs: VAD segment 29: start = 63.91, end = 66.01 (duration: 2.10)
whisper_vad_segments_from_probs: VAD segment 30: start = 66.59, end = 70.27 (duration: 3.68)
whisper_vad_segments_from_probs: VAD segment 31: start = 70.47, end = 72.09 (duration: 1.62)
whisper_vad_segments_from_probs: VAD segment 32: start = 72.58, end = 73.12 (duration: 0.54)
whisper_vad_segments_from_probs: VAD segment 33: start = 73.47, end = 75.65 (duration: 2.18)
whisper_vad_segments_from_probs: VAD segment 34: start = 75.81, end = 78.46 (duration: 2.65)
whisper_vad_segments_from_probs: VAD segment 35: start = 78.82, end = 82.01 (duration: 3.19)
whisper_vad_segments_from_probs: VAD segment 36: start = 82.31, end = 83.90 (duration: 1.59)
whisper_vad_segments_from_probs: VAD segment 37: start = 84.64, end = 90.30 (duration: 5.66)
whisper_vad_segments_from_probs: VAD segment 38: start = 90.98, end = 94.33 (duration: 3.35)
whisper_vad_segments_from_probs: VAD segment 39: start = 94.66, end = 95.68 (duration: 1.02)
whisper_vad_segments_from_probs: VAD segment 40: start = 96.03, end = 96.93 (duration: 0.90)
whisper_vad_segments_from_probs: VAD segment 41: start = 97.73, end = 98.91 (duration: 1.18)
whisper_vad_segments_from_probs: VAD segment 42: start = 100.64, end = 102.65 (duration: 2.01)
whisper_vad_segments_from_probs: VAD segment 43: start = 103.91, end = 105.60 (duration: 1.69)
whisper_vad_segments_from_probs: VAD segment 44: start = 106.31, end = 108.80 (duration: 2.49)
whisper_vad_segments_from_probs: VAD segment 45: start = 109.28, end = 110.81 (duration: 1.53)
whisper_vad_segments_from_probs: VAD segment 46: start = 112.13, end = 113.41 (duration: 1.28)
whisper_vad_segments_from_probs: VAD segment 47: start = 114.69, end = 115.71 (duration: 1.02)
whisper_vad_segments_from_probs: VAD segment 48: start = 116.10, end = 117.92 (duration: 1.82)
whisper_vad_segments_from_probs: VAD segment 49: start = 118.11, end = 119.39 (duration: 1.28)
whisper_vad_segments_from_probs: VAD segment 50: start = 120.19, end = 120.77 (duration: 0.58)
whisper_vad_segments_from_probs: VAD segment 51: start = 121.19, end = 123.04 (duration: 1.85)
whisper_vad_segments_from_probs: VAD segment 52: start = 123.55, end = 123.97 (duration: 0.42)
whisper_vad_segments_from_probs: VAD segment 53: start = 124.87, end = 125.98 (duration: 1.11)
whisper_vad_segments_from_probs: VAD segment 54: start = 126.34, end = 128.32 (duration: 1.98)
whisper_vad_segments_from_probs: VAD segment 55: start = 128.64, end = 130.46 (duration: 1.82)
whisper_vad_segments_from_probs: VAD segment 56: start = 130.91, end = 132.25 (duration: 1.34)
whisper_vad_segments_from_probs: VAD segment 57: start = 132.58, end = 133.69 (duration: 1.11)
whisper_vad_segments_from_probs: VAD segment 58: start = 134.27, end = 138.59 (duration: 4.32)
whisper_vad_segments_from_probs: VAD segment 59: start = 139.97, end = 140.51 (duration: 0.54)
whisper_vad_segments_from_probs: VAD segment 60: start = 141.09, end = 141.85 (duration: 0.76)
whisper_vad_segments_from_probs: VAD segment 61: start = 142.15, end = 142.46 (duration: 0.31)
whisper_vad_segments_from_probs: VAD segment 62: start = 143.30, end = 145.25 (duration: 1.95)
whisper_vad_segments_from_probs: VAD segment 63: start = 146.15, end = 147.55 (duration: 1.40)
whisper_vad_segments_from_probs: VAD segment 64: start = 147.97, end = 149.34 (duration: 1.37)
whisper_vad_segments_from_probs: VAD segment 65: start = 150.24, end = 151.61 (duration: 1.37)
whisper_vad_segments_from_probs: VAD segment 66: start = 152.96, end = 153.69 (duration: 0.73)
whisper_vad_segments_from_probs: VAD segment 67: start = 154.18, end = 155.49 (duration: 1.31)
whisper_vad_segments_from_probs: VAD segment 68: start = 155.81, end = 156.73 (duration: 0.92)
whisper_vad_segments_from_probs: VAD segment 69: start = 157.12, end = 160.22 (duration: 3.10)
whisper_vad_segments_from_probs: VAD segment 70: start = 160.39, end = 162.33 (duration: 1.94)
whisper_vad_segments_from_probs: VAD segment 71: start = 162.59, end = 163.20 (duration: 0.61)
whisper_vad_segments_from_probs: VAD segment 72: start = 163.62, end = 164.13 (duration: 0.51)
whisper_vad_segments_from_probs: VAD segment 73: start = 164.32, end = 165.18 (duration: 0.86)
whisper_vad_segments_from_probs: VAD segment 74: start = 165.89, end = 166.27 (duration: 0.38)
whisper_vad_segments_from_probs: VAD segment 75: start = 166.95, end = 168.64 (duration: 1.69)
whisper_vad_segments_from_probs: VAD segment 76: start = 169.25, end = 170.27 (duration: 1.02)
whisper_vad_segments_from_probs: VAD segment 77: start = 170.79, end = 173.05 (duration: 2.26)
whisper_vad_segments_from_probs: VAD segment 78: start = 173.70, end = 175.29 (duration: 1.59)
whisper_vad_segments_from_probs: VAD segment 79: start = 175.97, end = 176.93 (duration: 0.96)
whisper_vad_segments_from_probs: VAD segment 80: start = 178.24, end = 178.85 (duration: 0.61)
whisper_vad: detected 81 speech segments
whisper_vad: Including segment 0: 0.19 - 2.40 (duration: 2.21)
whisper_vad: Including segment 1: 2.79 - 4.61 (duration: 1.82)
whisper_vad: Including segment 2: 4.93 - 6.53 (duration: 1.60)
whisper_vad: Including segment 3: 6.98 - 8.29 (duration: 1.31)
whisper_vad: Including segment 4: 8.51 - 10.05 (duration: 1.54)
whisper_vad: Including segment 5: 10.40 - 13.09 (duration: 2.69)
whisper_vad: Including segment 6: 13.89 - 15.49 (duration: 1.60)
whisper_vad: Including segment 7: 15.78 - 16.93 (duration: 1.15)
whisper_vad: Including segment 8: 17.25 - 19.23 (duration: 1.98)
whisper_vad: Including segment 9: 19.59 - 21.57 (duration: 1.98)
whisper_vad: Including segment 10: 22.05 - 23.23 (duration: 1.18)
whisper_vad: Including segment 11: 23.39 - 24.48 (duration: 1.09)
whisper_vad: Including segment 12: 25.47 - 28.07 (duration: 2.60)
whisper_vad: Including segment 13: 28.16 - 33.03 (duration: 4.87)
whisper_vad: Including segment 14: 33.63 - 34.98 (duration: 1.35)
whisper_vad: Including segment 15: 35.39 - 36.83 (duration: 1.44)
whisper_vad: Including segment 16: 36.99 - 39.23 (duration: 2.24)
whisper_vad: Including segment 17: 39.30 - 41.89 (duration: 2.59)
whisper_vad: Including segment 18: 42.24 - 44.55 (duration: 2.31)
whisper_vad: Including segment 19: 45.31 - 46.59 (duration: 1.28)
whisper_vad: Including segment 20: 47.75 - 48.16 (duration: 0.41)
whisper_vad: Including segment 21: 48.67 - 49.92 (duration: 1.25)
whisper_vad: Including segment 22: 50.72 - 53.35 (duration: 2.63)
whisper_vad: Including segment 23: 54.11 - 56.26 (duration: 2.15)
whisper_vad: Including segment 24: 56.39 - 57.03 (duration: 0.64)
whisper_vad: Including segment 25: 57.19 - 57.99 (duration: 0.80)
whisper_vad: Including segment 26: 58.15 - 58.88 (duration: 0.73)
whisper_vad: Including segment 27: 59.04 - 59.62 (duration: 0.58)
whisper_vad: Including segment 28: 60.61 - 61.47 (duration: 0.86)
whisper_vad: Including segment 29: 63.91 - 66.11 (duration: 2.20)
whisper_vad: Including segment 30: 66.59 - 70.37 (duration: 3.78)
whisper_vad: Including segment 31: 70.47 - 72.19 (duration: 1.72)
whisper_vad: Including segment 32: 72.58 - 73.22 (duration: 0.64)
whisper_vad: Including segment 33: 73.47 - 75.75 (duration: 2.28)
whisper_vad: Including segment 34: 75.81 - 78.56 (duration: 2.75)
whisper_vad: Including segment 35: 78.82 - 82.11 (duration: 3.29)
whisper_vad: Including segment 36: 82.31 - 84.00 (duration: 1.69)
whisper_vad: Including segment 37: 84.64 - 90.40 (duration: 5.76)
whisper_vad: Including segment 38: 90.98 - 94.43 (duration: 3.45)
whisper_vad: Including segment 39: 94.66 - 95.78 (duration: 1.12)
whisper_vad: Including segment 40: 96.03 - 97.03 (duration: 1.00)
whisper_vad: Including segment 41: 97.73 - 99.01 (duration: 1.28)
whisper_vad: Including segment 42: 100.64 - 102.75 (duration: 2.11)
whisper_vad: Including segment 43: 103.91 - 105.70 (duration: 1.79)
whisper_vad: Including segment 44: 106.31 - 108.90 (duration: 2.59)
whisper_vad: Including segment 45: 109.28 - 110.91 (duration: 1.63)
whisper_vad: Including segment 46: 112.13 - 113.51 (duration: 1.38)
whisper_vad: Including segment 47: 114.69 - 115.81 (duration: 1.12)
whisper_vad: Including segment 48: 116.10 - 118.02 (duration: 1.92)
whisper_vad: Including segment 49: 118.11 - 119.49 (duration: 1.38)
whisper_vad: Including segment 50: 120.19 - 120.87 (duration: 0.68)
whisper_vad: Including segment 51: 121.19 - 123.14 (duration: 1.95)
whisper_vad: Including segment 52: 123.55 - 124.07 (duration: 0.52)
whisper_vad: Including segment 53: 124.87 - 126.08 (duration: 1.21)
whisper_vad: Including segment 54: 126.34 - 128.42 (duration: 2.08)
whisper_vad: Including segment 55: 128.64 - 130.56 (duration: 1.92)
whisper_vad: Including segment 56: 130.91 - 132.35 (duration: 1.44)
whisper_vad: Including segment 57: 132.58 - 133.79 (duration: 1.21)
whisper_vad: Including segment 58: 134.27 - 138.69 (duration: 4.42)
whisper_vad: Including segment 59: 139.97 - 140.61 (duration: 0.64)
whisper_vad: Including segment 60: 141.09 - 141.95 (duration: 0.86)
whisper_vad: Including segment 61: 142.15 - 142.56 (duration: 0.41)
whisper_vad: Including segment 62: 143.30 - 145.35 (duration: 2.05)
whisper_vad: Including segment 63: 146.15 - 147.65 (duration: 1.50)
whisper_vad: Including segment 64: 147.97 - 149.44 (duration: 1.47)
whisper_vad: Including segment 65: 150.24 - 151.71 (duration: 1.47)
whisper_vad: Including segment 66: 152.96 - 153.79 (duration: 0.83)
whisper_vad: Including segment 67: 154.18 - 155.59 (duration: 1.41)
whisper_vad: Including segment 68: 155.81 - 156.83 (duration: 1.02)
whisper_vad: Including segment 69: 157.12 - 160.32 (duration: 3.20)
whisper_vad: Including segment 70: 160.39 - 162.43 (duration: 2.04)
whisper_vad: Including segment 71: 162.59 - 163.30 (duration: 0.71)
whisper_vad: Including segment 72: 163.62 - 164.23 (duration: 0.61)
whisper_vad: Including segment 73: 164.32 - 165.28 (duration: 0.96)
whisper_vad: Including segment 74: 165.89 - 166.37 (duration: 0.48)
whisper_vad: Including segment 75: 166.95 - 168.74 (duration: 1.79)
whisper_vad: Including segment 76: 169.25 - 170.37 (duration: 1.12)
whisper_vad: Including segment 77: 170.79 - 173.15 (duration: 2.36)
whisper_vad: Including segment 78: 173.70 - 175.39 (duration: 1.69)
whisper_vad: Including segment 79: 175.97 - 177.03 (duration: 1.06)
whisper_vad: Including segment 80: 178.24 - 178.85 (duration: 0.61)
whisper_vad: total duration of speech segments: 137.48 seconds
whisper_vad: vad_segment_info: orig_start: 0.19, orig_end: 2.30, vad_start: 0.00, vad_end: 2.21
whisper_vad: vad_segment_info: orig_start: 2.79, orig_end: 4.51, vad_start: 2.31, vad_end: 4.13
whisper_vad: vad_segment_info: orig_start: 4.93, orig_end: 6.43, vad_start: 4.23, vad_end: 5.83
whisper_vad: vad_segment_info: orig_start: 6.98, orig_end: 8.19, vad_start: 5.93, vad_end: 7.24
whisper_vad: vad_segment_info: orig_start: 8.51, orig_end: 9.95, vad_start: 7.34, vad_end: 8.88
whisper_vad: vad_segment_info: orig_start: 10.40, orig_end: 12.99, vad_start: 8.98, vad_end: 11.67
whisper_vad: vad_segment_info: orig_start: 13.89, orig_end: 15.39, vad_start: 11.77, vad_end: 13.37
whisper_vad: vad_segment_info: orig_start: 15.78, orig_end: 16.83, vad_start: 13.47, vad_end: 14.62
whisper_vad: vad_segment_info: orig_start: 17.25, orig_end: 19.13, vad_start: 14.72, vad_end: 16.70
whisper_vad: vad_segment_info: orig_start: 19.59, orig_end: 21.47, vad_start: 16.80, vad_end: 18.78
whisper_vad: vad_segment_info: orig_start: 22.05, orig_end: 23.13, vad_start: 18.88, vad_end: 20.06
whisper_vad: vad_segment_info: orig_start: 23.39, orig_end: 24.38, vad_start: 20.16, vad_end: 21.25
whisper_vad: vad_segment_info: orig_start: 25.47, orig_end: 27.97, vad_start: 21.35, vad_end: 23.95
whisper_vad: vad_segment_info: orig_start: 28.16, orig_end: 32.93, vad_start: 24.05, vad_end: 28.92
whisper_vad: vad_segment_info: orig_start: 33.63, orig_end: 34.88, vad_start: 29.02, vad_end: 30.37
whisper_vad: vad_segment_info: orig_start: 35.39, orig_end: 36.73, vad_start: 30.47, vad_end: 31.91
whisper_vad: vad_segment_info: orig_start: 36.99, orig_end: 39.13, vad_start: 32.01, vad_end: 34.25
whisper_vad: vad_segment_info: orig_start: 39.30, orig_end: 41.79, vad_start: 34.35, vad_end: 36.94
whisper_vad: vad_segment_info: orig_start: 42.24, orig_end: 44.45, vad_start: 37.04, vad_end: 39.35
whisper_vad: vad_segment_info: orig_start: 45.31, orig_end: 46.49, vad_start: 39.45, vad_end: 40.73
whisper_vad: vad_segment_info: orig_start: 47.75, orig_end: 48.06, vad_start: 40.83, vad_end: 41.24
whisper_vad: vad_segment_info: orig_start: 48.67, orig_end: 49.82, vad_start: 41.34, vad_end: 42.59
whisper_vad: vad_segment_info: orig_start: 50.72, orig_end: 53.25, vad_start: 42.69, vad_end: 45.32
whisper_vad: vad_segment_info: orig_start: 54.11, orig_end: 56.16, vad_start: 45.42, vad_end: 47.57
whisper_vad: vad_segment_info: orig_start: 56.39, orig_end: 56.93, vad_start: 47.67, vad_end: 48.31
whisper_vad: vad_segment_info: orig_start: 57.19, orig_end: 57.89, vad_start: 48.41, vad_end: 49.21
whisper_vad: vad_segment_info: orig_start: 58.15, orig_end: 58.78, vad_start: 49.31, vad_end: 50.04
whisper_vad: vad_segment_info: orig_start: 59.04, orig_end: 59.52, vad_start: 50.14, vad_end: 50.72
whisper_vad: vad_segment_info: orig_start: 60.61, orig_end: 61.37, vad_start: 50.82, vad_end: 51.68
whisper_vad: vad_segment_info: orig_start: 63.91, orig_end: 66.01, vad_start: 51.78, vad_end: 53.98
whisper_vad: vad_segment_info: orig_start: 66.59, orig_end: 70.27, vad_start: 54.08, vad_end: 57.86
whisper_vad: vad_segment_info: orig_start: 70.47, orig_end: 72.09, vad_start: 57.96, vad_end: 59.68
whisper_vad: vad_segment_info: orig_start: 72.58, orig_end: 73.12, vad_start: 59.78, vad_end: 60.42
whisper_vad: vad_segment_info: orig_start: 73.47, orig_end: 75.65, vad_start: 60.52, vad_end: 62.80
whisper_vad: vad_segment_info: orig_start: 75.81, orig_end: 78.46, vad_start: 62.90, vad_end: 65.65
whisper_vad: vad_segment_info: orig_start: 78.82, orig_end: 82.01, vad_start: 65.75, vad_end: 69.04
whisper_vad: vad_segment_info: orig_start: 82.31, orig_end: 83.90, vad_start: 69.14, vad_end: 70.83
whisper_vad: vad_segment_info: orig_start: 84.64, orig_end: 90.30, vad_start: 70.93, vad_end: 76.69
whisper_vad: vad_segment_info: orig_start: 90.98, orig_end: 94.33, vad_start: 76.79, vad_end: 80.24
whisper_vad: vad_segment_info: orig_start: 94.66, orig_end: 95.68, vad_start: 80.34, vad_end: 81.46
whisper_vad: vad_segment_info: orig_start: 96.03, orig_end: 96.93, vad_start: 81.56, vad_end: 82.56
whisper_vad: vad_segment_info: orig_start: 97.73, orig_end: 98.91, vad_start: 82.66, vad_end: 83.94
whisper_vad: vad_segment_info: orig_start: 100.64, orig_end: 102.65, vad_start: 84.04, vad_end: 86.15
whisper_vad: vad_segment_info: orig_start: 103.91, orig_end: 105.60, vad_start: 86.25, vad_end: 88.04
whisper_vad: vad_segment_info: orig_start: 106.31, orig_end: 108.80, vad_start: 88.14, vad_end: 90.73
whisper_vad: vad_segment_info: orig_start: 109.28, orig_end: 110.81, vad_start: 90.83, vad_end: 92.46
whisper_vad: vad_segment_info: orig_start: 112.13, orig_end: 113.41, vad_start: 92.56, vad_end: 93.94
whisper_vad: vad_segment_info: orig_start: 114.69, orig_end: 115.71, vad_start: 94.04, vad_end: 95.16
whisper_vad: vad_segment_info: orig_start: 116.10, orig_end: 117.92, vad_start: 95.26, vad_end: 97.18
whisper_vad: vad_segment_info: orig_start: 118.11, orig_end: 119.39, vad_start: 97.28, vad_end: 98.66
whisper_vad: vad_segment_info: orig_start: 120.19, orig_end: 120.77, vad_start: 98.76, vad_end: 99.44
whisper_vad: vad_segment_info: orig_start: 121.19, orig_end: 123.04, vad_start: 99.54, vad_end: 101.49
whisper_vad: vad_segment_info: orig_start: 123.55, orig_end: 123.97, vad_start: 101.59, vad_end: 102.11
whisper_vad: vad_segment_info: orig_start: 124.87, orig_end: 125.98, vad_start: 102.21, vad_end: 103.42
whisper_vad: vad_segment_info: orig_start: 126.34, orig_end: 128.32, vad_start: 103.52, vad_end: 105.60
whisper_vad: vad_segment_info: orig_start: 128.64, orig_end: 130.46, vad_start: 105.70, vad_end: 107.62
whisper_vad: vad_segment_info: orig_start: 130.91, orig_end: 132.25, vad_start: 107.72, vad_end: 109.16
whisper_vad: vad_segment_info: orig_start: 132.58, orig_end: 133.69, vad_start: 109.26, vad_end: 110.47
whisper_vad: vad_segment_info: orig_start: 134.27, orig_end: 138.59, vad_start: 110.57, vad_end: 114.99
whisper_vad: vad_segment_info: orig_start: 139.97, orig_end: 140.51, vad_start: 115.09, vad_end: 115.73
whisper_vad: vad_segment_info: orig_start: 141.09, orig_end: 141.85, vad_start: 115.83, vad_end: 116.69
whisper_vad: vad_segment_info: orig_start: 142.15, orig_end: 142.46, vad_start: 116.79, vad_end: 117.20
whisper_vad: vad_segment_info: orig_start: 143.30, orig_end: 145.25, vad_start: 117.30, vad_end: 119.35
whisper_vad: vad_segment_info: orig_start: 146.15, orig_end: 147.55, vad_start: 119.45, vad_end: 120.95
whisper_vad: vad_segment_info: orig_start: 147.97, orig_end: 149.34, vad_start: 121.05, vad_end: 122.52
whisper_vad: vad_segment_info: orig_start: 150.24, orig_end: 151.61, vad_start: 122.62, vad_end: 124.09
whisper_vad: vad_segment_info: orig_start: 152.96, orig_end: 153.69, vad_start: 124.19, vad_end: 125.02
whisper_vad: vad_segment_info: orig_start: 154.18, orig_end: 155.49, vad_start: 125.12, vad_end: 126.53
whisper_vad: vad_segment_info: orig_start: 155.81, orig_end: 156.73, vad_start: 126.63, vad_end: 127.65
whisper_vad: vad_segment_info: orig_start: 157.12, orig_end: 160.22, vad_start: 127.75, vad_end: 130.95
whisper_vad: vad_segment_info: orig_start: 160.39, orig_end: 162.33, vad_start: 131.05, vad_end: 133.09
whisper_vad: vad_segment_info: orig_start: 162.59, orig_end: 163.20, vad_start: 133.19, vad_end: 133.90
whisper_vad: vad_segment_info: orig_start: 163.62, orig_end: 164.13, vad_start: 134.00, vad_end: 134.61
whisper_vad: vad_segment_info: orig_start: 164.32, orig_end: 165.18, vad_start: 134.71, vad_end: 135.67
whisper_vad: vad_segment_info: orig_start: 165.89, orig_end: 166.27, vad_start: 135.77, vad_end: 136.25
whisper_vad: vad_segment_info: orig_start: 166.95, orig_end: 168.64, vad_start: 136.35, vad_end: 138.14
whisper_vad: vad_segment_info: orig_start: 169.25, orig_end: 170.27, vad_start: 138.24, vad_end: 139.36
whisper_vad: vad_segment_info: orig_start: 170.79, orig_end: 173.05, vad_start: 139.46, vad_end: 141.82
whisper_vad: vad_segment_info: orig_start: 173.70, orig_end: 175.29, vad_start: 141.92, vad_end: 143.61
whisper_vad: vad_segment_info: orig_start: 175.97, orig_end: 176.93, vad_start: 143.71, vad_end: 144.77
whisper_vad: vad_segment_info: orig_start: 178.24, orig_end: 178.85, vad_start: 144.87, vad_end: 145.48
whisper_vad: Created time mapping table with 694 points
whisper_vad: Reduced audio from 2879829 to 2327680 samples (19.2% reduction)
whisper_full_with_state: auto-detected language: ko (p = 0.992537)
Received request: no_speech.wav
Successfully loaded no_speech.wav

system_info: n_threads = 4 / 12 | WHISPER : COREML = 0 | OPENVINO = 0 | CUDA : ARCHS = 520 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | OPENMP = 1 | REPACK = 1 |

operator (): processing 'no_speech.wav' (488789 samples, 30.5 sec), 4 threads, 1 processors, lang = auto, task = transcribe, timestamps = 1 ...

Running whisper.cpp inference on no_speech.wav
whisper_full: VAD is enabled, processing speech segments only
whisper_vad: VAD is enabled, processing speech segments only
whisper_vad_segments_from_samples: detecting speech timestamps in 488789 samples
whisper_vad_detect_speech: detecting speech in 488789 samples
whisper_vad_detect_speech: n_chunks: 955
whisper_vad_detect_speech: props size: 955
whisper_vad_detect_speech: chunk_len: 341 < n_window: 512
whisper_vad_detect_speech: vad time = 996.63 ms processing 488789 samples
whisper_vad_segments_from_probs: detecting speech timestamps using 955 probabilities
whisper_vad_segments_from_probs: Final speech segments after filtering: 0
<no crash>

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions