-
Notifications
You must be signed in to change notification settings - Fork 5.1k
Open
Description
Two variants, crash and returning previous result.
Server command:
.\whisper-server.exe --model ggml-large-v3-turbo.bin --vad --vad-model .\ggml-silero-v6.2.0.bin --language autoRequest:
PS F:\whisper\audio> Invoke-RestMethod -Uri 'http://127.0.0.1:8080/inference' -Method Post -ContentType 'multipart/form-data' -Form @{
>> file = Get-Item '.\no_speech.wav'
>> response_format = 'verbose_json'
>> }
Invoke-RestMethod: Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host..This will crash once it hits whisper_vad_segments_from_probs: Final speech segments after filtering: 0. Full log below:
Log
PS E:\Libraries\Downloads\whisper\whisper-cublas-12.4.0-bin-x64\Release> .\whisper-server.exe --model .\ggml-large-v3-turbo.bin --vad --vad-model .\ggml-silero-v6.2.0.bin --language auto
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce GTX 980, compute capability 5.2, VMM: yes
whisper_init_from_file_with_params_no_state: loading model from '.\ggml-large-v3-turbo.bin'
whisper_init_with_params_no_state: use gpu = 1
whisper_init_with_params_no_state: flash attn = 1
whisper_init_with_params_no_state: gpu_device = 0
whisper_init_with_params_no_state: dtw = 0
whisper_init_with_params_no_state: devices = 2
whisper_init_with_params_no_state: backends = 2
whisper_model_load: loading model
whisper_model_load: n_vocab = 51866
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 1280
whisper_model_load: n_text_head = 20
whisper_model_load: n_text_layer = 4
whisper_model_load: n_mels = 128
whisper_model_load: ftype = 1
whisper_model_load: qntvr = 0
whisper_model_load: type = 5 (large v3)
whisper_model_load: adding 1609 extra tokens
whisper_model_load: n_langs = 100
whisper_model_load: CUDA0 total size = 1623.92 MB
whisper_model_load: model size = 1623.92 MB
whisper_backend_init_gpu: using CUDA0 backend
whisper_init_state: kv self size = 10.49 MB
whisper_init_state: kv cross size = 31.46 MB
whisper_init_state: kv pad size = 7.86 MB
whisper_init_state: compute buffer (conv) = 37.69 MB
whisper_init_state: compute buffer (encode) = 55.35 MB
whisper_init_state: compute buffer (cross) = 9.27 MB
whisper_init_state: compute buffer (decode) = 100.04 MB
whisper server listening at http://127.0.0.1:8080
Received request: no_speech.wav
Successfully loaded no_speech.wav
system_info: n_threads = 4 / 12 | WHISPER : COREML = 0 | OPENVINO = 0 | CUDA : ARCHS = 520 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | OPENMP = 1 | REPACK = 1 |
operator (): processing 'no_speech.wav' (488789 samples, 30.5 sec), 4 threads, 1 processors, lang = auto, task = transcribe, timestamps = 1 ...
Running whisper.cpp inference on no_speech.wav
whisper_full: VAD is enabled, processing speech segments only
whisper_vad: VAD is enabled, processing speech segments only
whisper_vad_init_from_file_with_params: loading VAD model from '.\ggml-silero-v6.2.0.bin'
whisper_vad_init_with_params: model type: silero-16k
whisper_vad_init_with_params: model version: 6.2.0
whisper_vad_init_with_params: n_encoder_layers = 4
whisper_vad_init_with_params: encoder_in_channels[0] = 129
whisper_vad_init_with_params: encoder_in_channels[1] = 128
whisper_vad_init_with_params: encoder_in_channels[2] = 64
whisper_vad_init_with_params: encoder_in_channels[3] = 64
whisper_vad_init_with_params: encoder_out_channels[0] = 128
whisper_vad_init_with_params: encoder_out_channels[1] = 64
whisper_vad_init_with_params: encoder_out_channels[2] = 64
whisper_vad_init_with_params: encoder_out_channels[3] = 128
whisper_vad_init_with_params: lstm_input_size = 128
whisper_vad_init_with_params: lstm_hidden_size = 128
whisper_vad_init_with_params: final_conv_in = 128
whisper_vad_init_with_params: final_conv_out = 1
whisper_vad_init_with_params: CPU total size = 0.88 MB
whisper_vad_init_with_params: model size = 0.88 MB
whisper_backend_init_gpu: no GPU found
whisper_vad_init_context: compute buffer (VAD) = 1.60 MB
whisper_vad_segments_from_samples: detecting speech timestamps in 488789 samples
whisper_vad_detect_speech: detecting speech in 488789 samples
whisper_vad_detect_speech: n_chunks: 955
whisper_vad_detect_speech: props size: 955
whisper_vad_detect_speech: chunk_len: 341 < n_window: 512
whisper_vad_detect_speech: vad time = 164.73 ms processing 488789 samples
whisper_vad_segments_from_probs: detecting speech timestamps using 955 probabilities
whisper_vad_segments_from_probs: Final speech segments after filtering: 0
<crash>
If you pass it audio with speech first and then audio without speech, it will not crash but will return the old result.
Requests:
PS F:\whisper\audio> Invoke-RestMethod -Uri 'http://127.0.0.1:8080/inference' -Method Post -ContentType 'multipart/form-data' -Form @{
>> file = Get-Item '.\has_speech.wav'
>> response_format = 'verbose_json'
>> }
task : transcribe
language : korean
duration : 179.989318847656
text : <snip>
segments : <snip>
detected_language : korean
detected_language_probability : 0.992537379264832
language_probabilities : @{en=0.00533512607216835; ko=0.992537379264832}
PS F:\whisper\audio> Invoke-RestMethod -Uri 'http://127.0.0.1:8080/inference' -Method Post -ContentType 'multipart/form-data' -Form @{
>> file = Get-Item '.\no_speech.wav'
>> response_format = 'verbose_json'
>> }
task : transcribe
language : korean
duration : 30.5493125915527
text :
segments : {}
detected_language : korean
detected_language_probability : 0.992537379264832
language_probabilities : @{en=0.00533512607216835; ko=0.992537379264832}This might be expected because I haven't passed --no-context... so I tried that.
Server command:
.\whisper-server.exe --model .\ggml-large-v3-turbo.bin --vad --vad-model .\ggml-silero-v6.2.0.bin --language auto --no-contextRequests+responses:
PS F:\whisper\audio> Invoke-RestMethod -Uri 'http://127.0.0.1:8080/inference' -Method Post -ContentType 'multipart/form-data' -Form @{
>> file = Get-Item '.\has_speech.wav'
>> response_format = 'verbose_json'
>> }
task : transcribe
language : korean
duration : 179.989318847656
text : <snip>
segments : <snip>
detected_language : korean
detected_language_probability : 0.992537379264832
language_probabilities : @{en=0.00533512607216835; ko=0.992537379264832}
PS F:\whisper\audio> Invoke-RestMethod -Uri 'http://127.0.0.1:8080/inference' -Method Post -ContentType 'multipart/form-data' -Form @{
>> file = Get-Item '.\no_speech.wav'
>> response_format = 'verbose_json'
>> }
task : transcribe
language : korean
duration : 30.5493125915527
text :
segments : {}
detected_language : korean
detected_language_probability : 0.992537379264832
language_probabilities : @{en=0.00533512607216835; ko=0.992537379264832}Still does it. Seems to indicate (to me) that it is actually keeping the context?
Long log probably isn't useful in this case, but here it is anyway:
Log
PS E:\Libraries\Downloads\whisper\whisper-cublas-12.4.0-bin-x64\Release> .\whisper-server.exe --model .\ggml-large-v3-turbo.bin --vad --vad-model .\ggml-silero-v6.2.0.bin --language auto --no-context
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce GTX 980, compute capability 5.2, VMM: yes
whisper_init_from_file_with_params_no_state: loading model from '.\ggml-large-v3-turbo.bin'
whisper_init_with_params_no_state: use gpu = 1
whisper_init_with_params_no_state: flash attn = 1
whisper_init_with_params_no_state: gpu_device = 0
whisper_init_with_params_no_state: dtw = 0
whisper_init_with_params_no_state: devices = 2
whisper_init_with_params_no_state: backends = 2
whisper_model_load: loading model
whisper_model_load: n_vocab = 51866
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 1280
whisper_model_load: n_text_head = 20
whisper_model_load: n_text_layer = 4
whisper_model_load: n_mels = 128
whisper_model_load: ftype = 1
whisper_model_load: qntvr = 0
whisper_model_load: type = 5 (large v3)
whisper_model_load: adding 1609 extra tokens
whisper_model_load: n_langs = 100
whisper_model_load: CUDA0 total size = 1623.92 MB
whisper_model_load: model size = 1623.92 MB
whisper_backend_init_gpu: using CUDA0 backend
whisper_init_state: kv self size = 10.49 MB
whisper_init_state: kv cross size = 31.46 MB
whisper_init_state: kv pad size = 7.86 MB
whisper_init_state: compute buffer (conv) = 37.69 MB
whisper_init_state: compute buffer (encode) = 55.35 MB
whisper_init_state: compute buffer (cross) = 9.27 MB
whisper_init_state: compute buffer (decode) = 100.04 MB
whisper server listening at http://127.0.0.1:8080
Received request: has_speech.wav
Successfully loaded has_speech.wav
system_info: n_threads = 4 / 12 | WHISPER : COREML = 0 | OPENVINO = 0 | CUDA : ARCHS = 520 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | OPENMP = 1 | REPACK = 1 |
operator (): processing 'has_speech.wav' (2879829 samples, 180.0 sec), 4 threads, 1 processors, lang = auto, task = transcribe, timestamps = 1 ...
Running whisper.cpp inference on has_speech.wav
whisper_full: VAD is enabled, processing speech segments only
whisper_vad: VAD is enabled, processing speech segments only
whisper_vad_init_from_file_with_params: loading VAD model from '.\ggml-silero-v6.2.0.bin'
whisper_vad_init_with_params: model type: silero-16k
whisper_vad_init_with_params: model version: 6.2.0
whisper_vad_init_with_params: n_encoder_layers = 4
whisper_vad_init_with_params: encoder_in_channels[0] = 129
whisper_vad_init_with_params: encoder_in_channels[1] = 128
whisper_vad_init_with_params: encoder_in_channels[2] = 64
whisper_vad_init_with_params: encoder_in_channels[3] = 64
whisper_vad_init_with_params: encoder_out_channels[0] = 128
whisper_vad_init_with_params: encoder_out_channels[1] = 64
whisper_vad_init_with_params: encoder_out_channels[2] = 64
whisper_vad_init_with_params: encoder_out_channels[3] = 128
whisper_vad_init_with_params: lstm_input_size = 128
whisper_vad_init_with_params: lstm_hidden_size = 128
whisper_vad_init_with_params: final_conv_in = 128
whisper_vad_init_with_params: final_conv_out = 1
whisper_vad_init_with_params: CPU total size = 0.88 MB
whisper_vad_init_with_params: model size = 0.88 MB
whisper_backend_init_gpu: no GPU found
whisper_vad_init_context: compute buffer (VAD) = 1.60 MB
whisper_vad_segments_from_samples: detecting speech timestamps in 2879829 samples
whisper_vad_detect_speech: detecting speech in 2879829 samples
whisper_vad_detect_speech: n_chunks: 5625
whisper_vad_detect_speech: props size: 5625
whisper_vad_detect_speech: chunk_len: 341 < n_window: 512
whisper_vad_detect_speech: vad time = 840.26 ms processing 2879829 samples
whisper_vad_segments_from_probs: detecting speech timestamps using 5625 probabilities
whisper_vad_segments_from_probs: Merged 2 adjacent segments, now have 81 segments
whisper_vad_segments_from_probs: Final speech segments after filtering: 81
whisper_vad_segments_from_probs: VAD segment 0: start = 0.19, end = 2.30 (duration: 2.11)
whisper_vad_segments_from_probs: VAD segment 1: start = 2.79, end = 4.51 (duration: 1.72)
whisper_vad_segments_from_probs: VAD segment 2: start = 4.93, end = 6.43 (duration: 1.50)
whisper_vad_segments_from_probs: VAD segment 3: start = 6.98, end = 8.19 (duration: 1.21)
whisper_vad_segments_from_probs: VAD segment 4: start = 8.51, end = 9.95 (duration: 1.44)
whisper_vad_segments_from_probs: VAD segment 5: start = 10.40, end = 12.99 (duration: 2.59)
whisper_vad_segments_from_probs: VAD segment 6: start = 13.89, end = 15.39 (duration: 1.50)
whisper_vad_segments_from_probs: VAD segment 7: start = 15.78, end = 16.83 (duration: 1.05)
whisper_vad_segments_from_probs: VAD segment 8: start = 17.25, end = 19.13 (duration: 1.88)
whisper_vad_segments_from_probs: VAD segment 9: start = 19.59, end = 21.47 (duration: 1.88)
whisper_vad_segments_from_probs: VAD segment 10: start = 22.05, end = 23.13 (duration: 1.08)
whisper_vad_segments_from_probs: VAD segment 11: start = 23.39, end = 24.38 (duration: 0.99)
whisper_vad_segments_from_probs: VAD segment 12: start = 25.47, end = 27.97 (duration: 2.50)
whisper_vad_segments_from_probs: VAD segment 13: start = 28.16, end = 32.93 (duration: 4.77)
whisper_vad_segments_from_probs: VAD segment 14: start = 33.63, end = 34.88 (duration: 1.25)
whisper_vad_segments_from_probs: VAD segment 15: start = 35.39, end = 36.73 (duration: 1.34)
whisper_vad_segments_from_probs: VAD segment 16: start = 36.99, end = 39.13 (duration: 2.14)
whisper_vad_segments_from_probs: VAD segment 17: start = 39.30, end = 41.79 (duration: 2.49)
whisper_vad_segments_from_probs: VAD segment 18: start = 42.24, end = 44.45 (duration: 2.21)
whisper_vad_segments_from_probs: VAD segment 19: start = 45.31, end = 46.49 (duration: 1.18)
whisper_vad_segments_from_probs: VAD segment 20: start = 47.75, end = 48.06 (duration: 0.31)
whisper_vad_segments_from_probs: VAD segment 21: start = 48.67, end = 49.82 (duration: 1.15)
whisper_vad_segments_from_probs: VAD segment 22: start = 50.72, end = 53.25 (duration: 2.53)
whisper_vad_segments_from_probs: VAD segment 23: start = 54.11, end = 56.16 (duration: 2.05)
whisper_vad_segments_from_probs: VAD segment 24: start = 56.39, end = 56.93 (duration: 0.54)
whisper_vad_segments_from_probs: VAD segment 25: start = 57.19, end = 57.89 (duration: 0.70)
whisper_vad_segments_from_probs: VAD segment 26: start = 58.15, end = 58.78 (duration: 0.63)
whisper_vad_segments_from_probs: VAD segment 27: start = 59.04, end = 59.52 (duration: 0.48)
whisper_vad_segments_from_probs: VAD segment 28: start = 60.61, end = 61.37 (duration: 0.76)
whisper_vad_segments_from_probs: VAD segment 29: start = 63.91, end = 66.01 (duration: 2.10)
whisper_vad_segments_from_probs: VAD segment 30: start = 66.59, end = 70.27 (duration: 3.68)
whisper_vad_segments_from_probs: VAD segment 31: start = 70.47, end = 72.09 (duration: 1.62)
whisper_vad_segments_from_probs: VAD segment 32: start = 72.58, end = 73.12 (duration: 0.54)
whisper_vad_segments_from_probs: VAD segment 33: start = 73.47, end = 75.65 (duration: 2.18)
whisper_vad_segments_from_probs: VAD segment 34: start = 75.81, end = 78.46 (duration: 2.65)
whisper_vad_segments_from_probs: VAD segment 35: start = 78.82, end = 82.01 (duration: 3.19)
whisper_vad_segments_from_probs: VAD segment 36: start = 82.31, end = 83.90 (duration: 1.59)
whisper_vad_segments_from_probs: VAD segment 37: start = 84.64, end = 90.30 (duration: 5.66)
whisper_vad_segments_from_probs: VAD segment 38: start = 90.98, end = 94.33 (duration: 3.35)
whisper_vad_segments_from_probs: VAD segment 39: start = 94.66, end = 95.68 (duration: 1.02)
whisper_vad_segments_from_probs: VAD segment 40: start = 96.03, end = 96.93 (duration: 0.90)
whisper_vad_segments_from_probs: VAD segment 41: start = 97.73, end = 98.91 (duration: 1.18)
whisper_vad_segments_from_probs: VAD segment 42: start = 100.64, end = 102.65 (duration: 2.01)
whisper_vad_segments_from_probs: VAD segment 43: start = 103.91, end = 105.60 (duration: 1.69)
whisper_vad_segments_from_probs: VAD segment 44: start = 106.31, end = 108.80 (duration: 2.49)
whisper_vad_segments_from_probs: VAD segment 45: start = 109.28, end = 110.81 (duration: 1.53)
whisper_vad_segments_from_probs: VAD segment 46: start = 112.13, end = 113.41 (duration: 1.28)
whisper_vad_segments_from_probs: VAD segment 47: start = 114.69, end = 115.71 (duration: 1.02)
whisper_vad_segments_from_probs: VAD segment 48: start = 116.10, end = 117.92 (duration: 1.82)
whisper_vad_segments_from_probs: VAD segment 49: start = 118.11, end = 119.39 (duration: 1.28)
whisper_vad_segments_from_probs: VAD segment 50: start = 120.19, end = 120.77 (duration: 0.58)
whisper_vad_segments_from_probs: VAD segment 51: start = 121.19, end = 123.04 (duration: 1.85)
whisper_vad_segments_from_probs: VAD segment 52: start = 123.55, end = 123.97 (duration: 0.42)
whisper_vad_segments_from_probs: VAD segment 53: start = 124.87, end = 125.98 (duration: 1.11)
whisper_vad_segments_from_probs: VAD segment 54: start = 126.34, end = 128.32 (duration: 1.98)
whisper_vad_segments_from_probs: VAD segment 55: start = 128.64, end = 130.46 (duration: 1.82)
whisper_vad_segments_from_probs: VAD segment 56: start = 130.91, end = 132.25 (duration: 1.34)
whisper_vad_segments_from_probs: VAD segment 57: start = 132.58, end = 133.69 (duration: 1.11)
whisper_vad_segments_from_probs: VAD segment 58: start = 134.27, end = 138.59 (duration: 4.32)
whisper_vad_segments_from_probs: VAD segment 59: start = 139.97, end = 140.51 (duration: 0.54)
whisper_vad_segments_from_probs: VAD segment 60: start = 141.09, end = 141.85 (duration: 0.76)
whisper_vad_segments_from_probs: VAD segment 61: start = 142.15, end = 142.46 (duration: 0.31)
whisper_vad_segments_from_probs: VAD segment 62: start = 143.30, end = 145.25 (duration: 1.95)
whisper_vad_segments_from_probs: VAD segment 63: start = 146.15, end = 147.55 (duration: 1.40)
whisper_vad_segments_from_probs: VAD segment 64: start = 147.97, end = 149.34 (duration: 1.37)
whisper_vad_segments_from_probs: VAD segment 65: start = 150.24, end = 151.61 (duration: 1.37)
whisper_vad_segments_from_probs: VAD segment 66: start = 152.96, end = 153.69 (duration: 0.73)
whisper_vad_segments_from_probs: VAD segment 67: start = 154.18, end = 155.49 (duration: 1.31)
whisper_vad_segments_from_probs: VAD segment 68: start = 155.81, end = 156.73 (duration: 0.92)
whisper_vad_segments_from_probs: VAD segment 69: start = 157.12, end = 160.22 (duration: 3.10)
whisper_vad_segments_from_probs: VAD segment 70: start = 160.39, end = 162.33 (duration: 1.94)
whisper_vad_segments_from_probs: VAD segment 71: start = 162.59, end = 163.20 (duration: 0.61)
whisper_vad_segments_from_probs: VAD segment 72: start = 163.62, end = 164.13 (duration: 0.51)
whisper_vad_segments_from_probs: VAD segment 73: start = 164.32, end = 165.18 (duration: 0.86)
whisper_vad_segments_from_probs: VAD segment 74: start = 165.89, end = 166.27 (duration: 0.38)
whisper_vad_segments_from_probs: VAD segment 75: start = 166.95, end = 168.64 (duration: 1.69)
whisper_vad_segments_from_probs: VAD segment 76: start = 169.25, end = 170.27 (duration: 1.02)
whisper_vad_segments_from_probs: VAD segment 77: start = 170.79, end = 173.05 (duration: 2.26)
whisper_vad_segments_from_probs: VAD segment 78: start = 173.70, end = 175.29 (duration: 1.59)
whisper_vad_segments_from_probs: VAD segment 79: start = 175.97, end = 176.93 (duration: 0.96)
whisper_vad_segments_from_probs: VAD segment 80: start = 178.24, end = 178.85 (duration: 0.61)
whisper_vad: detected 81 speech segments
whisper_vad: Including segment 0: 0.19 - 2.40 (duration: 2.21)
whisper_vad: Including segment 1: 2.79 - 4.61 (duration: 1.82)
whisper_vad: Including segment 2: 4.93 - 6.53 (duration: 1.60)
whisper_vad: Including segment 3: 6.98 - 8.29 (duration: 1.31)
whisper_vad: Including segment 4: 8.51 - 10.05 (duration: 1.54)
whisper_vad: Including segment 5: 10.40 - 13.09 (duration: 2.69)
whisper_vad: Including segment 6: 13.89 - 15.49 (duration: 1.60)
whisper_vad: Including segment 7: 15.78 - 16.93 (duration: 1.15)
whisper_vad: Including segment 8: 17.25 - 19.23 (duration: 1.98)
whisper_vad: Including segment 9: 19.59 - 21.57 (duration: 1.98)
whisper_vad: Including segment 10: 22.05 - 23.23 (duration: 1.18)
whisper_vad: Including segment 11: 23.39 - 24.48 (duration: 1.09)
whisper_vad: Including segment 12: 25.47 - 28.07 (duration: 2.60)
whisper_vad: Including segment 13: 28.16 - 33.03 (duration: 4.87)
whisper_vad: Including segment 14: 33.63 - 34.98 (duration: 1.35)
whisper_vad: Including segment 15: 35.39 - 36.83 (duration: 1.44)
whisper_vad: Including segment 16: 36.99 - 39.23 (duration: 2.24)
whisper_vad: Including segment 17: 39.30 - 41.89 (duration: 2.59)
whisper_vad: Including segment 18: 42.24 - 44.55 (duration: 2.31)
whisper_vad: Including segment 19: 45.31 - 46.59 (duration: 1.28)
whisper_vad: Including segment 20: 47.75 - 48.16 (duration: 0.41)
whisper_vad: Including segment 21: 48.67 - 49.92 (duration: 1.25)
whisper_vad: Including segment 22: 50.72 - 53.35 (duration: 2.63)
whisper_vad: Including segment 23: 54.11 - 56.26 (duration: 2.15)
whisper_vad: Including segment 24: 56.39 - 57.03 (duration: 0.64)
whisper_vad: Including segment 25: 57.19 - 57.99 (duration: 0.80)
whisper_vad: Including segment 26: 58.15 - 58.88 (duration: 0.73)
whisper_vad: Including segment 27: 59.04 - 59.62 (duration: 0.58)
whisper_vad: Including segment 28: 60.61 - 61.47 (duration: 0.86)
whisper_vad: Including segment 29: 63.91 - 66.11 (duration: 2.20)
whisper_vad: Including segment 30: 66.59 - 70.37 (duration: 3.78)
whisper_vad: Including segment 31: 70.47 - 72.19 (duration: 1.72)
whisper_vad: Including segment 32: 72.58 - 73.22 (duration: 0.64)
whisper_vad: Including segment 33: 73.47 - 75.75 (duration: 2.28)
whisper_vad: Including segment 34: 75.81 - 78.56 (duration: 2.75)
whisper_vad: Including segment 35: 78.82 - 82.11 (duration: 3.29)
whisper_vad: Including segment 36: 82.31 - 84.00 (duration: 1.69)
whisper_vad: Including segment 37: 84.64 - 90.40 (duration: 5.76)
whisper_vad: Including segment 38: 90.98 - 94.43 (duration: 3.45)
whisper_vad: Including segment 39: 94.66 - 95.78 (duration: 1.12)
whisper_vad: Including segment 40: 96.03 - 97.03 (duration: 1.00)
whisper_vad: Including segment 41: 97.73 - 99.01 (duration: 1.28)
whisper_vad: Including segment 42: 100.64 - 102.75 (duration: 2.11)
whisper_vad: Including segment 43: 103.91 - 105.70 (duration: 1.79)
whisper_vad: Including segment 44: 106.31 - 108.90 (duration: 2.59)
whisper_vad: Including segment 45: 109.28 - 110.91 (duration: 1.63)
whisper_vad: Including segment 46: 112.13 - 113.51 (duration: 1.38)
whisper_vad: Including segment 47: 114.69 - 115.81 (duration: 1.12)
whisper_vad: Including segment 48: 116.10 - 118.02 (duration: 1.92)
whisper_vad: Including segment 49: 118.11 - 119.49 (duration: 1.38)
whisper_vad: Including segment 50: 120.19 - 120.87 (duration: 0.68)
whisper_vad: Including segment 51: 121.19 - 123.14 (duration: 1.95)
whisper_vad: Including segment 52: 123.55 - 124.07 (duration: 0.52)
whisper_vad: Including segment 53: 124.87 - 126.08 (duration: 1.21)
whisper_vad: Including segment 54: 126.34 - 128.42 (duration: 2.08)
whisper_vad: Including segment 55: 128.64 - 130.56 (duration: 1.92)
whisper_vad: Including segment 56: 130.91 - 132.35 (duration: 1.44)
whisper_vad: Including segment 57: 132.58 - 133.79 (duration: 1.21)
whisper_vad: Including segment 58: 134.27 - 138.69 (duration: 4.42)
whisper_vad: Including segment 59: 139.97 - 140.61 (duration: 0.64)
whisper_vad: Including segment 60: 141.09 - 141.95 (duration: 0.86)
whisper_vad: Including segment 61: 142.15 - 142.56 (duration: 0.41)
whisper_vad: Including segment 62: 143.30 - 145.35 (duration: 2.05)
whisper_vad: Including segment 63: 146.15 - 147.65 (duration: 1.50)
whisper_vad: Including segment 64: 147.97 - 149.44 (duration: 1.47)
whisper_vad: Including segment 65: 150.24 - 151.71 (duration: 1.47)
whisper_vad: Including segment 66: 152.96 - 153.79 (duration: 0.83)
whisper_vad: Including segment 67: 154.18 - 155.59 (duration: 1.41)
whisper_vad: Including segment 68: 155.81 - 156.83 (duration: 1.02)
whisper_vad: Including segment 69: 157.12 - 160.32 (duration: 3.20)
whisper_vad: Including segment 70: 160.39 - 162.43 (duration: 2.04)
whisper_vad: Including segment 71: 162.59 - 163.30 (duration: 0.71)
whisper_vad: Including segment 72: 163.62 - 164.23 (duration: 0.61)
whisper_vad: Including segment 73: 164.32 - 165.28 (duration: 0.96)
whisper_vad: Including segment 74: 165.89 - 166.37 (duration: 0.48)
whisper_vad: Including segment 75: 166.95 - 168.74 (duration: 1.79)
whisper_vad: Including segment 76: 169.25 - 170.37 (duration: 1.12)
whisper_vad: Including segment 77: 170.79 - 173.15 (duration: 2.36)
whisper_vad: Including segment 78: 173.70 - 175.39 (duration: 1.69)
whisper_vad: Including segment 79: 175.97 - 177.03 (duration: 1.06)
whisper_vad: Including segment 80: 178.24 - 178.85 (duration: 0.61)
whisper_vad: total duration of speech segments: 137.48 seconds
whisper_vad: vad_segment_info: orig_start: 0.19, orig_end: 2.30, vad_start: 0.00, vad_end: 2.21
whisper_vad: vad_segment_info: orig_start: 2.79, orig_end: 4.51, vad_start: 2.31, vad_end: 4.13
whisper_vad: vad_segment_info: orig_start: 4.93, orig_end: 6.43, vad_start: 4.23, vad_end: 5.83
whisper_vad: vad_segment_info: orig_start: 6.98, orig_end: 8.19, vad_start: 5.93, vad_end: 7.24
whisper_vad: vad_segment_info: orig_start: 8.51, orig_end: 9.95, vad_start: 7.34, vad_end: 8.88
whisper_vad: vad_segment_info: orig_start: 10.40, orig_end: 12.99, vad_start: 8.98, vad_end: 11.67
whisper_vad: vad_segment_info: orig_start: 13.89, orig_end: 15.39, vad_start: 11.77, vad_end: 13.37
whisper_vad: vad_segment_info: orig_start: 15.78, orig_end: 16.83, vad_start: 13.47, vad_end: 14.62
whisper_vad: vad_segment_info: orig_start: 17.25, orig_end: 19.13, vad_start: 14.72, vad_end: 16.70
whisper_vad: vad_segment_info: orig_start: 19.59, orig_end: 21.47, vad_start: 16.80, vad_end: 18.78
whisper_vad: vad_segment_info: orig_start: 22.05, orig_end: 23.13, vad_start: 18.88, vad_end: 20.06
whisper_vad: vad_segment_info: orig_start: 23.39, orig_end: 24.38, vad_start: 20.16, vad_end: 21.25
whisper_vad: vad_segment_info: orig_start: 25.47, orig_end: 27.97, vad_start: 21.35, vad_end: 23.95
whisper_vad: vad_segment_info: orig_start: 28.16, orig_end: 32.93, vad_start: 24.05, vad_end: 28.92
whisper_vad: vad_segment_info: orig_start: 33.63, orig_end: 34.88, vad_start: 29.02, vad_end: 30.37
whisper_vad: vad_segment_info: orig_start: 35.39, orig_end: 36.73, vad_start: 30.47, vad_end: 31.91
whisper_vad: vad_segment_info: orig_start: 36.99, orig_end: 39.13, vad_start: 32.01, vad_end: 34.25
whisper_vad: vad_segment_info: orig_start: 39.30, orig_end: 41.79, vad_start: 34.35, vad_end: 36.94
whisper_vad: vad_segment_info: orig_start: 42.24, orig_end: 44.45, vad_start: 37.04, vad_end: 39.35
whisper_vad: vad_segment_info: orig_start: 45.31, orig_end: 46.49, vad_start: 39.45, vad_end: 40.73
whisper_vad: vad_segment_info: orig_start: 47.75, orig_end: 48.06, vad_start: 40.83, vad_end: 41.24
whisper_vad: vad_segment_info: orig_start: 48.67, orig_end: 49.82, vad_start: 41.34, vad_end: 42.59
whisper_vad: vad_segment_info: orig_start: 50.72, orig_end: 53.25, vad_start: 42.69, vad_end: 45.32
whisper_vad: vad_segment_info: orig_start: 54.11, orig_end: 56.16, vad_start: 45.42, vad_end: 47.57
whisper_vad: vad_segment_info: orig_start: 56.39, orig_end: 56.93, vad_start: 47.67, vad_end: 48.31
whisper_vad: vad_segment_info: orig_start: 57.19, orig_end: 57.89, vad_start: 48.41, vad_end: 49.21
whisper_vad: vad_segment_info: orig_start: 58.15, orig_end: 58.78, vad_start: 49.31, vad_end: 50.04
whisper_vad: vad_segment_info: orig_start: 59.04, orig_end: 59.52, vad_start: 50.14, vad_end: 50.72
whisper_vad: vad_segment_info: orig_start: 60.61, orig_end: 61.37, vad_start: 50.82, vad_end: 51.68
whisper_vad: vad_segment_info: orig_start: 63.91, orig_end: 66.01, vad_start: 51.78, vad_end: 53.98
whisper_vad: vad_segment_info: orig_start: 66.59, orig_end: 70.27, vad_start: 54.08, vad_end: 57.86
whisper_vad: vad_segment_info: orig_start: 70.47, orig_end: 72.09, vad_start: 57.96, vad_end: 59.68
whisper_vad: vad_segment_info: orig_start: 72.58, orig_end: 73.12, vad_start: 59.78, vad_end: 60.42
whisper_vad: vad_segment_info: orig_start: 73.47, orig_end: 75.65, vad_start: 60.52, vad_end: 62.80
whisper_vad: vad_segment_info: orig_start: 75.81, orig_end: 78.46, vad_start: 62.90, vad_end: 65.65
whisper_vad: vad_segment_info: orig_start: 78.82, orig_end: 82.01, vad_start: 65.75, vad_end: 69.04
whisper_vad: vad_segment_info: orig_start: 82.31, orig_end: 83.90, vad_start: 69.14, vad_end: 70.83
whisper_vad: vad_segment_info: orig_start: 84.64, orig_end: 90.30, vad_start: 70.93, vad_end: 76.69
whisper_vad: vad_segment_info: orig_start: 90.98, orig_end: 94.33, vad_start: 76.79, vad_end: 80.24
whisper_vad: vad_segment_info: orig_start: 94.66, orig_end: 95.68, vad_start: 80.34, vad_end: 81.46
whisper_vad: vad_segment_info: orig_start: 96.03, orig_end: 96.93, vad_start: 81.56, vad_end: 82.56
whisper_vad: vad_segment_info: orig_start: 97.73, orig_end: 98.91, vad_start: 82.66, vad_end: 83.94
whisper_vad: vad_segment_info: orig_start: 100.64, orig_end: 102.65, vad_start: 84.04, vad_end: 86.15
whisper_vad: vad_segment_info: orig_start: 103.91, orig_end: 105.60, vad_start: 86.25, vad_end: 88.04
whisper_vad: vad_segment_info: orig_start: 106.31, orig_end: 108.80, vad_start: 88.14, vad_end: 90.73
whisper_vad: vad_segment_info: orig_start: 109.28, orig_end: 110.81, vad_start: 90.83, vad_end: 92.46
whisper_vad: vad_segment_info: orig_start: 112.13, orig_end: 113.41, vad_start: 92.56, vad_end: 93.94
whisper_vad: vad_segment_info: orig_start: 114.69, orig_end: 115.71, vad_start: 94.04, vad_end: 95.16
whisper_vad: vad_segment_info: orig_start: 116.10, orig_end: 117.92, vad_start: 95.26, vad_end: 97.18
whisper_vad: vad_segment_info: orig_start: 118.11, orig_end: 119.39, vad_start: 97.28, vad_end: 98.66
whisper_vad: vad_segment_info: orig_start: 120.19, orig_end: 120.77, vad_start: 98.76, vad_end: 99.44
whisper_vad: vad_segment_info: orig_start: 121.19, orig_end: 123.04, vad_start: 99.54, vad_end: 101.49
whisper_vad: vad_segment_info: orig_start: 123.55, orig_end: 123.97, vad_start: 101.59, vad_end: 102.11
whisper_vad: vad_segment_info: orig_start: 124.87, orig_end: 125.98, vad_start: 102.21, vad_end: 103.42
whisper_vad: vad_segment_info: orig_start: 126.34, orig_end: 128.32, vad_start: 103.52, vad_end: 105.60
whisper_vad: vad_segment_info: orig_start: 128.64, orig_end: 130.46, vad_start: 105.70, vad_end: 107.62
whisper_vad: vad_segment_info: orig_start: 130.91, orig_end: 132.25, vad_start: 107.72, vad_end: 109.16
whisper_vad: vad_segment_info: orig_start: 132.58, orig_end: 133.69, vad_start: 109.26, vad_end: 110.47
whisper_vad: vad_segment_info: orig_start: 134.27, orig_end: 138.59, vad_start: 110.57, vad_end: 114.99
whisper_vad: vad_segment_info: orig_start: 139.97, orig_end: 140.51, vad_start: 115.09, vad_end: 115.73
whisper_vad: vad_segment_info: orig_start: 141.09, orig_end: 141.85, vad_start: 115.83, vad_end: 116.69
whisper_vad: vad_segment_info: orig_start: 142.15, orig_end: 142.46, vad_start: 116.79, vad_end: 117.20
whisper_vad: vad_segment_info: orig_start: 143.30, orig_end: 145.25, vad_start: 117.30, vad_end: 119.35
whisper_vad: vad_segment_info: orig_start: 146.15, orig_end: 147.55, vad_start: 119.45, vad_end: 120.95
whisper_vad: vad_segment_info: orig_start: 147.97, orig_end: 149.34, vad_start: 121.05, vad_end: 122.52
whisper_vad: vad_segment_info: orig_start: 150.24, orig_end: 151.61, vad_start: 122.62, vad_end: 124.09
whisper_vad: vad_segment_info: orig_start: 152.96, orig_end: 153.69, vad_start: 124.19, vad_end: 125.02
whisper_vad: vad_segment_info: orig_start: 154.18, orig_end: 155.49, vad_start: 125.12, vad_end: 126.53
whisper_vad: vad_segment_info: orig_start: 155.81, orig_end: 156.73, vad_start: 126.63, vad_end: 127.65
whisper_vad: vad_segment_info: orig_start: 157.12, orig_end: 160.22, vad_start: 127.75, vad_end: 130.95
whisper_vad: vad_segment_info: orig_start: 160.39, orig_end: 162.33, vad_start: 131.05, vad_end: 133.09
whisper_vad: vad_segment_info: orig_start: 162.59, orig_end: 163.20, vad_start: 133.19, vad_end: 133.90
whisper_vad: vad_segment_info: orig_start: 163.62, orig_end: 164.13, vad_start: 134.00, vad_end: 134.61
whisper_vad: vad_segment_info: orig_start: 164.32, orig_end: 165.18, vad_start: 134.71, vad_end: 135.67
whisper_vad: vad_segment_info: orig_start: 165.89, orig_end: 166.27, vad_start: 135.77, vad_end: 136.25
whisper_vad: vad_segment_info: orig_start: 166.95, orig_end: 168.64, vad_start: 136.35, vad_end: 138.14
whisper_vad: vad_segment_info: orig_start: 169.25, orig_end: 170.27, vad_start: 138.24, vad_end: 139.36
whisper_vad: vad_segment_info: orig_start: 170.79, orig_end: 173.05, vad_start: 139.46, vad_end: 141.82
whisper_vad: vad_segment_info: orig_start: 173.70, orig_end: 175.29, vad_start: 141.92, vad_end: 143.61
whisper_vad: vad_segment_info: orig_start: 175.97, orig_end: 176.93, vad_start: 143.71, vad_end: 144.77
whisper_vad: vad_segment_info: orig_start: 178.24, orig_end: 178.85, vad_start: 144.87, vad_end: 145.48
whisper_vad: Created time mapping table with 694 points
whisper_vad: Reduced audio from 2879829 to 2327680 samples (19.2% reduction)
whisper_full_with_state: auto-detected language: ko (p = 0.992537)
Received request: no_speech.wav
Successfully loaded no_speech.wav
system_info: n_threads = 4 / 12 | WHISPER : COREML = 0 | OPENVINO = 0 | CUDA : ARCHS = 520 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | OPENMP = 1 | REPACK = 1 |
operator (): processing 'no_speech.wav' (488789 samples, 30.5 sec), 4 threads, 1 processors, lang = auto, task = transcribe, timestamps = 1 ...
Running whisper.cpp inference on no_speech.wav
whisper_full: VAD is enabled, processing speech segments only
whisper_vad: VAD is enabled, processing speech segments only
whisper_vad_segments_from_samples: detecting speech timestamps in 488789 samples
whisper_vad_detect_speech: detecting speech in 488789 samples
whisper_vad_detect_speech: n_chunks: 955
whisper_vad_detect_speech: props size: 955
whisper_vad_detect_speech: chunk_len: 341 < n_window: 512
whisper_vad_detect_speech: vad time = 996.63 ms processing 488789 samples
whisper_vad_segments_from_probs: detecting speech timestamps using 955 probabilities
whisper_vad_segments_from_probs: Final speech segments after filtering: 0
<no crash>
Metadata
Metadata
Assignees
Labels
No labels