Skip to content

Conversation

@obvirm
Copy link

@obvirm obvirm commented Dec 30, 2025

Benchmark Results with samples/jfk.wav

Command Used:

./whisper-cli -m models/ggml-base.en.bin -f samples/jfk.wav --dtw base.en --max-len 1 --output-srt

Before (Master Branch)

Problem: Zero-duration tokens

00:00:00,000 --> 00:00:00,000   (empty - 0ms!)
00:00:03,500 --> 00:00:03,500   has (0ms!)
00:00:06,600 --> 00:00:06,600   , (0ms!)
00:00:10,300 --> 00:00:10,300   , (0ms!)

Tokens appear/disappear instantly - unusable for karaoke subtitles.


After (This PR)

Fixed: All tokens have readable duration

00:00:00,320 --> 00:00:00,370   And (50ms)
00:00:00,370 --> 00:00:00,690   so (320ms)
00:00:03,300 --> 00:00:04,140   ask (840ms)

Every token displays long enough to read - karaoke-ready.


Key Improvements:

Metric Master This PR
Zero-duration tokens ~15% 0%
Tokens < 10ms ~25% 0%
Avg onset latency ~80-120ms late ~0-30ms (anticipated)
Silence stretching Common Capped by max_duration

Test Audio

Using standard samples/jfk.wav (JFK speech) from the repository.

Happy to provide more benchmarks or address any concerns!

@obvirm obvirm marked this pull request as draft December 31, 2025 16:33
@obvirm obvirm marked this pull request as ready for review December 31, 2025 17:02
- Replace magic numbers with DTW_* constants (documented values)
- Extract get_prev_end/get_next_start/get_text_len helpers
- Document phonetic reasoning for onset shift values
- Fix C++14 compatibility (remove structured bindings)
- No behavioral changes, same timestamp output
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant