How to achieve lyrics and speech detection? #1839

SetoKaiba · 2023-11-25T04:56:59Z

SetoKaiba
Nov 25, 2023

I see the video sites generate the lyrics with ♪ around, for example, "♪Jingle bells♪, ♪jingle bells♪, ♪jingle all the way♪".
And there will be no ♪ with speech only, for example, Merry Christmas.
How can I achieve this? Or is there a way to achieve it with comination of other model? Thank you.

NielsMayer · 2023-11-25T06:23:24Z

NielsMayer
Nov 25, 2023

Try the large or medium model?? whisper.cpp is outputting, for example (see ggml-org/whisper.cpp#1240 ):

(source: https://www.bbc.co.uk/programmes/m001sdb3 )

[00:05:18.720 --> 00:05:24.720]   ♪ ♪
[00:05:24.720 --> 00:05:28.720]   ♪ If I love you ♪
[00:05:28.720 --> 00:05:32.720]   ♪ Every single bright day ♪
[00:05:32.720 --> 00:05:36.720]   ♪ If I love you ♪
[00:05:36.720 --> 00:05:40.720]   ♪ I can make you feel my way ♪
[00:05:40.720 --> 00:05:44.720]   ♪ If I love you ♪
[00:05:44.720 --> 00:05:48.720]   ♪ Every single bright day ♪
[00:05:48.720 --> 00:05:52.720]   ♪ If I love you ♪
[00:05:52.720 --> 00:05:57.720]   ♪ I can make you feel my way ♪
[00:05:57.720 --> 00:06:01.720]   ♪ Hurts me every time ♪
[00:06:01.720 --> 00:06:06.720]   ♪ ♪
[00:06:06.720 --> 00:06:11.720]   ♪ Every day to do something ♪
[00:06:11.720 --> 00:06:16.720]   ♪ ♪
[00:06:16.720 --> 00:06:21.720]   ♪ As she draws my attention ♪
[00:06:21.720 --> 00:06:25.720]   ♪ ♪
[00:06:25.720 --> 00:06:29.720]   ♪ Moving with the dance ♪
[00:06:29.720 --> 00:06:35.720]   ♪ ♪
[00:06:35.720 --> 00:06:39.720]   ♪ If I find a brighter day ♪

I like this one the best, though:

[01:55:24.520 --> 01:55:34.520]   # DRAMATIC ORCHESTRAL MUSIC
[01:55:35.520 --> 01:55:40.520]   # DRAMATIC ORCHESTRAL MUSIC
[01:55:57.520 --> 01:56:02.520]   # DRAMATIC ORCHESTRAL MUSIC
[01:56:03.520 --> 01:56:07.520]   # DRAMATIC ORCHESTRAL MUSIC
[01:56:07.520 --> 01:56:23.520]   # DRAMATIC ORCHESTRAL MUSIC
[01:56:24.520 --> 01:56:28.520]   # DRAMATIC ORCHESTRAL MUSIC
[01:56:51.520 --> 01:56:55.520]   # DRAMATIC ORCHESTRAL MUSIC
[01:56:56.520 --> 01:57:00.520]   # DRAMATIC ORCHESTRAL MUSIC
[01:57:01.520 --> 01:57:05.520]   # DRAMATIC ORCHESTRAL MUSIC
[01:57:06.520 --> 01:57:10.520]   # DRAMATIC ORCHESTRAL MUSIC
[01:57:10.520 --> 01:57:14.520]   # DRAMATIC ORCHESTRAL MUSIC
[01:57:14.520 --> 01:57:18.520]   # DRAMATIC ORCHESTRAL MUSIC
[01:57:18.520 --> 01:57:22.520]   # DRAMATIC ORCHESTRAL MUSIC
[01:57:22.520 --> 01:57:26.520]   # DRAMATIC ORCHESTRAL MUSIC
[01:57:26.520 --> 01:57:30.520]   # DRAMATIC ORCHESTRAL MUSIC
[01:57:30.520 --> 01:57:34.520]   # DRAMATIC ORCHESTRAL MUSIC
[01:57:34.520 --> 01:57:38.520]   # DRAMATIC ORCHESTRAL MUSIC
[01:57:38.520 --> 01:57:45.520]   # DRAMATIC ORCHESTRAL MUSIC
[01:57:45.520 --> 01:57:54.520]   # DRAMATIC ORCHESTRAL MUSIC
[01:57:54.520 --> 01:57:58.520]   # DRAMATIC ORCHESTRAL MUSIC
[01:57:58.520 --> 01:58:12.520]   # DRAMATIC ORCHESTRAL MUSIC
[01:58:12.520 --> 01:58:16.520]   # DRAMATIC ORCHESTRAL MUSIC
[01:58:16.520 --> 01:58:30.520]   # DRAMATIC ORCHESTRAL MUSIC
[01:58:30.520 --> 01:58:34.520]   # DRAMATIC ORCHESTRAL MUSIC
[01:58:34.520 --> 01:58:47.520]   # DRAMATIC ORCHESTRAL MUSIC
[01:58:47.520 --> 01:58:51.520]   # DRAMATIC ORCHESTRAL MUSIC
[01:58:51.520 --> 01:59:08.520]   # DRAMATIC ORCHESTRAL MUSIC
[01:59:09.520 --> 01:59:13.520]   # DRAMATIC ORCHESTRAL MUSIC
[01:59:13.520 --> 01:59:29.520]   # DRAMATIC ORCHESTRAL MUSIC
[01:59:30.520 --> 01:59:34.520]   # DRAMATIC ORCHESTRAL MUSIC
[01:59:34.520 --> 01:59:48.520]   # DRAMATIC ORCHESTRAL MUSIC
[01:59:48.520 --> 01:59:50.040]   (dramatic music)

The above is with whisper.cpp and large v2, which often outputs more "fun" transcriptions for music than regular whisper...

Regular whisper (or rather faster-whisper in this case) doesn't seem to output as much "fun" stuff, but useful nonetheless:

https://rumble.com/v2nvr1w-trainspodder-helping-decode-reggae-lyrics.html

If anybody cares to enlighten re why different levels of "fun" are available in different whisper variants, please opine!

0 replies

SetoKaiba · 2023-11-25T06:55:21Z

SetoKaiba
Nov 25, 2023
Author

I did see some fun output with large model in English. What if the language is not English? Do they fine tune themselves?

0 replies

EtienneAb3d · 2023-11-25T07:15:29Z

EtienneAb3d
Nov 25, 2023

Whisper is explicitly filtering such char/token out. See here:

whisper/whisper/tokenizer.py

Line 247 in e58f288

- ♪♪♪

You can change the char/token list with the suppress_tokens option:

whisper/whisper/transcribe.py

Line 413 in e58f288

    
           parser.add_argument("--suppress_tokens", type=str, default="-1", help="comma-separated list of token ids to suppress during sampling; '-1' will suppress most special characters except common punctuations")

0 replies

SetoKaiba · 2023-11-25T09:47:13Z

SetoKaiba
Nov 25, 2023
Author

@EtienneAb3d I already tested whisper cli with no suppress_tokens, it does transcribe some ♪ in English. But it doesn't work with Chinese audio which is song and speech mixed. They all just transcribed as normal characters. So I'd like to know whether the sites supporting Chinese subtitle for lyrics fine-tuned themselves.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to achieve lyrics and speech detection? #1839

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to achieve lyrics and speech detection? #1839

Uh oh!

SetoKaiba Nov 25, 2023

Replies: 4 comments

Uh oh!

Uh oh!

NielsMayer Nov 25, 2023

Uh oh!

SetoKaiba Nov 25, 2023 Author

Uh oh!

EtienneAb3d Nov 25, 2023

Uh oh!

SetoKaiba Nov 25, 2023 Author

SetoKaiba
Nov 25, 2023

NielsMayer
Nov 25, 2023

SetoKaiba
Nov 25, 2023
Author

EtienneAb3d
Nov 25, 2023

SetoKaiba
Nov 25, 2023
Author