Distinguish between spoken audio and song lyrics #853

bilaalrashid · 2023-01-16T17:38:07Z

bilaalrashid
Jan 16, 2023

I've been trying Whisper out on radio broadcasts and the transcripts are pretty accurate, certainly good enough for real-world use when using the small or medium model. The major stumbling block I'm having in appliying a useful application to this trying to distinguish in the Whisper output between when a radio DJ(s) is speaking and when a song is being played.

I've had a few ideas but all of them seem either a bit flawed or impractical.

On the pre-processing side:

Using a seperate (hopefully pre-existing) model to try and classify the audio into chunks of spoken audio and songs before running Whisper. This is complicated by the fact that radio DJs often speak over instrumental music (termed "beds") which may make this a bit trickier. Trying to keep timestamps may be a bit of a faff too (although should still be doable)
Using websockets/APIs from the radio stations to get the timestamps of when a song is playing. This would be great, but these aren't always provided reliably so in practice this couldn't be exclusively relied on, and typically the accuracy can be lacking so it may be difficult to rely on them for precise timestamps of when a song has started or stopped

On the post-processing side:

Trying to analyse the output from Whisper and running a model to try and classify the text. I imagine this would be quite hard to do accurately and reliably. Some quick attemtpts with ChatGPT seem to do a reasonable job although I'm not sure how slightly tricker edge cases will fair, and being able to run something locally rather would be important to get something like this running at scale/continuously (API credits for GPT-3 would probably get expensive)
See if I can query chunks of the Whisper output with song lyric APIs to try and match things and distinguish that way. This may be my favourite/easiest to achieve solution at the moment, but querying an API may be slow and expensive and if the Whisper transcriptions are inaccurate that could affect the ability to match the lyrics with the API

I was wondering if anyone had tried doing anything similar to the above or had any feedback/ideas on the best way to do something like this with Whisper. Is there something I could perhaps try by interacting with Whisper at a lower-level that may be a better idea?

duracell80 · 2023-11-24T03:29:00Z

duracell80
Nov 24, 2023

Yes! I'm working on live Icecast transcription and of the BBC World Service as a test streams. It is highly usable and stable too. I ran it on CPU for an hour with no gaps between the chunks and not many misses on transcription and it was entirely feasible to mute the audio and follow the programme in real time. You could give any audio or video source to ffmpeg to send on to icecast and then delay the listening device as needed to sync up with the transcribing.

Chunking and timing thereof was the most challenging. VAD capability would be highly useful to prevent whisper from attempting to transcribe on that chunk. The only thing I can think of is to run a chunk of audio through VAD and score it as containing speech or not ... however; Lyrics in music also score as spoken word timestamps.

To prevent hallucination I would need a bool returned from a method: isSpeech() or isMusic()

If the chunk had music at the start and real speech at the end, it would just be a loss of not too many seconds of speech.

The flow in threads:

Start ffmpeg listening to a stream and use the segmentation features to chunk out 30 second chunks.
Delay a transcription thread by 30 seconds.
Transcribe the previously captured chunk with small model on CPU while:
Also at the same time having ffmpeg chunk out the next 30 seconds.
Use PyGame's music queue to playback the chunks
Concurrently the transcription will be presented on screen with a character by character typewriter effect as PyGame is playing that transcribed chunk.

Hallucinations:
On music stations; Here's an example of what happens with lyrics, it hears some and then goes out:

"I'm not broke but you can see the cracks. You can make me perfect again. All because of you. All because of you. All because of you. All because of you. All because of you."

And then after that on any song it can't understand I get.
"I'm not a fan of you. I'm not a fan of you. I'm not a fan of you. I'm not a fan of you. I'm not a fan of you."

Use cases

Archival logs
Summarize the last hour of radio ...
Accessibility to hearing impaired
Create an eBook of each hour of talk radio, upload to e-Reader / Kindle
Transcribe foreign radio talk shows, enough to gain some semblance of topics
Provide NOAA Weather Radio in other languages (with latency especially if invoking further TTS)
Alerts to topics and keywords, so mute your radio until a topic of interest appeared
Contextual sidebar of links; using tag clouds or instances of words present wikipedia links along side the audio.

I actually started out with transcribing NOAA Weather Radio live off RTL_SDR. There are a couple of use cases there, including translating Paul to other languages and playing out via gTTS or something like that. But also with the transcribed text you could in theory make a barometer from the pressure data spoken and grab other weather data too.

0 replies

EtienneAb3d · 2023-11-24T08:18:13Z

EtienneAb3d
Nov 24, 2023

Why not simply extract the vocals from the music with a tool like Demucs?
I also experimented with a bunch of other interesting processing here (more in preparation):
https://github.com/EtienneAb3d/WhisperHallu

2 replies

duracell80 Nov 25, 2023

Interesting, will check those out.

I got a rudimentary check working with Librosa, since it's only ever going to run on 30 second real time chunks, the latency to result is within limits to be useful. Causing a few added seconds of processing before displaying / prepending the text with a music emoji. Though I can see a need to test on Classical music stations.

Essentially the difference between speech and music could be described as gaps in the amplitude. Squelch. Speech should contain more "gaps" in the audio where a db level would fall below a threshold and music usually wouldn't. Based on that you could direct Whisper to act in different ways, since speech contains little background sound and literally music is all background noise and even humans can't some hear lyrics.

There might be a more performant way to do this also based on analysis of "volume levels" with something else.

Take a chunk of audio
Process with librosa's split function, squelch at 30db
Get array indicating silence, pauses in speech or other "gaps"
Check for number of elements in array, music will have a lower value than speech
Condition on size of array and return True for lower values and False for higher

Maybe also if the array values indicate the whole segment is silent, that could be a way to perform isSilent().

import librosa

path = "speech.wav"
#path = "music.wav"

def isMusic(path):
        y, sr = librosa.load(path, sr = None)
        splits = librosa.effects.split(y = y, frame_length = 4000, top_db = 30)

        if splits.size <= 3:
                return True
        else:
                return False

print(f"Is Music: {isMusic(path)}")

If True send to a shazam like lookup to return a possible now playing. If False it could be speech or commercials.

Jeru2023 Mar 18, 2024

Great point, it works for me when I set top_db=20, now i can easily detect if audio contains music.
But further more an audio with music may contain vocal of speech or singing, how can I differentiate these two?

duracell80 · 2023-11-26T21:51:03Z

duracell80
Nov 26, 2023

Prompt engineering: "You are listening to a radio station and may encounter music, do not make up the words to the songs".

Icecast metadata with an API lookup could be a workaround and form the basis of training whisper how to sing also?

Most icecast stations will give now playing information:

import urllib.request

# UTILITY: GETICY
def get_icy(stream_url):
        request = urllib.request.Request(stream_url)
        title = "Unknown Artist"
        try:
                request.add_header('Icy-MetaData', 1)
                response = urllib.request.urlopen(request)
                icy_metaint_header = response.headers.get('icy-metaint')

                if icy_metaint_header is not None:
                        metaint = int(icy_metaint_header)
                        read_buffer = metaint+255
                        content = response.read(read_buffer)
                        title = str(content[metaint:].split(b"'")[1]).replace("b'", "").replace("'", "").title()
        except:
                title = "Unknown Music / Artist"

        return title



title = get_icy(http://url-to-station:8035/listen.mp3)

now_playing = title.split("-")
artist = now_playing[0]
song = now_paying[1]`

Then with that information, lookup the lyrics in plain text without ASR.

https://pypi.org/project/lyricsgenius/
$ pip install lyricsgenius

Sign up for account, generate API token:
https://genius.com/api-clients/new

from lyricsgenius import Genius

token = "your token here"

genius = Genius(token)
genius.verbose = False

s_title = "Proof"
s_artist = "I AM KLOOT"

song = genius.search_song(s_title, s_artist)

lyrics = str(song.lyrics).replace(f"{s_title} Lyrics", "").replace("You might also likeEmbed", "")
print(lyrics)

song.save_lyrics()

2 ContributorsHey, could you stand another drink
I'm better when I don't think
It seems to get me through
Say, d'you wanna spin another line
Like we had a good time
Not that I need proof

Swell, we're living in a hotel
And someone's ringing my bell
In a room without a view
Hey, heard you read another book
Should I take another look
Who am I
Without you

0 replies

Distinguish between spoken audio and song lyrics #853

Uh oh!

Uh oh!

bilaalrashid Jan 16, 2023

Replies: 3 comments · 2 replies

Uh oh!

Uh oh!

duracell80 Nov 24, 2023

Uh oh!

EtienneAb3d Nov 24, 2023

Uh oh!

Uh oh!

duracell80 Nov 25, 2023

Uh oh!

Jeru2023 Mar 18, 2024

Uh oh!

Uh oh!

duracell80 Nov 26, 2023

bilaalrashid
Jan 16, 2023

Replies: 3 comments 2 replies

duracell80
Nov 24, 2023

EtienneAb3d
Nov 24, 2023

duracell80
Nov 26, 2023