Distinguish between spoken audio and song lyrics #853
Replies: 3 comments 2 replies
-
Yes! I'm working on live Icecast transcription and of the BBC World Service as a test streams. It is highly usable and stable too. I ran it on CPU for an hour with no gaps between the chunks and not many misses on transcription and it was entirely feasible to mute the audio and follow the programme in real time. You could give any audio or video source to ffmpeg to send on to icecast and then delay the listening device as needed to sync up with the transcribing. Chunking and timing thereof was the most challenging. VAD capability would be highly useful to prevent whisper from attempting to transcribe on that chunk. The only thing I can think of is to run a chunk of audio through VAD and score it as containing speech or not ... however; Lyrics in music also score as spoken word timestamps. To prevent hallucination I would need a bool returned from a method: isSpeech() or isMusic() If the chunk had music at the start and real speech at the end, it would just be a loss of not too many seconds of speech. The flow in threads:
Hallucinations: "I'm not broke but you can see the cracks. You can make me perfect again. All because of you. All because of you. All because of you. All because of you. All because of you." And then after that on any song it can't understand I get. Use cases
I actually started out with transcribing NOAA Weather Radio live off RTL_SDR. There are a couple of use cases there, including translating Paul to other languages and playing out via gTTS or something like that. But also with the transcribed text you could in theory make a barometer from the pressure data spoken and grab other weather data too. |
Beta Was this translation helpful? Give feedback.
-
Why not simply extract the vocals from the music with a tool like Demucs? |
Beta Was this translation helpful? Give feedback.
-
Prompt engineering: "You are listening to a radio station and may encounter music, do not make up the words to the songs". Icecast metadata with an API lookup could be a workaround and form the basis of training whisper how to sing also? Most icecast stations will give now playing information:
Then with that information, lookup the lyrics in plain text without ASR. https://pypi.org/project/lyricsgenius/ Sign up for account, generate API token:
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I've been trying Whisper out on radio broadcasts and the transcripts are pretty accurate, certainly good enough for real-world use when using the small or medium model. The major stumbling block I'm having in appliying a useful application to this trying to distinguish in the Whisper output between when a radio DJ(s) is speaking and when a song is being played.
I've had a few ideas but all of them seem either a bit flawed or impractical.
On the pre-processing side:
On the post-processing side:
I was wondering if anyone had tried doing anything similar to the above or had any feedback/ideas on the best way to do something like this with Whisper. Is there something I could perhaps try by interacting with Whisper at a lower-level that may be a better idea?
Beta Was this translation helpful? Give feedback.
All reactions