Replies: 1 comment
-
they seem to have recorded it with a different camera/microphone (I meant to say) |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I remember once reading an article from the Intecept.com, about how as part of their "metadata", those Alexa and Siri things have entirely flawless, not only speaker diarisation, but also accent detection features which the NSA can tap into. It was just about the kinds of things the NSA does to "justify funds", not about the programming implementation or the Math behind it. I don't think those people have any kinds of superpowers, how exactly are they doing that?
Say you would download this youtube video feed and extract the sound of it (the speech, music as well as ambient and disruptive noises on the same channel/sound track)
// __ Once Upon a Time in America (1984) Official Trailer #1 - Robert De Niro, James Woods Gangster Drama
https://www.youtube.com/watch?v=LcpCRyNo8T8
"Noodles" (N): I'm not interested in your friends in high places and I don't trust politicians.
"Max" (M): Now, if we listen to you, we would still be rolling drunks for (a?) living.
N: You broke?
M: You'd carry that stink of the streets with you the rest of your life.
N: I like the stink of the streets it makes me feel good I like to smell it, it opens up my lungs.
~
This is the extract from the LcpCRyNo8T8.en.vtt file generated by youtube automatic subtitles:
I'm not interested in your friends in
high places and I don't trust
politicians now if we listen to you
would still be rolling drunks for living
you broke to carry that stink of the
streets with you the rest of your life
I like this thing of the streets it
makes me feel good I like to smell it it
opens up my lungs
~
Notice the mistake in the last sentence by "Noodles". No speaker diarisation, no punctuation marks whatsoever. The last sentence is marked as an end of paragraph one liner because of the end of the conversation pause.
Then I extracted the audio out of the video feed:
$ ffmpeg -i LcpCRyNo8T8.mp4 LcpCRyNo8T8_audio.mp3
$ file LcpCRyNo8T8_audio.mp3
LcpCRyNo8T8_audio.mp3: Audio file with ID3 version 2.4.0, contains:MPEG ADTS, layer III, v1, 64 kbps, 44.1 kHz, Stereo
and after attempting to try https://cloud.google.com/speech-to-text and dropping it because they demand a credit card for a "free" trial?, I tried
https://www.happyscribe.com/convert-mp3-to-text
which only took my email and showed me a smiling face of their CEO to then produce a docx which corresponding starting lines looked like:
[00:00:01.530] - Speaker 1: I'm not interested in your friends in high places. And I don't trust politicians.
[00:00:05.180] - Speaker 2: Now, if we listen to you, we'd still be rolling drunks for a living.
[00:00:07.780] - Speaker 3: You broke.
[00:00:08.430] - Speaker 2: He'll carry that stink of the streets with you the rest of your life.
[00:00:10.800] - Speaker 1: I like the stink of the streets. It makes me feel good. I like to smell it and opens up my lungs.
The STT is much better than google's/youtube's automatic subs. I still wonder how it became from Speaker 1 to 3 and how exactly could they not notice that line three was a question. They make you upload an mp3 file and the voice of Speaker 1 and "3" doesn’t quite sound like the exact same because they seem to have recorded it with a different camera, but google/youtube could have exploited heuristics such as the video-sound synched multimodality to notice that it was the first actor mouthing the first and third sentences, which could be similarly used to notice when sentences are likely to have been finished (change of postures in the actors).
Prosody/extra musicality in speech shouldn't be that hard to transcribe as part of the text. Music sheets have been transcribed since voice was written in Sumerian script milennia go. Also, there are all kinds of lip syncking apps out there.
All of those aspects make me wonder why is it that STT is not near perfect and measurably so in such a way that in doubt such segments of text could be marked up for more scrutiny and turked.
The pieces of those technologies are all out there and/or shouldn't be that hard to be implemented. Why is that not happening?
Beta Was this translation helpful? Give feedback.
All reactions