Trys - Embedded Interjections #1315

jh-modjeski · 2023-05-05T03:01:26Z

jh-modjeski
May 5, 2023

I've just implemented a new feature for my Trys project that takes advantage of Whisper's word_timestamps to tag and embed interjections in the text of a primary speaker's transcript. If the interjection is shorter than the pause_len used for non-silence detection, then it will be embedded into the primary speaker's line. if it's longer than pause_len, it will be considered crosstalk and placed on a new line.

Example Output (Lex Fridman podcast with Andrej Karpathy)

00:00:03.136 - 00:00:12.576 (Lex Fridman):  Is there cool small projects like Archive Sanity and so on that you're thinking about that the world, the ML world can anticipate?
00:00:13.981 - 00:00:24.841 (Andrej Karpathy):  There's always some fun side projects. Archive Sanity is one. Basically, there's way too many archive papers. How can I organize it and recommend papers and so on? I transcribed all of your podcasts.
00:00:25.679 - 00:00:39.319 (Lex Fridman):  What did you learn from that experience? From transcribing the process of, like you like consuming audio books and podcasts and so on. And here's a process that achieves closer to human level performance on annotation.
00:00:40.109 - 00:01:20.639 (Andrej Karpathy):  Yeah, well, I definitely was surprised that transcription with OpenAI's Whisper was working so well compared to what I'm familiar with from Siri and a few other systems, I guess. It works so well, and that's what gave me some energy to try it out, and I thought it could be fun to run on podcasts. It's kind of not obvious to me why Whisper is so much better compared to anything else, because I feel like there should be a lot of incentive for a lot of companies to produce transcription systems, and that they've done so over a long time. Whisper is not a super exotic model. It's a transformer. It takes MEL spectrograms and just outputs tokens of text. It's not crazy. The model and everything has been around for a long time. I'm not actually 100% sure why this is.
00:01:20.867 - 00:01:30.207 (Lex Fridman):  It's not obvious to me either. It makes me feel like I'm missing something. (Andrej Karpathy:  I'm missing something.) Yeah, because there is a huge, even at Google and so on, YouTube transcription.
00:01:31.231 - 00:01:31.231 (Andrej Karpathy):  Yeah.
00:01:31.957 - 00:01:35.647 (Lex Fridman):  Yeah, it's unclear, but some of it is also integrating into a bigger system.
00:01:36.609 - 00:01:36.609 (Andrej Karpathy):  Yeah.
00:01:37.567 - 00:01:53.837 (Lex Fridman):  That, so the user interface, how it's deployed and all that kind of stuff. Maybe running it as an independent thing is much easier, like an order of magnitude easier than deploying to a large integrated system like YouTube transcription or anything like meetings. Like Zoom has transcription.
00:01:55.507 - 00:02:25.397 (Lex Fridman):  That's kind of crappy, but creating a interface where it detects the different individual speakers, it's able to display it in compelling ways, run it in real time, all that kind of stuff. Maybe that's difficult. But that's the only explanation I have because I'm currently paying quite a bit for human transcription, human caption, (Andrej Karpathy:  Right.) annotation. And it seems like there's a huge incentive to automate that. It's very confusing.
00:02:25.395 - 00:02:29.415 (Andrej Karpathy):  Yeah. And I think, I mean, I don't know if you looked at some of the whisper transcripts, but they're quite good.
00:02:29.867 - 00:02:51.817 (Lex Fridman):  They're good, and especially in tricky cases. (Andrej Karpathy:  Yeah.) I've seen Whisper's performance on super tricky cases, and it does incredibly well. So I don't know. A podcast is pretty simple. It's high-quality audio, and you're speaking usually pretty clearly. (Andrej Karpathy:  Yeah.) So I don't know. I don't know what OpenAI's plans are either.
00:02:52.191 - 00:03:18.911 (Andrej Karpathy):  Yeah, but there's always like fun projects basically. And stable diffusion also is opening up a huge amount of experimentation, I would say in the visual realm and generating images and videos and movies. (Lex Fridman:  Yeah, videos now.) And so that's going to be pretty crazy. That's going to almost certainly work and it's going to be really interesting when the cost of content creation is going to fall to zero. You used to need a painter for a few months to paint a thing, and now it's going to be speak to your phone to get your video.
00:03:20.030 - 00:03:22.190 (Lex Fridman):  So Hollywood will start using that to generate scenes,
00:03:24.820 - 00:03:32.270 (Lex Fridman):  which completely opens up, yeah, so you can make a movie like Avatar, eventually, for under a million dollars.
00:03:33.224 - 00:03:36.524 (Andrej Karpathy):  much less maybe just by talking to your phone. I mean, I know it sounds kind of crazy.

Trys is using nonsilence detection to determine sections of audio where whisper should transcribe. Each line represents a continuous audible section when the speaker was never quiet for longer than the pause_len. Interjections do not disrupt the primary speaker for longer than the pause_len, and therefore, we do not disrupt the transcribed line by making a new line for the interjection. I believe this is far more readable, because we follow the primary speaker's chain of thought until the end.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Trys - Embedded Interjections #1315

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Trys - Embedded Interjections #1315

Uh oh!

Uh oh!

jh-modjeski May 5, 2023

Example Output (Lex Fridman podcast with Andrej Karpathy)

Replies: 0 comments

jh-modjeski
May 5, 2023