Curious about the source of the training data? #21

wooters · 2022-09-21T23:11:56Z

wooters
Sep 21, 2022

Thank you for this amazing project!

I was wondering about the 680k hours of audio data. After reading through the paper and blog post, I don't think I saw any mention of where the data came from (other than the "from the web" phrase in the blog post). Are you able to say more about this?

I don't mean to sound like I'm looking a gift horse in the mouth, I'm just super curious about this. 😄

bryanlyon · 2022-09-24T09:55:32Z

bryanlyon
Sep 24, 2022

It's pretty obviously at least partially web videos from a major site. I often see in moments of extended silence "Thanks for watching" or "Please like and subscribe". This is likely because a lot of videos on sites like that include that in their subtitles at the very end, even if they don't actually say the words, so the model has learned that sometimes in silence it just needs to toss that in.

1 reply

wooters Sep 26, 2022
Author

Ah, interesting! Thanks!

RicardoAHS · 2022-09-26T20:54:39Z

RicardoAHS
Sep 26, 2022

Amara.org has appeared to me in some moments of silence.

0 replies

ephemer · 2023-11-06T04:08:43Z

ephemer
Nov 6, 2023

This makes sense. But how do we stop this from happening?

I'd really like to find a way to find all of these "silence" statements and stop them from appearing in the transcriptions

Even ChatGPT (Paid version) is affected by this bug since it uses Whisper under the hood.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Curious about the source of the training data? #21

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Curious about the source of the training data? #21

Uh oh!

wooters Sep 21, 2022

Replies: 3 comments · 1 reply

Uh oh!

bryanlyon Sep 24, 2022

Uh oh!

wooters Sep 26, 2022 Author

Uh oh!

RicardoAHS Sep 26, 2022

Uh oh!

ephemer Nov 6, 2023

wooters
Sep 21, 2022

Replies: 3 comments 1 reply

bryanlyon
Sep 24, 2022

wooters Sep 26, 2022
Author

RicardoAHS
Sep 26, 2022

ephemer
Nov 6, 2023