Training data information #1807

marisbasha · 2023-11-14T13:04:28Z

marisbasha
Nov 14, 2023

Hello @jongwook, I wonder if you could provide information if Whisper large-v2 was trained with the Ami Speech Corpus and TIMIT. I am writing a paper on using whisper embeddings as a learned similarity, and I need to know this information since if it's pretrained on those datasets it would make it hard to make assumptions.

jongwook · 2023-11-14T17:31:48Z

jongwook
Nov 14, 2023
Maintainer

Hi, we didn't specifically include those two datasets in training, but some samples from them may have been mixed into the data given the scale of the dataset. I'd still consider them "out-of-distribution".

1 reply

marisbasha Nov 14, 2023
Author

Thanks a lot!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Training data information #1807

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Training data information #1807

Uh oh!

marisbasha Nov 14, 2023

Replies: 1 comment · 1 reply

Uh oh!

jongwook Nov 14, 2023 Maintainer

Uh oh!

marisbasha Nov 14, 2023 Author

marisbasha
Nov 14, 2023

Replies: 1 comment 1 reply

jongwook
Nov 14, 2023
Maintainer

marisbasha Nov 14, 2023
Author