Replies: 1 comment
-
any update? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello, I want to take the audio at my workplace and transform it into a transcription; however, with base whisper, it seems as though it isn't that good. So, I have been wanting to create my own tokenizer that can understand jargon and output that jargon better. Stuff similar to acronyms. Below I have shown my steps in
I run this using huggingface trainer, with the generate option. Is it my data size? i have scoured online to try and find some sort of solution but they all just say it works. I am at my wits end and would appreciate any help on getting this tokenizer to learn my jargon.
Thank you in advance :)
Creating the tokenizer
len(tokenizer) == 193
Preprocessing steps
len(train_dataset) == 4000
len(test_dataset) == 1000
Model Config
Huggingface Trainer
Here I have made the dataset the same 30 examples to see if it would give me complete overprediction, but even with setting train and test to be the same, it is not overfitting at all.
Outputs after second epoch
Beta Was this translation helpful? Give feedback.
All reactions