Making better wakeword datasets #2429
StuartIanNaylor
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
https://github.com/StuartIanNaylor/wake_word_capture
Some back history is my own experience with various opensource wakeword is they are not particulary accurate and often prone to false positives as negatives.
I have made an improvement in the above as near all seem to provide not much more than a binary classification of Wakeword, Unknownwords and Noise.
This has several problems with a lack of cross entropy so that the model overfits on training, but interms of features there is a huge class imbalance.
I have fixed this to a certain extent by adding further classes and the above example is 'Computer' via a CRNN from https://github.com/google-research/google-research/tree/master/kws_streaming
What I do is quite simple by 1st creating a language database 'English' and creating a sylable and phonemetables and count.
'Unknown' then becomes all words with the same sylables as the wakeword, as an aproximation of key spectra in MFCC.
With a single word 2 classes 'LikeKW1', 'LikeKW2' that uses a phoneme slection to create 'Sounds like' on the front and last Sylables.
These words are excluded from Unknown because of the way softmax works but they make the training work harder to find distinguising features.
Then a '1syl' class of one sylable words to try and force overall edge/texture detection and one I call 'Phon' that is the complete KW duration of words concatenated and trimmed in a similar manner to noise again forcing overall edge/texture detection.
So end up with Kw, LikeKw1, LikeKw2, 1syl, Phon NotKw (unknown) and Noise as classifications.
Now I use TTS and a toy dataset of voices from the following.
Coqui ⓍTTSv2 870 https://github.com/coqui-ai/TTS
Emotivoice 1932 https://github.com/netease-youdao/EmotiVoice
Piper 904 https://github.com/k2-fsa/sherpa-onnx
Kokorov1 53 https://github.com/k2-fsa/sherpa-onnx
Kokorov1_1 103 https://github.com/k2-fsa/sherpa-onnx
Kokoro_en 11 https://github.com/k2-fsa/sherpa-onnx
VCKT 109 https://github.com/k2-fsa/sherpa-onnx
I say 'toy' as its just shy of 4000 wakeword with the 14000 of NotKw that sets my class sample size that all are augmented up to that qty.
TTS are great as its key to get clean samples so you can be accurate in augmentation of noise and reverberation even if I don't bother with reverb.
From phonetic columns various SqLite group by clauses, to try to create some even distribution of Phones and augmentation levels and that Wakeword voice exists accross classification so that edges of spectra becomes more important than any possible textures that different recordings and different datasets can provide.
I have added resultant Tflite of a streaming / nonstreaming version and quantised version with the training logs and started with the basic Kw,NotKw ,Noise then add the x2 LikeKw, then the 1Syl and finally phon all in different training runs and included the logs so you can see the training curves this provides.
Keyword models work better with more classifications and your dataset design should try to force equal distribution of features whilst more means there is less chance of softmax triggering not because it has a high feature hit but that, hits are extremely low in all other classes.
It works well but I stopped at just shy of 4000 voices which obviously for all the prosody of input spoken english could have, is hugely overfiited. There is aproblem that modern TTS like much in opensource speech tech have language covered but is completely shy when it comes to dialect, accents and the varied prosody we meet in the wild.
It actually works pretty damn well but even using cloning TTS trying to find a source of dialect / accent datasets that have good metadata with good prosody range, seems a huge struggle.
Otherwise I would of continued and maybe moved on to providing Transfer learning so that I can create other wakeword with smaller datasets.
I did start off this quest but that lack of clean recordings of any use that are not incorrectly labelled due to problems of forced alignment has stopped me using the like of CommonVoice and ML-Commons that also don't have metadata to filter and create balanced datasets.
I am pleasantly surprised by the results as was expecting less with what is purely an example toy dataset that seems far more tolerant to false positives, accurate to the KW and works at least out to 3meter without a microphone array or noise filter.
Classification models just work better with more classes to spread and balance feature and likely further can be add by adding similar sylable KW but unique phonetics and also gives choice of wakeword in use with no increase in model size and from what I can see no observable compute increase, yet at least.
I have a basic understanding of ML and classification wakeword models where with a simple model results are very dependent on the dataset.
I am wondering with you more knowleable out there are there better methods?
Also anyone got any hints on how to get more voices that are not just 'Neutral' english more representation of the many dialects and nations that use English as a 2nd language and as some of you might think the same of those in England the same :)
I can not say about
Anyone know of and large dialect prosody dataset of phonetic pangram short sentences to use with one of the latest cloning TTS?
For true consumer grade wakeword to not overfit a small number of voices to the wide range of prosody in the wild many more voices are required. The resultant model in conjunction with transfer learning could use much smaller datasets to create alternative wakewords whilst inheriting the base wakeword acuracy and can this type of dataset can be used with any model.
I wondered as you guys are far more knowledgeable than I am if you know of better ways to create clean sample words for wakeword that contain wider prosody and dialects.
Also maybe you might gain interest in a wakeword creation framwork of transfer learning from a base model?
Coqui likely adds the most value voices as you can force voices with none native accents to speak English inputted words on another language model to get the accents of none native speakers and maybe is an example of how you can create a prosody based TTS?
Beta Was this translation helpful? Give feedback.
All reactions