Can whisper be fine-tuned for emoji-encoded emotion recognition? #2196
Unanswered
silasalves
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I will skip the story on how symbols became letters just to become symbols again and repeat the story. The point is that I know of at least two extension that digital technology brought to written language:
I am wondering if it would be possible to use emojis or kaomojis for emotion recognition using whisper. Take the following conversation as an example:
I looked into emotion datasets and found The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) and the Toronto emotional speech set (TESS), but I don't think they are useful for fine-tuning because they repeat the same phrase with different emotions. For example, RAVDESS recorded 24 actors (12 male, 12 female) speaking "kids are talking by the door" and "dogs are sitting by the door" while modulating their vocal emotional tone to express calm, happy, sad, angry, fearful, surprise, and disgust. If RAVDESSS was used, the database would be made of samples like:
I am afraid the regularity of the samples' non-emoji tokens would overtrain the system for those two phrases of RAVDESS, for example. There is also the size of the databases, which I believe are not even close to what a deep learning approach would need. That being said, I wonder if these databases could be used to create a "toy experiment"? I do think that trying to dissociate the vocal emotional content from the linguistic emotional content (i.e. speaking the same phrase with varying emotion) would be desirable for testing the generalization of vocal intonation; however, I am not so sure about its value for training vocal intonation.
I found some papers in Google Scholar discussing the usage of emojis to encode emotions in text -- not necessarily for vocal emotion recognition -- and I just used insanely-fast-whisper, and I am impressed by its performance. Now I am looking for a method for vocal-human emotion recognition, and I realized that it would be great if whisper could do both tasks, just like it does with diarization.
Thanks for reading up to the end, and have a great rest of your day and/or night. ( ´ ω ` )ノ゙
Beta Was this translation helpful? Give feedback.
All reactions