Can whisper be fine-tuned for emoji-encoded emotion recognition? #2196

silasalves · 2024-06-01T04:12:38Z

silasalves
Jun 1, 2024

I will skip the story on how symbols became letters just to become symbols again and repeat the story. The point is that I know of at least two extension that digital technology brought to written language:

Emojis 👋
Kaomoji ~ヾ(・ω・)

I am wondering if it would be possible to use emojis or kaomojis for emotion recognition using whisper. Take the following conversation as an example:

Person A: Oh my god, look at this flower—wait, there's a bug on it! Ahh! 😱
Person B: What? Where? Ahh, it's huge! 😨
Person A: I know! It's so creepy, get it off! 😧
Person B: Wait, hold on. Let me grab a stick or something. 😰
Person A: Hurry, it's moving! 😬
Person B: Got it, got it—whoa, it's just a little beetle. Look at it go! 😌
Person A: Ugh, that scared me so much. 😟
Person B: Me too, but look at it. It's kind of cute in a weird way. 😊
Person A: Yeah, maybe. I guess it's harmless. 😅
Person B: Imagine if someone saw us freaking out over a tiny bug! 😂
Person A: I know, right? We'd look ridiculous. 😄
Person B: Okay, crisis averted. Back to tending the garden? 🙂
Person A: Yes, but let's keep an eye out for any more unexpected visitors. 😬
Person B: Definitely. Hey, we should plant more marigolds over here. 😊
Person A: Good idea. They'll brighten up this spot and maybe keep more bugs away. 😌
Person B: Perfect. Let's get to it. 😃

I looked into emotion datasets and found The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) and the Toronto emotional speech set (TESS), but I don't think they are useful for fine-tuning because they repeat the same phrase with different emotions. For example, RAVDESS recorded 24 actors (12 male, 12 female) speaking "kids are talking by the door" and "dogs are sitting by the door" while modulating their vocal emotional tone to express calm, happy, sad, angry, fearful, surprise, and disgust. If RAVDESSS was used, the database would be made of samples like:

Kids are talking by the door. 😌
Kids are talking by the door. 😄
Kids are talking by the door. 😱
...

I am afraid the regularity of the samples' non-emoji tokens would overtrain the system for those two phrases of RAVDESS, for example. There is also the size of the databases, which I believe are not even close to what a deep learning approach would need. That being said, I wonder if these databases could be used to create a "toy experiment"? I do think that trying to dissociate the vocal emotional content from the linguistic emotional content (i.e. speaking the same phrase with varying emotion) would be desirable for testing the generalization of vocal intonation; however, I am not so sure about its value for training vocal intonation.

I found some papers in Google Scholar discussing the usage of emojis to encode emotions in text -- not necessarily for vocal emotion recognition -- and I just used insanely-fast-whisper, and I am impressed by its performance. Now I am looking for a method for vocal-human emotion recognition, and I realized that it would be great if whisper could do both tasks, just like it does with diarization.

Thanks for reading up to the end, and have a great rest of your day and/or night. ( ´ ω ` )ノﾞ

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Can whisper be fine-tuned for emoji-encoded emotion recognition? #2196

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Can whisper be fine-tuned for emoji-encoded emotion recognition? #2196

Uh oh!

silasalves Jun 1, 2024

Replies: 0 comments

silasalves
Jun 1, 2024