Hey, we ran our state of the art emotion captioning and emotion estimation scores over all of Emilia. You can find the results here distributed over these five repositories.
https://huggingface.co/datasets/laion/Emilia-with-Emotion-Annotations
https://huggingface.co/datasets/laion/Emilia-with-Emotion-Annotations2
https://huggingface.co/datasets/laion/Emilia-with-Emotion-Annotations3
https://huggingface.co/datasets/laion/Emilia-with-Emotion-Annotations4
https://huggingface.co/datasets/laion/Emilia-with-Emotion-Annotations5
https://huggingface.co/Orange/Speaker-wavLM-tbr
I would suggest fine-tuning that is also conditioned on these speaker embeddings because they capture the time-independent attributes of a voice that make up a speaker's identity without confusing it with emotions and arousal. This could eventually enable voice cloning without having a pair for the reference audio. Just take the embedding of the target as conditioning together with the emotion scores.
Have fun! :)