-
Notifications
You must be signed in to change notification settings - Fork 1
Voice‐over Analysis
We use a cache-aware FastConformer (around 114M parameters) trained on a large-scale English speech trained for streaming ASR with a look-ahead of 1040ms which can be used for medium-latency streaming applications. The model achieved a word error rate of 2.3% and 5.5% on librispeech-test clean and other subdatasets respectively.
Key Bert is used to extract keywords from speech transcriptions.
We use the model audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim which is Wav2Vec2-Large-Robust fine-tuned on MSP-Podcast to return arousal/valence/dominance scores.
| IEMOCAP | MSP-Podcast | |
|---|---|---|
| Arousal | 0.66 | 0.74 |
| Valence | 0.44 | 0.63 |
| Dominance | 0.52 | 0.65 |
Concordance Correlation Coefficient (CCC) is reported for these models as it assesses how well the predicted dimension values agree with the ground truth.
The connection between emotion dimensions and 8 emotion classes is depicted below (based on Russell’s circumplex):
| Low Valence | Neutral Valence | High Valence | |
|---|---|---|---|
| Low Arousal | sad | calm | |
| Neutral Arousal | disgusted | neutral | |
| High Arousal | angry (high dominance), fearful (low dominance) | surprised | happy |
After uploading voice-over file (1), hit the preprocess button to start the speech recognition process (2). Then for a refined view of speech segments, you can change the granularity of the speech intervals by merging ones closer than a max distance (3). When speech segments are large and each has a long text, you can extract any number of keywords to summarize them (3). To apply the refinements, hit the analyze button (4) and will see the changes to the plots and text (5).