Skip to content

Voice‐over Analysis

Soroush Omranpour edited this page Oct 6, 2024 · 2 revisions

Speech Recognition

We use a cache-aware FastConformer (around 114M parameters) trained on a large-scale English speech trained for streaming ASR with a look-ahead of 1040ms which can be used for medium-latency streaming applications. The model achieved a word error rate of 2.3% and 5.5% on librispeech-test clean and other subdatasets respectively.

Keyword Extraction

Key Bert is used to extract keywords from speech transcriptions.

Speech Emotion Recognition

We use the model audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim which is Wav2Vec2-Large-Robust fine-tuned on MSP-Podcast to return arousal/valence/dominance scores.

IEMOCAP MSP-Podcast
Arousal 0.66 0.74
Valence 0.44 0.63
Dominance 0.52 0.65

Concordance Correlation Coefficient (CCC) is reported for these models as it assesses how well the predicted dimension values agree with the ground truth.

The connection between emotion dimensions and 8 emotion classes is depicted below (based on Russell’s circumplex):

Low Valence Neutral Valence High Valence
Low Arousal sad calm
Neutral Arousal disgusted neutral
High Arousal angry (high dominance), fearful (low dominance) surprised happy

Demo

After uploading voice-over file (1), hit the preprocess button to start the speech recognition process (2). Then for a refined view of speech segments, you can change the granularity of the speech intervals by merging ones closer than a max distance (3). When speech segments are large and each has a long text, you can extract any number of keywords to summarize them (3). To apply the refinements, hit the analyze button (4) and will see the changes to the plots and text (5).

Clone this wiki locally