Voice‐over Analysis

Speech Recognition

We use a cache-aware FastConformer (around 114M parameters) trained on a large-scale English speech trained for streaming ASR with a look-ahead of 1040ms which can be used for medium-latency streaming applications. The model achieved a word error rate of 2.3% and 5.5% on librispeech-test clean and other subdatasets respectively.

Keyword Extraction

Key Bert is used to extract keywords from speech transcriptions.

Speech Emotion Recognition

We use the model audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim which is Wav2Vec2-Large-Robust fine-tuned on MSP-Podcast to return arousal/valence/dominance scores.

	IEMOCAP	MSP-Podcast
Arousal	0.66	0.74
Valence	0.44	0.63
Dominance	0.52	0.65

Concordance Correlation Coefficient (CCC) is reported for these models as it assesses how well the predicted dimension values agree with the ground truth.

The connection between emotion dimensions and 8 emotion classes is depicted below (based on Russell’s circumplex):

	Low Valence	Neutral Valence	High Valence
Low Arousal	sad		calm
Neutral Arousal	disgusted	neutral
High Arousal	angry (high dominance), fearful (low dominance)	surprised	happy

Demo

After uploading voice-over file (1), hit the preprocess button to start the speech recognition process (2). Then for a refined view of speech segments, you can change the granularity of the speech intervals by merging ones closer than a max distance (3). When speech segments are large and each has a long text, you can extract any number of keywords to summarize them (3). To apply the refinements, hit the analyze button (4) and will see the changes to the plots and text (5).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Voice‐over Analysis

Speech Recognition

Keyword Extraction

Speech Emotion Recognition

Demo

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally