A deep learning pipeline for classifying emotions from speech audio using the Wav2Vec2 model with PyTorch and Hugging Face Transformers.
The objective of this project is to classify human emotions from audio signals using advanced pre-trained speech representations. It leverages Wav2Vec2 to extract deep features from raw waveform data and fine-tunes it for classification tasks using labeled datasets of emotional speech.
- Emotions Covered: Happy, Sad, Fear, Disgust, Neutral, Angry, PS (custom label)
- Number of Samples: ~2800 audio files
- Format:
.wav - Labels: Extracted from file naming convention.
Example:
OAF_happy.wav → Label: happy
YAF_sad.wav → Label: sad
| Technology | Purpose |
|---|---|
| Python | Core Programming Language |
| PyTorch | Deep Learning Framework |
| Hugging Face Transformers | Wav2Vec2 for Speech Feature Extraction & Classification |
| Librosa | Audio Processing |
| Matplotlib / Seaborn | Data Visualization |
| Scikit-learn | Metrics & Splitting Data |
- Walk through directory structure.
- Extract labels from filenames.
- Store paths and labels in a Pandas DataFrame.
- Count plot for class distribution.
- Visualization of waveform and spectrograms.
- Labels are mapped to integers.
- Custom
Datasetclass for PyTorch defined. - Audio processed using
Librosaand Hugging FaceWav2Vec2Processor.
- Pre-trained Wav2Vec2 (facebook/wav2vec2-base) loaded.
- Final classification head adjusted for emotion classes.
- Hugging Face
TrainerAPI used. - Evaluation metrics: Accuracy, Precision, Recall, F1-Score.
- Training arguments configured: epochs, batch size, learning rate.
- Evaluate on test set.
- Compute weighted metrics for imbalanced classes.
- Random test audio sample predicted.
- Outputs both original and predicted emotion.
Original Label: happy
Predicted Label: happy
git clone https://github.com/sabale-37/Speech-Emotion-Recognition.git
cd speech-emotion-recognitionpython -m venv venv
source venv/bin/activate # For Linux/macOS
venv\Scripts\activate # For Windows
pip install -r requirements.txtjupyter notebooktorch
transformers
datasets
scikit-learn
librosa
matplotlib
seaborn
pandas
numpy
ipython
- Hyperparameter tuning via Optuna or Grid Search.
- Augment dataset with noise-robust training.
- Compare transformer-based approaches with CNN-LSTM baselines.
- Real-time inference via Streamlit or Gradio interface.
Feel free to submit issues or PRs. Contributions are welcome!
Narayan Sabale
narayansabale026@gmail.com