This README is based on the requirements of the “International Competition of the Military Technical College” AI competition. For more information, visit the following links:
The provided dataset comprises 100 hours of clean and noisy audio recordings in the Egyptian dialect. The goal is to minimize the Word Error Rate (WER) using a suitable model. Our team, Hear to Win, achieved a WER of 26.17 using a small Jasper model from the NeMo toolkit, ranking 12th out of 150+ registered teams.
Figure 1: Diagram of the Conformer Architecture
The Conformer model has several key components:
-
Preprocessing: Audio data is first preprocessed by converting raw waveform inputs into spectrogram features, specifically Mel spectrograms. This transformation provides a frequency-based representation that is more effective for ASR tasks.
-
Augmentation: Data augmentation techniques, such as SpecAugment, are applied to make the model more robust to variations. SpecAugment randomly masks portions of the spectrogram along time and frequency dimensions to improve generalization.
-
Encoder: The Conformer encoder consists of convolutional layers for capturing local dependencies, and self-attention layers for capturing global dependencies. Each encoder layer is structured with feed-forward, self-attention, and convolution blocks.
-
CTC/Transducer Decoding: Depending on the model type, the decoder uses either a Connectionist Temporal Classification (CTC) or a Transducer setup:
- CTC aligns input sequences with output labels for frame-based prediction.
- Transducer models (used in Transducer and Hybrid architectures) combine encoder and decoder outputs for sequence prediction.
-
Output Layer: The final layer maps the encoded features to character or sub-word labels, depending on the configuration.
| Model Name | Type | Label Type | Key Components | Tokens | WER (epochs) | Notes |
|---|---|---|---|---|---|---|
| Conformer-Hybrid-Transducer-CTC | Hybrid | Character | Encoder: Conformer | 675,790 | Not converged (100) | Combines Transducer & CTC |
| Conformer-CTC-Char | CTC | Character | Encoder: Conformer | 675,790 | 13.2 (40) | Configured from scratch |
| Conformer-Transducer-Char | Transducer | Character | Encoder: Conformer, Decoder | 675,790 | Not converged (80) | RNNTDecoder & Joint reduction |
| Jasper | CTC | Character | Deep CNN Layers | N/A | 24.4 (30) | Deep CNN for ASR |
Note: Some of these model are in MTC-AIC phase2 repo
To re-test the submitted model using the provided notebook, follow these steps:
-
Download the Notebook:
- Go to inference-script.ipynb.
- Click on the "Raw" button to download the notebook file.
-
Upload to Kaggle:
- Sign in to your Kaggle account.
- Navigate to "Kernels" and create a new notebook.
- Upload the downloaded notebook file.
-
Run the Notebook:
-
Determining the Training Platform
- Google Colab: Efficient GPU but limited in access time.
- AzureML Compute Instances and Clusters: Offers $100 of free access to different CPUs for students. However, using a custom virtual environment with Python==3.10 resulted in the notebook reverting to Python 3.19.4, which is incompatible with the NeMo toolkit models requiring Python>=3.10.
- Kaggle: Given the limited time (30 hours per user * 3 team members) and total control of the environment, Kaggle was selected as our training platform.
-
Data Exploration and Cleaning
- The dataset contained some null values and HTML tags, which were cleaned.
-
Formatting the Data
- The data was formatted into the JSON format required by NeMo: 80% for training, 10% for testing, and 10% for validation.
-
Model Training
- Ran a Conformer Transducer model with simple configurations. The setup can be found here.
-
Training Various Models
- Trained different models including CTC, Transducer, Jasper, and Hybrid-Transducer-CTC with various model dimensions, encoders, decoders, and tokenizers.
- Adjusted configurations to achieve the best WER. Configurations for each model are documented in their respective notebooks.
This research provides a comparative analysis of various ASR models on an Egyptian dialect dataset, examining how model architecture and configurations impact performance, specifically in terms of Word Error Rate (WER). Our experiments yielded the following insights and answers to key research questions:
-
Impact of Model Dimensions on WER
Models with larger encoder dimensions (512 or higher) demonstrated worse performance and often produced empty predictions for substantial portions of the test data. This issue persisted regardless of the number of epochs, suggesting that large dimensions may lead to overfitting or ineffective learning when trained on datasets of limited size or diversity.- Recommendation: Small (176) or medium (256) encoder dimensions are optimal for achieving a balance between accuracy and computational efficiency, especially for dialectal ASR tasks.
-
Effect of Training Epochs on Model Convergence
Despite extensive training (up to 100 epochs for some models), both the Conformer-Hybrid and Transducer models failed to converge. This may be due to a mismatch between these models' complexity and the size of the dataset, which limits their ability to generalize effectively.- Recommendation: For datasets of similar size, simpler architectures like CTC or Jasper, which achieved WERs of 13.2% and 24.4% respectively, are recommended. Complex architectures may require more data or additional regularization techniques to reach convergence.
-
Benefits of SpecAugment for Handling Noise
Applying SpecAugment during training improved model robustness to noisy audio tracks, as observed in models that incorporated this augmentation. By masking portions of the spectrogram, SpecAugment effectively helps the model generalize to varied audio conditions, which is essential for real-world ASR applications with background noise.- Recommendation: Incorporate SpecAugment for ASR models aimed at dialectal or noisy datasets to enhance robustness and improve WER.
-
Best Performing Architecture
The CTC-based Conformer model, with a character label type and an encoder dimension of 176, achieved the best WER of 13.2% after 40 epochs. This result indicates that for limited-data scenarios, simpler architectures may outperform hybrid or transducer-based models, particularly when they are optimized with data augmentation and medium-sized encoders.- Conclusion: The CTC model’s simplicity, coupled with an appropriate encoder size and SpecAugment, makes it the most suitable choice for Egyptian dialect ASR.
For future ASR research on dialectal datasets, the findings suggest focusing on small to medium models with effective data augmentation techniques. Investigating other regularization methods or advanced training strategies, such as curriculum learning or semi-supervised training, may further improve convergence in complex architectures. Additionally, increasing the dataset size could enable larger models to fully utilize their capacity, potentially closing the gap between simple and complex model performance.
In conclusion, by identifying the optimal balance between model complexity and dataset constraints, this study provides actionable insights into developing efficient and robust ASR models for dialectal variations.
