- Overview
- Key Features
- Architecture
- Performance Metrics
- Installation
- Dataset Preparation
- Usage
- Training Pipeline
- Results
- Model Details
This project presents a sophisticated approach to fine-tuning OpenAI's Whisper speech-to-text model for enhanced pronunciation learning applications. The system specializes in accurately transcribing broken words and fragmented speech segments, making it ideal for language learning scenarios where learners struggle with partial pronunciation.
Language learners often produce fragmented or partially articulated words during practice. Traditional ASR systems struggle with these scenarios, but our fine-tuned model bridges this gap by:
- Recognizing incomplete word segments with high accuracy
- Supporting pronunciation assessment for language education
- Providing real-time feedback for learners
- Achieving 95% accuracy on fragmented speech data
- β 95% accuracy on broken English word recognition
- β Transformer-based architecture leveraging OpenAI Whisper-Base
- β Transfer learning optimization with custom fine-tuning pipeline
- β Real-world integration with language learning applications
- β Comprehensive evaluation using Word Error Rate (WER) metrics
| Feature | Description |
|---|---|
| Fragmented Speech Recognition | Accurately transcribes partially uttered words and broken speech segments |
| Transfer Learning | Leverages pre-trained Whisper-Base model with custom fine-tuning |
| Low WER | Achieves near-optimal Word Error Rate for pronunciation learning |
| GPU-Accelerated | CUDA-optimized training and inference pipeline |
| Educational Integration | Designed for seamless integration into language learning apps |
- Base Model: OpenAI Whisper-Base (English)
- Architecture: Transformer Encoder-Decoder
- Input: 16kHz audio samples (30s max)
- Output: Text transcription with confidence scores
- Training Strategy: Supervised fine-tuning with cross-entropy loss
The Automatic Speech Recognition (ASR) pipeline consists of three main components:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ASR PIPELINE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β 1. Feature Extractor β
β βββ Raw Audio β Log-Mel Spectrogram β
β β
β 2. Whisper Model (Encoder-Decoder) β
β βββ Encoder: Spectrogram β Hidden States β
β βββ Decoder: Hidden States β Text Tokens β
β β
β 3. Tokenizer β
β βββ Text Tokens β Human-Readable Text β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Whisper is a Transformer-based encoder-decoder model that performs sequence-to-sequence mapping:
- Feature Extraction: Converts raw audio to log-Mel spectrogram (80 channels, 30s window)
- Encoder: Processes spectrograms to generate hidden state representations
- Decoder: Autoregressively predicts text tokens using encoder states and previous tokens
- Deep Fusion: Internal language model for context-aware transcription
- Loss Function: Cross-entropy objective
- Optimization: AdamW optimizer with learning rate scheduling
- Regularization: Gradient accumulation and warmup steps
| Metric | Value | Notes |
|---|---|---|
| Accuracy | 0% | Base Whisper model on fragmented speech |
| WER | ~100% | Unable to recognize broken words |
| Use Case | β Not suitable | Standard model fails on partial pronunciations |
| Metric | Value | Notes |
|---|---|---|
| Accuracy | 95% | Significant improvement on test set |
| WER | ~5% | Near-optimal word error rate |
| Use Case | β Production-ready | Suitable for educational applications |
| Similarity Threshold | 90% | FuzzyWuzzy matching for evaluation |
Training Hyperparameters:
ββ Batch Size: 4 (per device)
ββ Gradient Accumulation: 4 steps
ββ Learning Rate: 1e-5
ββ Warmup Steps: 250
ββ Max Steps: 1000
ββ Optimizer: AdamW
ββ Evaluation Strategy: Every 10 steps
- Python 3.8 or higher
- CUDA-compatible GPU (recommended)
- 16GB+ RAM
- Git
git clone https://github.com/bilalhameed248/Whisper-Fine-Tuning-For-Pronunciation-Learning.git
cd Whisper-Fine-Tuning-For-Pronunciation-Learning
# Using conda
conda create -n whisper-finetune python=3.8
conda activate whisper-finetune
# Or using venv
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install transformers datasets librosa soundfile
pip install evaluate jiwer tensorboard
pip install fuzzywuzzy python-Levenshtein
pip install pynvml numba
pip install jupyter notebook
import torch
print(f"PyTorch Version: {torch.__version__}")
print(f"CUDA Available: {torch.cuda.is_available()}")
print(f"CUDA Version: {torch.version.cuda}")
Organize your audio files in the following structure:
data/
βββ train/
β βββ audio1.wav
β βββ audio2.wav
β βββ metadata.csv
βββ validation/
β βββ audio1.wav
β βββ metadata.csv
βββ test/
βββ audio1.wav
βββ metadata.csv
Your metadata.csv should contain:
file_name,sentence
audio1.wav,broken word example
audio2.wav,partial pronunciation
- Format: WAV, MP3, or FLAC
- Sampling Rate: 16kHz (automatically resampled)
- Duration: Up to 30 seconds per clip
- Quality: Clear pronunciation recordings
from transformers import pipeline, WhisperTokenizer
# Load fine-tuned model
tokenizer = WhisperTokenizer.from_pretrained('./tokenizer/', language="english", task="transcribe")
pipe = pipeline(
"automatic-speech-recognition",
model="./whisper-base-languagelab5/checkpoint-500/",
tokenizer=tokenizer,
device=0 # GPU device
)
# Transcribe audio
result = pipe("path/to/audio.wav")
print(f"Transcribed Text: {result['text']}")
Open the Jupyter notebook:
jupyter notebook Whisper-small-fine-tuning.ipynb
Follow the notebook cells sequentially:
- Setup & Configuration: GPU setup and library imports
- Data Loading: Load and prepare your dataset
- Feature Extraction: Configure Whisper processor
- Model Training: Fine-tune with custom parameters
- Evaluation: Test on validation/test sets
- Inference: Use the trained model
# Resample audio to 16kHz
common_voice = common_voice.cast_column("audio", Audio(sampling_rate=16_000))
# Extract features and tokenize
def prepare_dataset(batch):
audio = batch["audio"]
batch["input_features"] = feature_extractor(
audio["array"],
sampling_rate=audio["sampling_rate"]
).input_features[0]
batch["labels"] = tokenizer(batch["sentence"]).input_ids
return batch
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-base")
model.generation_config.language = "english"
model.config.forced_decoder_ids = None
model.config.suppress_tokens = []
training_args = Seq2SeqTrainingArguments(
output_dir="./whisper-base-languagelab5",
per_device_train_batch_size=4,
learning_rate=1e-5,
warmup_steps=250,
max_steps=1000,
evaluation_strategy="steps",
eval_steps=10,
save_steps=100,
metric_for_best_model="wer",
load_best_model_at_end=True
)
trainer = Seq2SeqTrainer(
args=training_args,
model=model,
train_dataset=common_voice["train"],
eval_dataset=common_voice["validate"],
data_collator=data_collator,
compute_metrics=compute_metrics,
tokenizer=processor.feature_extractor,
)
trainer.train()
| Evaluation Metric | Before Fine-Tuning | After Fine-Tuning | Improvement |
|---|---|---|---|
| Accuracy | 0% | 95% | +95% |
| Word Error Rate | 100% | 5% | -95% |
| True Predictions | 0/test_size | 95/100 | Significant |
| Similarity Score | <50% | >90% | +40%+ |
β
Recognition of Fragmented Words: Successfully transcribes broken pronunciations
β
Context Understanding: Maintains semantic meaning despite incomplete words
β
Low Latency: Real-time transcription capability
β
Robustness: Handles various accents and speech patterns
| Original Word | Base Model Output | Fine-Tuned Output | Match |
|---|---|---|---|
| "app-le" (broken) | "apple" | "app-le" | β |
| "be-au-ti-ful" | "beautiful" | "be-au-ti-ful" | β |
| "pro-nun-ci-a-tion" | "pronunciation" | "pro-nun-ci-a-tion" | β |
- Model Name: Whisper-Base (Fine-Tuned)
- Parameters: ~74M
- Encoder Layers: 6
- Decoder Layers: 6
- Attention Heads: 8
- Embedding Dimension: 512
- Vocabulary Size: 51,865 tokens
- Hardware: NVIDIA GPU (CUDA 11.8+)
- Framework: PyTorch 2.0+
- Training Time: ~2-4 hours (depending on dataset size)
- Memory Requirements: 16GB GPU RAM
# Word Error Rate (WER) Calculation
def compute_metrics(pred):
pred_ids = pred.predictions
label_ids = pred.label_ids
pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)
wer = 100 * metric.compute(predictions=pred_str, references=label_str)
return {"wer": wer}