|
| 1 | +# Speech Recognition with PyTorch and Kubeflow |
| 2 | + |
| 3 | +A complete example demonstrating **PyTorch Distributed Data Parallel (DDP)** training for speech recognition using Google's [Speech Command](https://huggingface.co/datasets/google/speech_commands) dataset. This project showcases both local development and distributed training on Kubernetes using **Kubeflow Trainer**. |
| 4 | + |
| 5 | +## 🎯 Overview |
| 6 | + |
| 7 | +This repository implements a **Transformer-based neural network** for classifying single-word spoken commands (35 classes) from the Speech Commands v0.02 dataset. The main focus is the comprehensive **`example.ipynb`** notebook that walks you through: |
| 8 | + |
| 9 | +- Local training and development |
| 10 | +- Container setup with Docker |
| 11 | +- Distributed training on Kubernetes using Kubeflow |
| 12 | +- Predict with trained model |
| 13 | + |
| 14 | +## 📋 Quick Start |
| 15 | + |
| 16 | +### 1. Local Environment Setup |
| 17 | + |
| 18 | +**Note**: If you encounter torch installation issues, install PyTorch first: |
| 19 | + |
| 20 | +```bash |
| 21 | +pip install torch==2.8 |
| 22 | +pip install -r requirements.txt |
| 23 | +``` |
| 24 | + |
| 25 | +### 2. Run the Complete Example |
| 26 | + |
| 27 | +File: `example.ipynb` |
| 28 | + |
| 29 | +This notebook contains everything you need, including: |
| 30 | + |
| 31 | +- Data download and preparation |
| 32 | +- Local training examples |
| 33 | +- Docker container setup |
| 34 | +- Kubernetes cluster creation with Kind |
| 35 | +- Kubeflow distributed training |
| 36 | +- Predict with trained model. |
| 37 | + |
| 38 | +## 🏗️ Architecture |
| 39 | + |
| 40 | +### Model Architecture |
| 41 | + |
| 42 | +- **Input**: Mel spectrograms (128 mel bins, 81 time frames) |
| 43 | +- **Model**: Transformer encoder (4 layers, 4 attention heads, 128 d_model) |
| 44 | +- **Output**: 35-class classification (Speech Commands) |
| 45 | +- **Training**: PyTorch DDP with automatic mixed precision |
| 46 | + |
| 47 | +### Dataset |
| 48 | + |
| 49 | +- **Source**: Google Speech Commands Dataset v0.02 |
| 50 | +- **Size**: 105,829 audio files (2.3GB) |
| 51 | +- **Classes**: 35 words including "yes", "no", digits 0-9, directions, etc. |
| 52 | +- **Format**: 1-second WAV files at 16kHz |
| 53 | + |
| 54 | +## 📁 Project Structure |
| 55 | + |
| 56 | +### Core Files |
| 57 | + |
| 58 | +- **`example.ipynb`** - 📓 **Main notebook with complete workflow** |
| 59 | +- **`train_model.py`** - 🚂 Standalone training script |
| 60 | +- **`predict.py`** - 🔮 Random audio prediction script |
| 61 | +- **`prepare-data.py`** - 📥 Dataset download utility |
| 62 | + |
| 63 | +### Infrastructure Files |
| 64 | + |
| 65 | +- **`Dockerfile`** - 🐳 Container setup (PyTorch 2.8.0 + CUDA 12.8) |
| 66 | +- **`kind-config.yaml`** - ☸️ Local Kubernetes cluster configuration |
| 67 | +- **`kubeflow-runtime-example.yaml`** - 🎛️ Kubeflow runtime definition |
| 68 | +- **`requirements.txt`** - 📦 Python dependencies (219 packages) |
| 69 | + |
| 70 | +## 🚀 Usage Examples |
| 71 | + |
| 72 | +### Data Preparation |
| 73 | + |
| 74 | +```bash |
| 75 | +python prepare-data.py |
| 76 | +``` |
| 77 | + |
| 78 | +### Local Training (Single GPU) |
| 79 | + |
| 80 | +```bash |
| 81 | +# Run with single GPU |
| 82 | +torchrun --nproc-per-node 1 train_model.py |
| 83 | + |
| 84 | +# Run with multiple GPUs |
| 85 | +torchrun --nproc-per-node 2 train_model.py |
| 86 | +``` |
| 87 | + |
| 88 | +### Random Audio Prediction |
| 89 | + |
| 90 | +```bash |
| 91 | +python predict.py |
| 92 | +``` |
| 93 | + |
| 94 | +Sample output: |
| 95 | + |
| 96 | +``` |
| 97 | +[ 1/10] ✓ File: /data/SpeechCommands/speech_commands_v0.02/left/ae71797c_nohash_0.wav |
| 98 | + True: 'left' | Predicted: 'left' | Confidence: 95.23% |
| 99 | +
|
| 100 | +[ 2/10] ✗ File: /data/SpeechCommands/speech_commands_v0.02/yes/ab123cd4_nohash_1.wav |
| 101 | + True: 'yes' | Predicted: 'no' | Confidence: 78.45% |
| 102 | +``` |
| 103 | + |
| 104 | +## 🐳 Docker & Kubernetes Setup |
| 105 | + |
| 106 | +### Build Docker Image |
| 107 | + |
| 108 | +```bash |
| 109 | +docker build -t speech-recognition-image:0.1 . |
| 110 | +``` |
| 111 | + |
| 112 | +### Create Local Kubernetes Cluster |
| 113 | + |
| 114 | +```bash |
| 115 | +# Create Kind cluster with data volume mounting |
| 116 | +kind create cluster --name ml --config kind-config.yaml |
| 117 | + |
| 118 | +# Load Docker image to cluster |
| 119 | +kind load docker-image speech-recognition-image:0.1 --name ml |
| 120 | +``` |
| 121 | + |
| 122 | +### Deploy Kubeflow Runtime |
| 123 | + |
| 124 | +```bash |
| 125 | +# Install Kubeflow Trainer operator |
| 126 | +export VERSION=v2.0.0 |
| 127 | +kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/manager?ref=${VERSION}" |
| 128 | + |
| 129 | +# Apply custom runtime |
| 130 | +kubectl apply -f kubeflow-runtime-example.yaml |
| 131 | +``` |
| 132 | + |
| 133 | +## 📊 Distributed Training with Kubeflow |
| 134 | + |
| 135 | +The **`example.ipynb`** notebook demonstrates distributed training: |
| 136 | + |
| 137 | +```python |
| 138 | +from kubeflow.trainer import CustomTrainer, TrainerClient |
| 139 | + |
| 140 | +client = TrainerClient() |
| 141 | + |
| 142 | +# Start distributed training job |
| 143 | +job_name = client.train( |
| 144 | + trainer=CustomTrainer( |
| 145 | + func=train_model, |
| 146 | + num_nodes=2, # Multi-node training |
| 147 | + resources_per_node={ |
| 148 | + "cpu": 5, |
| 149 | + "memory": "50Gi", |
| 150 | + # "nvidia.com/gpu": 1, # Uncomment for GPU |
| 151 | + }, |
| 152 | + ), |
| 153 | + runtime=torch_runtime, |
| 154 | +) |
| 155 | +``` |
| 156 | + |
| 157 | +## 🔧 Configuration |
| 158 | + |
| 159 | +### Key Parameters |
| 160 | + |
| 161 | +- **Batch Size**: 256 (64 in debug mode) |
| 162 | +- **Learning Rate**: 0.001 with linear scaling for distributed training |
| 163 | +- **Epochs**: 30 (10 in debug mode) |
| 164 | +- **Data Split**: 95% train, 3% validation, 2% test |
| 165 | +- **Debug Mode**: Set `debug = True` in scripts for faster iteration |
| 166 | + |
| 167 | +### Data Paths |
| 168 | + |
| 169 | +- **Dataset**: `/data/SpeechCommands/speech_commands_v0.02/` |
| 170 | +- **Experiments**: `/data/speech-recognition/runs/exp-{timestamp}/` |
| 171 | +- **Models**: Saved as `.pth` files with best validation accuracy |
| 172 | + |
| 173 | +## 📈 Monitoring |
| 174 | + |
| 175 | +### TensorBoard |
| 176 | + |
| 177 | +```bash |
| 178 | +tensorboard --logdir=/data/speech-recognition/runs |
| 179 | +``` |
| 180 | + |
| 181 | +### Kubernetes Logs |
| 182 | + |
| 183 | +```bash |
| 184 | +# Get pods |
| 185 | +kubectl get pods |
| 186 | + |
| 187 | +# View training logs |
| 188 | +kubectl logs <pod-name> -f |
| 189 | +``` |
| 190 | + |
| 191 | +## 🛠️ Development Workflow |
| 192 | + |
| 193 | +1. **Start with `example.ipynb`** - Complete guided walkthrough |
| 194 | +2. **Local development** - Use `train_model.py` for quick iterations |
| 195 | +3. **Test predictions** - Run `predict.py` to validate model performance |
| 196 | +4. **Scale up** - Deploy to Kubernetes for distributed training |
| 197 | + |
| 198 | +## 🧪 Tested Environments |
| 199 | + |
| 200 | +### Software Requirements |
| 201 | + |
| 202 | +- **Python**: 3.12 |
| 203 | +- **PyTorch**: 2.8 |
| 204 | +- **Operating System**: Linux x86 |
| 205 | + |
| 206 | +### Hardware Tested |
| 207 | + |
| 208 | +**Kubernetes Environment:** |
| 209 | + |
| 210 | +- **Kind**: v0.30.0 with Kubernetes Server v1.34.0 |
| 211 | +- **Local development cluster for testing** |
| 212 | + |
| 213 | +**Production Environments:** |
| 214 | + |
| 215 | +- **AWS**: 2x g4dn.12xlarge instances (4x Tesla T4 GPUs each) wiht Driver Version 570.172.08 CUDA Version 12.8 |
| 216 | +- **NVIDIA A6000**: Single card with Driver 535.230.02, CUDA 12.2 |
| 217 | + |
| 218 | +### Performance Expectations |
| 219 | + |
| 220 | +- **Accuracy**: ~80% on validation set |
| 221 | +- **Loss**: <0.6 after training completion |
| 222 | +- **Training Time**: Varies by hardware (use `debug=True` for faster testing on CPU) |
| 223 | + |
| 224 | +### Testing & Validation |
| 225 | + |
| 226 | +- Play WAV files in `example.ipynb` for quick audio verification |
| 227 | +- Or use `predict.py` to test random audio samples |
| 228 | + |
| 229 | +## 📝 Notes |
| 230 | + |
| 231 | +- **Data Volume**: The setup uses `/data` directory mounted across all containers |
| 232 | +- **GPU Support**: Works with both CPU and GPU training (set `debug=True` for CPU-only testing) |
| 233 | +- **Reproducibility**: Fixed random seeds (41) for consistent results |
| 234 | +- **Production Ready**: Includes model checkpointing, logging, and monitoring |
| 235 | +- **Recommended**: Always use `torchrun` for running `train_model.py` |
| 236 | + |
| 237 | +## 🤝 Contributing |
| 238 | + |
| 239 | +This is a complete example project demonstrating PyTorch DDP and Kubeflow integration. Feel free to adapt the patterns for your own speech recognition or distributed training projects. |
| 240 | + |
| 241 | +--- |
| 242 | + |
| 243 | +**💡 Tip**: Start with the `example.ipynb` notebook - it contains the complete workflow and explains each step in detail! |
0 commit comments