Speech Recognition (also known as Automatic Speech Recognition or ASR) is a subfield of artificial intelligence and computational linguistics concerned with converting spoken language into text. Instead of learning from explicit input-output pairs in a traditional sense, ASR systems learn to map acoustic signals to linguistic representations.
The core idea is simple:
- Audio is captured and preprocessed into acoustic features
- A model analyzes these features to identify phonemes, words, or characters
- The output is transcribed text representing what was spoken
ASR is particularly well-suited for problems involving human-computer interaction, accessibility, and content transcription.
Maps audio features to phonetic units. Estimates the probability of observing acoustic features given a sequence of phonemes or words.
Provides linguistic context and grammar. Estimates the probability of word sequences, helping disambiguate similar-sounding words.
The smallest units of sound in a language. English has approximately 44 phonemes.
Converting raw audio waveforms into meaningful representations like MFCCs or spectrograms.
The standard metric for ASR accuracy. Measures insertions, deletions, and substitutions relative to ground truth.
Combines acoustic and language models to find the most likely text sequence given the audio input.
Maps words to their phonetic pronunciations.
- Audio Capture: Record audio signal (typically 16kHz sampling rate)
- Preprocessing: Noise reduction, normalization
- Feature Extraction: Convert to MFCCs, filter banks, or spectrograms
- Acoustic Modeling: HMM-GMM or DNN predicts phoneme probabilities
- Language Modeling: Apply n-gram or neural language model
- Decoding: Search for optimal word sequence (Viterbi algorithm)
- Post-processing: Capitalization, punctuation, formatting
This pipeline dominated ASR from the 1980s through early 2010s.
- Speaker-Dependent: Trained for a specific person (higher accuracy, limited use)
- Speaker-Independent: Works for any speaker (most modern systems)
- Isolated Word: Recognizes single words spoken with pauses
- Continuous Speech: Handles natural flowing speech (more complex)
- On-Device: Processing happens locally (privacy, latency)
- Cloud-Based: Uses server resources (better accuracy, requires internet)
- Command-and-Control: Limited vocabulary (e.g., voice commands)
- Large Vocabulary: Open-domain transcription (dictation, meetings)
Wake Word Detection:
- Audio: continuous microphone input
- Task: detect "Hey Assistant" in real-time
- Features: mel-spectrogram of rolling 2-second window
- Model: small CNN predicts "wake word" vs "background"
The system continuously processes audio. When the model's confidence exceeds a threshold, the device activates.
Key insight: The model learns acoustic patterns without explicit phonetic decomposition—an end-to-end approach.
Raw audio signal: amplitude over time. High-dimensional and noisy.
Frequency content over time. Obtained via Short-Time Fourier Transform (STFT). Visual representation shows energy across frequencies.
Spectrograms with mel-scale frequency axis (mimics human hearing). Compresses high frequencies and expands low frequencies.
The most common traditional feature. Represents the "shape" of the spectral envelope. Typically 13-39 coefficients per frame.
Modern end-to-end models (Wav2Vec 2.0) can learn directly from raw audio, eliminating hand-crafted features.
Hidden Markov Models (HMMs)
Model temporal sequence of phonemes. Each phoneme is represented as a sequence of states.
Used within HMM states to model acoustic feature distributions.
Early template-matching approach. Compares input to stored reference templates.
Statistical models estimating word sequence probabilities (bigrams, trigrams, 5-grams).
The dominant approach from 1990s–2010. Combined HMMs for temporal modeling with GMMs for acoustic modeling.
Deep Learning revolutionized ASR starting in the early 2010s.
Key innovations:
- DNNs replace GMMs: Deep Neural Networks model acoustic features more effectively
- RNNs / LSTMs: Capture temporal dependencies better than HMMs
- CTC (Connectionist Temporal Classification): Enables end-to-end training without phoneme alignment
- Attention Mechanisms: Allow models to focus on relevant audio segments
- Transformers: Parallelizable architecture for sequence modeling
Neural networks learn:
- Acoustic feature representations
- Phoneme and word probabilities
- Implicit language models
This enables ASR to scale to diverse accents, languages, and acoustic conditions.
Encoder-decoder with attention. Directly outputs characters or words from audio.
Simple architecture: CNNs + RNNs + CTC. Demonstrated power of large-scale data.
Self-supervised pre-training on unlabeled audio. Fine-tunes with small labeled datasets. State-of-the-art for low-resource languages.
Transformer trained on 680,000 hours of multilingual data. Robust to accents, noise, and technical language. Zero-shot capabilities.
Combines convolution and attention for optimal local and global modeling. Used in Google's production systems.
Streaming-capable end-to-end models. Popular for real-time applications.
- Virtual Assistants: Siri, Alexa, Google Assistant
- Transcription Services: Meeting notes, subtitles, medical dictation
- Accessibility: Voice control for users with disabilities
- Call Centers: Automated customer service and analytics
- Language Learning: Pronunciation feedback
- Voice Search: Hands-free web and app queries
- Smart Home Control: Voice-activated devices
- Automotive Systems: In-car voice commands
ASR is especially powerful where hands-free interaction or real-time transcription is needed.
-
Speech Recognition by Machine — Acoustic-phonetic approach (1950s)
-
A Maximum Likelihood Approach to Continuous Speech Recognition — HMMs in ASR (1983)
-
Acoustic Modeling using Deep Belief Networks — Hinton et al. (2012)
-
Deep Neural Networks for Acoustic Modeling in Speech Recognition — Dahl et al. (2012)
-
Connectionist Temporal Classification — Graves et al. (2006)
-
Listen, Attend and Spell — Chan et al. (2016)
-
Deep Speech: Scaling up end-to-end speech recognition — Hannun et al. (2014)
-
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations — Baevski et al. (2020)
-
Robust Speech Recognition via Large-Scale Weak Supervision — Whisper paper, Radford et al. (2022)
-
Conformer: Convolution-augmented Transformer for Speech Recognition — Gulati et al. (2020)
-
Stanford CS224S – Spoken Language Processing
-
CMU 11-751 – Speech Recognition and Understanding
-
Coursera – Sequence Models (includes speech content)
-
Speech and Language Processing — Jurafsky & Martin (Chapter on ASR)
-
Hugging Face ASR Course
-
Fast.ai Practical Deep Learning — includes audio processing
-
Whisper (OpenAI) — State-of-the-art pre-trained model
-
Wav2Vec 2.0 (Hugging Face Transformers) — Fine-tunable pre-trained models
-
Kaldi — Traditional ASR toolkit (HMM-based)
-
ESPnet — End-to-end speech processing toolkit
-
SpeechBrain — PyTorch-based modern ASR toolkit
-
DeepSpeech (Mozilla) — Open-source end-to-end ASR
-
pyAudioAnalysis — Feature extraction and audio processing
- Start with audio visualization before modeling
- Understand MFCCs and spectrograms deeply
- Use pre-trained models (Whisper, Wav2Vec) before training from scratch
- Experiment with different audio preprocessing techniques
- Measure WER, not just training loss
- Test on diverse speakers, accents, and noise conditions
- Learn about audio sampling rates and their impact
- Study the difference between streaming and batch processing
- Ignoring audio quality: Noise and sampling rate greatly affect performance
- Overfitting to clean speech: Real-world audio is messy
- Neglecting language model: Acoustic model alone is insufficient
- Wrong evaluation metrics: WER on test set may not reflect real-world performance
- Not handling silence and background noise: Model must distinguish speech from non-speech
- Ignoring real-time constraints: Streaming vs batch has different requirements
- Accent and dialect bias: Models trained on limited demographics perform poorly on others
- Multimodal Models: Speech + vision + text (e.g., GPT-4 with voice)
- Voice Assistants + LLMs: ASR feeds transcription to language models for understanding
- Speech-to-Speech Translation: ASR → MT → TTS pipeline
- Emotion and Sentiment Analysis: Beyond words, analyzing tone and prosody
- Self-Supervised Learning: Wav2Vec 2.0 approach influencing broader AI
- Continual Learning: Adapting ASR to new speakers and domains without forgetting
ASR is a critical component of natural human-AI interaction and increasingly integrated with large language models.
Each step is intentionally small and self-contained. These can each live in their own folder or repository.
Goal: Build intuition for audio signals and their representations.
- Load a .wav file (use librosa)
- Plot waveform, spectrogram, and mel-spectrogram
- Visualize the difference between speech and silence
- Output: saved plots comparing different representations
Goal: Understand traditional ASR features.
- Implement MFCC extraction (or use librosa)
- Compare features from different speakers
- Visualize MFCC coefficients over time
- Experiment with different frame sizes and hop lengths
Goal: Build a simple keyword spotter.
- Collect 50 samples of a target word (e.g., "start")
- Extract mel-spectrograms
- Train a small CNN classifier (target word vs background)
- Test on live microphone input
Goal: Use state-of-the-art pre-trained ASR.
- Install Whisper (OpenAI)
- Build a CLI tool to transcribe audio files
- Compare different model sizes (tiny, base, medium)
- Measure WER on a small test set
- Add timestamp output for subtitles
Goal: Adapt a pre-trained model to new data.
- Use Hugging Face datasets (Common Voice or LibriSpeech subset)
- Fine-tune a small Wav2Vec 2.0 model
- Compare performance before and after fine-tuning
- Experiment with different learning rates and data sizes
Goal: Understand model limitations and robustness.
- Test Whisper on different accents (collect samples or use datasets)
- Add synthetic noise (white noise, background music)
- Measure WER degradation
- Visualize which phonemes/words are most affected
Goal: End-to-end speech application.
- Define 10 commands (e.g., "turn on lights", "play music")
- Use Whisper for transcription
- Parse commands with simple keyword matching or small NLU model
- Trigger mock actions
- Extend with LLM for natural language understanding
Goal: Learn by replication.
- Pick one paper (CTC, Listen Attend and Spell, or Deep Speech)
- Reproduce a simplified version on a toy dataset
- Write a short report on what broke and why
- Document differences between paper and implementation
Deep ASR understanding comes from hearing, visualizing, and decoding real audio signals.
- Generated with: GitHub Copilot
- Model family: GPT-4
- Generation role: Educational documentation
- Prompt style: Structured, following reinforcement_learning.md template
- Human edits: None
- Date generated: 1-8-2026
Note: This document follows the structure and style of the existing reinforcement_learning.md documentation to maintain consistency across the AI projects documentation set.
