Prerequisite: 02_Feature_Engineering.md.
Before deep learning became mainstream, these classical models were the core tools of NLP. Their ideas continue to influence modern architectures (e.g., CRF layers are still appended to BERT for NER). Understanding them reveals the arc of NLP's technical evolution.
- 1. Naive Bayes
- 2. Hidden Markov Model (HMM)
- 3. Conditional Random Field (CRF)
- 4. Comparison & Legacy
The simplest probabilistic classifier — based on Bayes' theorem, assuming all features (words) are mutually independent.
-
$P(c)$ : Class prior (e.g., 30% of emails are spam) -
$P(w_i | c)$ : Probability of word$w_i$ given class$c$ - The "Naive" assumption: Words are conditionally independent given the class (clearly unrealistic, but works surprisingly well in practice)
Input: "Free iPhone giveaway prize"
P(spam | text) ∝ P(spam) × P(Free|spam) × P(iPhone|spam) × P(giveaway|spam) × P(prize|spam)
P(legit | text) ∝ P(legit) × P(Free|legit) × P(iPhone|legit) × P(giveaway|legit) × P(prize|legit)
→ Compare probabilities, pick the larger one
- Spam filtering: The classic Naive Bayes application
- Sentiment analysis: Positive / negative binary classification
- Document classification: News categorization, topic labeling
- Pros: Extremely fast training, requires little data, highly interpretable
- Cons: Independence assumption is too strong; cannot capture word order or word interactions
2. Hidden Markov Model (HMM)
A Hidden Markov Model is a sequence labeling model — given an observed sequence (words), infer the hidden state sequence (labels).
Hidden states (Tags): DET NOUN VERB DET NOUN
↓ ↓ ↓ ↓ ↓
Observations (Words): The cat sat the mat
The model is defined by two sets of probabilities:
-
Transition probabilities: Probability of state-to-state transitions
-
$P(\text{NOUN} | \text{DET}) = 0.8$ (a determiner is very likely followed by a noun)
-
-
Emission probabilities: Probability of a state generating an observation
$P(\text{"cat"} | \text{NOUN}) = 0.02$
| Problem | Description | Algorithm |
|---|---|---|
| Evaluation | Given a model, compute the probability of an observation sequence | Forward Algorithm |
| Decoding | Given observations, find the most likely state sequence | Viterbi Algorithm |
| Learning | Learn model parameters from data | Baum-Welch (EM) |
Finds the globally optimal label path, rather than greedily choosing at each position:
- POS Tagging: The classic method for part-of-speech tagging
- Speech Recognition: Sound signal → phonemes → words (HMM dominated ASR for decades)
- Chinese Word Segmentation: jieba's underlying engine uses HMM
- Markov assumption: Current state depends only on the previous state — cannot model long-range dependencies
-
Generative model: Models joint
$P(\text{observations, states})$ , not directly$P(\text{states} | \text{observations})$ - Limited features: Can only use the current observed word as a feature
CRF directly addresses HMM's limitations — it is a discriminative sequence labeling model that directly models
| Aspect | HMM | CRF |
|---|---|---|
| Model type | Generative |
Discriminative |
| Features | Current word only | Arbitrary feature functions (surrounding words, affixes, capitalization, dictionary matches, etc.) |
| Independence assumption | Observations are independent | No independence assumption required |
-
$f_k$ : Feature functions (e.g., "current word is capitalized AND label is PERSON") -
$\lambda_k$ : Feature weights (learned during training) -
$Z(\mathbf{x})$ : Normalization constant
# CRF can use rich hand-crafted features
features = {
"current_word": word,
"lowercase": word.lower(),
"prefix_3": word[:3],
"suffix_3": word[-3:],
"is_capitalized": word[0].isupper(),
"is_all_caps": word.isupper(),
"previous_word": prev_word,
"next_word": next_word,
"pos_tag": pos_tag,
}- Named Entity Recognition (NER): CRF was the state-of-the-art algorithm for NER
- Chinese Word Segmentation: Character-level CRF segmenters
- Modern usage: BERT + CRF: BERT provides contextual representations, the CRF layer ensures global label consistency (e.g., preventing "B-PER" from being followed directly by "I-LOC")
| Model | Type | Task | Core Strength | Core Limitation |
|---|---|---|---|---|
| Naive Bayes | Classification | Text classification | Simple and fast | Independence assumption |
| HMM | Sequence labeling | POS / Segmentation | Unsupervised training | Limited features |
| CRF | Sequence labeling | NER / Segmentation | Rich features | Requires feature engineering |
Though no longer mainstream, their ideas profoundly shaped modern architectures:
- Naive Bayes' Bayesian thinking → Bayesian Deep Learning
- HMM's sequence modeling → The hidden state concept in RNNs
- CRF's global sequence optimization → BERT-CRF remains a strong NER baseline; Transformer self-attention can be viewed as a "soft" global dependency model
- Viterbi decoding → Precursor to Beam Search
Traditional NLP established the "feature engineering + statistical model" paradigm. Next, we enter the deep learning era — replacing hand-crafted features with continuous vectors, fundamentally changing the rules of the game.
Next: Word Embeddings