Prerequisite: 01_Text_Preprocessing.md.
Before deep learning, the core challenge of NLP was: how to convert text into fixed-length numerical vectors that can be fed into traditional machine learning models. The methods from this era are based on statistical counting — simple but still widely used in lightweight scenarios today.
The simplest text representation: treat a document as a collection of words, ignoring word order, and count frequencies.
Document 1: "the cat sat on the mat"
Document 2: "the dog sat on the log"
Vocabulary: [the, cat, sat, on, mat, dog, log]
Document 1 vector: [2, 1, 1, 1, 1, 0, 0]
Document 2 vector: [2, 0, 1, 1, 0, 1, 1]
- Pros: Simple implementation, fast computation, highly interpretable
- Cons:
- Loses word order ("dog bites man" and "man bites dog" have the same representation)
- High-dimensional and sparse (vocabulary can reach tens of thousands, mostly zeros)
- No semantic capture ("happy" and "glad" have completely orthogonal vectors)
TF-IDF (Term Frequency - Inverse Document Frequency) adds a "uniqueness" weight on top of BoW — a word that appears frequently in a specific document but rarely across the entire corpus has high discriminative power for that document.
Where:
| Scenario | TF | IDF | TF-IDF | Explanation |
|---|---|---|---|---|
| "the" in any document | High | Low | Low | Appears everywhere, no discriminative power |
| "Transformer" in an AI paper | High | Medium | High | Domain-specific keyword |
| "quantum entanglement" in a physics paper | Medium | High | High | Highly specialized |
- Search engines: Core ranking algorithm in traditional search (Elasticsearch's default scoring)
- Keyword extraction: Words with the highest TF-IDF are the document's keywords
- Feature input for classifiers: Paired with SVM, Logistic Regression, etc.
N-Gram models are the ancestor of language models — modeling language by counting the probability of the next word given the previous
The core objective of a language model is to estimate the joint probability of a word sequence:
N-Gram simplifies this via the Markov Assumption: the current word depends only on the previous
| Model | Condition | Example |
|---|---|---|
|
Unigram ( |
Completely ignores context | |
|
Bigram ( |
After "I", likely "am", "have", "think" | |
|
Trigram ( |
After "I am", likely "a", "not", "going" |
Using Maximum Likelihood Estimation (MLE):
Where
N-Grams unseen in the training corpus receive zero probability (the zero-probability problem). Solutions:
- Laplace Smoothing: Add 1 to every count
- Kneser-Ney Smoothing: Discount-based with backoff — the most effective method
-
Fixed window: Can only see the previous
$N-1$ words, unable to capture long-range dependencies -
Data sparsity: As
$N$ grows, possible N-Gram combinations grow exponentially, with most counts being zero - No semantic understanding: Purely based on statistical co-occurrence, with no comprehension of meaning
These limitations drove the paradigm shift from statistical models to neural language models — replacing discrete count tables with continuous vector spaces.
The standard evaluation metric for language models — measuring how "perplexed" the model is by test data:
- Low perplexity: Model is confident and accurate (sharp probability distribution)
- High perplexity: Model is confused and uncertain (flat probability distribution)
Next: Statistical Models — Classical machine learning approaches to NLP tasks.

