Prerequisite: ../01_Linguistics/01_Linguistic_Foundations.md.
Text preprocessing is the first step in any NLP pipeline — transforming raw, unstructured text into normalized input that algorithms can process. The quality of preprocessing directly determines the upper bound of downstream task performance.
- 1. The Classic NLP Pipeline
- 2. Tokenization
- 3. Text Normalization
- 4. Stop Words & Filtering
- 5. Regular Expressions in NLP
Traditional NLP uses a stage-by-stage pipeline architecture, where each step provides annotations for subsequent steps:
Raw Text → Tokenization → Normalization → Stop Words Removal → Feature Extraction → Model
In the deep learning era, many pipeline steps have been replaced by end-to-end models, but tokenization and text cleaning remain indispensable — even GPT needs a Tokenizer.
Tokenization is the process of splitting continuous text into discrete units (tokens). Different languages and different eras have adopted different strategies:
- English: Split by whitespace and punctuation (relatively straightforward)
"I don't know."→["I", "do", "n't", "know", "."]
- Chinese: No natural delimiters — requires dedicated segmentation algorithms
- Forward maximum matching, reverse maximum matching
- Statistical segmentation: jieba uses HMM + Viterbi algorithm
"自然语言处理"→["自然语言", "处理"]or["自然", "语言", "处理"]
The mainstream approach in modern LLMs — striking a balance between word-level and character-level:
| Algorithm | Strategy | Representative Models |
|---|---|---|
| BPE (Byte Pair Encoding) | Start from characters, iteratively merge the most frequent adjacent pairs | GPT family |
| WordPiece | Similar to BPE, but selects merges based on likelihood gain | BERT |
| Unigram | Start from a large vocabulary, progressively prune low-probability subwords | T5, LLaMA |
For detailed subword tokenization internals, see 02_Scientist/01_Architecture/04_Tokenizer.md.
Unifying text into a standard form to reduce meaningless variation:
"Natural Language Processing" → "natural language processing"
Note: Case information should be preserved for tasks like Named Entity Recognition.
Rule-based suffix truncation — crude but fast:
Porter Stemmer:
"running" → "run"
"studies" → "studi" ← may produce non-words
"university" → "univers"
Dictionary and morphological analysis-based reduction to canonical form:
WordNet Lemmatizer:
"running" → "run"
"better" → "good"
"studies" → "study"
| Aspect | Stemming | Lemmatization |
|---|---|---|
| Speed | Fast (pure rules) | Slow (requires dictionary lookup) |
| Accuracy | Low (may produce non-words) | High (guarantees valid words) |
| Use case | Information retrieval, search engines | Text analysis, semantic tasks |
High-frequency but low-information words — "the", "is", "a", "of", etc.
- Traditional approach: Remove them to reduce noise and feature dimensionality
- Modern approach: LLMs do not remove stop words — the attention mechanism automatically learns to downweight low-information tokens
- HTML tag removal
- URL / email address handling
- Number normalization (replace all numbers with
<NUM>) - Special character and emoji processing
- Repeated character compression ("soooo goood" → "so good")
Regular expressions are the "Swiss army knife" of text preprocessing:
import re
# Email extraction
emails = re.findall(r'[\w.+-]+@[\w-]+\.[\w.-]+', text)
# URL extraction
urls = re.findall(r'https?://\S+', text)
# Chinese character matching
chinese = re.findall(r'[\u4e00-\u9fff]+', text)
# Remove excess whitespace
clean = re.sub(r'\s+', ' ', text).strip()Regular expressions are typically used for:
- The first pass of data cleaning (noise removal)
- Rule-based simple NER (e.g., extracting phone numbers, date formats)
- The preprocessing stage of tokenizers
Next: Feature Engineering — How to convert text into numerical features that machines can process.
