Prerequisite: ../01_FT/01_Theory/01_Introduction.md (fine-tuning basics). See Also: ../../07_Paper_Tracking/04_Alignment_Frontiers.md (latest alignment research), ../../../04_Solutions/06_Finetuning_Playbook.md (DPO/PPO business implementation).
Modern LLM development follows a four-stage pipeline, where each stage builds on the previous:
┌─────────────────────────────────────────────────────────────────┐
│ Stage 1: Pre-training │
│ Goal: Build world knowledge and language understanding │
│ Data: Trillions of tokens from the web │
│ Output: A powerful but unaligned base model │
└─────────────────────────┬───────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ Stage 2: Instruction Fine-Tuning (SFT) │
│ Goal: Teach the model to follow instructions │
│ Data: High-quality (prompt, response) pairs │
│ Output: A model that can hold conversations │
│ Note: SFT quality sets the model's capability ceiling │
└─────────────────────────┬───────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ Stage 3: Preference Alignment │
│ Goal: Align behavior with human values and safety │
│ Signal: Subjective human/AI preference comparisons │
│ Methods: PPO, DPO, KTO, RLAIF, Constitutional AI │
│ Question answered: "How should the model behave?" │
└─────────────────────────┬───────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ Stage 4: Reasoning Alignment (RLVR) │
│ Goal: Develop deep reasoning and self-correction ability │
│ Signal: Objective verifiable correctness (math, code) │
│ Methods: GRPO + Verifier, Process Reward Models │
│ Question answered: "How should the model think?" │
└─────────────────────────────────────────────────────────────────┘
| Dimension | Preference Alignment | Reasoning Alignment |
|---|---|---|
| Core Question | How should the model behave? | How should the model think? |
| Reward Source | Human/AI preference labels | Rule-based verifier |
| Signal Type | Subjective (better/worse) | Objective (correct/incorrect) |
| Applicable Tasks | Open-ended (style, safety, tone) | Closed-ended (math, code, logic) |
| Key Risk | Reward hacking on learned RM | Sparse reward signal |
| Representative Methods | PPO, DPO, KTO | GRPO + RLVR |
| Industrial Example | InstructGPT, Claude | DeepSeek-R1 |
Techniques for aligning model behavior with human values and preferences.
| File | Topic |
|---|---|
| 01_Overview.md | RLHF pipeline, reward modeling, method comparison |
| 02_PPO.md | Proximal Policy Optimization |
| 03_DPO.md | Direct Preference Optimization |
| 04_KTO.md | Kahneman-Tversky Optimization |
| 05_RLAIF.md | Reinforcement Learning from AI Feedback |
| 06_Constitutional_AI.md | Rule-based self-critique |
| 07_Safety_Fine_Tuning.md | Safety-specific alignment |
Techniques for developing verifiable reasoning capabilities.
| File | Topic |
|---|---|
| 01_RLVR.md | Reinforcement Learning with Verifiable Rewards |
| 02_GRPO.md | Group Relative Policy Optimization |
Cross-cutting techniques applicable to both paradigms.
| File | Topic |
|---|---|
| 01_Rejection_Sampling.md | Best-of-N sampling strategies |
| 02_Iterative_Training.md | Online and iterative alignment |
| 03_Inference_Time_Compute.md | Scaling compute at inference |
| 04_Model_Merging.md | Merging aligned models |