Prerequisite: ../04_Post_Training/02_Alignment/. Continuously updated. Last update: 2025-06
- Institution: DeepSeek
- Link: arXiv:2501.12948
- One-line Summary: Trains strong reasoning via GRPO + rule-based rewards (no learned RM), observing emergent behaviors like the "Aha Moment."
- Core Innovation: Proves pure RL (without SFT reasoning data) can elicit self-reflection and long-chain reasoning from scratch; Cold Start SFT + large-scale GRPO two-stage pipeline.
- Key Results: R1 achieves 79.8% on AIME 2024, 97.3% on MATH-500, matching OpenAI o1-0912.
- Implications: Detailed in 04_Post_Training/02_Alignment/03_Reasoning_Alignment/02_GRPO.md.
- Institution: OpenAI
- Link: openai.com/index/learning-to-reason-with-llms
- One-line Summary: First large-scale inference-time compute scaling commercial model; significantly improves reasoning via internal chain-of-thought search.
- Core Innovation: "Thinking time" as a scalable dimension — more reasoning tokens = better results.
- Key Results: AIME 2024 top-500 ranking, Codeforces 89th percentile.
- Limitations: Architecture undisclosed; cannot confirm whether MCTS or other search algorithms are used.
- Institution: Alibaba Qwen
- Link: qwenlm.github.io/blog/qwq-32b-preview
- One-line Summary: 32B open-source reasoning model trained via RLVR for long-chain reasoning.
- Relation to Existing Work: Contrasts with R1-Distill-32B — QwQ uses direct RLVR training while R1-Distill uses distillation; the latter achieves better results.
- Institution: University of Virginia
- Link: arXiv:2405.14734
- One-line Summary: Removes the reference model from DPO, using length-normalized sequence log-probability as implicit reward.
-
Core Innovation:
$r(x, y) = \frac{\beta}{|y|} \log \pi_\theta(y|x)$ — length-normalized log-prob directly as reward, no$\pi_{ref}$ needed. - Key Results: Outperforms DPO and ORPO on AlpacaEval 2.0 LC.
- Implications: Further simplification direction for the DPO family.
- Institution: KAIST
- Link: arXiv:2403.07691
- One-line Summary: Merges SFT and preference alignment into a single training stage without a reference model.
- Core Innovation: Adds an odds ratio penalty to the standard NLL loss, penalizing the generation probability of rejected responses.
- Institution: Center for AI Safety
- Link: arXiv:2310.01405
- One-line Summary: Controls model behavior by directly manipulating internal representations (rather than fine-tuning).
- Core Innovation: Identifies direction vectors encoding specific concepts (e.g., "honesty") in the model's representation space; adds/subtracts these vectors to steer outputs.
- Institution: Center for AI Safety / Gray Swan AI
- Link: arXiv:2406.04313
- One-line Summary: Installs "circuit breakers" at the representation level that interrupt generation when harmful output patterns are detected.
- Core Innovation: Bypasses behavioral-level alignment (RLHF/DPO) to directly block harmful generation paths in representation space.
- Key Results: >90% defense success rate against GCG, AutoDAN, and similar attacks while maintaining normal task performance.
- Institution: OpenAI
- Link: openai.com/index/introducing-o3-and-o4-mini
- One-line Summary: Next-generation reasoning models with tool use (web search, code execution, image analysis) integrated into the thinking loop.
- Core Innovation: "Thinking with tools" — the model can invoke tools mid-reasoning, not just as final actions; significantly extends the effective reasoning depth.
- Key Results: o3 achieves 96.7% on MATH-500, 88.9% on GPQA Diamond; o4-mini matches o3 on most benchmarks at 1/10 cost.
- Institution: DeepSeek
- Link: huggingface.co/deepseek-ai/DeepSeek-R1-0528
- One-line Summary: Major R1 update doubling reasoning depth (23K avg tokens vs 12K), matching o3 and Gemini 2.5 Pro.
- Core Innovation: Enhanced post-training algorithms with increased compute; deeper chain-of-thought reasoning with sustained 10+ minute thinking on hard problems.
- Key Results: Matches o3 on AIME 2024 and MATH-500; open-weight under MIT license.
- Institution: Anthropic
- Link: anthropic.com/news/claude-4
- One-line Summary: 200K-token context with multi-hour "Extended Thinking" and hybrid reasoning + tool use in a single loop.
- Core Innovation: Extended Thinking allows sustained reasoning over hours; agentic coding capabilities (7-hour autonomous refactoring demonstrated).
- Key Results: #1 on SWE-bench at launch; Opus 4 sets new SOTA on agentic coding tasks.
- Institution: Alibaba Qwen
- One-line Summary: Single model supports toggling between "thinking" (deep CoT) and "non-thinking" (fast) modes via system prompt.
- Core Innovation: Two-phase training — standard SFT first, then RL with verifiable rewards to train the thinking mode; no separate reasoning model needed.
- Implications: Unifies fast chat and deep reasoning into one model, reducing deployment complexity.