|
40 | 40 | <script type="text/json"> |
41 | 41 | { |
42 | 42 | "title": "Lecture 25", |
43 | | - "description": "Alignment, explainability, and open directions in LLM research", |
44 | | - "published": "December 8, 2025", |
| 43 | + "description": "Alignment, explainability, and open research directions in modern machine learning, with a focus on large language models and system-level reliability.", |
| 44 | + "published": "December 1, 2025", |
45 | 45 | "lecturers": [ |
46 | 46 |
|
47 | 47 | { |
|
53 | 53 | "authors": [ |
54 | 54 |
|
55 | 55 | { |
56 | | - "author": "Reid Chen", |
57 | | - "authorURL": "https://www.deepneural.network" |
| 56 | + "author": "Rishit Malpani" |
| 57 | + }, |
| 58 | + |
| 59 | + { |
| 60 | + "author": "Reid Chen" |
58 | 61 | }, |
59 | 62 |
|
60 | 63 | { |
|
157 | 160 | <div class="page-content"> |
158 | 161 | <d-title> |
159 | 162 | <h1>Lecture 25</h1> |
160 | | - <p>Alignment, explainability, and open directions in LLM research</p> |
| 163 | + <p>Alignment, explainability, and open research directions in modern machine learning, with a focus on large language models and system-level reliability.</p> |
161 | 164 | </d-title> |
162 | 165 |
|
163 | 166 | <d-byline></d-byline> |
164 | 167 |
|
165 | | - <d-article> <h2 id="phases-of-model-training">Phases of Model Training</h2> |
| 168 | + <d-article> <h2 id="key-takeaways">Key Takeaways</h2> |
| 169 | + |
| 170 | +<ul> |
| 171 | + <li>Modern AI research is shifting from raw performance to <strong>alignment, interpretability, and system-level reliability</strong>.</li> |
| 172 | + <li>Post-hoc explainability tools are widely used but have serious <strong>fidelity and robustness limitations</strong>.</li> |
| 173 | + <li>Scaling laws explain why larger models work better, but they do <strong>not guarantee safety or alignment</strong>.</li> |
| 174 | + <li>Interpretability benefits not only users, but also <strong>system designers</strong>, by improving measurement, modularity, and value alignment.</li> |
| 175 | + <li>Many core challenges (alignment, reasoning, data limits, economic impact) remain <strong>open research problems</strong>.</li> |
| 176 | +</ul> |
166 | 177 |
|
167 | | -<p>The training pipeline for modern Large Language Models (LLMs) generally follows a progression from broad pattern matching to specific task alignment.</p> |
| 178 | +<h2 id="logistics">Logistics</h2> |
168 | 179 |
|
169 | 180 | <ul> |
170 | | - <li><strong>Random Model</strong>: The starting point of the architecture.</li> |
171 | | - <li><strong>Pre-training</strong>: The model is trained unsupervised on massive datasets (e.g., Common Crawl) to learn general patterns.</li> |
172 | | - <li><strong>Fine-tuning</strong>: The pre-trained model is refined using In-Domain Data via Supervised Fine-Tuning (SFT) or Reinforcement Learning from Human Feedback (RLHF).</li> |
173 | | - <li><strong>In-context learning</strong>: During usage, prompts and examples in the input guide the model to produce outputs adapted to user intent without updating weights.</li> |
| 181 | + <li><strong>Project Final Report:</strong> Due Friday, December 12th. Submit via Canvas.</li> |
| 182 | + <li><strong>Final Exam:</strong> December 17th, 5:05–7:05 PM in Science 180. A study guide has been released.</li> |
174 | 183 | </ul> |
175 | 184 |
|
| 185 | +<hr /> |
| 186 | + |
| 187 | +<h2 id="learning-goals">Learning Goals</h2> |
| 188 | + |
| 189 | +<p>By the end of this lecture, you should be able to:</p> |
| 190 | + |
| 191 | +<ul> |
| 192 | + <li>Explain why <strong>alignment</strong> and <strong>explainability</strong> are central problems in modern AI.</li> |
| 193 | + <li>Distinguish between <strong>post-hoc</strong>, <strong>transparent</strong>, and <strong>mechanistic</strong> interpretability.</li> |
| 194 | + <li>Describe the difference between <strong>outer alignment</strong> and <strong>inner alignment</strong>.</li> |
| 195 | + <li>Understand how <strong>system design</strong> interacts with interpretability.</li> |
| 196 | + <li>Identify major <strong>open research problems</strong> in alignment and interpretability.</li> |
| 197 | +</ul> |
| 198 | + |
| 199 | +<hr /> |
| 200 | + |
| 201 | +<h2 id="the-llm-training-and-usage-pipeline">The LLM Training and Usage Pipeline</h2> |
| 202 | + |
| 203 | +<p>Modern Large Language Models (LLMs) progress through distinct stages, from broad pattern learning to task-specific adaptation:</p> |
| 204 | + |
| 205 | +<ol> |
| 206 | + <li><strong>Random Model</strong>: The initialized architecture before training.</li> |
| 207 | + <li><strong>Pre-Training</strong>: Unsupervised training on massive datasets (e.g., Common Crawl) to learn general patterns.</li> |
| 208 | + <li><strong>Fine-Tuning</strong>: Alignment using in-domain data via Supervised Fine-Tuning (SFT) or Reinforcement Learning from Human Feedback (RLHF).</li> |
| 209 | + <li><strong>In-Context Learning</strong>: At inference time, prompts and examples guide behavior without updating weights.</li> |
| 210 | +</ol> |
| 211 | + |
| 212 | +<p><strong>Key Observation:</strong><br /> |
| 213 | +The same trained model can behave very differently depending on context. Pre-training, fine-tuning, and in-context learning primarily change <strong>how the model is used</strong>, not just its parameters.</p> |
| 214 | + |
| 215 | +<hr /> |
| 216 | + |
176 | 217 | <h2 id="why-explainability-matters">Why Explainability Matters</h2> |
177 | 218 |
|
178 | | -<p>Models trained on large data are rarely naturally interpretable to humans. Historically, the field has moved through several phases:</p> |
| 219 | +<p>Models trained on large-scale data are rarely naturally interpretable to humans. Explainability is critical for:</p> |
| 220 | + |
179 | 221 | <ul> |
180 | | - <li><strong>2016</strong>: Interpretability is invoked when metrics (like accuracy) are imperfect proxies for the true objective.</li> |
181 | | - <li><strong>2017</strong>: Doshi-Velez & Kim defined three modes of evaluation: application-grounded, human-grounded, and functionally-grounded.</li> |
182 | | - <li><strong>2017-2020</strong>: Approaches fragmented into Post-Hoc (industry standard), Transparency (niche), and Mechanistic (technically deep).</li> |
| 222 | + <li>Safety and trust</li> |
| 223 | + <li>Debugging and model validation</li> |
| 224 | + <li>Regulatory and ethical compliance</li> |
| 225 | + <li>Understanding system-level behavior beyond accuracy</li> |
183 | 226 | </ul> |
184 | 227 |
|
185 | | -<h2 id="fairness--sensitive-features">Fairness & Sensitive Features</h2> |
| 228 | +<h3 id="common-confusions">Common Confusions</h3> |
| 229 | + |
| 230 | +<ul> |
| 231 | + <li><strong>Explainability ≠ Accuracy</strong>: A highly accurate model can still be unsafe or untrustworthy.</li> |
| 232 | + <li><strong>Post-hoc explanations ≠ true understanding</strong>: Plausible explanations may not reflect the model’s actual computation.</li> |
| 233 | + <li><strong>Dropping sensitive features ≠ fairness</strong>: Bias can persist through correlated variables.</li> |
| 234 | +</ul> |
186 | 235 |
|
187 | | -<p>Merely dropping sensitive features like “race” from training data does <strong>not</strong> ensure the model is invariant to them, as biases can be encoded via correlated variables.</p> |
| 236 | +<hr /> |
| 237 | + |
| 238 | +<h2 id="fairness-and-sensitive-features">Fairness and Sensitive Features</h2> |
| 239 | + |
| 240 | +<p>Removing sensitive attributes like race or gender from training data does <strong>not</strong> ensure invariance.</p> |
188 | 241 |
|
189 | 242 | <p><strong>Strategies for Invariance:</strong></p> |
| 243 | + |
190 | 244 | <ol> |
191 | 245 | <li><strong>Remove the feature</strong>: Often insufficient due to correlations.</li> |
192 | | - <li><strong>Train then clean</strong>: Train on all features, then attempt to remove the learned component associated with the sensitive feature.</li> |
| 246 | + <li><strong>Train then clean</strong>: Train with all features, then remove learned components post-hoc.</li> |
193 | 247 | <li><strong>Test-time blinding</strong>: Drop the feature only during inference.</li> |
194 | | - <li><strong>Modified Loss</strong>: Train with a loss function specifically designed to encourage invariant predictions.</li> |
| 248 | + <li><strong>Modified loss functions</strong>: Penalize prediction dependence on sensitive attributes.</li> |
195 | 249 | </ol> |
196 | 250 |
|
| 251 | +<hr /> |
| 252 | + |
| 253 | +<h2 id="the-history-of-interpretability">The History of Interpretability</h2> |
| 254 | + |
| 255 | +<h3 id="interpretability-categories">Interpretability Categories</h3> |
| 256 | + |
| 257 | +<table> |
| 258 | + <thead> |
| 259 | + <tr> |
| 260 | + <th>Type</th> |
| 261 | + <th>Core Idea</th> |
| 262 | + <th>Main Limitation</th> |
| 263 | + </tr> |
| 264 | + </thead> |
| 265 | + <tbody> |
| 266 | + <tr> |
| 267 | + <td>Post-hoc</td> |
| 268 | + <td>Explain predictions after training</td> |
| 269 | + <td>Often lacks fidelity</td> |
| 270 | + </tr> |
| 271 | + <tr> |
| 272 | + <td>Transparent</td> |
| 273 | + <td>Interpretable by design</td> |
| 274 | + <td>Limited flexibility</td> |
| 275 | + </tr> |
| 276 | + <tr> |
| 277 | + <td>Mechanistic</td> |
| 278 | + <td>Reverse-engineer internals</td> |
| 279 | + <td>Hard to scale</td> |
| 280 | + </tr> |
| 281 | + </tbody> |
| 282 | +</table> |
| 283 | + |
| 284 | +<h3 id="2016-setting-the-stage">2016: Setting the Stage</h3> |
| 285 | + |
| 286 | +<ul> |
| 287 | + <li><strong>The Mythos</strong>: Interpretability invoked when metrics are imperfect proxies for objectives (Lipton, 2016).</li> |
| 288 | + <li><strong>Evaluation Modes</strong>: Application-grounded, human-grounded, and functionally-grounded (Doshi-Velez & Kim, 2017).</li> |
| 289 | +</ul> |
| 290 | + |
| 291 | +<h3 id="20172020-fragmentation">2017–2020: Fragmentation</h3> |
| 292 | + |
| 293 | +<table> |
| 294 | + <thead> |
| 295 | + <tr> |
| 296 | + <th>Methodology</th> |
| 297 | + <th>Examples</th> |
| 298 | + <th>Description</th> |
| 299 | + </tr> |
| 300 | + </thead> |
| 301 | + <tbody> |
| 302 | + <tr> |
| 303 | + <td>Post-hoc</td> |
| 304 | + <td>LIME, SHAP, Integrated Gradients</td> |
| 305 | + <td>Industry standard; explain after training</td> |
| 306 | + </tr> |
| 307 | + <tr> |
| 308 | + <td>Transparency</td> |
| 309 | + <td>GAMs, Monotonic Nets</td> |
| 310 | + <td>Niche, common in healthcare/tabular data</td> |
| 311 | + </tr> |
| 312 | + <tr> |
| 313 | + <td>Mechanistic</td> |
| 314 | + <td>Circuits, probing</td> |
| 315 | + <td>Technically deep, rarely user-facing</td> |
| 316 | + </tr> |
| 317 | + </tbody> |
| 318 | +</table> |
| 319 | + |
| 320 | +<h3 id="cracks-in-post-hoc-explanations">Cracks in Post-Hoc Explanations</h3> |
| 321 | + |
| 322 | +<ul> |
| 323 | + <li><strong>Insensitivity</strong>: Saliency maps may remain unchanged under weight randomization (Adebayo et al., 2018).</li> |
| 324 | + <li><strong>Vulnerability</strong>: LIME and SHAP can be easily fooled (Slack et al., 2020).</li> |
| 325 | + <li><strong>Plausibility vs. Faithfulness</strong>: Explanations may look reasonable but misrepresent computation (Jacovi & Goldberg, 2020).</li> |
| 326 | + <li><strong>High-Stakes Critique</strong>: In safety-critical settings, post-hoc methods may be insufficient (Rudin, 2019).</li> |
| 327 | +</ul> |
| 328 | + |
| 329 | +<hr /> |
| 330 | + |
197 | 331 | <h2 id="interpretability-approaches">Interpretability Approaches</h2> |
198 | 332 |
|
199 | 333 | <h3 id="1-post-hoc-explanations">1. Post-hoc Explanations</h3> |
@@ -289,11 +423,14 @@ <h2 id="scaling-laws-vs-interpretability">Scaling Laws vs. Interpretability</h2> |
289 | 423 |
|
290 | 424 | <p>Empirical performance follows a power-law relationship: $L(x) = (x/x_0)^{-\alpha}$ provided it is not bottlenecked by the other two factors. However, as models scale, they become less interpretable.</p> |
291 | 425 |
|
292 | | -<h2 id="system-design-view-of-interpretability">System Design View of Interpretability</h2> |
| 426 | +<hr /> |
293 | 427 |
|
294 | | -<p>Interpretability is not just about debugging; it is a system design feature. It allows us to move from <strong>Individual Stats</strong> (like a player’s points per game) to <strong>System Stats</strong> (like a lineup’s net rating), which correlates better with winning.</p> |
| 428 | +<h2 id="a-system-design-view-of-interpretability">A System Design View of Interpretability</h2> |
| 429 | + |
| 430 | +<p>Interpretability is a system-level property, not just a debugging tool. Like moving from individual player stats to lineup net rating, interpretability helps optimize the <strong>human–AI system</strong>.</p> |
| 431 | + |
| 432 | +<p>The three main benefits are:</p> |
295 | 433 |
|
296 | | -<p>The three main system design benefits are:</p> |
297 | 434 | <ol> |
298 | 435 | <li><strong>Information Acquisition</strong></li> |
299 | 436 | <li><strong>Value Alignment</strong></li> |
@@ -345,6 +482,11 @@ <h2 id="open-challenges--takeaways">Open Challenges & Takeaways</h2> |
345 | 482 | <li><strong>Verifiable Rewards</strong>: Scaling RL requires rewards that can be verified at scale.</li> |
346 | 483 | <li><strong>Symbolic Reasoning</strong>: Combining LLMs with symbolic reasoning and graphical models remains an open problem.</li> |
347 | 484 | </ul> |
| 485 | + |
| 486 | +<hr /> |
| 487 | + |
| 488 | +<p><strong>Final Takeaway:</strong><br /> |
| 489 | +Scaling delivers performance, but interpretability, alignment, and system-level thinking determine whether AI systems are safe, useful, and beneficial in the real world.</p> |
348 | 490 | </d-article> |
349 | 491 |
|
350 | 492 | <d-appendix> |
@@ -391,7 +533,7 @@ <h2 class="footer-heading">Introduction to Deep Learning and Generative Models</ |
391 | 533 | </body> |
392 | 534 |
|
393 | 535 | <d-bibliography |
394 | | - src="/dgm-fall-2025/assets/bibliography/2025-12-08-lecture-25.bib" |
| 536 | + src="/dgm-fall-2025/assets/bibliography/2025-12-01-lecture-25.bib" |
395 | 537 | > |
396 | 538 | </d-bibliography> |
397 | 539 |
|
|
0 commit comments