Skip to content

Commit c4648a9

Browse files
committed
🚀 Deploy updated DGM site (2025-12-17 10:10)
1 parent 1e853f9 commit c4648a9

File tree

4 files changed

+189
-57
lines changed

4 files changed

+189
-57
lines changed

dgm-fall-2025/notes/index.html

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -93,24 +93,14 @@ <h2>The notes written by students and edited by instructors</h2>
9393

9494
<ul class="post-list">
9595

96-
<li >
97-
<p class="post-meta">December 8, 2025</p>
98-
<h2>
99-
<a class="post-title" href="/dgm-fall-2025/notes/lecture-25/"
100-
>Lecture 25</a
101-
>
102-
</h2>
103-
<p>Alignment, explainability, and open directions in LLM research</p>
104-
</li>
105-
10696
<li >
10797
<p class="post-meta">December 1, 2025</p>
10898
<h2>
10999
<a class="post-title" href="/dgm-fall-2025/notes/lecture-25/"
110-
>Lecture 25: Alignment, Explainability, and Open Directions</a
100+
>Lecture 25</a
111101
>
112102
</h2>
113-
<p>A summary of the lecture covering the history of interpretability, scaling laws, system design perspectives on AI, and current open problems in deep learning.</p>
103+
<p>Alignment, explainability, and open research directions in modern machine learning, with a focus on large language models and system-level reliability.</p>
114104
</li>
115105

116106
<li >
@@ -183,7 +173,7 @@ <h2>
183173
<p>Generative Adversarial Networks (GANs)</p>
184174
</li>
185175

186-
<li style="border-bottom: none;" >
176+
<li >
187177
<p class="post-meta">October 29, 2025</p>
188178
<h2>
189179
<a class="post-title" href="/dgm-fall-2025/notes/lecture-16/"
@@ -193,6 +183,16 @@ <h2>
193183
<p>Detailed lecture notes exploring Autoencoders, their variants, and Variational Autoencoders in Deep Generative Models.</p>
194184
</li>
195185

186+
<li style="border-bottom: none;" >
187+
<p class="post-meta">October 27, 2025</p>
188+
<h2>
189+
<a class="post-title" href="/dgm-fall-2025/notes/lecture-15/"
190+
>Lecture 15</a
191+
>
192+
</h2>
193+
<p>A Linear Intro to Generative Models</p>
194+
</li>
195+
196196
</ul>
197197

198198

dgm-fall-2025/notes/lecture-25/index.html

Lines changed: 165 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -40,8 +40,8 @@
4040
<script type="text/json">
4141
{
4242
"title": "Lecture 25",
43-
"description": "Alignment, explainability, and open directions in LLM research",
44-
"published": "December 8, 2025",
43+
"description": "Alignment, explainability, and open research directions in modern machine learning, with a focus on large language models and system-level reliability.",
44+
"published": "December 1, 2025",
4545
"lecturers": [
4646

4747
{
@@ -53,8 +53,11 @@
5353
"authors": [
5454

5555
{
56-
"author": "Reid Chen",
57-
"authorURL": "https://www.deepneural.network"
56+
"author": "Rishit Malpani"
57+
},
58+
59+
{
60+
"author": "Reid Chen"
5861
},
5962

6063
{
@@ -157,43 +160,174 @@
157160
<div class="page-content">
158161
<d-title>
159162
<h1>Lecture 25</h1>
160-
<p>Alignment, explainability, and open directions in LLM research</p>
163+
<p>Alignment, explainability, and open research directions in modern machine learning, with a focus on large language models and system-level reliability.</p>
161164
</d-title>
162165

163166
<d-byline></d-byline>
164167

165-
<d-article> <h2 id="phases-of-model-training">Phases of Model Training</h2>
168+
<d-article> <h2 id="key-takeaways">Key Takeaways</h2>
169+
170+
<ul>
171+
<li>Modern AI research is shifting from raw performance to <strong>alignment, interpretability, and system-level reliability</strong>.</li>
172+
<li>Post-hoc explainability tools are widely used but have serious <strong>fidelity and robustness limitations</strong>.</li>
173+
<li>Scaling laws explain why larger models work better, but they do <strong>not guarantee safety or alignment</strong>.</li>
174+
<li>Interpretability benefits not only users, but also <strong>system designers</strong>, by improving measurement, modularity, and value alignment.</li>
175+
<li>Many core challenges (alignment, reasoning, data limits, economic impact) remain <strong>open research problems</strong>.</li>
176+
</ul>
166177

167-
<p>The training pipeline for modern Large Language Models (LLMs) generally follows a progression from broad pattern matching to specific task alignment.</p>
178+
<h2 id="logistics">Logistics</h2>
168179

169180
<ul>
170-
<li><strong>Random Model</strong>: The starting point of the architecture.</li>
171-
<li><strong>Pre-training</strong>: The model is trained unsupervised on massive datasets (e.g., Common Crawl) to learn general patterns.</li>
172-
<li><strong>Fine-tuning</strong>: The pre-trained model is refined using In-Domain Data via Supervised Fine-Tuning (SFT) or Reinforcement Learning from Human Feedback (RLHF).</li>
173-
<li><strong>In-context learning</strong>: During usage, prompts and examples in the input guide the model to produce outputs adapted to user intent without updating weights.</li>
181+
<li><strong>Project Final Report:</strong> Due Friday, December 12th. Submit via Canvas.</li>
182+
<li><strong>Final Exam:</strong> December 17th, 5:05–7:05 PM in Science 180. A study guide has been released.</li>
174183
</ul>
175184

185+
<hr />
186+
187+
<h2 id="learning-goals">Learning Goals</h2>
188+
189+
<p>By the end of this lecture, you should be able to:</p>
190+
191+
<ul>
192+
<li>Explain why <strong>alignment</strong> and <strong>explainability</strong> are central problems in modern AI.</li>
193+
<li>Distinguish between <strong>post-hoc</strong>, <strong>transparent</strong>, and <strong>mechanistic</strong> interpretability.</li>
194+
<li>Describe the difference between <strong>outer alignment</strong> and <strong>inner alignment</strong>.</li>
195+
<li>Understand how <strong>system design</strong> interacts with interpretability.</li>
196+
<li>Identify major <strong>open research problems</strong> in alignment and interpretability.</li>
197+
</ul>
198+
199+
<hr />
200+
201+
<h2 id="the-llm-training-and-usage-pipeline">The LLM Training and Usage Pipeline</h2>
202+
203+
<p>Modern Large Language Models (LLMs) progress through distinct stages, from broad pattern learning to task-specific adaptation:</p>
204+
205+
<ol>
206+
<li><strong>Random Model</strong>: The initialized architecture before training.</li>
207+
<li><strong>Pre-Training</strong>: Unsupervised training on massive datasets (e.g., Common Crawl) to learn general patterns.</li>
208+
<li><strong>Fine-Tuning</strong>: Alignment using in-domain data via Supervised Fine-Tuning (SFT) or Reinforcement Learning from Human Feedback (RLHF).</li>
209+
<li><strong>In-Context Learning</strong>: At inference time, prompts and examples guide behavior without updating weights.</li>
210+
</ol>
211+
212+
<p><strong>Key Observation:</strong><br />
213+
The same trained model can behave very differently depending on context. Pre-training, fine-tuning, and in-context learning primarily change <strong>how the model is used</strong>, not just its parameters.</p>
214+
215+
<hr />
216+
176217
<h2 id="why-explainability-matters">Why Explainability Matters</h2>
177218

178-
<p>Models trained on large data are rarely naturally interpretable to humans. Historically, the field has moved through several phases:</p>
219+
<p>Models trained on large-scale data are rarely naturally interpretable to humans. Explainability is critical for:</p>
220+
179221
<ul>
180-
<li><strong>2016</strong>: Interpretability is invoked when metrics (like accuracy) are imperfect proxies for the true objective.</li>
181-
<li><strong>2017</strong>: Doshi-Velez &amp; Kim defined three modes of evaluation: application-grounded, human-grounded, and functionally-grounded.</li>
182-
<li><strong>2017-2020</strong>: Approaches fragmented into Post-Hoc (industry standard), Transparency (niche), and Mechanistic (technically deep).</li>
222+
<li>Safety and trust</li>
223+
<li>Debugging and model validation</li>
224+
<li>Regulatory and ethical compliance</li>
225+
<li>Understanding system-level behavior beyond accuracy</li>
183226
</ul>
184227

185-
<h2 id="fairness--sensitive-features">Fairness &amp; Sensitive Features</h2>
228+
<h3 id="common-confusions">Common Confusions</h3>
229+
230+
<ul>
231+
<li><strong>Explainability ≠ Accuracy</strong>: A highly accurate model can still be unsafe or untrustworthy.</li>
232+
<li><strong>Post-hoc explanations ≠ true understanding</strong>: Plausible explanations may not reflect the model’s actual computation.</li>
233+
<li><strong>Dropping sensitive features ≠ fairness</strong>: Bias can persist through correlated variables.</li>
234+
</ul>
186235

187-
<p>Merely dropping sensitive features like “race” from training data does <strong>not</strong> ensure the model is invariant to them, as biases can be encoded via correlated variables.</p>
236+
<hr />
237+
238+
<h2 id="fairness-and-sensitive-features">Fairness and Sensitive Features</h2>
239+
240+
<p>Removing sensitive attributes like race or gender from training data does <strong>not</strong> ensure invariance.</p>
188241

189242
<p><strong>Strategies for Invariance:</strong></p>
243+
190244
<ol>
191245
<li><strong>Remove the feature</strong>: Often insufficient due to correlations.</li>
192-
<li><strong>Train then clean</strong>: Train on all features, then attempt to remove the learned component associated with the sensitive feature.</li>
246+
<li><strong>Train then clean</strong>: Train with all features, then remove learned components post-hoc.</li>
193247
<li><strong>Test-time blinding</strong>: Drop the feature only during inference.</li>
194-
<li><strong>Modified Loss</strong>: Train with a loss function specifically designed to encourage invariant predictions.</li>
248+
<li><strong>Modified loss functions</strong>: Penalize prediction dependence on sensitive attributes.</li>
195249
</ol>
196250

251+
<hr />
252+
253+
<h2 id="the-history-of-interpretability">The History of Interpretability</h2>
254+
255+
<h3 id="interpretability-categories">Interpretability Categories</h3>
256+
257+
<table>
258+
<thead>
259+
<tr>
260+
<th>Type</th>
261+
<th>Core Idea</th>
262+
<th>Main Limitation</th>
263+
</tr>
264+
</thead>
265+
<tbody>
266+
<tr>
267+
<td>Post-hoc</td>
268+
<td>Explain predictions after training</td>
269+
<td>Often lacks fidelity</td>
270+
</tr>
271+
<tr>
272+
<td>Transparent</td>
273+
<td>Interpretable by design</td>
274+
<td>Limited flexibility</td>
275+
</tr>
276+
<tr>
277+
<td>Mechanistic</td>
278+
<td>Reverse-engineer internals</td>
279+
<td>Hard to scale</td>
280+
</tr>
281+
</tbody>
282+
</table>
283+
284+
<h3 id="2016-setting-the-stage">2016: Setting the Stage</h3>
285+
286+
<ul>
287+
<li><strong>The Mythos</strong>: Interpretability invoked when metrics are imperfect proxies for objectives (Lipton, 2016).</li>
288+
<li><strong>Evaluation Modes</strong>: Application-grounded, human-grounded, and functionally-grounded (Doshi-Velez &amp; Kim, 2017).</li>
289+
</ul>
290+
291+
<h3 id="20172020-fragmentation">2017–2020: Fragmentation</h3>
292+
293+
<table>
294+
<thead>
295+
<tr>
296+
<th>Methodology</th>
297+
<th>Examples</th>
298+
<th>Description</th>
299+
</tr>
300+
</thead>
301+
<tbody>
302+
<tr>
303+
<td>Post-hoc</td>
304+
<td>LIME, SHAP, Integrated Gradients</td>
305+
<td>Industry standard; explain after training</td>
306+
</tr>
307+
<tr>
308+
<td>Transparency</td>
309+
<td>GAMs, Monotonic Nets</td>
310+
<td>Niche, common in healthcare/tabular data</td>
311+
</tr>
312+
<tr>
313+
<td>Mechanistic</td>
314+
<td>Circuits, probing</td>
315+
<td>Technically deep, rarely user-facing</td>
316+
</tr>
317+
</tbody>
318+
</table>
319+
320+
<h3 id="cracks-in-post-hoc-explanations">Cracks in Post-Hoc Explanations</h3>
321+
322+
<ul>
323+
<li><strong>Insensitivity</strong>: Saliency maps may remain unchanged under weight randomization (Adebayo et al., 2018).</li>
324+
<li><strong>Vulnerability</strong>: LIME and SHAP can be easily fooled (Slack et al., 2020).</li>
325+
<li><strong>Plausibility vs. Faithfulness</strong>: Explanations may look reasonable but misrepresent computation (Jacovi &amp; Goldberg, 2020).</li>
326+
<li><strong>High-Stakes Critique</strong>: In safety-critical settings, post-hoc methods may be insufficient (Rudin, 2019).</li>
327+
</ul>
328+
329+
<hr />
330+
197331
<h2 id="interpretability-approaches">Interpretability Approaches</h2>
198332

199333
<h3 id="1-post-hoc-explanations">1. Post-hoc Explanations</h3>
@@ -289,11 +423,14 @@ <h2 id="scaling-laws-vs-interpretability">Scaling Laws vs. Interpretability</h2>
289423

290424
<p>Empirical performance follows a power-law relationship: $L(x) = (x/x_0)^{-\alpha}$ provided it is not bottlenecked by the other two factors. However, as models scale, they become less interpretable.</p>
291425

292-
<h2 id="system-design-view-of-interpretability">System Design View of Interpretability</h2>
426+
<hr />
293427

294-
<p>Interpretability is not just about debugging; it is a system design feature. It allows us to move from <strong>Individual Stats</strong> (like a player’s points per game) to <strong>System Stats</strong> (like a lineup’s net rating), which correlates better with winning.</p>
428+
<h2 id="a-system-design-view-of-interpretability">A System Design View of Interpretability</h2>
429+
430+
<p>Interpretability is a system-level property, not just a debugging tool. Like moving from individual player stats to lineup net rating, interpretability helps optimize the <strong>human–AI system</strong>.</p>
431+
432+
<p>The three main benefits are:</p>
295433

296-
<p>The three main system design benefits are:</p>
297434
<ol>
298435
<li><strong>Information Acquisition</strong></li>
299436
<li><strong>Value Alignment</strong></li>
@@ -345,6 +482,11 @@ <h2 id="open-challenges--takeaways">Open Challenges &amp; Takeaways</h2>
345482
<li><strong>Verifiable Rewards</strong>: Scaling RL requires rewards that can be verified at scale.</li>
346483
<li><strong>Symbolic Reasoning</strong>: Combining LLMs with symbolic reasoning and graphical models remains an open problem.</li>
347484
</ul>
485+
486+
<hr />
487+
488+
<p><strong>Final Takeaway:</strong><br />
489+
Scaling delivers performance, but interpretability, alignment, and system-level thinking determine whether AI systems are safe, useful, and beneficial in the real world.</p>
348490
</d-article>
349491

350492
<d-appendix>
@@ -391,7 +533,7 @@ <h2 class="footer-heading">Introduction to Deep Learning and Generative Models</
391533
</body>
392534

393535
<d-bibliography
394-
src="/dgm-fall-2025/assets/bibliography/2025-12-08-lecture-25.bib"
536+
src="/dgm-fall-2025/assets/bibliography/2025-12-01-lecture-25.bib"
395537
>
396538
</d-bibliography>
397539

dgm-fall-2025/notes/page2/index.html

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -93,16 +93,6 @@ <h2>The notes written by students and edited by instructors</h2>
9393

9494
<ul class="post-list">
9595

96-
<li >
97-
<p class="post-meta">October 27, 2025</p>
98-
<h2>
99-
<a class="post-title" href="/dgm-fall-2025/notes/lecture-15/"
100-
>Lecture 15</a
101-
>
102-
</h2>
103-
<p>A Linear Intro to Generative Models</p>
104-
</li>
105-
10696
<li >
10797
<p class="post-meta">October 15, 2025</p>
10898
<h2>
@@ -183,7 +173,7 @@ <h2>
183173
<p>Automatic Differentiation with PyTorch</p>
184174
</li>
185175

186-
<li style="border-bottom: none;" >
176+
<li >
187177
<p class="post-meta">September 17, 2025</p>
188178
<h2>
189179
<a class="post-title" href="/dgm-fall-2025/notes/lecture-05/"
@@ -193,6 +183,16 @@ <h2>
193183
<p>Fitting Neurons with Gradient Descent</p>
194184
</li>
195185

186+
<li style="border-bottom: none;" >
187+
<p class="post-meta">September 15, 2025</p>
188+
<h2>
189+
<a class="post-title" href="/dgm-fall-2025/notes/lecture-04/"
190+
>Lecture 04</a
191+
>
192+
</h2>
193+
<p>Single-layer networks</p>
194+
</li>
195+
196196
</ul>
197197

198198

dgm-fall-2025/notes/page3/index.html

Lines changed: 0 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -93,16 +93,6 @@ <h2>The notes written by students and edited by instructors</h2>
9393

9494
<ul class="post-list">
9595

96-
<li >
97-
<p class="post-meta">September 15, 2025</p>
98-
<h2>
99-
<a class="post-title" href="/dgm-fall-2025/notes/lecture-04/"
100-
>Lecture 04</a
101-
>
102-
</h2>
103-
<p>Single-layer networks</p>
104-
</li>
105-
10696
<li >
10797
<p class="post-meta">September 10, 2025</p>
10898
<h2>

0 commit comments

Comments
 (0)