Skip to content

Commit e6bdf6f

Browse files
committed
🚀 Deploy updated DGM site (2025-09-17 07:09)
1 parent 594a321 commit e6bdf6f

File tree

3 files changed

+124
-142
lines changed

3 files changed

+124
-142
lines changed
8.84 MB
Binary file not shown.

dgm-fall-2025/lectures/index.html

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -240,7 +240,7 @@ <h2 class="post-description"></h2>
240240
<br />
241241
[
242242

243-
<a href="" target="_blank">slides</a>
243+
<a href="/dgm-fall-2025/assets/lectures/Lecture_05_gradient_descent.pdf" target="_blank">slides</a>
244244

245245

246246

dgm-fall-2025/notes/lecture-03/index.html

Lines changed: 123 additions & 141 deletions
Original file line numberDiff line numberDiff line change
@@ -199,9 +199,8 @@ <h2 id="3-vectors-matrices-and-broadcasting">3. Vectors, Matrices, and Broadcast
199199
<li>A vector is an <strong>n-by-1</strong> matrix, where <strong>n</strong> is the number of <strong>row</strong>, and <strong>1</strong> is the number of <strong>column</strong>.</li>
200200
<li>In deep learning, we often represent vectors as <strong>column vectors</strong>.</li>
201201
<li>The linear combination (pre-activation value) is written: $z = \mathbf{w}^\top \mathbf{x} + b$, where</li>
202-
</ul>
203-
<d-math block="">
204-
\mathbf{x} =
202+
<li>
203+
\[\mathbf{x} =
205204
\begin{bmatrix}
206205
x_1 \\
207206
x_2 \\
@@ -219,47 +218,46 @@ <h2 id="3-vectors-matrices-and-broadcasting">3. Vectors, Matrices, and Broadcast
219218
\end{bmatrix}
220219
\in \mathbb{R}^{m \times 1},
221220
\quad
222-
z \in \mathbb{R}
223-
</d-math>
221+
z \in \mathbb{R}\]
222+
</li>
223+
</ul>
224224
</li>
225225
<li><strong>Matrices:</strong>
226226
<ul>
227227
<li>A matrix is an <strong>n-by-m arrays</strong> of numbers, where <strong>n</strong> is the number of <strong>row</strong>, and <strong>m</strong> is the number of <strong>column</strong>.</li>
228228
<li>The linear combination is written: $\mathbf{z} = \mathbf{Xw} + b$, where</li>
229229
</ul>
230-
231-
<d-math block="">
232-
\mathbf{X} =
233-
\begin{bmatrix}
234-
x^{[1]}_1 &amp; x^{[1]}_2 &amp; \cdots &amp; x^{[1]}_m \\
235-
x^{[2]}_1 &amp; x^{[2]}_2 &amp; \cdots &amp; x^{[2]}_m \\
236-
\vdots &amp; \vdots &amp; \ddots &amp; \vdots \\
237-
x^{[n]}_1 &amp; x^{[n]}_2 &amp; \cdots &amp; x^{[n]}_m
238-
\end{bmatrix}
239-
\in \mathbb{R}^{n \times m},
240-
\quad
241-
\mathbf{w} =
242-
\begin{bmatrix}
243-
w_1 \\
244-
w_2 \\
245-
\vdots \\
246-
w_m
247-
\end{bmatrix}
248-
\in \mathbb{R}^{m \times 1},
249-
\quad
250-
\mathbf{z} =
251-
\begin{bmatrix}
252-
z^{[1]} \\
253-
z^{[2]} \\
254-
\vdots \\
255-
z^{[n]}
256-
\end{bmatrix}
257-
\in \mathbb{R}^{n \times 1}
258-
</d-math>
259-
<ul>
260-
<li>Time Complexity of N-by-N matrices multiplication by naive algorithms: $O(n^3)$.</li>
261-
</ul>
262230
</li>
231+
</ul>
232+
233+
<p>\(\mathbf{X} =
234+
\begin{bmatrix}
235+
x^{[1]}_1 &amp; x^{[1]}_2 &amp; \cdots &amp; x^{[1]}_m \\
236+
x^{[2]}_1 &amp; x^{[2]}_2 &amp; \cdots &amp; x^{[2]}_m \\
237+
\vdots &amp; \vdots &amp; \ddots &amp; \vdots \\
238+
x^{[n]}_1 &amp; x^{[n]}_2 &amp; \cdots &amp; x^{[n]}_m
239+
\end{bmatrix}
240+
\in \mathbb{R}^{n \times m},
241+
\quad
242+
\mathbf{w} =
243+
\begin{bmatrix}
244+
w_1 \\
245+
w_2 \\
246+
\vdots \\
247+
w_m
248+
\end{bmatrix}
249+
\in \mathbb{R}^{m \times 1},
250+
\quad
251+
\mathbf{z} =
252+
\begin{bmatrix}
253+
z^{[1]} \\
254+
z^{[2]} \\
255+
\vdots \\
256+
z^{[n]}
257+
\end{bmatrix}
258+
\in \mathbb{R}^{n \times 1}\)</p>
259+
<ul>
260+
<li>Time Complexity of N-by-N matrices multiplication by naive algorithms: $O(n^3)$.</li>
263261
<li><strong>Broadcasting:</strong>
264262
<ul>
265263
<li>The rigorous math formula of linear combination is: $\mathbf{z} = \mathbf{Xw} + \mathbf{1}_n b$.</li>
@@ -311,7 +309,7 @@ <h2 id="4-probability-basics">4. Probability Basics</h2>
311309
$f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}$
312310
<ul>
313311
<li>Called <strong>“Normal”</strong> because of the <strong>Central Limit Theorem</strong>.</li>
314-
<li><strong>Standard Normal:</strong> when $ \mu (mean) = 0, \sigma (standard deviation) = 1 $.</li>
312+
<li><strong>Standard Normal:</strong> when $\mu= 0, \sigma= 1$.</li>
315313
</ul>
316314
</li>
317315
</ul>
@@ -320,8 +318,8 @@ <h2 id="4-probability-basics">4. Probability Basics</h2>
320318
</li>
321319
<li><strong>Central Limit Theorem (CLT):</strong>
322320
<ul>
323-
<li>Let $X_1, X_2, \ldots, X_n$ be independent and identically distributed (i.i.d.) random variables with mean $\mu$ and variance $\sigma^2$.</li>
324-
<li>Define the <strong>sample mean</strong>:$\bar{X}_n = \frac{1}{n}(X_1 + X_2 + \cdots + X_n)$</li>
321+
<li>Let $X_1, X_2, \ldots, X_n$ be independent and identically distributed (i.i.d.) random variables with mean $\mu$ and finite variance $\sigma^2$.</li>
322+
<li>Define the <strong>sample mean</strong>: $\bar{X}_n = \frac{1}{n}(X_1 + X_2 + \cdots + X_n)$</li>
325323
<li>Then we have: $\frac{\bar{X}_n - \mu}{\frac{\sigma}{\sqrt{n}}} \to N(0,1)$ as $n \to \infty$</li>
326324
</ul>
327325
</li>
@@ -391,8 +389,11 @@ <h2 id="4-probability-basics">4. Probability Basics</h2>
391389
</li>
392390
<li><strong>Bayes’Rule:</strong>
393391
<ul>
394-
<li>$ P(A \mid B) = \frac{P(B \mid A) P(A)}{P(B)} $</li>
395-
<li>Example: Medical Test:
392+
<li>
393+
<p>$P(A \mid B) = \frac{P(B \mid A) P(A)}{P(B)}$</p>
394+
</li>
395+
<li>
396+
<p>Example: Medical Test:</p>
396397
<ul>
397398
<li>$P(\text{disease} \mid \text{positive test}) = \frac{P(\text{positive test} \mid \text{disease}) \, P(\text{disease})} {P(\text{positive test})} $</li>
398399
</ul>
@@ -418,44 +419,37 @@ <h4 id="maximum-likelihood-estimation-mle">Maximum Likelihood Estimation (MLE)</
418419
<ul>
419420
<li><strong>Definition:</strong></li>
420421
</ul>
422+
</li>
423+
</ul>
421424

422-
<d-math block="">
423-
424-
\hat{\theta}_{\text{MAP}}
425-
= \arg\max{\theta} P(\theta \mid \text{data})
426-
= \arg\max_{\theta} P(\text{data} \mid \theta),P(\theta).
427-
</d-math>
425+
\[\hat{\theta}_{\text{MAP}}
426+
= \arg\max_{\theta} P(\theta \mid \text{data})
427+
= \arg\max_{\theta} \big[ P(\text{data} \mid \theta)\, P(\theta) \big].\]
428428

429-
<ul>
430-
<li><strong>Interpretation:</strong><br />
429+
<ul>
430+
<li><strong>Interpretation:</strong><br />
431431
MLE chooses $\theta$ that makes the observed data most “likely.”</li>
432-
<li><strong>Log-likelihood:</strong></li>
433-
</ul>
432+
<li><strong>Log-likelihood:</strong></li>
433+
</ul>
434434

435-
<d-math block="">
436-
437-
\ell(\theta)
438-
= \log L(\theta)
439-
= \sum_i \Big[ x_i \log \theta + (1-x_i)\log(1-\theta) \Big].
440-
</d-math>
435+
\[\ell(\theta)
436+
= \log L(\theta)
437+
= \sum_i \Big[ x_i \log \theta + (1-x_i)\log(1-\theta) \Big].
438+
&lt;/d-math&gt;\]
441439

442-
<ul>
443-
<li><strong>Example (Bernoulli):</strong><br />
440+
<ul>
441+
<li><strong>Example (Bernoulli):</strong><br />
444442
Suppose we observe $k$ successes in $n$ Bernoulli trials. Then</li>
445-
</ul>
443+
</ul>
446444

447-
<d-math block="">
448-
\hat{\theta}_{\text{MLE}} = \frac{k}{n}.
449-
</d-math>
445+
\[\hat{\theta}_{\text{MLE}} = \frac{k}{n}.\]
450446

447+
<ul>
448+
<li><strong>Notes:</strong>
451449
<ul>
452-
<li><strong>Notes:</strong>
453-
<ul>
454-
<li>MLE does not always exist.</li>
455-
<li>MLE may not be unique.</li>
456-
<li>MLE may not always be admissible.</li>
457-
</ul>
458-
</li>
450+
<li>MLE does not always exist.</li>
451+
<li>MLE may not be unique.</li>
452+
<li>MLE may not always be admissible.</li>
459453
</ul>
460454
</li>
461455
<li>
@@ -464,18 +458,17 @@ <h4 id="maximum-a-posteriori-map">Maximum A Posteriori (MAP)</h4>
464458
<ul>
465459
<li><strong>Definition:</strong></li>
466460
</ul>
461+
</li>
462+
</ul>
467463

468-
<d-math block="">
469-
470-
\hat{\theta}{\text{MAP}}
471-
= \arg\max{\theta} P(\theta \mid \text{data})
472-
= \arg\max_{\theta} P(\text{data} \mid \theta),P(\theta).
473-
</d-math>
464+
\[\hat{\theta}{\text{MAP}}
465+
= \arg\max{\theta} P(\theta \mid \text{data})
466+
= \arg\max_{\theta} P(\text{data} \mid \theta),P(\theta).\]
474467

475-
<ul>
476-
<li>MAP incorporates a <strong>prior distribution</strong> $P(\theta)$.</li>
477-
<li>MLE ignores the prior.</li>
478-
</ul>
468+
<ul>
469+
<li>MAP incorporates a <strong>prior distribution</strong> $P(\theta)$.</li>
470+
<li>
471+
<p>MLE ignores the prior.</p>
479472
</li>
480473
<li>
481474
<h4 id="regularization-as-map">Regularization as MAP</h4>
@@ -492,22 +485,18 @@ <h4 id="regularization-as-map">Regularization as MAP</h4>
492485

493486
<p>Formally:</p>
494487

495-
<d-math block="">
496-
\hat{\theta}_{\text{reg}}
488+
\[\hat{\theta}_{\text{reg}}
497489
= \arg\max_{\theta} \Big[ \log L(\theta) - \lambda R(\theta) \Big]
498490
\quad\Longleftrightarrow\quad
499-
\hat{\theta}_{\text{MAP}}
500-
</d-math>
491+
\hat{\theta}_{\text{MAP}}\]
501492

502493
<h2 id="6-linear-regression">6. Linear Regression</h2>
503494
<p>Linear regression models the relationship between inputs (features) and outputs (responses).</p>
504495

505496
<ul>
506497
<li>
507498
<h4 id="model-definition">Model Definition</h4>
508-
<d-math block="">
509-
y = X\beta + \epsilon
510-
</d-math>
499+
<p>\(y = X\beta + \epsilon\)</p>
511500

512501
<ul>
513502
<li>$y$: response variable (dependent variable).</li>
@@ -524,83 +513,76 @@ <h4 id="evaluation-metrics">Evaluation Metrics</h4>
524513
<ul>
525514
<li><strong>Coefficient of Determination ($R^2$):</strong></li>
526515
</ul>
516+
</li>
517+
</ul>
527518

528-
<d-math block="">
529-
R^2 = 1 - \frac{SS_{\text{res}}}{SS_{\text{tot}}}
530-
</d-math>
519+
\[R^2 = 1 - \frac{SS_{\text{res}}}{SS_{\text{tot}}}\]
531520

532-
<ul>
533-
<li>
534-
<p>Measures the proportion of variance in $y$ explained by the model.</p>
535-
</li>
536-
<li>
537-
<p><strong>Mean Squared Error (MSE):</strong></p>
538-
</li>
539-
</ul>
521+
<ul>
522+
<li>
523+
<p>Measures the proportion of variance in $y$ explained by the model.</p>
524+
</li>
525+
<li>
526+
<p><strong>Mean Squared Error (MSE):</strong></p>
527+
</li>
528+
</ul>
540529

541-
<d-math block="">
542-
MSE = \frac{1}{n} \sum_i (y_i - \hat{y}_i)^2
543-
</d-math>
530+
\[MSE = \frac{1}{n} \sum_i (y_i - \hat{y}_i)^2\]
544531

545-
<ul>
546-
<li><strong>Mean Absolute Error (MAE):</strong></li>
547-
</ul>
532+
<ul>
533+
<li><strong>Mean Absolute Error (MAE):</strong></li>
534+
</ul>
548535

549-
<d-math block="">
550-
MAE = \frac{1}{n} \sum_i |y_i - \hat{y}_i|
551-
</d-math>
552-
</li>
536+
\[MAE = \frac{1}{n} \sum_i |y_i - \hat{y}_i|\]
537+
538+
<ul>
553539
<li>
554540
<h4 id="ordinary-least-squares-ols">Ordinary Least Squares (OLS)</h4>
555541

556542
<ul>
557543
<li><strong>Objective:</strong></li>
558544
</ul>
545+
</li>
546+
</ul>
559547

560-
<d-math block="">
561-
\hat{\beta}_{\text{OLS}}
562-
= \arg\min_{\beta} \|y - X\beta\|^2
563-
</d-math>
564-
565-
<ul>
566-
<li>
567-
<p><strong>Residuals:</strong> $e_i = y_i - \hat{y}_i$.</p>
568-
</li>
569-
<li>
570-
<p><strong>Closed-form solution:</strong></p>
571-
</li>
572-
</ul>
548+
\[\hat{\beta}_{\text{OLS}}
549+
= \arg\min_{\beta} \|y - X\beta\|^2\]
573550

574-
<d-math block="">
575-
\hat{\beta}_{\text{OLS}} = (X^TX)^{-1}X^Ty
576-
</d-math>
551+
<ul>
552+
<li>
553+
<p><strong>Residuals:</strong> $e_i = y_i - \hat{y}_i$.</p>
577554
</li>
555+
<li>
556+
<p><strong>Closed-form solution:</strong></p>
557+
</li>
558+
</ul>
559+
560+
\[\hat{\beta}_{\text{OLS}} = (X^TX)^{-1}X^Ty\]
561+
562+
<ul>
578563
<li>
579564
<h4 id="regularization-in-linear-regression">Regularization in Linear Regression</h4>
580565

581566
<ul>
582567
<li><strong>Ridge Regression (L2):</strong></li>
583568
</ul>
569+
</li>
570+
</ul>
584571

585-
<d-math block="">
586-
\hat{\beta}_{\text{ridge}}
587-
= \arg\min_{\beta} \|y - X\beta\|^2 + \lambda \|\beta\|_2^2
588-
</d-math>
589-
590-
<p>Equivalent MAP interpretation: Gaussian prior $\beta \sim N(0, \sigma^2I)$.</p>
572+
\[\hat{\beta}_{\text{ridge}}
573+
= \arg\min_{\beta} \|y - X\beta\|^2 + \lambda \|\beta\|_2^2\]
591574

592-
<ul>
593-
<li><strong>Lasso Regression (L1):</strong></li>
594-
</ul>
575+
<p>Equivalent MAP interpretation: Gaussian prior $\beta \sim N(0, \sigma^2I)$.</p>
595576

596-
<d-math block="">
597-
\hat{\beta}_{\text{lasso}}
598-
= \arg\min_{\beta} \|y - X\beta\|^2 + \lambda \|\beta\|_1
599-
</d-math>
600-
<p>Equivalent MAP interpretation: Laplace prior $\beta \sim \text{Laplace}(0, b)$.<br />
601-
Encourages <strong>sparsity</strong> (many coefficients shrink to 0).</p>
602-
</li>
577+
<ul>
578+
<li><strong>Lasso Regression (L1):</strong></li>
603579
</ul>
580+
581+
\[\hat{\beta}_{\text{lasso}}
582+
= \arg\min_{\beta} \|y - X\beta\|^2 + \lambda \|\beta\|_1\]
583+
584+
<p>Equivalent MAP interpretation: Laplace prior $\beta \sim \text{Laplace}(0, b)$.<br />
585+
Encourages <strong>sparsity</strong> (many coefficients shrink to 0).</p>
604586
</d-article>
605587

606588
<d-appendix>

0 commit comments

Comments
 (0)