Another md -> mdx

wiseodd · wiseodd · commit a95799bbbdde · 2024-08-06T15:31:55.000-04:00
diff --git a/.astro/types.d.ts b/.astro/types.d.ts
@@ -157,13 +157,13 @@ declare module 'astro:content' {
   collection: "post";
   data: InferEntrySchema<"post">
 } & { render(): Render[".md"] };
-"bayesian-regression.md": {
-	id: "bayesian-regression.md";
+"bayesian-regression.mdx": {
+	id: "bayesian-regression.mdx";
   slug: "bayesian-regression";
   body: string;
   collection: "post";
   data: InferEntrySchema<"post">
-} & { render(): Render[".md"] };
+} & { render(): Render[".mdx"] };
 "boundary-seeking-gan.mdx": {
 	id: "boundary-seeking-gan.mdx";
   slug: "boundary-seeking-gan";
@@ -269,13 +269,13 @@ declare module 'astro:content' {
   collection: "post";
   data: InferEntrySchema<"post">
 } & { render(): Render[".md"] };
-"gan-pytorch.md": {
-	id: "gan-pytorch.md";
+"gan-pytorch.mdx": {
+	id: "gan-pytorch.mdx";
   slug: "gan-pytorch";
   body: string;
   collection: "post";
   data: InferEntrySchema<"post">
-} & { render(): Render[".md"] };
+} & { render(): Render[".mdx"] };
 "gan-tensorflow.md": {
 	id: "gan-tensorflow.md";
   slug: "gan-tensorflow";
@@ -332,13 +332,13 @@ declare module 'astro:content' {
   collection: "post";
   data: InferEntrySchema<"post">
 } & { render(): Render[".md"] };
-"kl-mle.md": {
-	id: "kl-mle.md";
+"kl-mle.mdx": {
+	id: "kl-mle.mdx";
   slug: "kl-mle";
   body: string;
   collection: "post";
   data: InferEntrySchema<"post">
-} & { render(): Render[".md"] };
+} & { render(): Render[".mdx"] };
 "laplace.mdx": {
 	id: "laplace.mdx";
   slug: "laplace";
@@ -486,27 +486,27 @@ declare module 'astro:content' {
   collection: "post";
   data: InferEntrySchema<"post">
 } & { render(): Render[".md"] };
-"theano-pde.md": {
-	id: "theano-pde.md";
+"theano-pde.mdx": {
+	id: "theano-pde.mdx";
   slug: "theano-pde";
   body: string;
   collection: "post";
   data: InferEntrySchema<"post">
-} & { render(): Render[".md"] };
+} & { render(): Render[".mdx"] };
 "twitter-auth-flask.md": {
 	id: "twitter-auth-flask.md";
   slug: "twitter-auth-flask";
   body: string;
   collection: "post";
   data: InferEntrySchema<"post">
 } & { render(): Render[".md"] };
-"vae-pytorch.md": {
-	id: "vae-pytorch.md";
+"vae-pytorch.mdx": {
+	id: "vae-pytorch.mdx";
   slug: "vae-pytorch";
   body: string;
   collection: "post";
   data: InferEntrySchema<"post">
-} & { render(): Render[".md"] };
+} & { render(): Render[".mdx"] };
 "variational-autoencoder.md": {
 	id: "variational-autoencoder.md";
   slug: "variational-autoencoder";
diff --git a/src/content/post/bayesian-regression.mdx b/src/content/post/bayesian-regression.mdx
@@ -11,25 +11,31 @@ Linear Regression could be intuitively interpreted in several point of views, e.
 
 ## Linear Regression: Refreshments
 
-Recall, in Linear Regression, we want to map our inputs into real numbers, i.e. \\( f: \mathbb{R}^N \to \mathbb{R} \\). For example, given some features, e.g. how many hour of studying, number of subject taken, and the IQ of a student, we want to predict his or her GPA.
+Recall, in Linear Regression, we want to map our inputs into real numbers, i.e. $f: \mathbb{R}^N \to \mathbb{R}$. For example, given some features, e.g. how many hour of studying, number of subject taken, and the IQ of a student, we want to predict his or her GPA.
 
-There are several types of Linear Regression, depending on their cost function and the regularizer. In this post, we would focus on Linear Regression with \\( \ell_2 \\) cost and \\( \ell_2 \\) regularization. In statistics, this kind of regression is called Ridge Regression.
+There are several types of Linear Regression, depending on their cost function and the regularizer. In this post, we would focus on Linear Regression with $\ell_2$ cost and $\ell_2$ regularization. In statistics, this kind of regression is called Ridge Regression.
 
 Formally, the objective is as follows:
 
-$$ L = \frac{1}{2} \Vert \hat{y} - y \Vert^2_2 + \frac{\lambda}{2} \Vert W \Vert^2_2 $$
+$$
+L = \frac{1}{2} \Vert \hat{y} - y \Vert^2_2 + \frac{\lambda}{2} \Vert W \Vert^2_2
+$$
 
-where \\( \hat{y} \\) is the ground truth value, and \\( y \\) is given by:
+where $\hat{y}$ is the ground truth value, and $y$ is given by:
 
-$$ y = W^Tx $$
+$$
+y = W^Tx
+$$
 
-which is a linear combination of feature vector and weight matrix. The additional \\( \frac{1}{2} \\) in both terms is just for mathematical convenience when taking the derivative.
+which is a linear combination of feature vector and weight matrix. The additional $\frac{1}{2}$ in both terms is just for mathematical convenience when taking the derivative.
 
-The idea is then to minimize this objective function with regard to \\( W \\). That is, we want to find weight matrix \\( W \\) that minimize the squared error.
+The idea is then to minimize this objective function with regard to $W$. That is, we want to find weight matrix $W$ that minimize the squared error.
 
 Of course we could ignore the regularization term. What we end up with then, is a vanilla Linear Regression:
 
-$$ L = \frac{1}{2} \Vert \hat{y} - y \Vert^2_2 $$
+$$
+L = \frac{1}{2} \Vert \hat{y} - y \Vert^2_2
+$$
 
 Minimization this objective is the definition of Linear Least Square problem.
 
@@ -48,23 +54,33 @@ $$
 
 or equivalently, we could say that the error is:
 
-$$ \epsilon = \hat{y} - y $$
+$$
+\epsilon = \hat{y} - y
+$$
 
-Now, let's say we model the regression target as a Gaussian random variable, i.e. \\( y \sim N(\mu, \sigma^2) \\), with \\( \mu = y = W^Tx \\), the prediction of our model. Formally:
+Now, let's say we model the regression target as a Gaussian random variable, i.e. $y \sim N(\mu, \sigma^2)$, with $\mu = y = W^Tx$, the prediction of our model. Formally:
 
-$$ P(\hat{y} \vert x, W) = N(\hat{y} \vert W^Tx, \sigma^2) $$
+$$
+P(\hat{y} \vert x, W) = N(\hat{y} \vert W^Tx, \sigma^2)
+$$
 
-Then, to find the optimum \\( W \\), we could use Maximum Likelihood Estimation (MLE). As the above model is a likelihood, i.e. describing our data \\( y \\) under parameter \\( W \\), we will do MLE on that:
+Then, to find the optimum $W$, we could use Maximum Likelihood Estimation (MLE). As the above model is a likelihood, i.e. describing our data $y$ under parameter $W$, we will do MLE on that:
 
-$$ W*{MLE} = \mathop{\rm arg\,max}\limits*{W} N(\hat{y} \vert W^Tx, \sigma^2) $$
+$$
+W*{MLE} = \mathop{\rm arg\,max}\limits*{W} N(\hat{y} \vert W^Tx, \sigma^2)
+$$
 
 The PDF of Gaussian is given by:
 
-$$ P(\hat{y} \vert x, W) = \frac{1}{\sqrt{2 \sigma^2 \pi}} \, \exp \left( -\frac{(\hat{y} - W^Tx)^2}{2 \sigma^2} \right) $$
+$$
+P(\hat{y} \vert x, W) = \frac{1}{\sqrt{2 \sigma^2 \pi}} \, \exp \left( -\frac{(\hat{y} - W^Tx)^2}{2 \sigma^2} \right)
+$$
 
 As we are doing maximization, we could ignore the normalizing constant of the likelihood. Hence:
 
-$$ W*{MLE} = \mathop{\rm arg\,max}\limits*{W} \, \exp \left( -\frac{(\hat{y} - W^Tx)^2}{2 \sigma^2} \right) $$
+$$
+W*{MLE} = \mathop{\rm arg\,max}\limits*{W} \, \exp \left( -\frac{(\hat{y} - W^Tx)^2}{2 \sigma^2} \right)
+$$
 
 As always, it is easier to optimize the log likelihood:
 
@@ -77,7 +93,7 @@ W_{MLE} &= \mathop{\rm arg\,max}\limits_{W} \, \log \left( \exp \left( -\frac{(\
 \end{align}
 $$
 
-For simplicity, let's say \\( \sigma^2 = 1 \\), then:
+For simplicity, let's say $\sigma^2 = 1$, then:
 
 $$
 \begin{align}
@@ -95,13 +111,17 @@ So we see, doing MLE on Gaussian likelihood is equal to Linear Regression!
 
 But what if we want to go Bayesian, i.e. introduce a prior, and working with the posterior instead? Well, then we are doing MAP estimation! The posterior is likelihood times prior:
 
-$$ P(W \vert \hat{y}, x) = P(\hat{y} \vert x, W) P(W \vert \mu_0, \sigma^2_0) $$
+$$
+P(W \vert \hat{y}, x) = P(\hat{y} \vert x, W) P(W \vert \mu_0, \sigma^2_0)
+$$
 
-Since we have already known the likelihood, now we ask, what should be the prior? If we set it to be uniformly distributed, then we will be back to the MLE estimation, full detail [here]({% post_url 2017-01-01-mle-vs-map %}). So, for non-trivial example, let's use Gaussian prior for weight \\( W \\):
+Since we have already known the likelihood, now we ask, what should be the prior? If we set it to be uniformly distributed, then we will be back to the MLE estimation. So, for non-trivial example, let's use Gaussian prior for weight $W$:
 
-$$ P(W \vert \mu_0, \sigma^2_0) = N(0, \sigma^2_0) $$
+$$
+P(W \vert \mu_0, \sigma^2_0) = N(0, \sigma^2_0)
+$$
 
-Expanding the PDF, and again ignoring the normalizing constant and keeping in mind that \\( \mu_0 = 0 \\), we have:
+Expanding the PDF, and again ignoring the normalizing constant and keeping in mind that $\mu_0 = 0$, we have:
 
 $$
 \begin{align}
@@ -134,11 +154,13 @@ $$
 \end{align}
 $$
 
-Seems familiar, right! Now if we assume that \\( \sigma^2 = 1 \\) and \\( \lambda = \frac{1}{\sigma^2_0} \\), then our log posterior becomes:
+Seems familiar, right! Now if we assume that $\sigma^2 = 1$ and $\lambda = \frac{1}{\sigma^2_0}$, then our log posterior becomes:
 
-$$ \log P(W \vert \hat{y}, x) \propto -\frac{1}{2} \Vert \hat{y} - W^Tx\Vert^2_2 - \frac{\lambda}{2} \Vert W \Vert^2_2 $$
+$$
+\log P(W \vert \hat{y}, x) \propto -\frac{1}{2} \Vert \hat{y} - W^Tx\Vert^2_2 - \frac{\lambda}{2} \Vert W \Vert^2_2
+$$
 
-That is, the log posterior of Gaussian likelihood and Gaussian prior is the same as the objective function for Ridge Regression! Hence, Gaussian prior is equal to \\( \ell_2 \\) regularization!
+That is, the log posterior of Gaussian likelihood and Gaussian prior is the same as the objective function for Ridge Regression! Hence, Gaussian prior is equal to $\ell_2$ regularization!
 
 ## Full Bayesian Approach
 
@@ -153,11 +175,11 @@ P(y' \vert \hat{y}, x) &= \int_W P(y' \vert x', W) P(W \vert \hat{y}, x) \\[10pt
 \end{align}
 $$
 
-that is, given the likelihood of our new data point \\( (x', y') \\), we compute the likelihood, and weigh it with the posterior.
+that is, given the likelihood of our new data point $(x', y')$, we compute the likelihood, and weigh it with the posterior.
 
-Intuitively, given all possible value for \\( W \\) in the posterior, we try those values one by one to predict the new data. The result is then averaged proportionality to the probability of those values, hence we are taking expectation.
+Intuitively, given all possible value for $W$ in the posterior, we try those values one by one to predict the new data. The result is then averaged proportionality to the probability of those values, hence we are taking expectation.
 
-And of course, that is the reason why we use a shortcut in the form of MAP. For illustration, if each component of \\( W \\) is binary, i.e. have two possible values, and there are \\( K \\) components in \\( W \\), we are talking about \\( 2^K \\) possible assignments for \\( W \\), which is exponential! In real world, each component of \\( W \\) is a real number, which makes the problem of enumerating all possible values of \\( W \\) intractable!
+And of course, that is the reason why we use a shortcut in the form of MAP. For illustration, if each component of $W$ is binary, i.e. have two possible values, and there are $K$ components in $W$, we are talking about $2^K$ possible assignments for $W$, which is exponential! In real world, each component of $W$ is a real number, which makes the problem of enumerating all possible values of $W$ intractable!
 
 Of course we could use approximate method like Variational Bayes or MCMC, but they are still more costly than MAP. As MAP and MLE is guaranteed to find one of the modes (local maxima), it is good enough.
 
diff --git a/src/content/post/gan-pytorch.mdx b/src/content/post/gan-pytorch.mdx
@@ -11,7 +11,7 @@ Those two libraries are different from the existing libraries like TensorFlow an
 
 Enter Pytorch. It is a Torch's port for Python. The programming style of Pytorch is imperative, meaning that if we've already familiar using Numpy to code our alogrithm up, then jumping to Pytorch should be a breeze. One does not need to learn symbolic mathematical computation, like in TensorFlow and Theano.
 
-With that being said, let's try Pytorch by implementing Generative Adversarial Networks (GAN). As a reference point, here is the [TensorFlow version]({% post_url 2016-09-17-gan-tensorflow %}).
+With that being said, let's try Pytorch by implementing Generative Adversarial Networks (GAN).
 
 Let's start by importing stuffs:
 
@@ -32,13 +32,13 @@ h_dim = 128
 lr = 1e-3
 ```
 
-Now let's construct our Generative Network \\( G(z) \\):
+Now let's construct our Generative Network $G(z)$:
 
 ```python
 def xavier*init(size):
-in_dim = size[0]
-xavier_stddev = 1. / np.sqrt(in_dim / 2.)
-return Variable(torch.randn(\_size) * xavier_stddev, requires_grad=True)
+    in_dim = size[0]
+    xavier_stddev = 1. / np.sqrt(in_dim / 2.)
+    return Variable(torch.randn(\_size) * xavier_stddev, requires_grad=True)
 
 Wzh = xavier_init(size=[Z_dim, h_dim])
 bzh = Variable(torch.zeros(h_dim), requires_grad=True)
@@ -47,14 +47,14 @@ Whx = xavier_init(size=[h_dim, X_dim])
 bhx = Variable(torch.zeros(X_dim), requires_grad=True)
 
 def G(z):
-h = nn.relu(z @ Wzh + bzh.repeat(z.size(0), 1))
-X = nn.sigmoid(h @ Whx + bhx.repeat(h.size(0), 1))
-return X
+    h = nn.relu(z @ Wzh + bzh.repeat(z.size(0), 1))
+    X = nn.sigmoid(h @ Whx + bhx.repeat(h.size(0), 1))
+    return X
 ```
 
 It is awfully similar to the TensorFlow version, what is the difference then? It is subtle without more hints, but basically those variables `Wzh, bzh, Whx, bhx` are real tensor/ndarray, just like in Numpy. That means, if we evaluate it with `print(Wzh)` the value is immediately shown. Also, the function `G(z)` is a real function, in the sense that if we input a tensor, we will immediately get the return value back. Try doing those things in TensorFlow or Theano.
 
-Next is the Discriminator Network \\( D(X) \\):
+Next is the Discriminator Network $D(X)$:
 
 ```python
 Wxh = xavier_init(size=[X_dim, h_dim])
@@ -98,7 +98,7 @@ X = Variable(torch.from_numpy(X))
 
 ```
 
-So first, let's define the \\( D(X) \\)'s "forward-loss-backward-update" step. First, the forward step:
+So first, let's define the $D(X)$'s "forward-loss-backward-update" step. First, the forward step:
 
 ```python # D(X) forward and loss
 G_sample = G(z)
@@ -125,29 +125,29 @@ Of course we could code up our own optimizer. But Pytorch has built-in optimizer
 As we have two different optimizers, we need to clear up the computed gradient in our computational graph as we do not need it anymore. Also, it is necessary so that the gradients won't mix up with the subsequent call of `backward()` as `D_solver` shares some subgraphs with `G_solver`.
 
 ```python
-def reset*grad():
-for p in params:
-p.grad.data.zero*()
+def reset_grad():
+    for p in params:
+    p.grad.data.zero_()
 ```
 
-We do similar things to implement the "forward-loss-backward-update" for \\( G(z) \\):
+We do similar things to implement the "forward-loss-backward-update" for $G(z)$:
 
-```python # Housekeeping - reset gradient
+```python
+# Housekeeping - reset gradient
 reset_grad()
 
-    # Generator forward-loss-backward-update
-    z = Variable(torch.randn(mb_size, Z_dim))
-    G_sample = G(z)
-    D_fake = D(G_sample)
-
-    G_loss = nn.binary_cross_entropy(D_fake, ones_label)
+# Generator forward-loss-backward-update
+z = Variable(torch.randn(mb_size, Z_dim))
+G_sample = G(z)
+D_fake = D(G_sample)
 
-    G_loss.backward()
-    G_solver.step()
+G_loss = nn.binary_cross_entropy(D_fake, ones_label)
 
-    # Housekeeping - reset gradient
-    reset_grad()
+G_loss.backward()
+G_solver.step()
 
+# Housekeeping - reset gradient
+reset_grad()
 ```
 
 And that is it, really.
@@ -156,4 +156,4 @@ But we might ask, why do all of those things matter? Why not to just use TensorF
 
 In contrast, in imperative computation, we could just use `print()` function basically anywhere and anytime we want and immediately it will display the value. Doing other "non-trivial" operations like loop and conditional are also become much more easier in Pytorch, just like the good old Python. Hence, one could argue that this way of programming is more "natural".
 
-The full code is available in my Github repo: <https://github.com/wiseodd/generative-models>.
+The full code is available in my Github repo: https://github.com/wiseodd/generative-models.
diff --git a/src/content/post/kl-mle.mdx b/src/content/post/kl-mle.mdx
@@ -5,7 +5,11 @@ publishDate: 2017-01-26 03:53
 tags: [machine learning, probability]
 ---
 
-When reading Kevin Murphy's book, I came across this statement: _"... maxmizing likelihood is equivalent to minimizing_ \\( D*{KL}[P(. \vert \theta^{\ast}) \, \Vert \, P(. \vert \theta)] \\)*, where \\( P(. \vert \theta^{\ast}) \\) is the true distribution and \\( P(. \vert \theta) \\) is our estimate ..."\_. So here is an attempt to prove that.
+When reading Kevin Murphy's book, I came across this statement:
+
+> ... maximizing likelihood is equivalent to minimizing $D_{KL}[P(. \vert \theta^*) \, \Vert \, P(. \vert \theta)]$, where $P(. \vert \theta^*)$ is the true distribution and $P(. \vert \theta)$ is our estimate ...
+
+So here is an attempt to prove that.
 
 $$
 \begin{align}
@@ -17,15 +21,12 @@ D_{KL}[P(x \vert \theta^*) \, \Vert \, P(x \vert \theta)] &= \mathbb{E}_{x \sim
 \end{align}
 $$
 
-If it looks familiar, the left term is the entropy of \\( P(x \vert \theta^\*) \\). However it does not depend on the estimated parameter \\( \theta \\), so we will ignore that.
+If it looks familiar, the left term is the entropy of $P(x \vert \theta^*)$. However it does not depend on the estimated parameter $\theta$, so we will ignore that.
 
-Suppose we sample \\( N \\) of these \\( x \sim P(x \vert \theta^\*) \\). Then, the [Law of Large Number](https://en.wikipedia.org/wiki/Law_of_large_numbers) says that as \\( N \\) goes to infinity:
+Suppose we sample $N$ of these $x \sim P(x \vert \theta^*)$. Then, the [Law of Large Number](https://en.wikipedia.org/wiki/Law_of_large_numbers) says that as $N$ goes to infinity:
 
 $$
-
 -\frac{1}{N} \sum_i^N \log \, P(x_i \vert \theta) = -\mathbb{E}_{x \sim P(x \vert \theta^*)}\left[\log \, P(x \vert \theta) \right]
-
-
 $$
 
 which is the right term of the above KL-Divergence. Notice that:
@@ -39,8 +40,8 @@ $$
 \end{align}
 $$
 
-where NLL is the negative log-likelihood and \\( c \\) is a constant.
+where NLL is the negative log-likelihood and $c$ is a constant.
 
-Then, if we minimize \\( D\_{KL}[P(x \vert \theta^*) \, \Vert \, P(x \vert \theta)] \\), it is equivalent to minimizing the NLL. In other words, it is equivalent to maximizing the log-likelihood.
+Then, if we minimize $D_{KL}[P(x \vert \theta^*) \, \Vert \, P(x \vert \theta)]$, it is equivalent to minimizing the NLL. In other words, it is equivalent to maximizing the log-likelihood.
 
 Why does this matter, though? Because this gives MLE a nice interpretation: maximizing the likelihood of data under our estimate is equal to minimizing the difference between our estimate and the real data distribution. We can see MLE as a proxy for fitting our estimate to the real distribution, which cannot be done directly as the real distribution is unknown to us.
diff --git a/src/content/post/theano-pde.mdx b/src/content/post/theano-pde.mdx
diff --git a/src/content/post/vae-pytorch.mdx b/src/content/post/vae-pytorch.mdx