You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<h2>Part 2 – Implementing the UNet from scratch</h2>
@@ -1003,7 +1003,7 @@ <h4>Evaluation results</h4>
1003
1003
<h4>Limitations on pure noise</h4>
1004
1004
Although the model is decent at removing noise from images, our goal is to generate digits from pure noise. This proves to be an issue because with MSE loss, the model will learn to predict the image that minimizes the sum of its squared distance to all other training images. To illustrate this issue, we will feed the model a pure noise sample <code>z</code> ~ 𝒩(0, 𝐈) on all training inputs <code>x</code>, and because <code>z</code> contains no information about <code>x</code>, the result is an average of all digits in the training set.
1005
1005
1006
-
As a result, while the training loss curve shows not much difference:
1006
+
As a result, while the training loss curve shows not much suspect:
@@ -1022,7 +1022,7 @@ <h4>Limitations on pure noise</h4>
1022
1022
<h4>The Flow Matching Model</h4>
1023
1023
Instead of trying to denoise the image in a single step, we aim to iteratively denoise the image, similar to how we do so in the sampling loops using DeepFloyd's noise coefficients. To do this, we will start by interpolating how intermediate noise samples are constructed. The simplest approrach is to use linear interpolation, namely let the intermediate sample be <code>x<sub>t</sub></code> = (1 - <code>t</code>)x<sub>0</sub> + <code>tx<sub>1</sub></code> for a given <code>t</code> ∈ [0, 1], where <code>x<sub>0</sub></code> is the noise and <code>x<sub>1</sub></code> is the clean image.<br>
1024
1024
1025
-
<br>Now that we have an equation relating a clean image with any pure noise sample, we can train our model to learn the <strong>flow</strong>, or the change with respect to <code>t</code> for any given <code>x<sub>t</sub></code>. This produces a vector field across all images, where the velocity for each is d/dt <code>x<sub>t</sub></code> = <code>x<sub>1</sub></code> - <code>x<sub>0</sub></code>. Therefore, if we can predict <code>x<sub>1</sub></code> - <code>x<sub>0</sub></code> for any given <code>t</code> and <code>x<sub>t</sub></code>, we can go along the path traced out by the vector field and arrived at somewhere near the manifold of clean images. This technique is known as a <strong>flow matching model</strong>, and with the model trained, we can numerically integrate a random noise sample <code>x<sub>0</sub></code> with a set number of iterations, and get our clean image <code>x<sub>1</sub></code>.
1025
+
<br>Now that we have an equation relating a clean image with any pure noise sample, we can train our model to learn the <strong>flow</strong>, or the change with respect to <code>t</code> for any given <code>x<sub>t</sub></code>. This produces a vector field across all images, where the velocity for each is d/dt <code>x<sub>t</sub></code> = <code>x<sub>1</sub></code> - <code>x<sub>0</sub></code>. Therefore, if we can predict <code>x<sub>1</sub></code> - <code>x<sub>0</sub></code> for any given <code>t</code> and <code>x<sub>t</sub></code>, we can go along the path traced out by the vector field and arrived at somewhere near the manifold of clean images. This technique is known as a <strong>flow matching model</strong>, and with the model trained, we can numerically integrate a random noise sample <code>x<sub>0</sub></code> with a set number of iterations using Euler's method, and get a clean image <code>x<sub>1</sub></code>.
1026
1026
1027
1027
<h4>Training a Time-Conditioned UNet</h4>
1028
1028
To add time conditioning to our UNet, we will make the following changes to our model architecture:
@@ -1034,10 +1034,24 @@ <h4>Training a Time-Conditioned UNet</h4>
1034
1034
</div>
1035
1035
1036
1036
<h4>Flow Matching Hyperparameters</h4>
1037
-
For the hyperparameters, we will be using a batch size of 64, a learning rate of <code>1e-2</code>, a hidden dimension of 64, the Adam optimizer with the given learning rate, a exponential learning rate decay scheduler with γ = 0.1<sup>(1.0 / <code>num_epochs</code>)</sup>, and a training time of 10 epochs. To advance the scheduler, we will call <code>scheduler.step()</code> at the end of each training epoch.
1037
+
For the hyperparameters, we will be using a batch size of 64, a learning rate of <code>1e-2</code>, a hidden dimension of 64, the Adam optimizer with the given learning rate, a exponential learning rate decay scheduler with γ = 0.1<sup>(1.0 / <code>num_epochs</code>)</sup>, a sampling iteration count of <code>T</code> = 50, and a training time of 10 epochs. To advance the scheduler, we will call <code>scheduler.step()</code> at the end of each training epoch.
1038
1038
1039
1039
<h4>Forward and Sampling Operations</h4>
1040
-
To train our model, for each clean image <code>x<sub>1</sub></code> we will generate <code>x<sub>0</sub></code> ∈ 𝒩(0, 𝐈) and <code>t</code> ∈ U([0, 1]), where U is the uniform distribution. After computing <code>x<sub>t</sub></code> = (1 - <code>t</code>)x<sub>0</sub> + <code>tx<sub>1</sub></code>, we will feed <code>x<sub>t</sub></code> and <code>t</code> into our UNet and compute the loss of unet(<code>x<sub>t</sub></code>, <code>t</code>) and <code>x<sub>1</sub></code> - <code>x<sub>0</sub></code>. Below is its loss curve:
1040
+
To train our model, for each clean image <code>x<sub>1</sub></code> we will generate <code>x<sub>0</sub></code> ∈ 𝒩(0, 𝐈) and <code>t</code> ∈ U([0, 1]), where U is the uniform distribution. After computing <code>x<sub>t</sub></code> = (1 - <code>t</code>)x<sub>0</sub> + <code>tx<sub>1</sub></code>, we will feed <code>x<sub>t</sub></code> and <code>t</code> into our UNet and compute the loss of u<sub>θ</sub>(<code>x<sub>t</sub></code>, <code>t</code>) and <code>x<sub>1</sub></code> - <code>x<sub>0</sub></code>. Below is the new model's training loss curve:
When sampling from the model, we will simply generate a random <code>x<sub>0</sub></code> ∈ 𝒩(0, 𝐈), and for every iteration <code>i</code> from 1 to <code>T</code>, we will compute <code>x<sub>0</sub></code> = <code>x<sub>0</sub></code> + (1 / <code>T</code>)u<sub>θ</sub>(<code>x<sub>t</sub></code>, <code>t</code>), where <code>t</code> = <code>i</code> / <code>T</code>. The following are the results ofthe 1st, 5th, and 10th epoch:
0 commit comments