You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: project-5/proj5.html
+31-7Lines changed: 31 additions & 7 deletions
Original file line number
Diff line number
Diff line change
@@ -930,7 +930,7 @@ <h3>Training an Unconditioned UNet</h3>
930
930
931
931
The most basic denoiser is a one-step denoiser. Formally, given a noisy image <code>z</code>, we aim to train a denoiser D<sub>θ</sub>(z) that can map it to a clean image <code>x</code>. To do this, we can optimize over the L<sup>2</sup> loss E<sub>z,x</sub>||D<sub>θ</sub>(z)||<sup>2</sup> while training.<br>
932
932
933
-
<br>To create a noisy image, we can use the process z = x + σε where σ ∈ [0, 1] and ε ~ 𝒩(0, 1). Here, 𝒩 is the standard normal distribution. To visualize the kind of images this process will result in below is an example of an MNIST digit with progressively more noise as σ gradually increases from 0 to 1:
933
+
<br>To create a noisy image, we can use the process z = x + σε where σ ∈ [0, 1] and ε ~ 𝒩(0, 𝐈). Here, 𝒩 is the standard normal distribution. To visualize the kind of images this process will result in, below is an example of an MNIST digit with progressively more noise as σ gradually increases from 0 to 1:
934
934
935
935
<divclass="image-row">
936
936
<figure>
@@ -974,7 +974,7 @@ <h3>Training an Unconditioned UNet</h3>
974
974
where <code>D</code> is the number of hidden dimensions.
975
975
976
976
<h4>Training hyperparameters</h4>
977
-
For the hyperparameters, we will be using a batch size of 256, a learning rate of 1e-4, a hidden dimension of 128, the Adam optimizer with the given learning rate, and a training time of 5 epochs. A fixed noise level of σ = 0.5 will be used to noise the training images.
977
+
For the hyperparameters, we will be using a batch size of 256, a learning rate of <code>1e-4</code>, a hidden dimension of 128, the Adam optimizer with the given learning rate, and a training time of 5 epochs. A fixed noise level of σ = 0.5 will be used to noise the training images.
978
978
979
979
<h4>Evaluation results</h4>
980
980
After the model is trained, below is the training loss curve, where the loss of the model is plotted for every batch processed:
@@ -991,29 +991,53 @@ <h4>Evaluation results</h4>
991
991
</figure>
992
992
</div>
993
993
994
-
We can see that the model performs decently well. To illustrate its effectiveness on images noised with different levels of σ below is the model after the 5th epoch denoising the same image with different levels of noise for σ ∈ [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]:
994
+
We can see that the model performs decently on different digits. To illustrate its effectiveness on images noised with different levels of σ below is the model after the 5th epoch denoising the same image with different levels of noise for σ ∈ [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]:
Although the model works well for images with small amounts of noise, the more noise the image has, the less quality of the model's prediction.
1002
+
1001
1003
<h4>Limitations on pure noise</h4>
1002
-
Although the model is decent at removing noise from images, our goal is to generate digits from pure noise. This proves to be an issue because with MSE loss, the model will learn to predict the image that minimizes the sum of its squared distance to all other training images. Because pure noise is the input to the model for any given training image, the result is an average of all digits in the training set. This is illustrated in the following inputs and the output of the model after the 1st and 5th epoch:
1004
+
Although the model is decent at removing noise from images, our goal is to generate digits from pure noise. This proves to be an issue because with MSE loss, the model will learn to predict the image that minimizes the sum of its squared distance to all other training images. To illustrate this issue, we will feed the model a pure noise sample <code>z</code> ~ 𝒩(0, 𝐈) on all training inputs <code>x</code>, and because <code>z</code> contains no information about <code>x</code>, the result is an average of all digits in the training set.
1005
+
1006
+
As a result, while the training loss curve shows not much difference:
To generate plausible-looking digits, we need a different approach than one-step denoising.
1021
+
1022
+
<h4>The Flow Matching Model</h4>
1023
+
Instead of trying to denoise the image in a single step, we aim to iteratively denoise the image, similar to how we do so in the sampling loops using DeepFloyd's noise coefficients. To do this, we will start by interpolating how intermediate noise samples are constructed. The simplest approrach is to use linear interpolation, namely let the intermediate sample be <code>x<sub>t</sub></code> = (1 - <code>t</code>)x<sub>0</sub> + <code>tx<sub>1</sub></code> for a given <code>t</code> ∈ [0, 1], where <code>x<sub>0</sub></code> is the noise and <code>x<sub>1</sub></code> is the clean image.<br>
1024
+
1025
+
<br>Now that we have an equation relating a clean image with any pure noise sample, we can train our model to learn the <strong>flow</strong>, or the change with respect to <code>t</code> for any given <code>x<sub>t</sub></code>. This produces a vector field across all images, where the velocity for each is d/dt <code>x<sub>t</sub></code> = <code>x<sub>1</sub></code> - <code>x<sub>0</sub></code>. Therefore, if we can predict <code>x<sub>1</sub></code> - <code>x<sub>0</sub></code> for any given <code>t</code> and <code>x<sub>t</sub></code>, we can go along the path traced out by the vector field and arrived at somewhere near the manifold of clean images. This technique is known as a <strong>flow matching model</strong>, and with the model trained, we can numerically integrate a random noise sample <code>x<sub>0</sub></code> with a set number of iterations, and get our clean image <code>x<sub>1</sub></code>.
1026
+
1027
+
<h4>Training a Time-Conditioned UNet</h4>
1028
+
To add time conditioning to our UNet, we will make the following changes to our model architecture:
For the hyperparameters, we will be using a batch size of 64, a learning rate of <code>1e-2</code>, a hidden dimension of 64, the Adam optimizer with the given learning rate, a exponential learning rate decay scheduler with γ = 0.1<sup>(1.0 / <code>num_epochs</code>)</sup>, and a training time of 10 epochs. To advance the scheduler, we will call <code>scheduler.step()</code> at the end of each training epoch.
1038
+
1039
+
<h4>Forward and Sampling Operations</h4>
1040
+
To train our model, for each clean image <code>x<sub>1</sub></code> we will generate <code>x<sub>0</sub></code> ∈ 𝒩(0, 𝐈) and <code>t</code> ∈ U([0, 1]), where U is the uniform distribution. After computing <code>x<sub>t</sub></code> = (1 - <code>t</code>)x<sub>0</sub> + <code>tx<sub>1</sub></code>, we will feed <code>x<sub>t</sub></code> and <code>t</code> into our UNet and compute the loss of unet(<code>x<sub>t</sub></code>, <code>t</code>) and <code>x<sub>1</sub></code> - <code>x<sub>0</sub></code>. Below is its loss curve:
0 commit comments