We recommend using GPUs from Colab to finish this
project!
Overview
In part A you will play around with diffusion models, implement diffusion
sampling loops, and use them for other tasks such as inpainting and
creating optical illusions.
Instructions can be found below and in the provided
notebook.
Because part A is simply to get your feet wet with pre-trained diffusion
models, all deliverables should be completed in the notebook. You will
still submit a webpage with your results.
START EARLY!
This project, in many ways, will be the most
difficult project this semester.
Part 0: Setup
Gaining Access to DeepFloyd
We are going to use the DeepFloyd
IF diffusion model.
DeepFloyd is a two stage model trained by Stability AI.
The first stage produces images of size and the second
stage takes the outputs of the first stage and generates images of size
. We provide upsampling code at the very end of the
notebook, though this is not required in your submission.
Before using DeepFloyd, you must accept its usage conditions. To do so:
Accept the license on the model card of DeepFloyd/IF-I-XL-v1.0.
Accepting the license on the stage I model card will auto accept for the
other IF models.
DeepFloyd was trained as a text-to-image model, which takes text prompts
as input and outputs images that are aligned with the text. Throughout
this notebook, you will see that we ask you to generate with the prompt
"a high quality photo". We want you to think of this as a "null" prompt
that doesn't have any specific meaning, and is simply a way for the model
to do unconditional generation. You can view this as using the diffusion
model to "force" a noisy image onto the "manifold" of real images.
In the later sections, we will guide this project with a more detailed
text prompt.
Downloading Precomputed Text Embeddings
Because the text encoder is very large, and barely fits on a free tier
Colab GPU, we have precomputed a couple of text embeddings for you to
try.
You can download the .pth file here.
This should hopefully save some headaches from GPU out of memory errors.
At the end of part A of the project, we provide you code if you want to
try your own text prompts. If you'd like, you can pay $10 for Colab Pro
and avoid needing to load the two models on different sessions.
In the notebook, we instantiate DeepFloyd's stage_1 and
stage_2 objects used for generation, as well as several
text
prompts for sample generation.
Deliverables
For the 3 text prompts that we provide, display the caption and the
output of the model. Reflect on the quality of the outputs and their
relationships to the text prompts. Make sure to try at least 2
different num_inference_steps values.
Report the random seed that you're using here. You should use the
same seed all subsequent parts.
Begin Solution Part 0
IMPORTANT NOTE: When a label of "noise = " is seen on any of the figures displayed, it refers to
the "i_start" value, the timestep at which iterative denoising and its related functions start
their denoising loop from, it is inversely proportional to the noise of the input image.
DESCRIPTION: For each size category, the row of images on top corresponds to 20 inference steps and the row of images
on the bottom corresponds to 30 inference steps. The differences are subtle but noticable, such as the details on
walls of the cabins or the facial features of the man.
DESCRIPTION: Below is the image of the chosen seed value for operations that use semi-randomness. The value is 150.
End Solution Part 0
Part 1: Sampling Loops
In this part of the problem set, you will write your own "sampling loops"
that use the pretrained DeepFloyd denoisers. These should produce high
quality images such as the ones generated above.
You will then modify these sampling loops to solve different tasks such
as inpainting or producing optical illusions.
Diffusion Models Primer
Starting with a clean image, , we can iteratively add noise to an
image, obtaining progressively more and more noisy versions of the
image, , until we're left with basically pure noise at timestep
. When , we have a clean image, and for larger more noise
is in the image.
A diffusion model tries to reverse this process by denoising the image.
By giving a diffusion model a noisy and the timestep , the
model predicts the noise in the image. With the predicted noise, we can
either completely remove the noise from the image, to obtain an estimate
of , or we can remove just a portion of the noise, obtaining an
estimate of , with slightly less noise.
To generate images from the diffusion model (sampling), we start with
pure noise at timestep sampled from a gaussian distribution, which
we denote . We can then predict and remove part of the noise,
giving us . Repeating this process until we arrive at
gives us a clean image.
For the DeepFloyd models, .
The exact amount of noise added at each step is dictated by noise
coefficients, , which were chosen by the people who
trained DeepFloyd.
1.1 Implementing the Forward Process
A key part of diffusion is the forward process, which takes a clean
image and adds noise to it. In this part, we will write a function to
implement this. The forward process is defined by:
which is equivalent to computing
That is, given a clean image , we get a noisy image at
timestep by sampling from a Gaussian with mean and variance .
Note that the forward process is not just adding noise -- we also
scale
the image.
You will need to use the alphas_cumprod variable, which
contains the for all .
Remember that corresponds to a clean image, and larger
corresponds to more noise.
Thus, is close to 1 for small , and close to 0 for
large . The test image of the Campanile can be downloaded at here, which you should then
resize to 64x64.
Run the forward process on the test image with
and display the results. You should get progressively more noisy
images.
Deliverables
Implement the noisy_im = forward(im, t) function
Show the test image at noise level [250, 500, 750].
Hints
The torch.randn_like function is helpful for
computing
.
Use the alphas_cumprod variable, which contains an
array of the hyperparameters, with alphas_cumprod[t]
corresponding to .
Berkeley Campanile
Noisy Campanile at t=250
Noisy Campanile at t=500
Noisy Campanile at t=750
Begin Solution Part 1.1, 1.2 and 1.3
DESCRIPTION: The noisy images are the output of the forward process at various t (timestep) values
with t = 1000 resulting in a completely noisy image and t = 0 being the original image. Gaussian
Blur manages to at least allow the viewer of the image to perceive the lower frequencies and get
the general shape of the image. One step denoising is the most effective, although it relies on
generation to fill in the higher frequencies as the noise level increases, since data loss is
unavoidable with noise.
End Solution Part 1.1, 1.2 and 1.3
1.2 Classical Denoising
Let's try to denoise these images using classical methods.
Again, take noisy images for timesteps [250, 500, 750], but use
Gaussian blur filtering to try to remove the noise.
Getting good results should be quite difficult, if not impossible.
Deliverables
For each of the 3 noisy test images from the previous part, show
your best Gaussian-denoised version side by side.
Hint:
torchvision.transforms.functional.gaussian_blur is
useful. Here is the documentation.
Noisy Campanile at t=250
Noisy Campanile at t=500
Noisy Campanile at t=750
Gaussian Blur Denoising at t=250
Gaussian Blur Denoising at t=500
Gaussian Blur Denoising at t=750
1.3 One-Step Denoising
Now, we'll use a pretrained diffusion model to denoise. The actual
denoiser can be found at stage_1.unet.
This is a UNet that has already been trained on a very, very
large dataset of pairs of images.
We can use it to recover Gaussian noise from the image. Then, we can
remove this noise to recover (something close to) the original image.
Note: this UNet is conditioned on the amount of Gaussian noise by
taking
timestep as additional input.
Because this diffusion model was trained with text conditioning, we
also need a text prompt embedding. We provide the embedding for the
prompt "a high quality photo" for you to use. Later on, you can
use your own text prompts.
Deliverables
For the 3 noisy images from 1.2 (t = [250, 500, 750]):
Using the UNet, denoise the image by estimating the
noise.
Estimate the noise in the new noisy image, by passing it
through
stage_1.unet
Remove the noise from the noisy image to obtain an estimate of
the original image.
Visualize the original image, the noisy image, and the
estimate
of the original image
Hints
When removing the noise, you can't simply subtract the noise
estimate. Recall that in equation A.2 we need to scale the noise. Look
at equation A.2 to figure out how we predict from and
.
You will probably have to wrangle tensors to the correct device
and
into the correct data types. The functions .to(device)
and .half() will be useful. The denoiser is loaded as
half precision (to save memory).
The signature for the unet is stage_1.unet(image, t,
encoder_hidden_states=prompt_embeds, return_dict=False).
You
need to pass in the noisy image, the timestep, and the prompt
embeddings. The return_dict argument just makes the
output nicer.
The unet will output a tensor of shape (1, 6, 64, 64). This is
because DeepFloyd was trained to predict the noise as well as
variance
of the noise. The first 3 channels is the noise estimate, which you
will use. The second 3 channels is the variance estimate which you
may
ignore.
To save GPU memory, you should wrap all of your code in a
with
torch.no_grad(): context. This tells torch not to do
automatic differentiation, and saves a considerable amount of
memory.
Noisy Campanile at t=250
Noisy Campanile at t=500
Noisy Campanile at t=750
One-Step Denoised Campanile at t=250
One-Step Denoised Campanile at t=500
One-Step Denoised Campanile at t=750
1.4 Iterative Denoising
In part 1.3, you should see that the denoising UNet does a much better
job of projecting the image onto the natural image manifold, but it
does
get worse as you add more noise. This makes sense, as the problem is
much harder with more noise!
But diffusion models are designed to denoise iteratively.
In this part we will implement this.
In theory, we could start with noise at timestep ,
denoise for one step to get an estimate of , and carry on
until
we get . But this would require running the diffusion model 1000
times, which is quite slow (and costs $$$).
It turns out, we can actually speed things up by skipping steps. The
rationale for why this is possible is due to a connection with
differential equations. It's a tad complicated, and not within scope
for
this course, but if you're interested you can check out this excellent
article.
To skip steps we can create a new list of timesteps that we'll call
strided_timesteps, which does just this.
strided_timesteps will
correspond to the noisiest image (and thus the largest ) and
strided_timesteps[-1] will correspond to a clean image.
One
simple way of constructing this list is by introducing a regular
stride
step (e.g. stride of 30 works well).
On the ith denoising step we are at strided_timesteps[i], and want to get to strided_timesteps[i+1] (from more noisy to less noisy).
To
actually do this, we have the following formula:
where:
is your image at timestep
is your noisy image at timestep where (less
noisy)
is defined by alphas_cumprod, as
explained above.
is our current estimate of the clean image using equation A.2
just like in section 1.3
The is random noise, which in the case of DeepFloyd is also
predicted.
The process to compute this is not very important, so we supply a
function, add_variance, to do this for you.
You can think of this as a linear interpolation between the signal and
noise:
Interpolation
See equations 6 and 7 of the DDPM paper for more
information. Be careful about bars above the alpha! Some have them and some do not.
First, create the list strided_timesteps. You should
start at timestep 990, and take step sizes of size 30 until you arrive at
0. After completing the problem set, feel free to try different
"schedules" of timesteps.
Also implement the function iterative_denoise(image,
i_start), which takes a noisy image image, as well
as a starting index i_start. The function should denoise
an image starting at timestep timestep[i_start], applying
the above formula to obtain an image at timestep t' =
timestep[i_start + 1], and repeat iteratively until we arrive at
a clean image.
Add noise to the test image im to timestep
timestep[10] and display this image. Then run the
iterative_denoise function on the noisy image, with
i_start = 10, to obtain a clean image and display it. Please
display every 5th image of the denoising loop. Compare this to the
"one-step" denoising method from the previous section, and to gaussian
blurring.
Deliverables
Using i_start = 10:
Create strided_timesteps: a list of monotonically
decreasing timesteps, starting at 990, with a stride of 30, eventually
reaching 0. Also initialize the timesteps using the function
stage_1.scheduler.set_timesteps(timesteps=strided_timesteps)
Show the noisy image every 5th loop of denoising (it should
gradually
become less noisy)
Show the final predicted clean image, using iterative denoising
Show the predicted clean image using only a single denoising step,
as
was done in the previous part. This should look much worse.
Show the predicted clean image using gaussian blurring, as was done
in
part 1.2.
Complete the iterative_denoise function
Hints
Remember, the unet will output a tensor of shape (1, 6, 64, 64).
This is because DeepFloyd was trained to predict the noise as well as
variance of the noise. The first 3 channels is the noise estimate,
which you will use here.
The second 3 channels is the variance estimate which you will pass to
the add_variance function
Read the documentation for the add_variance function to
figure out how to use it to add the to the image.
Depending on if your final images are torch tensors or numpy arrays,
you may need to modify the `show_images` call a bit.
Noisy Campanile at t=90
Noisy Campanile at t=240
Noisy Campanile at t=390
Noisy Campanile at t=540
Noisy Campanile at t=690
Original
Iteratively Denoised Campanile
One-Step Denoised Campanile
Gaussian Blurred Campanile
Begin Solution Part 1.4
DESCRIPTION: The input image is fed into the various denoising functions at noise level of 10 after being
ran through the forward process. The outputs for every fifth loop are displayed and show the gradual
denoising process. Overall, the output shape is formed pretty early on in the series of loops, and it is
the lower frequencies that are never recovered. When compared, the output of iterative denoising is
"cleaner" than the output of one-step denoising, but it significantly lacks the higher frequencies, which
have at least some presence, such as the leaf density of the trees behind the Campanile. The gaussian blur
is effective in revealing the lower frequencies and gagueing the overall shape of the image, but not much else.
End Solution Part 1.4
1.5 Diffusion Model Sampling
In part 1.4, we use the diffusion model to denoise an image. Another
thing
we can do with the iterative_denoise function is to
generate
images from scratch. We can do this by setting i_start = 0
and passing in random noise. This effectively denoises pure noise.
Please
do this, and show 5 results of "a high quality photo".
Deliverables
Show 5 sampled images.
Hints
Use torch.randn to make the noise.
Make sure you move the tensor to the correct device and correct data
type by calling .half() and
.to(device).
The quality of the images will not be spectacular, but should be
reasonable images.
We will fix this in the next section with CFG.
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Begin Solution Part 1.5
DESCRIPTION: The iterative denoise function essentially eliminates the random, static-like noise from
the random input images. However, the produced images do not have defined features, especially in the
higher frequencies where fine-grained detail would be found, since the randomness the image was generated
from offers no guidance, and, which a generic prompt in the classifier, outputs are still largely noisy.
End Solution Part 1.5
1.6 Classifier-Free Guidance (CFG)
You may have noticed that the generated images in the prior section are
not very good, and some are completely non-sensical.
In order to greatly improve image quality (at the expense of image
diversity), we can use a technicque called Classifier-Free Guidance.
In CFG, we compute both a conditional and an unconditional noise
estimate. We denote these and .
Then, we let our new noise estimate be:
where controls the strength of CFG. Notice that for ,
we get an unconditional noise estimate, and for we get the
conditional noise estimate.
The magic happens when . In this case, we get much higher
quality images. Why this happens is still up to vigorous debate.
For more information on CFG, you can check out this blog post.
Please implement the iterative_denoise_cfg function,
identical to the iterative_denoise function but using
classifier-free guidance.
To get an unconditional noise estimate, we can just pass an empty prompt
embedding to the diffusion model (the model was trained to predict an
unconditional noise estimate when given an empty text prompt).
Disclaimer
Disclaimer
Before, we used "a high quality photo" as a "null"
condition.
Now, we will use the actual "" null prompt for
unconditional
guidance for CFG. In the later part, you should always use
""
null prompt for unconditional guidance and use "a high quality
photo" for unconditional generation.
Deliverables
Implement the iterative_denoise_cfg function
Show 5 images of "a high quality photo" with a CFG
scale of .
Hints
You will need to run the UNet twice, once for the conditional prompt
embedding, and once for the unconditional
The UNet will predict both a conditional and an unconditional
variance. Just use the conditional variance with the
add_variance function.
The resulting images should be much better than those in the prior
section.
Sample 1 with CFG
Sample 2 with CFG
Sample 3 with CFG
Sample 4 with CFG
Sample 5 with CFG
Begin Solution Part 1.6
DESCRIPTION: By using classifier free guidance, a null prompt, and a scale of 7, the output images now
resemble photographs instead of the noisy, grainy, texture-like outputs of the previous function. The
type of images shown also reflect on the type of training data the classifier has seen - high resolution
photographs usually happen to be taken of something noteworthy as opposed to mundane.
End Solution Part 1.6
1.7 Image-to-image Translation
In part 1.4, we take a real image, add noise to it, and then denoise.
This
effectively allows us to make edits to existing images. The more noise
we
add, the larger the edit will be. This works because in order to denoise
an image, the diffusion model must to some extent "hallucinate" new
things
-- the model has to be "creative." Another way to think about it is that
the denoising process "forces" a noisy image back onto the manifold of
natural images.
Here, we're going to take the original test image, noise it a little,
and
force it back onto the image manifold without any conditioning.
Effectively, we're going to get an image that is similar to the test
image
(with a low-enough noise level). This follows the SDEdit algorithm.
To start, please run the forward process to get a noisy test image, and
then run the iterative_denoise_cfg function using a
starting
index of [1, 3, 5, 7, 10, 20] steps and show the results, labeled with
the
starting index. You should see a series of "edits" to the original
image,
gradually matching the original image closer and closer.
Deliverables
Edits of the test image, using the given prompt at noise levels [1,
3, 5, 7, 10, 20] with text prompt
"a high quality photo"
Edits of 2 of your own test images, using the same procedure.
Hints
You should have a range of images, gradually looking more like the
original image
Note: You should use CFG from this point forward.
SDEdit with i_start=1
SDEdit with i_start=3
SDEdit with i_start=5
SDEdit with i_start=7
SDEdit with i_start=10
SDEdit with i_start=20
Campanile
Begin Solution Part 1.7
DESCRIPTION: The forward process is used to generate various levels of noise on a given input image,
then, the noisy image is ran through a denoiser to see how much of the original image can be recovered
with the multi-step denoising algorithm at different stages. As the noise level progresses to zero, the
output image bears no resemblance to the clean, unaltered input image.
End Solution Part 1.7
1.7.1 Editing Hand-Drawn and Web Images
This procedure works particularly well if we start with a nonrealistic
image (e.g. painting, a sketch, some scribbles) and project it onto the
natural image manifold.
Please experiment by starting with hand-drawn or other non-realistic
images and see how you can get them onto the natural image manifold in
fun ways.
We provide you with 2 ways to provide inputs to the model:
Download images from the web
Draw your own images
Please find an image from the internet and apply edits exactly as above.
And also draw your own images, and apply edits exactly as above. Feel
free to copy the prior cell here. For drawing inspiration, you can check
out the examples on this
project page.
Deliverables
1 image from the web of your choice, edited using the above method
for noise levels [1, 3, 5, 7, 10, 20] (and whatever additional noise
levels you want)
2 hand drawn images, edited using the above method for noise
levels [1, 3, 5, 7, 10, 20] (and whatever additional noise levels
you want)
Hints
We provide you with preprocessing code to convert web images to the format expected by DeepFloyd
Unfortunately, the drawing interface is hardcoded to be 300x600
pixels, but we need a square image. The code will center crop, so
just draw in the middle of the canvas.
Bear at i_start=1
Bear at i_start=3
Bear at i_start=5
Bear at i_start=7
Bear at i_start=10
Bear at i_start=20
Bear
House at i_start=1
House at i_start=3
House at i_start=5
House at i_start=7
House at i_start=10
House at i_start=20
Original House Sketch
Begin Solution Part 1.7.1
DESCRIPTION: As an experiment, I took an input image from the web that contained only fine details, and, as
expected, it did not take many timesteps to completely overwrite the content in the white square due to its
fine nature, which meant the information the image conveyed (people, house, background) quickly disappeared.
DESCRIPTION: For my first drawing, I attempted a sailboat to see if the denosing algorithm was able to predict
it, but there seems to be a bias towards generating images of humans from noise.
DESCRIPTION: This time, I attemped to draw randmom movements with my cursor to see if there was a bias toward
generating images of humans more than other objects, and the denoising algorithm interpreted the drawings as
a person viewed at a distance, with there being some shape resemblance at noise level 10.
End Solution Part 1.7.1
1.7.2 Inpainting
We can use the same procedure to implement inpainting (following the RePaint paper). That is,
given an image , and a binary mask , we can create a
new image that has the same content where is 0, but new content
wherever is 1.
To do this, we can run the diffusion denoising loop. But at every step,
after obtaining , we "force" to have the same pixels as
where is 0, i.e.:
Essentially, we leave everything inside the edit mask alone, but we
replace everything outside the edit mask with our original image -- with
the correct amount of noise added for timestep .
Please implement this below, and edit the picture to inpaint the top of
the Campanile.
Deliverables
A properly implemented inpaint function
The test image inpainted (feel free to use your own mask)
2 of your own images edited (come up with your own mask)
look at the results from this
paper for inspiration
Hints
Reuse the forward function you implemented earlier to
implement inpainting
Because we are using the diffusion model for tasks it was not
trained for, you may have to run the sampling process a few times
before you get a nice result.
You can copy and paste your iterative_denoise_cfg function. To get
inpainting to work should only require (roughly) 1-2 additional lines
and a few small changes.
Campanile
Mask
Hole to Fill
Campanile Inpainted
Begin Solution Part 1.7.2
DESCRIPTION: Shown below are the three input images and their corresponding masks. It is the job of
the denoiser to fill the white space on each mask while the part of the image corresponding to the
dark part of the mask will be cut in from the unaltered input image at each step of the denoising
process.
DESCRIPTION: For my two input images, I chose masks corresponding to their level of detail. For the image
of Yosemite, I chose to exclude more of the image since the landscape seemed pretty predictable for a
generatative denoiser. For the image of the Grizzly Bear, I chose to exclude a small but detailed section
of the image. For both images, there are instances where mask boundaries are clearly visible, which is not
the case for the image of the Campanile due to the small gap and the low-detail region of the mask.
End Solution Part 1.7.2
1.7.3 Text-Conditional Image-to-image Translation
Now, we will do the same thing as the previous section, but guide the
projection with a text prompt. This is no longer pure
"projection to the natural image manifold" but also adds control using
language. This is simply a matter of changing the prompt from
"a high quality photo" to any of the precomputed prompts we
provide you (if you want to use your own prompts, see appendix).
Deliverables
Edits of the test image, using the given prompt at noise levels [1,
3, 5, 7, 10, 20]
Edits of 2 of your own test images, using the same procedure
Hints
The images should gradually look more like original image, but also
look like the text prompt.
Rocket Ship at noise level 1
Rocket Ship at noise level 3
Rocket Ship at noise level 5
Rocket Ship at noise level 7
Rocket Ship at noise level 10
Rocket Ship at noise level 20
Campanile
Begin Solution Part 1.7.3
DESCRIPTION: The results are an implementation of the Classifier-Free Guidance denoiser, with the inputs
being the Campanile, Yosemite, and a Grizzly Bear at various noise levels. Since I chose to keep the
same "rocket" prompts for all input images, there were some odd and unexpected results at lower noise levels.
End Solution Part 1.7.3
1.8 Visual Anagrams
In this part, we are finally ready to implement Visual
Anagrams and create optical illusions with diffusion models. In
this part, we will create an image that looks like "an oil
painting of people around a campfire", but when flipped upside
down will reveal "an oil painting of an old man".
To do this, we will denoise an image at step normally with the
prompt
"an oil painting of an old man", to obtain noise estimate
. But at the same time, we will flip upside down, and
denoise with the prompt
"an oil painting of people around a campfire", to get noise
estimate . We can flip back, to make it
right-side up, and average the two noise estimates. We can then perform a
reverse/denoising diffusion step with the averaged noise estimate.
The full algorithm will be:
where UNet is the diffusion model UNet from before,
is a function that flips the image, and and are two different
text prompt embeddings. And our final noise estimate is . Please
implement the above algorithm and show example of an illusion.
Deliverables
Correctly implemented visual_anagrams function
A visual anagram where on one orientation
"an oil painting of people around a campfire" is displayed
and, when flipped, "an oil painting of an old man" is
displayed.
2 more illusions of your choice that change appearance when you flip
it upside down.
Hints
- You may have to run multiple times to get a really good result for
the same reasons as above.
An Oil Painting of an Old Man
An Oil Painting of People around a Campfire
Begin Solution Part 1.8
DESCRIPTION: To implement this function, the number of calls to UNet were doubled in comparison
to earlier functions to get two noise estimate and variance values. Ultimately, the variance was
ignored instead of being averaged since it had no measurable effect on the results. The noise estimate
was flipped back upright after being computed with the flipped x_t value. The given prompt generates
clean, reliable results, while my prompts require several attempts to get a successful result. As shown,
not all results were successful.
End Solution Part 1.8
1.9 Hybrid Images
In this part we'll implement Factorized
Diffusion and create hybrid images just like in project 2.
In order to create hybrid images with a diffusion model we can use a
similar technique as above. We will create a composite noise estimate
, by estimating the noise with two different text prompts, and
then combining low frequencies from one noise estimate with high
frequencies of the other. The algorithm is:
where UNet is the diffusion model UNet, is a low pass
function, is a high pass function, and and
are two different text prompt embeddings. Our final noise estimate is
. Please show an example of a hybrid image using this technique
(you may have to run multiple times to get a really good result for the
same reasons as above). We recommend that you use a gaussian blur of
kernel size 33 and sigma 2.
Deliverables
Correctly implemented make_hybrids function
An image that looks like a skull from far away but a
waterfall from close up
2 more hybrid images of your choosing.
Hints
use torchvision.transforms.functional.gaussian_blur. The
documentation can be found here.
You may have to run multiple times to get a really good result for the
same reasons as above
Hybrid image of a skull and a waterfall
Begin Solution Part 1.9
DESCRIPTION: For these results, the recommended Guassian kernel size and sigma value were used.
To observe the effects described for each image, stand close to see detailed features and stand
far away to see the low-frequency features. The third result varies from the first two because the
"face" of the image is in the high-frequencies, in contrast to the "face" being present in the low
frequencies of the earlier images. This is because while both Trump and a cheeseburger have similar color
schemes and palettes, only the cheeseburger is recognizable in low frequencies (at a distance).
End Solution Part 1.9
Part 2: Bells & Whistles
Up to 0.5 cookie points from Part A and part B each, for a
maximum of 1 total cookie point.
Create something cool with what you learned in this project (0.1
Cookie Points)
Design a course logo (0.2 Cookie Points)
Your own ideas (N cookie points): Be creative!
Using your own Prompts and Upsampling Generations
We provide you with code in the notebook to use your own prompts and upsample your generations!
We recommend using GPUs from Colab to finish this
project!
Overview
In part B you will train your own diffusion model on MNIST. Starter code can
be found in the provided
notebook.
START EARLY!
This project, in many ways, will be the most
difficult project this semester.
Note: this is an updated, clearer version of the part B instructions. For the old version, please see here.
Part 1: Training a Single-Step Denoising UNet
Let's warmup by building a simple one-step denoiser. Given a noisy image
, we
aim to train a denoiser such that it maps to a clean
image . To do so, we can optimize over an L2 loss:
1.1 Implementing the UNet
In this project, we implement the denoiser as a UNet. It consists of a
few downsampling and upsampling blocks with skip connections.
Figure 1: Unconditional UNet
The diagram above uses a number of standard tensor operations defined as
follows:
Figure 2: Standard UNet Operations
where:
Conv2d(kernel_size, stride, padding) is
nn.Conv2d()
BN is nn.BatchNorm2d()
GELU is nn.GELU()
ConvTranspose2d(kernel_size, stride, padding) is
nn.ConvTranspose2d()
AvgPool(kernel_size) is
nn.AvgPool2d()
At a high level, the blocks do the following:
(1) Conv is a convolutional layer that doesn't
change the image resolution, only the channel dimension.
(2) DownConv is a convolutional layer that
downsamples the tensor by 2.
(3) UpConv is a convolutional layer that upsamples
the tensor by 2.
(4) Flatten is an average pooling layer that
flattens a 7x7 tensor into a 1x1 tensor. 7 is the resulting height and
width after the downsampling operations.
(5) Unflatten is a convolutional layer that
unflattens/upsamples a 1x1 tensor into a 7x7 tensor.
(6) Concat is a channel-wise concatenation between
tensors with the same 2D shape. This is simply
torch.cat.
D is the number of hidden channels and is a
hyperparameter that we will set ourselves.
We define composed operations using our simple operations in order to
make our network deeper. This doesn't change the tensor's height, width,
or number of channels, but simply adds more learnable parameters.
(7) ConvBlock, is similar to Conv
but includes an additional Conv. Note that it has
the same input and output shape as (1) Conv.
(8) DownBlock, is similar to
DownConv but includes an additional
ConvBlock. Note that it has the same input and
output shape as (2) DownConv.
(9) UpBlock, is similar to UpConv
but includes an additional ConvBlock. Note that it
has the same input and output shape as (3)
UpConv.
1.2 Using the UNet to Train a Denoiser
Recall from equation 1 that we aim to solve the following denoising
problem:
Given a noisy image , we
aim to train a denoiser such that it maps to a clean
image . To do so, we can optimize over an L2 loss
To train our denoiser, we need to generate training data pairs of (,
), where each is a clean MNIST digit. For each training batch, we
can generate from using the the following noising process:
Visualize the different noising processes over , assuming normalized .
It should be similar to the following plot:
Figure 3: Varying levels of noise on MNIST digits
1.2.1 Training
Now, we will train the model to perform denoising.
Objective: Train a denoiser to denoise noisy image with
applied to a clean image .
Dataset and dataloader: Use the MNIST dataset via
torchvision.datasets.MNIST with flags to access training
and test sets. Train only on the training set. Shuffle the dataset
before creating the dataloader. Recommended batch size: 256. We'll
train over our dataset for 5 epochs.
You should only noise the image batches when fetched from the
dataloader so that in every epoch the network will see new noised
images, improving generalization.
Model: Use the UNet architecture defined in section 1.1 with
recommended hidden dimension D = 128.
Optimizer: Use Adam optimizer with learning rate of
1e-4.
Figure 4: Training Loss Curve
You should visualize denoised results on the test set at the end of
training. Display sample results after the 1st and 5th epoch.
They should look something like these:
Figure 5: Results on digits from the test set after 1
epoch of training
Figure 6: Results on digits from the test set after 5
epochs of training
1.2.2 Out-of-Distribution Testing
Our denoiser was trained on MNIST digits noised with . Let's
see how the denoiser performs on different 's that it wasn't
trained for.
Visualize the denoiser results on test set digits with varying levels of
noise .
Figure 7: Results on digits from the test set with varying
noise levels.
Deliverables
A visualization of the noising process using . (figure 3)
A training loss curve plot every few iterations during the whole
training process (figure 4).
Sample results on the test set after the first and the 5-th epoch
(staff solution takes ~7 minutes for 5 epochs on a Colab T4 GPU).
(figure 5, 6)
Sample results on the test set with out-of-distribution noise levels
after the model is trained. Keep the same image and
vary . (figure 7)
Hint
Since training can take a while, we strongly recommend that you
checkpoint your model every epoch onto your personal Google
Drive.
This is because Colab notebooks aren't persistent such that if you are
idle for a while, you will lose connection and your training progress.
This consists of:
Google Drive mounting.
Epoch-wise model & optimizer checkpointing.
Model & optimizer resuming from checkpoints.
Begin Solution Part 1
IMPORTANT NOTE: This function will take a few minutes to run, so its best to find its stored output in the iPYNB
to save time.
DESCRIPTION: Below is the noising process of the 28 x 28 images. This is the same forward process as used in part A
and what will be fed into the inputs for denoising. Here are various levels of sigma.
DESCRIPTION: Below is an example of the denoising process executed after one epoch of training. This epoch is
a great step towards a full denoiser, but there is still some noise present.
DESCRIPTION: Below is an example of the denoising process executed after five epochs of training. This epoch shows
minor changes over the previous epoch but is nonetheless noticably clearer with more epochs of training and a lower
loss function that approaches zero.
DESCRIPTION: Below is the training loss curve for the 5 epochs that were used in an unconditional denoising function.
The loss quickly drops and and begins approaching zero very early into the process, as is reflected by earlier images,
which showed a sudden increase and performance and the start of training, with gradual advancements afterwards.
DESCRIPTION: Below is the fully trained denoiser being fed inputs with various levels of sigma to test its ability to denoise images
and recover meaningful information. For all images, the denoised images have less contrast and clarity, but artifacts do not
appear until the very high noise levels, confirming its success shown by the loss function.
End Solution Part 1
Part 2: Training a Diffusion Model
Now, we are ready for diffusion, where we will train a UNet model that can iteratively denoise an image.
We will implement DDPM in
this part.
Let's revisit the problem we solved in equation B.1:
We will first introduce one small difference: we can change our UNet to predict
the added noise instead of the clean image (like in part 4A
of the project).
Mathematically, these are equivalent since (equation B.2).
Therefore, we can turn equation B.1 into the following:
where is a UNet trained to predict noise.
For diffusion, we eventually want to sample a pure noise image and generate a realistic image from the noise.
However, we saw in part A that one-step denoising does not
yield good results. Instead, we need to iteratively denoise the
image for better results.
Recall in part A that we used equation A.2 to generate noisy images
from for some timestep for :
Intuitively, when we want to be the clean image , when we want to be pure noise , and for , should be some linear combination of the two.
The precise derivation of is beyond the scope of this project
(see DDPM paper for more
details). Here, we provide you with the DDPM recipe to build a list for utilizing lists and :
Create a list of length such that and and all other elements for are evenly spaced between the two.
is a cumulative product of for .
Because we are working with simple MNIST digits, we can afford to have a smaller of 300 instead of the 1000 used in part A. Observe how is close to 1 for small and close to 0 for . is known as the variance schedule; it controls the amount of noise added at each timestep.
Now, to denoise image , we could simply apply our UNet on and get the noise . However, this won't work very well because the UNet is expecting the noisy image to have a noise variance for best results, but the variance of varies with .
One could train separate UNets, but it is much
easier to simply condition a single UNet with timestep , giving us our final objective:
2.1 Adding Time Conditioning to UNet
We need a way to inject scalar into our UNet model to condition it. There are many ways to do this. Here is what we suggest:
Figure 8: Time-Conditioned UNet
This uses a new operator called
FCBlock (fully-connected block) which we use to inject the conditioning signal into the UNet:
Figure 9: FCBlock for conditioning
Here Linear(F_in, F_out) is a linear layer with
F_in input features and F_out output
features. You can implement it using nn.Linear.
Since our conditioning signal is a scalar, F_in should be of size 1. We also recommend that you normalize to be in the range [0, 1] before embedding it, i.e. pass in .
You can embed by following this pseudo code:
fc1_t = FCBlock(...)
fc2_t = FCBlock(...)
# the t passed in here should be normalized to be in the range [0, 1]
t1 = fc1_t(t)
t2 = fc2_t(t)
# Follow diagram to get unflatten.# Replace the original unflatten with modulated unflatten.
unflatten = unflatten + t1
# Follow diagram to get up1.
...
# Replace the original up1 with modulated up1.
up1 = up1 + t2
# Follow diagram to get the output.
...
2.2 Training the UNet
Training our time-conditioned UNet is now pretty easy. Basically, we pick a random image from the training set, a random , and train the denoiser to predict the noise in We repeat this for different images and different values until the model converges and we are happy.
Algorithm B.1. Training time-conditioned UNet
Objective: Train a time-conditioned UNet to predict the noise in given a noisy image and a timestep .
Dataset and dataloader: Use the MNIST dataset via
torchvision.datasets.MNIST with flags to access training
and test sets. Train only on the training set. Shuffle the dataset
before creating the dataloader. Recommended batch size: 128. We'll
train over our dataset for 20 epochs since this task is more difficult than part A.
As shown in algorithm B.1, You should only noise the image batches when fetched from the
dataloader.
Model: Use the time-conditioned UNet architecture defined in section 2.1 with
recommended hidden dimension D = 64. Follow the diagram and pseudocode for how to inject the conditioning signal into the UNet. Remember to normalize before embedding it.
Optimizer: Use Adam optimizer with an initial learning rate of
1e-3. We will be using an exponential learning rate decay scheduler with a gamma of . This can be implemented using scheduler = torch.optim.lr_scheduler.ExponentialLR(...). You should call scheduler.step() after every epoch.
Figure 10: Time-Conditioned UNet training loss curve
2.3 Sampling from the UNet
The sampling process is very similar to part A, except we don't need to predict the variance like in the DeepFloyd model. Instead, we can use our list .
Algorithm B.2. Sampling from time-conditioned UNet
Epoch 1
Epoch 5
Epoch 10
Epoch 15
Epoch 20
Deliverables
A training loss curve plot for the time-conditioned UNet over the whole
training process (figure 10).
Sampling results for the time-conditioned UNet for 5 and 20 epochs.
Note: providing a gif is optional and can be done as a bells and whistles below.
Begin Solution Part 2.1, 2.2, 2.3
IMPORTANT NOTE: This function takes a while to run (over 10 minutes), but its saved output can be found on the iPYNB
when you scroll through the codeboxes and see various outputs.
DESCRIPTION: Below are samples taken at various epochs of the training of the time-conditioned denoising model with
random values as its input. Once again, the images quickly become somewhat legible words early into training,
however, the more gradual changes seen after the first few samples would require closer analysis.
DESCRIPTION: As reflected on the images above, the loss function follows a sharp downard trajectory early into
training and then asymptotically approaches zero. This is a sign that the training is effective and resulting model
can successfully handle data similar to that it was trained on - however, the lack of a test set means different data
will significantly reduce the accuracy of the model.
End Solution Part 2.1, 2.2, 2.3
2.4 Adding Class-Conditioning to UNet
To make the results better and give us more control for image generation, we can also optionally condition our UNet on the class of the digit 0-9. This will require adding 2 more FCBlocks to our UNet but, we suggest that for class-conditioning vector , you make it a one-hot vector instead of a single scalar.
Because we still want our UNet to work without it being conditioned on the class, we implement dropout where 10% of the time () we drop the class conditioning vector by setting it to 0.
Here is one way to condition our UNet on both time and class :
fc1_t = FCBlock(...)
fc1_c = FCBlock(...)
fc2_t = FCBlock(...)
fc2_c = FCBlock(...)
t1 = fc1_t(t)
c1 = fc1_c(c)
t2 = fc2_t(t)
c2 = fc2_c(c)
# Follow diagram to get unflatten.# Replace the original unflatten with modulated unflatten.
unflatten = c1 * unflatten + t1
# Follow diagram to get up1.
...
# Replace the original up1 with modulated up1.
up1 = c2 * up1 + t1
# Follow diagram to get the output.
...
Training for this section will be the same as time-only, with the only difference being the conditioning vector and doing unconditional generation periodically.
Algorithm B.3. Training class-conditioned UNet
Figure 11: Class-conditioned UNet training loss curve
2.5 Sampling from the Class-Conditioned UNet
The sampling process is the same as part A, where we saw that conditional results aren't good unless we use classifier-free guidance. Use classifier-free guidance with for this part.
Algorithm B.4. Sampling from class-conditioned UNet
Epoch 1
Epoch 5
Epoch 10
Epoch 15
Epoch 20
Deliverables
A training loss curve plot for the class-conditioned UNet over the whole training process.
Sampling results for the class-conditioned UNet for 5 and 20 epochs. Generate 4 instances of each digit as shown above.
Note: providing a gif is optional and can be done as a bells and whistles below.
Begin Solution Part 2.4, 2.5
IMPORTANT NOTE: I was unforunately unable to obtain samples of the denoised and classed digits
because there was an unresolved kernel error, which means I am only able to generate the loss curve.
DESCRIPTION: Loss curve for functions shows successful forward process, but I was unable to get sampling running.