berkeley logoProgramming Project #5 (proj5)
CS180: Intro to Computer Vision and Computational Photography

Original Campanile Hole Filled

Hole Filling

Original Dog Edited Dog

"Make it Real"

Man Wearing Hat

A Lithograph of a Waterfall

A Lithograph of a Skull

Bear Dancing

An Oil Painting of an Old Man

An Oil Painting of People Around a Fire


Part A (and B!): The Power of Diffusion Models!

The first part of a larger project.

Due: 11/07/24 11:59pm

We recommend using GPUs from Colab to finish this project!

Overview

In part A you will play around with diffusion models, implement diffusion sampling loops, and use them for other tasks such as inpainting and creating optical illusions. Instructions can be found below and in the provided notebook.

Because part A is simply to get your feet wet with pre-trained diffusion models, all deliverables should be completed in the notebook. You will still submit a webpage with your results.

START EARLY!

This project, in many ways, will be the most difficult project this semester.

Part 0: Setup

Gaining Access to DeepFloyd

We are going to use the DeepFloyd IF diffusion model. DeepFloyd is a two stage model trained by Stability AI. The first stage produces images of size 64×64 and the second stage takes the outputs of the first stage and generates images of size 256×256. We provide upsampling code at the very end of the notebook, though this is not required in your submission. Before using DeepFloyd, you must accept its usage conditions. To do so:

  1. Make a Hugging Face account and log in.
  2. Accept the license on the model card of DeepFloyd/IF-I-XL-v1.0. Accepting the license on the stage I model card will auto accept for the other IF models.
  3. Log in locally by entering your Hugging Face Hub access token below. You should be able to find and create tokens here.

Disclaimer about Text Embeddings

DeepFloyd was trained as a text-to-image model, which takes text prompts as input and outputs images that are aligned with the text. Throughout this notebook, you will see that we ask you to generate with the prompt "a high quality photo". We want you to think of this as a "null" prompt that doesn't have any specific meaning, and is simply a way for the model to do unconditional generation. You can view this as using the diffusion model to "force" a noisy image onto the "manifold" of real images.

In the later sections, we will guide this project with a more detailed text prompt.

Downloading Precomputed Text Embeddings

Because the text encoder is very large, and barely fits on a free tier Colab GPU, we have precomputed a couple of text embeddings for you to try. You can download the .pth file here. This should hopefully save some headaches from GPU out of memory errors. At the end of part A of the project, we provide you code if you want to try your own text prompts. If you'd like, you can pay $10 for Colab Pro and avoid needing to load the two models on different sessions.

In the notebook, we instantiate DeepFloyd's stage_1 and stage_2 objects used for generation, as well as several text prompts for sample generation.

Deliverables

Begin Solution Part 0


IMPORTANT NOTE: When a label of "noise = " is seen on any of the figures displayed, it refers to the "i_start" value, the timestep at which iterative denoising and its related functions start their denoising loop from, it is inversely proportional to the noise of the input image.


DESCRIPTION: For each size category, the row of images on top corresponds to 20 inference steps and the row of images on the bottom corresponds to 30 inference steps. The differences are subtle but noticable, such as the details on walls of the cabins or the facial features of the man.



DESCRIPTION: Below is the image of the chosen seed value for operations that use semi-randomness. The value is 150.



End Solution Part 0

Part 1: Sampling Loops

In this part of the problem set, you will write your own "sampling loops" that use the pretrained DeepFloyd denoisers. These should produce high quality images such as the ones generated above.

You will then modify these sampling loops to solve different tasks such as inpainting or producing optical illusions.

Diffusion Models Primer

Starting with a clean image, x0, we can iteratively add noise to an image, obtaining progressively more and more noisy versions of the image, xt, until we're left with basically pure noise at timestep t=T. When t=0, we have a clean image, and for larger t more noise is in the image.

A diffusion model tries to reverse this process by denoising the image. By giving a diffusion model a noisy xt and the timestep t, the model predicts the noise in the image. With the predicted noise, we can either completely remove the noise from the image, to obtain an estimate of x0, or we can remove just a portion of the noise, obtaining an estimate of xt1, with slightly less noise.

To generate images from the diffusion model (sampling), we start with pure noise at timestep T sampled from a gaussian distribution, which we denote xT. We can then predict and remove part of the noise, giving us xT1. Repeating this process until we arrive at x0 gives us a clean image.

For the DeepFloyd models, T=1000.

The exact amount of noise added at each step is dictated by noise coefficients, α¯t, which were chosen by the people who trained DeepFloyd.

1.1 Implementing the Forward Process

A key part of diffusion is the forward process, which takes a clean image and adds noise to it. In this part, we will write a function to implement this. The forward process is defined by:

(A.1)q(xt|x0)=N(xt;α¯x0,(1α¯t)I)

which is equivalent to computing (A.2)xt=α¯tx0+1α¯tϵwhere ϵN(0,1) That is, given a clean image x0, we get a noisy image xt at timestep t by sampling from a Gaussian with mean α¯tx0 and variance (1α¯t). Note that the forward process is not just adding noise -- we also scale the image.

You will need to use the alphas_cumprod variable, which contains the α¯t for all t[0,999]. Remember that t=0 corresponds to a clean image, and larger t corresponds to more noise. Thus, α¯t is close to 1 for small t, and close to 0 for large t. The test image of the Campanile can be downloaded at here, which you should then resize to 64x64. Run the forward process on the test image with t[250,500,750] and display the results. You should get progressively more noisy images.

Deliverables

Hints

Berkeley Campanile

Berkeley Campanile

Noisy Campanile at t=250

Noisy Campanile at t=250

Noisy Campanile at t=500

Noisy Campanile at t=500

Noisy Campanile at t=750

Noisy Campanile at t=750

Begin Solution Part 1.1, 1.2 and 1.3


DESCRIPTION: The noisy images are the output of the forward process at various t (timestep) values with t = 1000 resulting in a completely noisy image and t = 0 being the original image. Gaussian Blur manages to at least allow the viewer of the image to perceive the lower frequencies and get the general shape of the image. One step denoising is the most effective, although it relies on generation to fill in the higher frequencies as the noise level increases, since data loss is unavoidable with noise.



End Solution Part 1.1, 1.2 and 1.3

1.2 Classical Denoising

Let's try to denoise these images using classical methods. Again, take noisy images for timesteps [250, 500, 750], but use Gaussian blur filtering to try to remove the noise. Getting good results should be quite difficult, if not impossible.

Deliverables Hint:
Noisy Campanile at t=250

Noisy Campanile at t=250

Noisy Campanile at t=500

Noisy Campanile at t=500

Noisy Campanile at t=750

Noisy Campanile at t=750

Gaussian Blur at t=250

Gaussian Blur Denoising at t=250

Gaussian Blur Denoising at t=500

Gaussian Blur Denoising at t=500

Gaussian Blur Denoising at t=750

Gaussian Blur Denoising at t=750

1.3 One-Step Denoising

Now, we'll use a pretrained diffusion model to denoise. The actual denoiser can be found at stage_1.unet. This is a UNet that has already been trained on a very, very large dataset of (x0,xt) pairs of images. We can use it to recover Gaussian noise from the image. Then, we can remove this noise to recover (something close to) the original image. Note: this UNet is conditioned on the amount of Gaussian noise by taking timestep t as additional input.

Because this diffusion model was trained with text conditioning, we also need a text prompt embedding. We provide the embedding for the prompt "a high quality photo" for you to use. Later on, you can use your own text prompts.

Deliverables

Hints

Noisy Campanile at t=250

Noisy Campanile at t=250

Noisy Campanile at t=500

Noisy Campanile at t=500

Noisy Campanile at t=750

Noisy Campanile at t=750

Estimated Campanile at t=250

One-Step Denoised Campanile at t=250

Denoised Campanile at t=500

One-Step Denoised Campanile at t=500

Denoised Campanile at t=750

One-Step Denoised Campanile at t=750

1.4 Iterative Denoising

In part 1.3, you should see that the denoising UNet does a much better job of projecting the image onto the natural image manifold, but it does get worse as you add more noise. This makes sense, as the problem is much harder with more noise!

But diffusion models are designed to denoise iteratively. In this part we will implement this.

In theory, we could start with noise x1000 at timestep T=1000, denoise for one step to get an estimate of x999, and carry on until we get x0. But this would require running the diffusion model 1000 times, which is quite slow (and costs $$$).

It turns out, we can actually speed things up by skipping steps. The rationale for why this is possible is due to a connection with differential equations. It's a tad complicated, and not within scope for this course, but if you're interested you can check out this excellent article.

To skip steps we can create a new list of timesteps that we'll call strided_timesteps, which does just this. strided_timesteps will correspond to the noisiest image (and thus the largest t) and strided_timesteps[-1] will correspond to a clean image. One simple way of constructing this list is by introducing a regular stride step (e.g. stride of 30 works well).

On the ith denoising step we are at t= strided_timesteps[i], and want to get to t= strided_timesteps[i+1] (from more noisy to less noisy). To actually do this, we have the following formula:

(A.3)xt=α¯tβt1α¯tx0+αt(1α¯t)1α¯txt+vσ

where:

The vσ is random noise, which in the case of DeepFloyd is also predicted. The process to compute this is not very important, so we supply a function, add_variance, to do this for you.

You can think of this as a linear interpolation between the signal and noise:
Interpolation Example

Interpolation

See equations 6 and 7 of the DDPM paper for more information. Be careful about bars above the alpha! Some have them and some do not.

First, create the list strided_timesteps. You should start at timestep 990, and take step sizes of size 30 until you arrive at 0. After completing the problem set, feel free to try different "schedules" of timesteps.

Also implement the function iterative_denoise(image, i_start), which takes a noisy image image, as well as a starting index i_start. The function should denoise an image starting at timestep timestep[i_start], applying the above formula to obtain an image at timestep t' = timestep[i_start + 1], and repeat iteratively until we arrive at a clean image.

Add noise to the test image im to timestep timestep[10] and display this image. Then run the iterative_denoise function on the noisy image, with i_start = 10, to obtain a clean image and display it. Please display every 5th image of the denoising loop. Compare this to the "one-step" denoising method from the previous section, and to gaussian blurring.

Deliverables

Using i_start = 10: Hints
Noisy Campanile at t=90

Noisy Campanile at t=90

Noisy Campanile at t=240

Noisy Campanile at t=240

Noisy Campanile at t=390

Noisy Campanile at t=390

Noisy Campanile at t=540

Noisy Campanile at t=540

Noisy Campanile at t=690

Noisy Campanile at t=690

Original Campanile

Original

Iteratively Denoised Campanile

Iteratively Denoised Campanile

One-Step Denoised Campanile

One-Step Denoised Campanile

Gaussian Blurred Campanile

Gaussian Blurred Campanile

Begin Solution Part 1.4


DESCRIPTION: The input image is fed into the various denoising functions at noise level of 10 after being ran through the forward process. The outputs for every fifth loop are displayed and show the gradual denoising process. Overall, the output shape is formed pretty early on in the series of loops, and it is the lower frequencies that are never recovered. When compared, the output of iterative denoising is "cleaner" than the output of one-step denoising, but it significantly lacks the higher frequencies, which have at least some presence, such as the leaf density of the trees behind the Campanile. The gaussian blur is effective in revealing the lower frequencies and gagueing the overall shape of the image, but not much else.



End Solution Part 1.4

1.5 Diffusion Model Sampling

In part 1.4, we use the diffusion model to denoise an image. Another thing we can do with the iterative_denoise function is to generate images from scratch. We can do this by setting i_start = 0 and passing in random noise. This effectively denoises pure noise. Please do this, and show 5 results of "a high quality photo".

Deliverables

Hints

Noisy Campanile at t=90

Sample 1

Sample 2

Sample 2

Sample 3

Sample 3

Sample 4

Sample 4

Sample 5

Sample 5

Begin Solution Part 1.5


DESCRIPTION: The iterative denoise function essentially eliminates the random, static-like noise from the random input images. However, the produced images do not have defined features, especially in the higher frequencies where fine-grained detail would be found, since the randomness the image was generated from offers no guidance, and, which a generic prompt in the classifier, outputs are still largely noisy.



End Solution Part 1.5

1.6 Classifier-Free Guidance (CFG)

You may have noticed that the generated images in the prior section are not very good, and some are completely non-sensical. In order to greatly improve image quality (at the expense of image diversity), we can use a technicque called Classifier-Free Guidance.

In CFG, we compute both a conditional and an unconditional noise estimate. We denote these ϵc and ϵu. Then, we let our new noise estimate be: (A.4)ϵ=ϵu+γ(ϵcϵu) where γ controls the strength of CFG. Notice that for γ=0, we get an unconditional noise estimate, and for γ=1 we get the conditional noise estimate. The magic happens when γ>1. In this case, we get much higher quality images. Why this happens is still up to vigorous debate. For more information on CFG, you can check out this blog post.

Please implement the iterative_denoise_cfg function, identical to the iterative_denoise function but using classifier-free guidance. To get an unconditional noise estimate, we can just pass an empty prompt embedding to the diffusion model (the model was trained to predict an unconditional noise estimate when given an empty text prompt).

Disclaimer Disclaimer Before, we used "a high quality photo" as a "null" condition. Now, we will use the actual "" null prompt for unconditional guidance for CFG. In the later part, you should always use "" null prompt for unconditional guidance and use "a high quality photo" for unconditional generation.

Deliverables

Hints

Noisy Campanile at t=90

Sample 1 with CFG

Sample 2

Sample 2 with CFG

Sample 3

Sample 3 with CFG

Sample 4

Sample 4 with CFG

Sample 5

Sample 5 with CFG

Begin Solution Part 1.6


DESCRIPTION: By using classifier free guidance, a null prompt, and a scale of 7, the output images now resemble photographs instead of the noisy, grainy, texture-like outputs of the previous function. The type of images shown also reflect on the type of training data the classifier has seen - high resolution photographs usually happen to be taken of something noteworthy as opposed to mundane.



End Solution Part 1.6

1.7 Image-to-image Translation

In part 1.4, we take a real image, add noise to it, and then denoise. This effectively allows us to make edits to existing images. The more noise we add, the larger the edit will be. This works because in order to denoise an image, the diffusion model must to some extent "hallucinate" new things -- the model has to be "creative." Another way to think about it is that the denoising process "forces" a noisy image back onto the manifold of natural images.

Here, we're going to take the original test image, noise it a little, and force it back onto the image manifold without any conditioning. Effectively, we're going to get an image that is similar to the test image (with a low-enough noise level). This follows the SDEdit algorithm.

To start, please run the forward process to get a noisy test image, and then run the iterative_denoise_cfg function using a starting index of [1, 3, 5, 7, 10, 20] steps and show the results, labeled with the starting index. You should see a series of "edits" to the original image, gradually matching the original image closer and closer.

Deliverables

Hints

Note: You should use CFG from this point forward.
Sample 5

SDEdit with i_start=1

Sample 4

SDEdit with i_start=3

Sample 3

SDEdit with i_start=5

Sample 2

SDEdit with i_start=7

Noisy Campanile at t=90

SDEdit with i_start=10

Noisy Campanile at t=90

SDEdit with i_start=20

Original Campanile

Campanile

Begin Solution Part 1.7


DESCRIPTION: The forward process is used to generate various levels of noise on a given input image, then, the noisy image is ran through a denoiser to see how much of the original image can be recovered with the multi-step denoising algorithm at different stages. As the noise level progresses to zero, the output image bears no resemblance to the clean, unaltered input image.



End Solution Part 1.7

1.7.1 Editing Hand-Drawn and Web Images

This procedure works particularly well if we start with a nonrealistic image (e.g. painting, a sketch, some scribbles) and project it onto the natural image manifold.

Please experiment by starting with hand-drawn or other non-realistic images and see how you can get them onto the natural image manifold in fun ways.

We provide you with 2 ways to provide inputs to the model:

  1. Download images from the web
  2. Draw your own images

Please find an image from the internet and apply edits exactly as above. And also draw your own images, and apply edits exactly as above. Feel free to copy the prior cell here. For drawing inspiration, you can check out the examples on this project page.

Deliverables

Hints

Avocado at noise level 1

Bear at i_start=1

Avocado at noise level 3

Bear at i_start=3

Avocado at noise level 5

Bear at i_start=5

Avocado at noise level 7

Bear at i_start=7

Avocado at noise level 10

Bear at i_start=10

Avocado at noise level 20

Bear at i_start=20

Original Avocado

Bear

House at noise level 1

House at i_start=1

House at noise level 3

House at i_start=3

House at noise level 5

House at i_start=5

House at noise level 7

House at i_start=7

House at noise level 10

House at i_start=10

House at noise level 20

House at i_start=20

House at noise level 20

Original House Sketch

Begin Solution Part 1.7.1


DESCRIPTION: As an experiment, I took an input image from the web that contained only fine details, and, as expected, it did not take many timesteps to completely overwrite the content in the white square due to its fine nature, which meant the information the image conveyed (people, house, background) quickly disappeared.



DESCRIPTION: For my first drawing, I attempted a sailboat to see if the denosing algorithm was able to predict it, but there seems to be a bias towards generating images of humans from noise.



DESCRIPTION: This time, I attemped to draw randmom movements with my cursor to see if there was a bias toward generating images of humans more than other objects, and the denoising algorithm interpreted the drawings as a person viewed at a distance, with there being some shape resemblance at noise level 10.



End Solution Part 1.7.1

1.7.2 Inpainting

We can use the same procedure to implement inpainting (following the RePaint paper). That is, given an image xorig, and a binary mask m, we can create a new image that has the same content where m is 0, but new content wherever m is 1.

To do this, we can run the diffusion denoising loop. But at every step, after obtaining xt, we "force" xt to have the same pixels as xorig where m is 0, i.e.:

(A.5)xtmxt+(1m)forward(xorig,t)

Essentially, we leave everything inside the edit mask alone, but we replace everything outside the edit mask with our original image -- with the correct amount of noise added for timestep t.

Please implement this below, and edit the picture to inpaint the top of the Campanile.

Deliverables

Hints

Resized Campanile

Campanile

Mask

Mask

To Replace

Hole to Fill

Campanile Inpainted

Campanile Inpainted

Begin Solution Part 1.7.2


DESCRIPTION: Shown below are the three input images and their corresponding masks. It is the job of the denoiser to fill the white space on each mask while the part of the image corresponding to the dark part of the mask will be cut in from the unaltered input image at each step of the denoising process.



DESCRIPTION: For my two input images, I chose masks corresponding to their level of detail. For the image of Yosemite, I chose to exclude more of the image since the landscape seemed pretty predictable for a generatative denoiser. For the image of the Grizzly Bear, I chose to exclude a small but detailed section of the image. For both images, there are instances where mask boundaries are clearly visible, which is not the case for the image of the Campanile due to the small gap and the low-detail region of the mask.



End Solution Part 1.7.2

1.7.3 Text-Conditional Image-to-image Translation

Now, we will do the same thing as the previous section, but guide the projection with a text prompt. This is no longer pure "projection to the natural image manifold" but also adds control using language. This is simply a matter of changing the prompt from "a high quality photo" to any of the precomputed prompts we provide you (if you want to use your own prompts, see appendix).

Deliverables Hints
Rocket Ship at noise level 1

Rocket Ship at noise level 1

Rocket Ship at noise level 3

Rocket Ship at noise level 3

Rocket Ship at noise level 5

Rocket Ship at noise level 5

Rocket Ship at noise level 7

Rocket Ship at noise level 7

Rocket Ship at noise level 10

Rocket Ship at noise level 10

Rocket Ship at noise level 20

Rocket Ship at noise level 20

Rocket Ship at noise level 20

Campanile

Begin Solution Part 1.7.3


DESCRIPTION: The results are an implementation of the Classifier-Free Guidance denoiser, with the inputs being the Campanile, Yosemite, and a Grizzly Bear at various noise levels. Since I chose to keep the same "rocket" prompts for all input images, there were some odd and unexpected results at lower noise levels.



End Solution Part 1.7.3

1.8 Visual Anagrams

In this part, we are finally ready to implement Visual Anagrams and create optical illusions with diffusion models. In this part, we will create an image that looks like "an oil painting of people around a campfire", but when flipped upside down will reveal "an oil painting of an old man".

To do this, we will denoise an image xt at step t normally with the prompt "an oil painting of an old man", to obtain noise estimate ϵ1. But at the same time, we will flip xt upside down, and denoise with the prompt "an oil painting of people around a campfire", to get noise estimate ϵ2. We can flip ϵ2 back, to make it right-side up, and average the two noise estimates. We can then perform a reverse/denoising diffusion step with the averaged noise estimate.

The full algorithm will be:

ϵ1=UNet(xt,t,p1)

ϵ2=flip(UNet(flip(xt),t,p2))

ϵ=(ϵ1+ϵ2)/2

where UNet is the diffusion model UNet from before, flip() is a function that flips the image, and p1 and p2 are two different text prompt embeddings. And our final noise estimate is ϵ. Please implement the above algorithm and show example of an illusion.

Deliverables Hints
Old Man

An Oil Painting of an Old Man

Old Man Flipped

An Oil Painting of People around a Campfire

Begin Solution Part 1.8


DESCRIPTION: To implement this function, the number of calls to UNet were doubled in comparison to earlier functions to get two noise estimate and variance values. Ultimately, the variance was ignored instead of being averaged since it had no measurable effect on the results. The noise estimate was flipped back upright after being computed with the flipped x_t value. The given prompt generates clean, reliable results, while my prompts require several attempts to get a successful result. As shown, not all results were successful.



End Solution Part 1.8

1.9 Hybrid Images

In this part we'll implement Factorized Diffusion and create hybrid images just like in project 2.

In order to create hybrid images with a diffusion model we can use a similar technique as above. We will create a composite noise estimate ϵ, by estimating the noise with two different text prompts, and then combining low frequencies from one noise estimate with high frequencies of the other. The algorithm is:

ϵ1=UNet(xt,t,p1)

ϵ2=UNet(xt,t,p2)

ϵ=flowpass(ϵ1)+fhighpass(ϵ2)

where UNet is the diffusion model UNet, flowpass is a low pass function, fhighpass is a high pass function, and p1 and p2 are two different text prompt embeddings. Our final noise estimate is ϵ. Please show an example of a hybrid image using this technique (you may have to run multiple times to get a really good result for the same reasons as above). We recommend that you use a gaussian blur of kernel size 33 and sigma 2.

Deliverables Hints
Hybrid image of a skull and a waterfall

Hybrid image of a skull and a waterfall

Begin Solution Part 1.9


DESCRIPTION: For these results, the recommended Guassian kernel size and sigma value were used. To observe the effects described for each image, stand close to see detailed features and stand far away to see the low-frequency features. The third result varies from the first two because the "face" of the image is in the high-frequencies, in contrast to the "face" being present in the low frequencies of the earlier images. This is because while both Trump and a cheeseburger have similar color schemes and palettes, only the cheeseburger is recognizable in low frequencies (at a distance).



End Solution Part 1.9

Part 2: Bells & Whistles

Using your own Prompts and Upsampling Generations

We provide you with code in the notebook to use your own prompts and upsample your generations!

Part B: Diffusion Models from Scratch!

The first part of a larger project.

Due: 11/19/24 11:59pm

We recommend using GPUs from Colab to finish this project!

Overview

In part B you will train your own diffusion model on MNIST. Starter code can be found in the provided notebook.

START EARLY!

This project, in many ways, will be the most difficult project this semester.

Note: this is an updated, clearer version of the part B instructions. For the old version, please see here.

Part 1: Training a Single-Step Denoising UNet

Let's warmup by building a simple one-step denoiser. Given a noisy image z, we aim to train a denoiser Dθ such that it maps z to a clean image x. To do so, we can optimize over an L2 loss: (B.1)L=Ez,xDθ(z)x2

1.1 Implementing the UNet

In this project, we implement the denoiser as a UNet. It consists of a few downsampling and upsampling blocks with skip connections.

UNet Architecture

Figure 1: Unconditional UNet

The diagram above uses a number of standard tensor operations defined as follows:

UNet Operations

Figure 2: Standard UNet Operations

where: At a high level, the blocks do the following:

We define composed operations using our simple operations in order to make our network deeper. This doesn't change the tensor's height, width, or number of channels, but simply adds more learnable parameters.

1.2 Using the UNet to Train a Denoiser

Recall from equation 1 that we aim to solve the following denoising problem: Given a noisy image z, we aim to train a denoiser Dθ such that it maps z to a clean image x. To do so, we can optimize over an L2 loss L=Ez,xDθ(z)x2. To train our denoiser, we need to generate training data pairs of (z, x), where each x is a clean MNIST digit. For each training batch, we can generate z from x using the the following noising process: (B.2)z=x+σϵ,where ϵN(0,I). Visualize the different noising processes over σ=[0.0,0.2,0.4,0.5,0.6,0.8,1.0], assuming normalized x[0,1]. It should be similar to the following plot:
Varying Sigmas

Figure 3: Varying levels of noise on MNIST digits

1.2.1 Training

Now, we will train the model to perform denoising.

Training Loss Curve

Figure 4: Training Loss Curve

You should visualize denoised results on the test set at the end of training. Display sample results after the 1st and 5th epoch.

They should look something like these:

After the first epoch

Figure 5: Results on digits from the test set after 1 epoch of training

After the 5-th epoch

Figure 6: Results on digits from the test set after 5 epochs of training

1.2.2 Out-of-Distribution Testing

Our denoiser was trained on MNIST digits noised with σ=0.5. Let's see how the denoiser performs on different σ's that it wasn't trained for.

Visualize the denoiser results on test set digits with varying levels of noise σ=[0.0,0.2,0.4,0.5,0.6,0.8,1.0].

Varying Sigmas

Figure 7: Results on digits from the test set with varying noise levels.

Deliverables

Hint

Begin Solution Part 1


IMPORTANT NOTE: This function will take a few minutes to run, so its best to find its stored output in the iPYNB to save time.


DESCRIPTION: Below is the noising process of the 28 x 28 images. This is the same forward process as used in part A and what will be fed into the inputs for denoising. Here are various levels of sigma.



DESCRIPTION: Below is an example of the denoising process executed after one epoch of training. This epoch is a great step towards a full denoiser, but there is still some noise present.



DESCRIPTION: Below is an example of the denoising process executed after five epochs of training. This epoch shows minor changes over the previous epoch but is nonetheless noticably clearer with more epochs of training and a lower loss function that approaches zero.



DESCRIPTION: Below is the training loss curve for the 5 epochs that were used in an unconditional denoising function. The loss quickly drops and and begins approaching zero very early into the process, as is reflected by earlier images, which showed a sudden increase and performance and the start of training, with gradual advancements afterwards.



DESCRIPTION: Below is the fully trained denoiser being fed inputs with various levels of sigma to test its ability to denoise images and recover meaningful information. For all images, the denoised images have less contrast and clarity, but artifacts do not appear until the very high noise levels, confirming its success shown by the loss function.



End Solution Part 1

Part 2: Training a Diffusion Model

Now, we are ready for diffusion, where we will train a UNet model that can iteratively denoise an image. We will implement DDPM in this part.

Let's revisit the problem we solved in equation B.1:

L=Ez,xDθ(z)x2.

We will first introduce one small difference: we can change our UNet to predict the added noise ϵ instead of the clean image x (like in part 4A of the project). Mathematically, these are equivalent since x=zσϵ (equation B.2). Therefore, we can turn equation B.1 into the following:

(B.3)L=Eϵ,zϵθ(z)ϵ2

where ϵθ is a UNet trained to predict noise.

For diffusion, we eventually want to sample a pure noise image ϵN(0,I) and generate a realistic image x from the noise. However, we saw in part A that one-step denoising does not yield good results. Instead, we need to iteratively denoise the image for better results.

Recall in part A that we used equation A.2 to generate noisy images xt from x0 for some timestep t for t{0,1,,T}: xt=α¯tx0+1α¯tϵwhere ϵN(0,1). Intuitively, when t=0 we want xt to be the clean image x0, when t=T we want xt to be pure noise ϵ, and for t{1,,T1}, xt should be some linear combination of the two. The precise derivation of α¯ is beyond the scope of this project (see DDPM paper for more details). Here, we provide you with the DDPM recipe to build a list α¯ for t{0,1,,T} utilizing lists α and β:

Because we are working with simple MNIST digits, we can afford to have a smaller T of 300 instead of the 1000 used in part A. Observe how α¯t is close to 1 for small t and close to 0 for T. β is known as the variance schedule; it controls the amount of noise added at each timestep.

Now, to denoise image xt, we could simply apply our UNet ϵθ on xt and get the noise ϵ. However, this won't work very well because the UNet is expecting the noisy image to have a noise variance σ=0.5 for best results, but the variance of xt varies with t. One could train T separate UNets, but it is much easier to simply condition a single UNet with timestep t, giving us our final objective: (B.4)L=Eϵ,x0,tϵθ(xt,t)ϵ2.

2.1 Adding Time Conditioning to UNet

We need a way to inject scalar t into our UNet model to condition it. There are many ways to do this. Here is what we suggest:
UNet Highlighted

Figure 8: Time-Conditioned UNet

This uses a new operator called FCBlock (fully-connected block) which we use to inject the conditioning signal into the UNet:

FCBlock

Figure 9: FCBlock for conditioning

Here Linear(F_in, F_out) is a linear layer with F_in input features and F_out output features. You can implement it using nn.Linear.

Since our conditioning signal t is a scalar, F_in should be of size 1. We also recommend that you normalize t to be in the range [0, 1] before embedding it, i.e. pass in tT.

You can embed t by following this pseudo code:


fc1_t = FCBlock(...)
fc2_t = FCBlock(...)

# the t passed in here should be normalized to be in the range [0, 1]
t1 = fc1_t(t)
t2 = fc2_t(t)

# Follow diagram to get unflatten.
# Replace the original unflatten with modulated unflatten.
unflatten = unflatten + t1
# Follow diagram to get up1.
...
# Replace the original up1 with modulated up1.
up1 = up1 + t2
# Follow diagram to get the output.
...
  

2.2 Training the UNet

Training our time-conditioned UNet ϵθ(xt,t) is now pretty easy. Basically, we pick a random image from the training set, a random t, and train the denoiser to predict the noise in xt We repeat this for different images and different t values until the model converges and we are happy.

Algorithm Diagram

Algorithm B.1. Training time-conditioned UNet

Loss Curve

Figure 10: Time-Conditioned UNet training loss curve

2.3 Sampling from the UNet

The sampling process is very similar to part A, except we don't need to predict the variance like in the DeepFloyd model. Instead, we can use our list β.

Algorithm Diagram

Algorithm B.2. Sampling from time-conditioned UNet

Epoch 1

Epoch 5

Epoch 10

Epoch 15

Epoch 20

Deliverables

Begin Solution Part 2.1, 2.2, 2.3


IMPORTANT NOTE: This function takes a while to run (over 10 minutes), but its saved output can be found on the iPYNB when you scroll through the codeboxes and see various outputs.


DESCRIPTION: Below are samples taken at various epochs of the training of the time-conditioned denoising model with random values as its input. Once again, the images quickly become somewhat legible words early into training, however, the more gradual changes seen after the first few samples would require closer analysis.












DESCRIPTION: As reflected on the images above, the loss function follows a sharp downard trajectory early into training and then asymptotically approaches zero. This is a sign that the training is effective and resulting model can successfully handle data similar to that it was trained on - however, the lack of a test set means different data will significantly reduce the accuracy of the model.



End Solution Part 2.1, 2.2, 2.3

2.4 Adding Class-Conditioning to UNet

To make the results better and give us more control for image generation, we can also optionally condition our UNet on the class of the digit 0-9. This will require adding 2 more FCBlocks to our UNet but, we suggest that for class-conditioning vector c, you make it a one-hot vector instead of a single scalar. Because we still want our UNet to work without it being conditioned on the class, we implement dropout where 10% of the time (puncond=0.1) we drop the class conditioning vector c by setting it to 0. Here is one way to condition our UNet ϵθ(xt,t,c) on both time t and class c:

fc1_t = FCBlock(...)
fc1_c = FCBlock(...)
fc2_t = FCBlock(...)
fc2_c = FCBlock(...)

t1 = fc1_t(t)
c1 = fc1_c(c)
t2 = fc2_t(t)
c2 = fc2_c(c)

# Follow diagram to get unflatten.
# Replace the original unflatten with modulated unflatten.
unflatten = c1 * unflatten + t1
# Follow diagram to get up1.
...
# Replace the original up1 with modulated up1.
up1 = c2 * up1 + t1
# Follow diagram to get the output.
...



      
Training for this section will be the same as time-only, with the only difference being the conditioning vector c and doing unconditional generation periodically.
Algorithm Diagram

Algorithm B.3. Training class-conditioned UNet

Training Loss Curve

Figure 11: Class-conditioned UNet training loss curve

2.5 Sampling from the Class-Conditioned UNet

The sampling process is the same as part A, where we saw that conditional results aren't good unless we use classifier-free guidance. Use classifier-free guidance with γ=5.0 for this part.
Algorithm Diagram

Algorithm B.4. Sampling from class-conditioned UNet

Epoch 1

Epoch 5

Epoch 10

Epoch 15

Epoch 20

Deliverables

Begin Solution Part 2.4, 2.5


IMPORTANT NOTE: I was unforunately unable to obtain samples of the denoised and classed digits because there was an unresolved kernel error, which means I am only able to generate the loss curve.


DESCRIPTION: Loss curve for functions shows successful forward process, but I was unable to get sampling running.



End Solution Part 2.4, 2.5

Acknowledgements

This project was a joint effort by Daniel Geng, Hang Gao, and Ryan Tabrizi, advised by Liyue Shen, Andrew Owens, and Alexei Efros.