diff --git a/README.md b/README.md
index f044c82..0ed4c99 100644
--- a/README.md
+++ b/README.md
@@ -1,13 +1,77 @@
-CUDA Denoiser For CUDA Path Tracer
-==================================
+**University of Pennsylvania, CIS 565: GPU Programming and Architecture**
+# Project 4 - CUDA Denoiser for CUDA Path Tracer
 
-**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 4**
+* Jonas Oppenheim ([LinkedIn](https://www.linkedin.com/in/jonasoppenheim/), [GitHub](https://github.com/oppenheimj/), [personal](http://www.jonasoppenheim.com/))
+* Tested on: Windows 10, Ryzen 9 5950x, 32GB, RTX 3080 (personal machine)
 
-* (TODO) YOUR NAME HERE
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+## Introduction
+We saw during the previous project that it takes many hundreds or even thousands of iterations for the noise to dissipate in a path traced image. The purpose of this project is to implement a clever technique that denoises a path traced image after only a handful of iterations. The technique is described in the paper "[Edge-Avoiding A-Trous Wavelet Transform for fast Global Illumination Filtering](https://jo.dreggn.org/home/2010_atrous.pdf)," by Dammertz, Sewtz, Hanika, and Lensch.
 
-### (TODO: Your README)
+The naive way to reduce noise in a path traced image would be to apply a Gaussian blur filter. This would be naive because edges that should be sharp would instead end up looking blurred. So what we _really_ want to do is only do this sort of blurring _within areas that, sortof, are one piece_.
 
-*DO NOT* leave the README to the last minute! It is a crucial part of the
-project, and we will not be able to grade you without a good README.
+The idea presented in the paper is to instead store per-pixel information and then use this information to allow pixels to compare themselves to their neighbors in order to selectively apply the blur. First, the path tracer is run for a few iterations and per-pixel information is stored in what is called a "gbuffer". This information includes position and normal vectors. Then, every pixel looks at surrounding pixels and compares its gbuffer data to the neighbors' gbuffer data to see which neighbors are similar and should be blurred.
 
+## Implementation
+The implementation was fairly straight forward. The paper itself reluctantly provides some hints at implementation details towards the end in the form of a GLSL fragment shader. I used a separate `void denoise()` CPU-side function wrapping a denoising kernel call. The assignment made it sound like we should denoise after every iteration of pathtracing, and its conceivable that this would have produced the best results. Instead, I tried invoking my denoising kernel a single time at the end of all pathtracing iterations.
+
+## Questions
+### Qualitative
+1. The denoising procedure runs as quickly as a single iteration of path tracing. This is a huge result. The visual gains from denoising are worth hundreds or even thousands of pathtrace iterations. It is clear that the most efficient way to get the best result is to perform some low numer of pathtrace iterations and then denoising.
+
+    ![qual1](img/qual_1.png)
+
+2. Without denoising, an acceptably smooth result is achieved by 1000 iterations of path tracing. Note that this is highly subjective. _With_ denoising, only 25 iterations are needed to achieve a comparably smooth result. The grainyness in the whites is due to issues with color compressing.
+
+    | 1,000 iterations of path tracing w/o denoising | 25 iterations of path tracing w/ denoising |
+    |---|---|
+    |![](img/1000_iter_pt.png)|![](img/25_iter_pt.png)|
+
+3. The runtime complexity of this algorithm is clearly linear because the operation done on each pixel is constant, for a given filter size. The slightly upward bending curve suggests that there is some penalty in terms of hardware efficiency, e.g. block size.
+![](img/qual_3.png)
+
+
+4. Filter size is computed on the CPU in the following way:
+    ```
+    for (int power = 0; power < filterSize; power++) {
+        int stepWidth = 1 << power;
+        ...
+    }
+    ```
+    and then each thread inside the kernel on the GPU uses this `stepWidth`, along with an array of `glm::vec2`s to compute offsets:
+    ```
+    for (int i = 0; i < 25; i++) {
+        glm::vec2 uv = pixelCoord + offset[i] * stepWidth;
+        ...
+    }
+    ```
+    Increasing the filter size changes the number of times the kernel executes, but does not change the complexity of the kernel invocation, and so the runtime increases linearly.
+    ![](img/qual_4.png)
+
+
+    In addition to the above, you should also analyze your denoiser on a qualitative level:
+### Qualitative
+1. Visual quality improves as filter size increases until about five (which translates to 2^5*5=160, so 160x160), after which point there is little improvement. This makes sense, since the "distance" with respect to position and normal between the center pixel and the farther out pixels will get large enough that the color contributions are effectively erased.
+
+2. The denoising procedure seems to work best with diffuse materials with solid colors because colors of neighboring pixels are most likely to be similar. It is seen that the diffuse sphere looks essentially perfect, while the edges of the reflective sphere still have some noise that couldn't be smoothed out.
+
+    | Diffuse sphere | Reflective sphere |
+    |---|---|
+    |![](img/diffuse_sphere.png)|![](img/reflective_sphere.png)|
+
+3. The results vary from scene to scene. Because it is sampling so few neighboring pixels compared to a full Gaussian filter, ever pixel counts for a lot. In low-light situations where the image is extremely noisy, the denoising procedure struggled.
+
+    | Best cornell large light | Best cornell small light |
+    |---|---|
+    |![](img/best_cornell_biglight.png)|![](img/best_cornell_smalllight.png)|
+
+## Debug images
+| positions | normals |
+|---|---|
+|![](img/debug_pos.png)|![](img/debug_nor.png)|
+
+## Bloopers
+The bloopers were absolutely a highlight of the project. The bottom right is my favorite :)
+|  |  |
+|---|---|
+|![](img/blooper1.png)|![](img/blooper2.png)|
+|![](img/blooper3.png)|![](img/blooper4.png)|
\ No newline at end of file
diff --git a/img/1000_iter_pt.png b/img/1000_iter_pt.png
new file mode 100644
index 0000000..9b4e88f
Binary files /dev/null and b/img/1000_iter_pt.png differ
diff --git a/img/25_iter_pt.png b/img/25_iter_pt.png
new file mode 100644
index 0000000..24b4636
Binary files /dev/null and b/img/25_iter_pt.png differ
diff --git a/img/best_cornell_biglight.png b/img/best_cornell_biglight.png
new file mode 100644
index 0000000..a738645
Binary files /dev/null and b/img/best_cornell_biglight.png differ
diff --git a/img/best_cornell_smalllight.png b/img/best_cornell_smalllight.png
new file mode 100644
index 0000000..e03f48e
Binary files /dev/null and b/img/best_cornell_smalllight.png differ
diff --git a/img/blooper1.png b/img/blooper1.png
new file mode 100644
index 0000000..e438172
Binary files /dev/null and b/img/blooper1.png differ
diff --git a/img/blooper2.png b/img/blooper2.png
new file mode 100644
index 0000000..ff809bd
Binary files /dev/null and b/img/blooper2.png differ
diff --git a/img/blooper3.png b/img/blooper3.png
new file mode 100644
index 0000000..9772f22
Binary files /dev/null and b/img/blooper3.png differ
diff --git a/img/blooper4.png b/img/blooper4.png
new file mode 100644
index 0000000..a49d5c3
Binary files /dev/null and b/img/blooper4.png differ
diff --git a/img/debug_nor.png b/img/debug_nor.png
new file mode 100644
index 0000000..9b79c60
Binary files /dev/null and b/img/debug_nor.png differ
diff --git a/img/debug_pos.png b/img/debug_pos.png
new file mode 100644
index 0000000..1ecbe9c
Binary files /dev/null and b/img/debug_pos.png differ
diff --git a/img/diffuse_sphere.png b/img/diffuse_sphere.png
new file mode 100644
index 0000000..aeb7e5d
Binary files /dev/null and b/img/diffuse_sphere.png differ
diff --git a/img/qual_1.png b/img/qual_1.png
new file mode 100644
index 0000000..2b6b3e3
Binary files /dev/null and b/img/qual_1.png differ
diff --git a/img/qual_3.png b/img/qual_3.png
new file mode 100644
index 0000000..3a46da0
Binary files /dev/null and b/img/qual_3.png differ
diff --git a/img/qual_4.png b/img/qual_4.png
new file mode 100644
index 0000000..4a7a332
Binary files /dev/null and b/img/qual_4.png differ
diff --git a/img/reflective_sphere.png b/img/reflective_sphere.png
new file mode 100644
index 0000000..04f4e07
Binary files /dev/null and b/img/reflective_sphere.png differ
diff --git a/scenes/cornell.txt b/scenes/cornell.txt
index 83ff820..77ad551 100644
--- a/scenes/cornell.txt
+++ b/scenes/cornell.txt
@@ -52,7 +52,7 @@ EMITTANCE   0
 CAMERA
 RES         800 800
 FOVY        45
-ITERATIONS  5000
+ITERATIONS  10
 DEPTH       8
 FILE        cornell
 EYE         0.0 5 10.5
diff --git a/src/main.cpp b/src/main.cpp
index 4092ae4..f8bb3a8 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -1,6 +1,7 @@
 #include "main.h"
 #include "preview.h"
 #include <cstring>
+#include <chrono>
 
 #include "../imgui/imgui.h"
 #include "../imgui/imgui_impl_glfw.h"
@@ -23,11 +24,22 @@ int ui_iterations = 0;
 int startupIterations = 0;
 int lastLoopIterations = 0;
 bool ui_showGbuffer = false;
+
 bool ui_denoise = false;
-int ui_filterSize = 80;
-float ui_colorWeight = 0.45f;
-float ui_normalWeight = 0.35f;
-float ui_positionWeight = 0.2f;
+bool lastLoopDenoise = false;
+
+int ui_filterSize = 5;
+int lastLoopFilterSize;
+
+float ui_colorWeight = 0.572f;
+float lastLoopColorWeight;
+
+float ui_normalWeight = 0.021f;
+float lastLoopNormalWeight;
+
+float ui_positionWeight = 0.789f;
+float lastLoopPositionWeight;
+
 bool ui_saveAndExit = false;
 
 static bool camchanged = true;
@@ -45,6 +57,8 @@ int iteration;
 int width;
 int height;
 
+long duration_total_us;
+
 //-------------------------------
 //-------------MAIN--------------
 //-------------------------------
@@ -120,15 +134,41 @@ void saveImage() {
     //img.saveHDR(filename);  // Save a Radiance HDR file
 }
 
+bool denoisingSettingChanged() {
+    bool settingChanged = false;
+
+    if (lastLoopFilterSize != ui_filterSize) {
+        lastLoopFilterSize = ui_filterSize;
+        settingChanged = true;
+    }
+
+    if (lastLoopColorWeight != ui_colorWeight) {
+        lastLoopColorWeight = ui_colorWeight;
+        settingChanged = true;
+    }
+
+    if (lastLoopNormalWeight != ui_normalWeight) {
+        lastLoopNormalWeight = ui_normalWeight;
+        settingChanged = true;
+    }
+
+    if (lastLoopPositionWeight != ui_positionWeight) {
+        lastLoopPositionWeight = ui_positionWeight;
+        settingChanged = true;
+    }
+
+    return settingChanged;
+}
+
 void runCuda() {
     if (lastLoopIterations != ui_iterations) {
-      lastLoopIterations = ui_iterations;
-      camchanged = true;
+        lastLoopIterations = ui_iterations;
+        camchanged = true;
     }
 
     if (camchanged) {
         iteration = 0;
-        Camera &cam = renderState->camera;
+        Camera& cam = renderState->camera;
         cameraPosition.x = zoom * sin(phi) * sin(theta);
         cameraPosition.y = zoom * cos(theta);
         cameraPosition.z = zoom * cos(phi) * sin(theta);
@@ -144,7 +184,7 @@ void runCuda() {
         cameraPosition += cam.lookAt;
         cam.position = cameraPosition;
         camchanged = false;
-      }
+    }
 
     // Map OpenGL buffer object for writing from CUDA on a single GPU
     // No data is moved (Win & Linux). When mapped to CUDA, OpenGL should not use this buffer
@@ -154,7 +194,7 @@ void runCuda() {
         pathtraceInit(scene);
     }
 
-    uchar4 *pbo_dptr = NULL;
+    uchar4* pbo_dptr = NULL;
     cudaGLMapBufferObject((void**)&pbo_dptr, pbo);
 
     if (iteration < ui_iterations) {
@@ -162,13 +202,43 @@ void runCuda() {
 
         // execute the kernel
         int frame = 0;
+
+        auto start = chrono::high_resolution_clock::now();
         pathtrace(frame, iteration);
+        duration_total_us += chrono::duration_cast<chrono::microseconds>(chrono::high_resolution_clock::now() - start).count();
+
+        if (iteration == ui_iterations) {
+            std::cout << "Pathtrace avg duration " << duration_total_us / ui_iterations << std::endl;
+            duration_total_us = 0;
+        }
+    }
+
+    if (ui_denoise && iteration == ui_iterations) {
+        if (denoisingSettingChanged() || lastLoopDenoise != ui_denoise) {
+            std::cout << "Need to denoise!" << std::endl;
+
+            lastLoopDenoise = ui_denoise;
+            denoiseFree();
+            denoiseInit(scene);
+
+            auto start = chrono::high_resolution_clock::now();
+            denoise(ui_filterSize, ui_colorWeight, ui_normalWeight, ui_positionWeight);
+            auto duration_us = chrono::duration_cast<chrono::microseconds>(chrono::high_resolution_clock::now() - start).count();
+
+            std::cout << "Denoising duration " << duration_us << std::endl;
+        }
+    }
+
+    if (lastLoopDenoise != ui_denoise) {
+        lastLoopDenoise = ui_denoise;
     }
 
     if (ui_showGbuffer) {
-      showGBuffer(pbo_dptr);
+        showGBuffer(pbo_dptr);
+    } else if (ui_denoise) {
+        showDenoise(pbo_dptr, iteration);
     } else {
-      showImage(pbo_dptr, iteration);
+        showImage(pbo_dptr, iteration);
     }
 
     // unmap buffer object
diff --git a/src/pathtrace.cu b/src/pathtrace.cu
index 23e5f90..99f84ed 100644
--- a/src/pathtrace.cu
+++ b/src/pathtrace.cu
@@ -67,19 +67,58 @@ __global__ void sendImageToPBO(uchar4* pbo, glm::ivec2 resolution,
     }
 }
 
+//Kernel that writes the image to the OpenGL PBO directly.
+__global__ void sendDenoisedImageToPBO(uchar4* pbo, glm::ivec2 resolution, glm::vec3* image) {
+    int x = (blockIdx.x * blockDim.x) + threadIdx.x;
+    int y = (blockIdx.y * blockDim.y) + threadIdx.y;
+
+    if (x < resolution.x && y < resolution.y) {
+        int index = x + (y * resolution.x);
+        glm::vec3 pix = image[index];
+
+        // Each thread writes one pixel location in the texture (textel)
+        pbo[index].w = 0;
+        pbo[index].x = pix.x * 20.f;
+        pbo[index].y = pix.y * 20.f;
+        pbo[index].z = pix.z * 20.f;
+    }
+}
+
 __global__ void gbufferToPBO(uchar4* pbo, glm::ivec2 resolution, GBufferPixel* gBuffer) {
     int x = (blockIdx.x * blockDim.x) + threadIdx.x;
     int y = (blockIdx.y * blockDim.y) + threadIdx.y;
 
+    //if (x < resolution.x && y < resolution.y) {
+    //    int index = x + (y * resolution.x);
+    //    float timeToIntersect = gBuffer[index].t * 256.0;
+
+    //    pbo[index].w = 0;
+    //    pbo[index].x = timeToIntersect;
+    //    pbo[index].y = timeToIntersect;
+    //    pbo[index].z = timeToIntersect;
+    //}
+
+    // normal
     if (x < resolution.x && y < resolution.y) {
         int index = x + (y * resolution.x);
-        float timeToIntersect = gBuffer[index].t * 256.0;
+        glm::vec3 normal2Color = glm::clamp(glm::normalize(gBuffer[index].normal) * 255.f, -255.f, 255.f);
 
         pbo[index].w = 0;
-        pbo[index].x = timeToIntersect;
-        pbo[index].y = timeToIntersect;
-        pbo[index].z = timeToIntersect;
+        pbo[index].x = normal2Color.x;
+        pbo[index].y = normal2Color.y;
+        pbo[index].z = normal2Color.z;
     }
+
+     //position
+    //if (x < resolution.x && y < resolution.y) {
+    //    int index = x + (y * resolution.x);
+    //    glm::vec3 position2Color = glm::clamp(glm::normalize(gBuffer[index].position) * 255.f, -255.f, 255.f);
+
+    //    pbo[index].w = 0;
+    //    pbo[index].x = position2Color.x;
+    //    pbo[index].y = position2Color.y;
+    //    pbo[index].z = position2Color.z;
+    //}
 }
 
 static Scene * hst_scene = NULL;
@@ -89,8 +128,53 @@ static Material * dev_materials = NULL;
 static PathSegment * dev_paths = NULL;
 static ShadeableIntersection * dev_intersections = NULL;
 static GBufferPixel* dev_gBuffer = NULL;
-// TODO: static variables for device memory, any extra info you need, etc
-// ...
+
+static glm::vec2* dev_offsets = NULL;
+static float* dev_filter = NULL;
+
+static glm::vec3* dev_atrous_in = NULL;
+static glm::vec3* dev_atrous_out = NULL;
+
+glm::vec2* generateOffsets() {
+    glm::vec2 offsets[25];
+
+    int offsets_index = 0;
+
+    for (int i = -2; i < 3; i++) {
+        for (int j = -2; j < 3; j++) {
+            offsets[offsets_index++] = glm::vec2(i, j);
+        }
+    }
+
+    return offsets;
+}
+
+void denoiseInit(Scene *scene) {
+    hst_scene = scene;
+    const Camera& cam = hst_scene->state.camera;
+    const int pixelcount = cam.resolution.x * cam.resolution.y;
+
+    float filter[25] = { 0.003765, 0.015019, 0.023792, 0.015019, 0.003765,
+                    0.015019, 0.059912, 0.094907, 0.059912, 0.015019,
+                    0.023792, 0.094907, 0.150342, 0.094907, 0.023792,
+                    0.015019, 0.059912, 0.094907, 0.059912, 0.015019,
+                    0.003765, 0.015019, 0.023792, 0.015019, 0.003765 };
+    cudaMalloc(&dev_filter, 25 * sizeof(float));
+    cudaMemcpy(dev_filter, filter, 25 * sizeof(float), cudaMemcpyHostToDevice);
+
+    glm::vec2* offsets = generateOffsets();
+    cudaMalloc(&dev_offsets, 25 * sizeof(glm::vec2));
+    cudaMemcpy(dev_offsets, offsets, 25 * sizeof(glm::vec2), cudaMemcpyHostToDevice);
+
+    cudaMalloc(&dev_atrous_in, pixelcount * sizeof(glm::vec3));
+    cudaMemset(dev_atrous_in, 0, pixelcount * sizeof(glm::vec3));
+
+    cudaMalloc(&dev_atrous_out, pixelcount * sizeof(glm::vec3));
+    cudaMemset(dev_atrous_out, 0, pixelcount * sizeof(glm::vec3));
+
+    checkCUDAError("denoiseInit");
+    cudaDeviceSynchronize();
+}
 
 void pathtraceInit(Scene *scene) {
     hst_scene = scene;
@@ -113,33 +197,41 @@ void pathtraceInit(Scene *scene) {
 
     cudaMalloc(&dev_gBuffer, pixelcount * sizeof(GBufferPixel));
 
-    // TODO: initialize any extra device memeory you need
+    cudaMalloc(&dev_materials, scene->materials.size() * sizeof(Material));
+    cudaMemcpy(dev_materials, scene->materials.data(), scene->materials.size() * sizeof(Material), cudaMemcpyHostToDevice);
 
     checkCUDAError("pathtraceInit");
+    cudaDeviceSynchronize();
+}
+
+void denoiseFree() {
+    cudaFree(dev_filter);
+    cudaFree(dev_offsets);
+    cudaFree(dev_atrous_in);
+    cudaFree(dev_atrous_out);
+
+    checkCUDAError("denoiseFree");
+    cudaDeviceSynchronize();
 }
 
 void pathtraceFree() {
-    cudaFree(dev_image);  // no-op if dev_image is null
+    // no-op if dev_image is null
+    cudaFree(dev_image);
   	cudaFree(dev_paths);
   	cudaFree(dev_geoms);
   	cudaFree(dev_materials);
   	cudaFree(dev_intersections);
     cudaFree(dev_gBuffer);
-    // TODO: clean up any extra device memory you created
 
     checkCUDAError("pathtraceFree");
+    cudaDeviceSynchronize();
 }
 
-/**
-* Generate PathSegments with rays from the camera through the screen into the
-* scene, which is the first bounce of rays.
-*
-* Antialiasing - add rays for sub-pixel sampling
-* motion blur - jitter rays "in time"
-* lens effect - jitter ray origin positions based on a lens
-*/
-__global__ void generateRayFromCamera(Camera cam, int iter, int traceDepth, PathSegment* pathSegments)
-{
+__global__ void generateRayFromCamera(
+    Camera cam,
+    int iter, int traceDepth,
+    PathSegment* pathSegments
+) {
 	int x = (blockIdx.x * blockDim.x) + threadIdx.x;
 	int y = (blockIdx.y * blockDim.y) + threadIdx.y;
 
@@ -148,31 +240,26 @@ __global__ void generateRayFromCamera(Camera cam, int iter, int traceDepth, Path
 		PathSegment & segment = pathSegments[index];
 
 		segment.ray.origin = cam.position;
-    segment.color = glm::vec3(1.0f, 1.0f, 1.0f);
+        segment.color = glm::vec3(1.0f, 1.0f, 1.0f);
 
 		segment.ray.direction = glm::normalize(cam.view
 			- cam.right * cam.pixelLength.x * ((float)x - (float)cam.resolution.x * 0.5f)
-			- cam.up * cam.pixelLength.y * ((float)y - (float)cam.resolution.y * 0.5f)
-			);
+			- cam.up * cam.pixelLength.y * ((float)y - (float)cam.resolution.y * 0.5f));
 
-		segment.pixelIndex = index;
-		segment.remainingBounces = traceDepth;
+        segment.pixelIndex = index;
+        segment.remainingBounces = traceDepth;
 	}
 }
 
 __global__ void computeIntersections(
-	int depth
-	, int num_paths
-	, PathSegment * pathSegments
-	, Geom * geoms
-	, int geoms_size
-	, ShadeableIntersection * intersections
-	)
-{
+    int depth, int num_paths,
+    PathSegment * pathSegments,
+    Geom * geoms, int geoms_size,
+    ShadeableIntersection * intersections
+) {
 	int path_index = blockIdx.x * blockDim.x + threadIdx.x;
 
-	if (path_index < num_paths)
-	{
+	if (path_index < num_paths) {
 		PathSegment pathSegment = pathSegments[path_index];
 
 		float t;
@@ -187,23 +274,18 @@ __global__ void computeIntersections(
 
 		// naive parse through global geoms
 
-		for (int i = 0; i < geoms_size; i++)
-		{
+		for (int i = 0; i < geoms_size; i++) {
 			Geom & geom = geoms[i];
 
-			if (geom.type == CUBE)
-			{
+			if (geom.type == CUBE) {
 				t = boxIntersectionTest(geom, pathSegment.ray, tmp_intersect, tmp_normal, outside);
-			}
-			else if (geom.type == SPHERE)
-			{
+			} else if (geom.type == SPHERE) {
 				t = sphereIntersectionTest(geom, pathSegment.ray, tmp_intersect, tmp_normal, outside);
 			}
 
 			// Compute the minimum t from the intersection tests to determine what
 			// scene geometry object was hit first.
-			if (t > 0.0f && t_min > t)
-			{
+			if (t > 0.0f && t_min > t) {
 				t_min = t;
 				hit_geom_index = i;
 				intersect_point = tmp_intersect;
@@ -211,12 +293,9 @@ __global__ void computeIntersections(
 			}
 		}
 
-		if (hit_geom_index == -1)
-		{
+		if (hit_geom_index == -1) {
 			intersections[path_index].t = -1.0f;
-		}
-		else
-		{
+		} else {
 			//The ray hits something
 			intersections[path_index].t = t_min;
 			intersections[path_index].materialId = geoms[hit_geom_index].materialid;
@@ -226,72 +305,125 @@ __global__ void computeIntersections(
 }
 
 __global__ void shadeSimpleMaterials (
-  int iter
-  , int num_paths
-	, ShadeableIntersection * shadeableIntersections
-	, PathSegment * pathSegments
-	, Material * materials
-	)
-{
-  int idx = blockIdx.x * blockDim.x + threadIdx.x;
-  if (idx < num_paths)
-  {
-    ShadeableIntersection intersection = shadeableIntersections[idx];
-    PathSegment segment = pathSegments[idx];
-    if (segment.remainingBounces == 0) {
-      return;
+    int iter, int num_paths,
+    ShadeableIntersection * shadeableIntersections,
+    PathSegment * pathSegments, Material * materials
+) {
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (idx < num_paths) {
+        ShadeableIntersection intersection = shadeableIntersections[idx];
+        PathSegment segment = pathSegments[idx];
+
+        if (segment.remainingBounces == 0) {
+            return;
+        }
+
+        if (intersection.t > 0.0f) {
+            segment.remainingBounces--;
+            thrust::default_random_engine rng = makeSeededRandomEngine(iter, idx, segment.remainingBounces);
+
+            Material material = materials[intersection.materialId];
+            glm::vec3 materialColor = material.color;
+
+            // If the material indicates that the object was a light, "light" the ray
+            if (material.emittance > 0.0f) {
+                segment.color *= (materialColor * material.emittance);
+                segment.remainingBounces = 0;
+            } else {
+                segment.color *= materialColor;
+                glm::vec3 intersectPos = intersection.t * segment.ray.direction + segment.ray.origin;
+                scatterRay(segment, intersectPos, intersection.surfaceNormal, material, rng);
+            }
+
+        // If there was no intersection, color the ray black.
+        // Lots of renderers use 4 channel color, RGBA, where A = alpha, often
+        // used for opacity, in which case they can indicate "no opacity".
+        // This can be useful for post-processing and image compositing.
+        } else {
+            segment.color = glm::vec3(0.0f);
+            segment.remainingBounces = 0;
+        }
+
+        pathSegments[idx] = segment;
     }
+}
 
-    if (intersection.t > 0.0f) { // if the intersection exists...
-      segment.remainingBounces--;
-      // Set up the RNG
-      thrust::default_random_engine rng = makeSeededRandomEngine(iter, idx, segment.remainingBounces);
-
-      Material material = materials[intersection.materialId];
-      glm::vec3 materialColor = material.color;
-
-      // If the material indicates that the object was a light, "light" the ray
-      if (material.emittance > 0.0f) {
-        segment.color *= (materialColor * material.emittance);
-        segment.remainingBounces = 0;
-      }
-      else {
-        segment.color *= materialColor;
-        glm::vec3 intersectPos = intersection.t * segment.ray.direction + segment.ray.origin;
-        scatterRay(segment, intersectPos, intersection.surfaceNormal, material, rng);
-      }
-    // If there was no intersection, color the ray black.
-    // Lots of renderers use 4 channel color, RGBA, where A = alpha, often
-    // used for opacity, in which case they can indicate "no opacity".
-    // This can be useful for post-processing and image compositing.
-    } else {
-      segment.color = glm::vec3(0.0f);
-      segment.remainingBounces = 0;
+__global__ void generateGBuffer (
+    int num_paths,
+    ShadeableIntersection* shadeableIntersections,
+    PathSegment* pathSegments,
+    GBufferPixel* gBuffer
+) {
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (idx < num_paths) {
+        gBuffer[idx].normal = shadeableIntersections[idx].surfaceNormal;
+        gBuffer[idx].position = getPointOnRay(pathSegments[idx].ray, shadeableIntersections[idx].t);
     }
-
-    pathSegments[idx] = segment;
-  }
 }
 
-__global__ void generateGBuffer (
-  int num_paths,
-  ShadeableIntersection* shadeableIntersections,
-	PathSegment* pathSegments,
-  GBufferPixel* gBuffer) {
-  int idx = blockIdx.x * blockDim.x + threadIdx.x;
-  if (idx < num_paths)
-  {
-    gBuffer[idx].t = shadeableIntersections[idx].t;
-  }
+__global__ void kernDenoise(
+    int resX, int resY,
+    GBufferPixel* gBuffer,
+    glm::vec3* image,
+    float* kernel,
+    glm::vec2* offset,
+    float stepWidth,
+    float c_phi, float n_phi, float p_phi,
+    glm::vec3* atrous_in, glm::vec3* atrous_out
+) {
+    const int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+    const int pixelcount = resX * resY;
+
+    if (index < pixelcount) {
+        const int pixelY = int(index / resX);
+        const int pixelX = index - (pixelY * resX);
+        glm::vec2 pixelCoord = glm::vec2(pixelX, pixelY);
+
+        glm::vec3 cval = atrous_in[index];
+        glm::vec3 nval = gBuffer[index].normal;
+        glm::vec3 pval = gBuffer[index].position;
+
+        glm::vec3 sum;
+        float cumulative_w = 0.0;
+
+        for (int i = 0; i < 25; i++) {
+            glm::vec2 uv = pixelCoord + offset[i] * stepWidth;
+            int uvIndex = uv.x + uv.y * resX;
+
+            if (0 <= uvIndex && uvIndex < pixelcount) {
+                glm::vec3 ctmp = atrous_in[uvIndex];
+                glm::vec3 t = cval - ctmp;
+                float dist2 = glm::dot(t, t);
+                float c_w = min(exp(-(dist2) / c_phi), float(1.0));
+
+                glm::vec3 ntmp = gBuffer[uvIndex].normal;
+                t = nval - ntmp;
+                dist2 = max(glm::dot(t, t) / (stepWidth * stepWidth), float(0.0));
+                float n_w = min(exp(-(dist2) / n_phi), float(1.0));
+
+                glm::vec3 ptmp = gBuffer[uvIndex].position;
+                t = pval - ptmp;
+                dist2 = glm::dot(t, t);
+                float p_w = min(exp(-(dist2) / p_phi), float(1.0));
+
+                float weight = c_w * n_w * p_w * kernel[i];
+
+                sum += ctmp * weight;
+                cumulative_w += weight;
+            }
+        }
+
+        atrous_out[index] = sum / cumulative_w;
+    }
 }
 
 // Add the current iteration's output to the overall image
-__global__ void finalGather(int nPaths, glm::vec3 * image, PathSegment * iterationPaths)
-{
+__global__ void finalGather(int nPaths, glm::vec3 * image, PathSegment * iterationPaths) {
 	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
 
-	if (index < nPaths)
-	{
+	if (index < nPaths) {
 		PathSegment iterationPath = iterationPaths[index];
 		image[iterationPath.pixelIndex] += iterationPath.color;
 	}
@@ -309,39 +441,12 @@ void pathtrace(int frame, int iter) {
 	// 2D block for generating ray from camera
     const dim3 blockSize2d(8, 8);
     const dim3 blocksPerGrid2d(
-            (cam.resolution.x + blockSize2d.x - 1) / blockSize2d.x,
-            (cam.resolution.y + blockSize2d.y - 1) / blockSize2d.y);
+        (cam.resolution.x + blockSize2d.x - 1) / blockSize2d.x,
+        (cam.resolution.y + blockSize2d.y - 1) / blockSize2d.y);
 
 	// 1D block for path tracing
 	const int blockSize1d = 128;
 
-    ///////////////////////////////////////////////////////////////////////////
-
-    // Pathtracing Recap:
-    // * Initialize array of path rays (using rays that come out of the camera)
-    //   * You can pass the Camera object to that kernel.
-    //   * Each path ray must carry at minimum a (ray, color) pair,
-    //   * where color starts as the multiplicative identity, white = (1, 1, 1).
-    //   * This has already been done for you.
-    // * NEW: For the first depth, generate geometry buffers (gbuffers)
-    // * For each depth:
-    //   * Compute an intersection in the scene for each path ray.
-    //     A very naive version of this has been implemented for you, but feel
-    //     free to add more primitives and/or a better algorithm.
-    //     Currently, intersection distance is recorded as a parametric distance,
-    //     t, or a "distance along the ray." t = -1.0 indicates no intersection.
-    //     * Color is attenuated (multiplied) by reflections off of any object
-    //   * Stream compact away all of the terminated paths.
-    //     You may use either your implementation or `thrust::remove_if` or its
-    //     cousins.
-    //     * Note that you can't really use a 2D kernel launch any more - switch
-    //       to 1D.
-    //   * Shade the rays that intersected something or didn't bottom out.
-    //     That is, color the ray by performing a color computation according
-    //     to the shader, then generate a new ray to continue the ray path.
-    //     We recommend just updating the ray's PathSegment in place.
-    //     Note that this step may come before or after stream compaction,
-    //     since some shaders you write may also cause a path to terminate.
     // * Finally:
     //     * if not denoising, add this iteration's results to the image
     //     * TODO: if denoising, run kernels that take both the raw pathtraced result and the gbuffer, and put the result in the "pbo" from opengl
@@ -356,46 +461,36 @@ void pathtrace(int frame, int iter) {
 	// --- PathSegment Tracing Stage ---
 	// Shoot ray into scene, bounce between objects, push shading chunks
 
-  // Empty gbuffer
-  cudaMemset(dev_gBuffer, 0, pixelcount * sizeof(GBufferPixel));
+    // Empty gbuffer
+    cudaMemset(dev_gBuffer, 0, pixelcount * sizeof(GBufferPixel));
 
 	// clean shading chunks
 	cudaMemset(dev_intersections, 0, pixelcount * sizeof(ShadeableIntersection));
 
-  bool iterationComplete = false;
+    bool iterationComplete = false;
 	while (!iterationComplete) {
 
-	// tracing
-	dim3 numblocksPathSegmentTracing = (num_paths + blockSize1d - 1) / blockSize1d;
-	computeIntersections <<<numblocksPathSegmentTracing, blockSize1d>>> (
-		depth
-		, num_paths
-		, dev_paths
-		, dev_geoms
-		, hst_scene->geoms.size()
-		, dev_intersections
-		);
-	checkCUDAError("trace one bounce");
-	cudaDeviceSynchronize();
-
-  if (depth == 0) {
-    generateGBuffer<<<numblocksPathSegmentTracing, blockSize1d>>>(num_paths, dev_intersections, dev_paths, dev_gBuffer);
-  }
-
-	depth++;
-
-  shadeSimpleMaterials<<<numblocksPathSegmentTracing, blockSize1d>>> (
-    iter,
-    num_paths,
-    dev_intersections,
-    dev_paths,
-    dev_materials
-  );
-  iterationComplete = depth == traceDepth;
-	}
+	    // tracing
+	    dim3 numblocksPathSegmentTracing = (num_paths + blockSize1d - 1) / blockSize1d;
+	    computeIntersections <<<numblocksPathSegmentTracing, blockSize1d>>> (
+		    depth, num_paths, dev_paths, dev_geoms, hst_scene->geoms.size(), dev_intersections);
+	    checkCUDAError("trace one bounce");
+	    cudaDeviceSynchronize();
+
+        if (depth == 0) {
+            // Run on the first bounce
+            generateGBuffer<<<numblocksPathSegmentTracing, blockSize1d>>>(num_paths, dev_intersections, dev_paths, dev_gBuffer);
+        }
+
+        depth++;
 
-  // Assemble this iteration and apply it to the image
-  dim3 numBlocksPixels = (pixelcount + blockSize1d - 1) / blockSize1d;
+        shadeSimpleMaterials<<<numblocksPathSegmentTracing, blockSize1d>>> (
+            iter, num_paths, dev_intersections, dev_paths, dev_materials);
+        iterationComplete = depth == traceDepth;
+    }
+
+    // Assemble this iteration and apply it to the image
+    dim3 numBlocksPixels = (pixelcount + blockSize1d - 1) / blockSize1d;
 	finalGather<<<numBlocksPixels, blockSize1d>>>(num_paths, dev_image, dev_paths);
 
     ///////////////////////////////////////////////////////////////////////////
@@ -407,9 +502,39 @@ void pathtrace(int frame, int iter) {
             pixelcount * sizeof(glm::vec3), cudaMemcpyDeviceToHost);
 
     checkCUDAError("pathtrace");
+    cudaDeviceSynchronize();
+}
+
+void denoise(int filterSize, float c_phi, float n_phi, float p_phi) {
+    std::cout << "Denoising with weights " << c_phi << " " << n_phi << " " << p_phi << std::endl;
+
+    const Camera& cam = hst_scene->state.camera;
+    const int pixelcount = cam.resolution.x * cam.resolution.y;
+
+    cudaMemcpy(dev_atrous_in, dev_image, pixelcount * sizeof(glm::vec3), cudaMemcpyDeviceToDevice);
+    checkCUDAError("Filling dev_atrous_in");
+
+    const int blockSize1d = 128;
+    dim3 numBlocksPixels = (pixelcount + blockSize1d - 1) / blockSize1d;
+    
+    for (int power = 0; power < filterSize; power++) {
+        int stepWidth = (1 << power) - 1;
+
+        kernDenoise << <numBlocksPixels, blockSize1d >> > (
+            cam.resolution.x, cam.resolution.y,
+            dev_gBuffer, dev_image,
+            dev_filter, dev_offsets,
+            stepWidth, c_phi, n_phi, p_phi,
+            dev_atrous_in, dev_atrous_out);
+
+        glm::vec3* tmp = dev_atrous_in;
+        dev_atrous_in = dev_atrous_out;
+        dev_atrous_out = tmp;
+    }
+
+    cudaDeviceSynchronize();
 }
 
-// CHECKITOUT: this kernel "post-processes" the gbuffer/gbuffers into something that you can visualize for debugging.
 void showGBuffer(uchar4* pbo) {
     const Camera &cam = hst_scene->state.camera;
     const dim3 blockSize2d(8, 8);
@@ -431,3 +556,16 @@ const Camera &cam = hst_scene->state.camera;
     // Send results to OpenGL buffer for rendering
     sendImageToPBO<<<blocksPerGrid2d, blockSize2d>>>(pbo, cam.resolution, iter, dev_image);
 }
+
+void showDenoise(uchar4* pbo, int iter) {
+    const Camera& cam = hst_scene->state.camera;
+    const dim3 blockSize2d(8, 8);
+    const dim3 blocksPerGrid2d(
+        (cam.resolution.x + blockSize2d.x - 1) / blockSize2d.x,
+        (cam.resolution.y + blockSize2d.y - 1) / blockSize2d.y);
+
+    // CHECKITOUT: process the gbuffer results and send them to OpenGL buffer for visualization
+    //sendDenoisedImageToPBO << <blocksPerGrid2d, blockSize2d >> > (pbo, cam.resolution, dev_atrous_in);
+    sendImageToPBO << <blocksPerGrid2d, blockSize2d >> > (pbo, cam.resolution, iter, dev_atrous_in);
+
+}
diff --git a/src/pathtrace.h b/src/pathtrace.h
index 9e12f44..37e8354 100644
--- a/src/pathtrace.h
+++ b/src/pathtrace.h
@@ -6,5 +6,11 @@
 void pathtraceInit(Scene *scene);
 void pathtraceFree();
 void pathtrace(int frame, int iteration);
+
+void denoiseInit(Scene *scene);
+void denoiseFree();
+void denoise(int filterSize, float c_phi, float n_phi, float p_phi);
+
 void showGBuffer(uchar4 *pbo);
 void showImage(uchar4 *pbo, int iter);
+void showDenoise(uchar4* pbo, int iter);
diff --git a/src/preview.cpp b/src/preview.cpp
index 3ca2718..83305d3 100644
--- a/src/preview.cpp
+++ b/src/preview.cpp
@@ -214,10 +214,10 @@ void drawGui(int windowWidth, int windowHeight) {
 
     ImGui::Checkbox("Denoise", &ui_denoise);
 
-    ImGui::SliderInt("Filter Size", &ui_filterSize, 0, 100);
-    ImGui::SliderFloat("Color Weight", &ui_colorWeight, 0.0f, 10.0f);
-    ImGui::SliderFloat("Normal Weight", &ui_normalWeight, 0.0f, 10.0f);
-    ImGui::SliderFloat("Position Weight", &ui_positionWeight, 0.0f, 10.0f);
+    ImGui::SliderInt("Filter Size", &ui_filterSize, 1, 10);
+    ImGui::SliderFloat("Color Weight", &ui_colorWeight, 0.001f, 10.0f);
+    ImGui::SliderFloat("Normal Weight", &ui_normalWeight, 0.001f, 1.0f);
+    ImGui::SliderFloat("Position Weight", &ui_positionWeight, 0.001f, 1.0f);
 
     ImGui::Separator();
 
diff --git a/src/sceneStructs.h b/src/sceneStructs.h
index da7e558..aff2771 100644
--- a/src/sceneStructs.h
+++ b/src/sceneStructs.h
@@ -66,17 +66,15 @@ struct PathSegment {
 	int remainingBounces;
 };
 
-// Use with a corresponding PathSegment to do:
-// 1) color contribution computation
-// 2) BSDF evaluation: generate a new ray
 struct ShadeableIntersection {
-  float t;
-  glm::vec3 surfaceNormal;
-  int materialId;
+    float t;
+    glm::vec3 surfaceNormal;
+    int materialId;
 };
 
-// CHECKITOUT - a simple struct for storing scene geometry information per-pixel.
-// What information might be helpful for guiding a denoising filter?
 struct GBufferPixel {
-  float t;
+    float t;
+    glm::vec3 normal;
+    glm::vec3 position;
+    glm::vec3 color;
 };