diff --git a/README.md b/README.md
index f044c821..9005c508 100644
--- a/README.md
+++ b/README.md
@@ -3,11 +3,102 @@ CUDA Denoiser For CUDA Path Tracer
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 4**
 
-* (TODO) YOUR NAME HERE
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Yu-Chia Shen
+  * [LinkedIn](https://www.linkedin.com/in/ycshen0831/)
+* Tested on: Windows 10, i5-11400F @ 4.3GHz 16GB, GTX 3060 12GB (personal)
 
-### (TODO: Your README)
+# Overview
+This project is to build a denoiser using colors and 2 geometry buffer: normals, and positions. The effect of the denoiser can smooth the image while not blur the edges between objects. The technique is based on the paper "Edge-Avoiding A-Trous Wavelet Transform for fast Global Illumination Filtering". You can find the paper here: https://jo.dreggn.org/home/2010_atrous.pdf
 
-*DO NOT* leave the README to the last minute! It is a crucial part of the
-project, and we will not be able to grade you without a good README.
+![](./img/Laterns.png)
 
+# Result
+
+## **Before & After Comparison**
+| Raw Image | After Denoise | 
+| :--------------------------: | :-------------------: | 
+| ![](./img/before.png) | ![](./img/after.png) |
+
+| Blur Image | After Denoise | 
+| :--------------------------: | :-------------------: | 
+| ![](./img/beforeBlur.jpg) | ![](./img/after.png) |
+
+## **GBuffer**
+| Normal Buffer |  Position Buffer | 
+| :--------------------------: | :-------------------: | 
+| ![](./img/normals_img.png) | ![](./img/positions_img.png) |
+
+# Performance Analysis
+
+## Time Analysis for Denoiser
+
+The following chart shows that the Denoiser time is approximately the same as the ray tracer in one iteration. Also, since denoise only needed to be applied in the final iteration, it doesn't not affect the whole render too much. It only count as one iteration of the ray tracer. 
+
+![](./img/Resolution%20(pixel)%20vs%20Run%20Time%20(ms)%20vs%20iter.png)
+
+## Visual Effect Analysis
+
+### Visual Effect
+| Iteration 1000 without Denoiser | Iteration 6000 without Denoiser | Denoised Image 100 iteration
+| :--------------------------: | :-------------------: | :-------------------: | 
+| ![](./img/iteration/a1000.png) | ![](./img/iteration/a6000.png)  | ![](./img/iteration/aResult.png) | 
+
+We can see that the image needed 6000 iteration to achieve the effect of the denoiser. Even go throught 1000 iteration, the image is still noisy when zoom in. Therefore, the denoiser can greatly improve the visual effect with very less iteration.
+
+### Image Difference Using tools
+| Iteration 500 without Denoiser | Difference | Denoised Image 100 iteration 
+| :--------------------------: | :-------------------: | :-------------------: | 
+| ![](./img/iteration/cornell.2022-10-22_23-16-51z.500.0samp.png) | ![](./img/vs/vs500.png)  | ![](./img/denoiseResult.png) | 
+
+| Iteration 1000 without Denoiser | Difference | Denoised Image 100 iteration 
+| :--------------------------: | :-------------------: | :-------------------: | 
+| ![](./img/iteration/cornell.2022-10-22_19-24-22z.1000.0samp.png) | ![](./img/vs/vs1000.png)  | ![](./img/denoiseResult.png) | 
+
+| Iteration 6000 without Denoiser | Difference | Denoised Image 100 iteration 
+| :--------------------------: | :-------------------: | :-------------------: | 
+| ![](./img/iteration/cornell.2022-10-22_19-24-22z.6000.5samp.png) | ![](./img/vs/vs6000.png)  | ![](./img/denoiseResult.png) | 
+
+## **Resolution vs Execute Time**
+![](./img/chart/Resolution%20(pixel)%20vs%20Run%20Time%20(ms).png)
+
+## **Filter Size vs Execute Time**
+![](./img/chart/Filter%20Size%20(pixel)%20vs%20Run%20Time%20(ms).png)
+
+## **Visual Effect with different Iteration**
+| Denoised Image 5 iteration| Denoised Image 10 iteration | Denoised Image 20 iteration 
+| :--------------------------: | :-------------------: | :-------------------: | 
+| ![](./img/denoiseIter/cornell.2022-10-22_23-28-04z.5.5samp.png) | ![](./img/denoiseIter/cornell.2022-10-22_23-28-04z.10.4samp.png)  | ![](./img/denoiseIter/cornell.2022-10-22_23-28-04z.20.3samp.png) | 
+
+| Denoised Image 40 iteration| Denoised Image 80 iteration | Denoised Image 100 iteration 
+| :--------------------------: | :-------------------: | :-------------------: | 
+| ![](./img/denoiseIter/cornell.2022-10-22_23-28-04z.40.2samp.png) | ![](./img/denoiseIter/cornell.2022-10-22_23-28-04z.80.1samp.png)  | ![](./img/denoiseIter/cornell.2022-10-22_23-28-04z.100.0samp.png) | 
+
+We can see that the visual quality improve when we use more iterations. However, there is a limit for high iteration. The images with 80 iteration and 100 iteration seems the same.
+
+## **Visual Effect with different Filter Size**
+| 8 x 8 | 20 x 20 |  40 x 40 |
+| :--------------------------: | :-------------------: |  :-------------------: | 
+| ![](./img/visual_filter_size/f8.png)  | ![](./img/visual_filter_size/f20.png) | ![](./img/visual_filter_size/f40.png) |
+
+| 80 x 80 | 200 x 200 | 400 x 400|
+| :--------------------------: | :-------------------: |  :-------------------: | 
+| ![](./img/visual_filter_size/f80.png)  | ![](./img/visual_filter_size/f200.png) | ![](./img/visual_filter_size/f400.png) |
+
+You can see that the visual quality improve while increasing the filter size. The surface of the objects are smoother when filter size is larger. However, there is a limit for the filter size. When the filter size is larger than 80 x 80, the visual improvement is no longer exist. That means the visual quality does not scale uniformly with the filter size.
+
+## Visual Effect with Different Material
+
+| Material Type | Before | After |
+| :--------------------------: | :-------------------: |  :-------------------: | 
+| Diffuse  | ![](./img/material/diffuse.png) | ![](./img/material/diffuseDenoise.png) |
+| Specular  | ![](./img/before.png) | ![](./img/after.png) |
+
+We can see that the denoise effect is better on diffuse object than specular objects. The contour of the specular is blured after denoise.
+
+## Visual Effect with Different Scenes
+| Light Source | 100 iteration | 1000 iteration |
+| :--------------------------: | :-------------------: |  :-------------------: | 
+| Small Light Source  | ![](./img/NewScene/cornell.2022-10-22_23-49-30z.101.1samp.png) | ![](./img/NewScene/cornell.2022-10-22_23-49-30z.1026.3samp.png) |
+| Large Light Source  | ![](./img/after.png) | ![](./img/NewScene/Large.png) |
+
+We can see that for 100 iteration, large light source has better denoised result. This is because before denoise, small light source has a more noisy image. Therefore, too many noise in the image result in a poor denoiser effect. However, when the image converge after 1000 iteration, the denoiser effect become better.
diff --git a/img/Filter Size (pixel) vs Run Time (ms) (1).png b/img/Filter Size (pixel) vs Run Time (ms) (1).png
new file mode 100644
index 00000000..f08b0dcc
Binary files /dev/null and b/img/Filter Size (pixel) vs Run Time (ms) (1).png differ
diff --git a/img/Filter Size (pixel) vs Run Time (ms).png b/img/Filter Size (pixel) vs Run Time (ms).png
new file mode 100644
index 00000000..983aed28
Binary files /dev/null and b/img/Filter Size (pixel) vs Run Time (ms).png differ
diff --git a/img/NewScene/Large.png b/img/NewScene/Large.png
new file mode 100644
index 00000000..2dd61c58
Binary files /dev/null and b/img/NewScene/Large.png differ
diff --git a/img/NewScene/cornell.2022-10-22_23-49-30z.101.1samp.png b/img/NewScene/cornell.2022-10-22_23-49-30z.101.1samp.png
new file mode 100644
index 00000000..3715412e
Binary files /dev/null and b/img/NewScene/cornell.2022-10-22_23-49-30z.101.1samp.png differ
diff --git a/img/NewScene/cornell.2022-10-22_23-49-30z.1026.3samp.png b/img/NewScene/cornell.2022-10-22_23-49-30z.1026.3samp.png
new file mode 100644
index 00000000..14b2d148
Binary files /dev/null and b/img/NewScene/cornell.2022-10-22_23-49-30z.1026.3samp.png differ
diff --git a/img/Resolution (pixel) vs Run Time (ms) vs iter.png b/img/Resolution (pixel) vs Run Time (ms) vs iter.png
new file mode 100644
index 00000000..dc0beffc
Binary files /dev/null and b/img/Resolution (pixel) vs Run Time (ms) vs iter.png differ
diff --git a/img/after.png b/img/after.png
new file mode 100644
index 00000000..a5e73b6f
Binary files /dev/null and b/img/after.png differ
diff --git a/img/before.png b/img/before.png
new file mode 100644
index 00000000..bf0e70b8
Binary files /dev/null and b/img/before.png differ
diff --git a/img/beforeBlur.jpg b/img/beforeBlur.jpg
new file mode 100644
index 00000000..bed9776e
Binary files /dev/null and b/img/beforeBlur.jpg differ
diff --git a/img/chart/Filter Size (pixel) vs Run Time (ms).png b/img/chart/Filter Size (pixel) vs Run Time (ms).png
new file mode 100644
index 00000000..983aed28
Binary files /dev/null and b/img/chart/Filter Size (pixel) vs Run Time (ms).png differ
diff --git a/img/chart/Resolution (pixel) vs Run Time (ms).png b/img/chart/Resolution (pixel) vs Run Time (ms).png
new file mode 100644
index 00000000..8b89d9aa
Binary files /dev/null and b/img/chart/Resolution (pixel) vs Run Time (ms).png differ
diff --git a/img/cornell.2022-10-22_04-08-05z.0samp.png b/img/cornell.2022-10-22_04-08-05z.0samp.png
new file mode 100644
index 00000000..163c3199
Binary files /dev/null and b/img/cornell.2022-10-22_04-08-05z.0samp.png differ
diff --git a/img/cornell.2022-10-22_04-08-05z.100samp.png b/img/cornell.2022-10-22_04-08-05z.100samp.png
new file mode 100644
index 00000000..d63b1b83
Binary files /dev/null and b/img/cornell.2022-10-22_04-08-05z.100samp.png differ
diff --git a/img/cornell.2022-10-22_04-08-05z.25samp.png b/img/cornell.2022-10-22_04-08-05z.25samp.png
new file mode 100644
index 00000000..fcd87f9f
Binary files /dev/null and b/img/cornell.2022-10-22_04-08-05z.25samp.png differ
diff --git a/img/cornell.2022-10-22_04-08-05z.50samp.png b/img/cornell.2022-10-22_04-08-05z.50samp.png
new file mode 100644
index 00000000..45345587
Binary files /dev/null and b/img/cornell.2022-10-22_04-08-05z.50samp.png differ
diff --git a/img/cornell.2022-10-22_04-08-05z.75samp.png b/img/cornell.2022-10-22_04-08-05z.75samp.png
new file mode 100644
index 00000000..d63b1b83
Binary files /dev/null and b/img/cornell.2022-10-22_04-08-05z.75samp.png differ
diff --git a/img/denoiseIter/cornell.2022-10-22_23-28-04z.10.4samp.png b/img/denoiseIter/cornell.2022-10-22_23-28-04z.10.4samp.png
new file mode 100644
index 00000000..7b58e215
Binary files /dev/null and b/img/denoiseIter/cornell.2022-10-22_23-28-04z.10.4samp.png differ
diff --git a/img/denoiseIter/cornell.2022-10-22_23-28-04z.100.0samp.png b/img/denoiseIter/cornell.2022-10-22_23-28-04z.100.0samp.png
new file mode 100644
index 00000000..0232480d
Binary files /dev/null and b/img/denoiseIter/cornell.2022-10-22_23-28-04z.100.0samp.png differ
diff --git a/img/denoiseIter/cornell.2022-10-22_23-28-04z.20.3samp.png b/img/denoiseIter/cornell.2022-10-22_23-28-04z.20.3samp.png
new file mode 100644
index 00000000..87df3c99
Binary files /dev/null and b/img/denoiseIter/cornell.2022-10-22_23-28-04z.20.3samp.png differ
diff --git a/img/denoiseIter/cornell.2022-10-22_23-28-04z.40.2samp.png b/img/denoiseIter/cornell.2022-10-22_23-28-04z.40.2samp.png
new file mode 100644
index 00000000..b8758881
Binary files /dev/null and b/img/denoiseIter/cornell.2022-10-22_23-28-04z.40.2samp.png differ
diff --git a/img/denoiseIter/cornell.2022-10-22_23-28-04z.5.5samp.png b/img/denoiseIter/cornell.2022-10-22_23-28-04z.5.5samp.png
new file mode 100644
index 00000000..7454305e
Binary files /dev/null and b/img/denoiseIter/cornell.2022-10-22_23-28-04z.5.5samp.png differ
diff --git a/img/denoiseIter/cornell.2022-10-22_23-28-04z.80.1samp.png b/img/denoiseIter/cornell.2022-10-22_23-28-04z.80.1samp.png
new file mode 100644
index 00000000..b3931613
Binary files /dev/null and b/img/denoiseIter/cornell.2022-10-22_23-28-04z.80.1samp.png differ
diff --git a/img/denoiseResult.png b/img/denoiseResult.png
new file mode 100644
index 00000000..48a2c09f
Binary files /dev/null and b/img/denoiseResult.png differ
diff --git a/img/iteration/a1000.png b/img/iteration/a1000.png
new file mode 100644
index 00000000..a57035f1
Binary files /dev/null and b/img/iteration/a1000.png differ
diff --git a/img/iteration/a6000.png b/img/iteration/a6000.png
new file mode 100644
index 00000000..7436858d
Binary files /dev/null and b/img/iteration/a6000.png differ
diff --git a/img/iteration/aResult.png b/img/iteration/aResult.png
new file mode 100644
index 00000000..a3650d9a
Binary files /dev/null and b/img/iteration/aResult.png differ
diff --git a/img/iteration/cornell.2022-10-22_19-24-22z.1000.0samp.png b/img/iteration/cornell.2022-10-22_19-24-22z.1000.0samp.png
new file mode 100644
index 00000000..23f3857b
Binary files /dev/null and b/img/iteration/cornell.2022-10-22_19-24-22z.1000.0samp.png differ
diff --git a/img/iteration/cornell.2022-10-22_19-24-22z.10000.9samp.png b/img/iteration/cornell.2022-10-22_19-24-22z.10000.9samp.png
new file mode 100644
index 00000000..3c23b908
Binary files /dev/null and b/img/iteration/cornell.2022-10-22_19-24-22z.10000.9samp.png differ
diff --git a/img/iteration/cornell.2022-10-22_19-24-22z.2000.1samp.png b/img/iteration/cornell.2022-10-22_19-24-22z.2000.1samp.png
new file mode 100644
index 00000000..54b439dc
Binary files /dev/null and b/img/iteration/cornell.2022-10-22_19-24-22z.2000.1samp.png differ
diff --git a/img/iteration/cornell.2022-10-22_19-24-22z.3000.2samp.png b/img/iteration/cornell.2022-10-22_19-24-22z.3000.2samp.png
new file mode 100644
index 00000000..9d0060e9
Binary files /dev/null and b/img/iteration/cornell.2022-10-22_19-24-22z.3000.2samp.png differ
diff --git a/img/iteration/cornell.2022-10-22_19-24-22z.4000.3samp.png b/img/iteration/cornell.2022-10-22_19-24-22z.4000.3samp.png
new file mode 100644
index 00000000..b04c2ed0
Binary files /dev/null and b/img/iteration/cornell.2022-10-22_19-24-22z.4000.3samp.png differ
diff --git a/img/iteration/cornell.2022-10-22_19-24-22z.5000.4samp.png b/img/iteration/cornell.2022-10-22_19-24-22z.5000.4samp.png
new file mode 100644
index 00000000..bb70aa8a
Binary files /dev/null and b/img/iteration/cornell.2022-10-22_19-24-22z.5000.4samp.png differ
diff --git a/img/iteration/cornell.2022-10-22_19-24-22z.6000.5samp.png b/img/iteration/cornell.2022-10-22_19-24-22z.6000.5samp.png
new file mode 100644
index 00000000..b1cab3d6
Binary files /dev/null and b/img/iteration/cornell.2022-10-22_19-24-22z.6000.5samp.png differ
diff --git a/img/iteration/cornell.2022-10-22_19-24-22z.7000.6samp.png b/img/iteration/cornell.2022-10-22_19-24-22z.7000.6samp.png
new file mode 100644
index 00000000..a9a6e672
Binary files /dev/null and b/img/iteration/cornell.2022-10-22_19-24-22z.7000.6samp.png differ
diff --git a/img/iteration/cornell.2022-10-22_19-24-22z.8000.7samp.png b/img/iteration/cornell.2022-10-22_19-24-22z.8000.7samp.png
new file mode 100644
index 00000000..c6ac11fa
Binary files /dev/null and b/img/iteration/cornell.2022-10-22_19-24-22z.8000.7samp.png differ
diff --git a/img/iteration/cornell.2022-10-22_19-24-22z.9000.8samp.png b/img/iteration/cornell.2022-10-22_19-24-22z.9000.8samp.png
new file mode 100644
index 00000000..b5132076
Binary files /dev/null and b/img/iteration/cornell.2022-10-22_19-24-22z.9000.8samp.png differ
diff --git a/img/iteration/cornell.2022-10-22_23-16-51z.500.0samp.png b/img/iteration/cornell.2022-10-22_23-16-51z.500.0samp.png
new file mode 100644
index 00000000..81f507fd
Binary files /dev/null and b/img/iteration/cornell.2022-10-22_23-16-51z.500.0samp.png differ
diff --git a/img/material/diffuse.png b/img/material/diffuse.png
new file mode 100644
index 00000000..2a798d9b
Binary files /dev/null and b/img/material/diffuse.png differ
diff --git a/img/material/diffuseDenoise.png b/img/material/diffuseDenoise.png
new file mode 100644
index 00000000..23a2d320
Binary files /dev/null and b/img/material/diffuseDenoise.png differ
diff --git a/img/normals_img.png b/img/normals_img.png
new file mode 100644
index 00000000..9bd9dc93
Binary files /dev/null and b/img/normals_img.png differ
diff --git a/img/positions_img.png b/img/positions_img.png
new file mode 100644
index 00000000..679baa0b
Binary files /dev/null and b/img/positions_img.png differ
diff --git a/img/visual_filter_size/0.png b/img/visual_filter_size/0.png
new file mode 100644
index 00000000..6f89deff
Binary files /dev/null and b/img/visual_filter_size/0.png differ
diff --git a/img/visual_filter_size/f20.png b/img/visual_filter_size/f20.png
new file mode 100644
index 00000000..c9dba2cc
Binary files /dev/null and b/img/visual_filter_size/f20.png differ
diff --git a/img/visual_filter_size/f200.png b/img/visual_filter_size/f200.png
new file mode 100644
index 00000000..3d485d1e
Binary files /dev/null and b/img/visual_filter_size/f200.png differ
diff --git a/img/visual_filter_size/f40.png b/img/visual_filter_size/f40.png
new file mode 100644
index 00000000..f9139fde
Binary files /dev/null and b/img/visual_filter_size/f40.png differ
diff --git a/img/visual_filter_size/f400.png b/img/visual_filter_size/f400.png
new file mode 100644
index 00000000..d3f60d69
Binary files /dev/null and b/img/visual_filter_size/f400.png differ
diff --git a/img/visual_filter_size/f8.png b/img/visual_filter_size/f8.png
new file mode 100644
index 00000000..e69cb25e
Binary files /dev/null and b/img/visual_filter_size/f8.png differ
diff --git a/img/visual_filter_size/f80.png b/img/visual_filter_size/f80.png
new file mode 100644
index 00000000..cd3e30c8
Binary files /dev/null and b/img/visual_filter_size/f80.png differ
diff --git a/img/vs/vs1000.png b/img/vs/vs1000.png
new file mode 100644
index 00000000..407ea444
Binary files /dev/null and b/img/vs/vs1000.png differ
diff --git a/img/vs/vs500.png b/img/vs/vs500.png
new file mode 100644
index 00000000..2e6870d9
Binary files /dev/null and b/img/vs/vs500.png differ
diff --git a/img/vs/vs6000.png b/img/vs/vs6000.png
new file mode 100644
index 00000000..0dfad1d0
Binary files /dev/null and b/img/vs/vs6000.png differ
diff --git a/scenes/cornell_ceiling_light.txt b/scenes/cornell_ceiling_light.txt
index 15af5f19..1cebbfd3 100644
--- a/scenes/cornell_ceiling_light.txt
+++ b/scenes/cornell_ceiling_light.txt
@@ -52,7 +52,7 @@ EMITTANCE   0
 CAMERA
 RES         800 800
 FOVY        45
-ITERATIONS  10
+ITERATIONS  1000
 DEPTH       8
 FILE        cornell
 EYE         0.0 5 10.5
@@ -115,3 +115,9 @@ material 4
 TRANS       -1 4 -1
 ROTAT       0 0 0
 SCALE       3 3 3
+
+
+
+
+
+
diff --git a/src/main.cpp b/src/main.cpp
index 4092ae4a..4dc002b1 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -22,6 +22,9 @@ static double lastY;
 int ui_iterations = 0;
 int startupIterations = 0;
 int lastLoopIterations = 0;
+int lastPosWeight = 0;
+int lastNormalWeight = 0;
+int lastColorWeight = 0;
 bool ui_showGbuffer = false;
 bool ui_denoise = false;
 int ui_filterSize = 80;
@@ -29,6 +32,7 @@ float ui_colorWeight = 0.45f;
 float ui_normalWeight = 0.35f;
 float ui_positionWeight = 0.2f;
 bool ui_saveAndExit = false;
+extern int ui_typeGbuffer = 0;
 
 static bool camchanged = true;
 static float dtheta = 0, dphi = 0;
@@ -45,6 +49,8 @@ int iteration;
 int width;
 int height;
 
+int num_pic = 0;
+
 //-------------------------------
 //-------------MAIN--------------
 //-------------------------------
@@ -112,7 +118,7 @@ void saveImage() {
 
     std::string filename = renderState->imageName;
     std::ostringstream ss;
-    ss << filename << "." << startTimeString << "." << samples << "samp";
+    ss << filename << "." << startTimeString << "." << samples << "." << num_pic++ << "samp";
     filename = ss.str();
 
     // CHECKITOUT
@@ -162,18 +168,26 @@ void runCuda() {
 
         // execute the kernel
         int frame = 0;
-        pathtrace(frame, iteration);
+        pathtrace(frame, iteration, ui_iterations == iteration);
     }
 
     if (ui_showGbuffer) {
-      showGBuffer(pbo_dptr);
-    } else {
+      showGBuffer(pbo_dptr, iteration, ui_typeGbuffer);
+    } 
+    else if (ui_denoise) {
+      showImageDenoise(pbo_dptr, iteration);
+    }
+    else {
       showImage(pbo_dptr, iteration);
     }
 
     // unmap buffer object
     cudaGLUnmapBufferObject(pbo);
 
+    if (iteration != 0 && iteration % 1000 == 0 && iteration != ui_iterations) {
+      saveImage();
+    }
+
     if (ui_saveAndExit) {
         saveImage();
         pathtraceFree();
diff --git a/src/main.h b/src/main.h
index 06d311a8..a94e5033 100644
--- a/src/main.h
+++ b/src/main.h
@@ -41,6 +41,7 @@ extern float ui_colorWeight;
 extern float ui_normalWeight;
 extern float ui_positionWeight;
 extern bool ui_saveAndExit;
+extern int ui_typeGbuffer;
 
 void runCuda();
 void keyCallback(GLFWwindow *window, int key, int scancode, int action, int mods);
diff --git a/src/pathtrace.cu b/src/pathtrace.cu
index 23e5f909..50575922 100644
--- a/src/pathtrace.cu
+++ b/src/pathtrace.cu
@@ -5,6 +5,7 @@
 #include <thrust/random.h>
 #include <thrust/remove.h>
 
+#include "main.h"
 #include "sceneStructs.h"
 #include "scene.h"
 #include "glm/glm.hpp"
@@ -15,119 +16,165 @@
 #include "interactions.h"
 
 #define ERRORCHECK 1
+#define DENOISE 1
 
 #define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__)
 #define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__)
-void checkCUDAErrorFn(const char *msg, const char *file, int line) {
+void checkCUDAErrorFn(const char* msg, const char* file, int line) {
 #if ERRORCHECK
-    cudaDeviceSynchronize();
-    cudaError_t err = cudaGetLastError();
-    if (cudaSuccess == err) {
-        return;
-    }
+  cudaDeviceSynchronize();
+  cudaError_t err = cudaGetLastError();
+  if (cudaSuccess == err) {
+    return;
+  }
 
-    fprintf(stderr, "CUDA error");
-    if (file) {
-        fprintf(stderr, " (%s:%d)", file, line);
-    }
-    fprintf(stderr, ": %s: %s\n", msg, cudaGetErrorString(err));
+  fprintf(stderr, "CUDA error");
+  if (file) {
+    fprintf(stderr, " (%s:%d)", file, line);
+  }
+  fprintf(stderr, ": %s: %s\n", msg, cudaGetErrorString(err));
 #  ifdef _WIN32
-    getchar();
+  getchar();
 #  endif
-    exit(EXIT_FAILURE);
+  exit(EXIT_FAILURE);
 #endif
 }
 
 __host__ __device__
 thrust::default_random_engine makeSeededRandomEngine(int iter, int index, int depth) {
-    int h = utilhash((1 << 31) | (depth << 22) | iter) ^ utilhash(index);
-    return thrust::default_random_engine(h);
+  int h = utilhash((1 << 31) | (depth << 22) | iter) ^ utilhash(index);
+  return thrust::default_random_engine(h);
 }
 
 //Kernel that writes the image to the OpenGL PBO directly.
 __global__ void sendImageToPBO(uchar4* pbo, glm::ivec2 resolution,
-        int iter, glm::vec3* image) {
-    int x = (blockIdx.x * blockDim.x) + threadIdx.x;
-    int y = (blockIdx.y * blockDim.y) + threadIdx.y;
-
-    if (x < resolution.x && y < resolution.y) {
-        int index = x + (y * resolution.x);
-        glm::vec3 pix = image[index];
-
-        glm::ivec3 color;
-        color.x = glm::clamp((int) (pix.x / iter * 255.0), 0, 255);
-        color.y = glm::clamp((int) (pix.y / iter * 255.0), 0, 255);
-        color.z = glm::clamp((int) (pix.z / iter * 255.0), 0, 255);
-
-        // Each thread writes one pixel location in the texture (textel)
-        pbo[index].w = 0;
-        pbo[index].x = color.x;
-        pbo[index].y = color.y;
-        pbo[index].z = color.z;
-    }
+  int iter, glm::vec3* image) {
+  int x = (blockIdx.x * blockDim.x) + threadIdx.x;
+  int y = (blockIdx.y * blockDim.y) + threadIdx.y;
+
+  if (x < resolution.x && y < resolution.y) {
+    int index = x + (y * resolution.x);
+    glm::vec3 pix = image[index];
+
+    glm::ivec3 color;
+    color.x = glm::clamp((int)(pix.x / iter * 255.0), 0, 255);
+    color.y = glm::clamp((int)(pix.y / iter * 255.0), 0, 255);
+    color.z = glm::clamp((int)(pix.z / iter * 255.0), 0, 255);
+
+    // Each thread writes one pixel location in the texture (textel)
+    pbo[index].w = 0;
+    pbo[index].x = color.x;
+    pbo[index].y = color.y;
+    pbo[index].z = color.z;
+  }
 }
 
-__global__ void gbufferToPBO(uchar4* pbo, glm::ivec2 resolution, GBufferPixel* gBuffer) {
-    int x = (blockIdx.x * blockDim.x) + threadIdx.x;
-    int y = (blockIdx.y * blockDim.y) + threadIdx.y;
+__global__ void gbufferToPBO(uchar4* pbo, glm::vec3* image, glm::ivec2 resolution, GBufferPixel* gBuffer, int type) {
+  int x = (blockIdx.x * blockDim.x) + threadIdx.x;
+  int y = (blockIdx.y * blockDim.y) + threadIdx.y;
 
-    if (x < resolution.x && y < resolution.y) {
-        int index = x + (y * resolution.x);
-        float timeToIntersect = gBuffer[index].t * 256.0;
+  if (x < resolution.x && y < resolution.y) {
+    int index = x + (y * resolution.x);
+    float timeToIntersect = gBuffer[index].t * 256.0;
 
-        pbo[index].w = 0;
-        pbo[index].x = timeToIntersect;
-        pbo[index].y = timeToIntersect;
-        pbo[index].z = timeToIntersect;
+    glm::vec3 output;
+    if (type == 0) {
+      output.x = glm::abs(gBuffer[index].normal.x * 255.f);
+      output.y = glm::abs(gBuffer[index].normal.y * 255.f);
+      output.z = glm::abs(gBuffer[index].normal.z * 255.f);
+    }
+    else if (type == 1) {
+      output = gBuffer[index].pos * 256.f * 0.1f;
+    }
+    else {
+      output = glm::vec3(gBuffer[index].t) * 256.f;
     }
+
+    pbo[index].w = 0;
+    pbo[index].x = output.x;
+    pbo[index].y = output.y;
+    pbo[index].z = output.z;
+
+    image[index].x = output.x;
+    image[index].y = output.y;
+    image[index].z = output.z;
+  }
 }
 
-static Scene * hst_scene = NULL;
-static glm::vec3 * dev_image = NULL;
-static Geom * dev_geoms = NULL;
-static Material * dev_materials = NULL;
-static PathSegment * dev_paths = NULL;
-static ShadeableIntersection * dev_intersections = NULL;
+static Scene* hst_scene = NULL;
+static glm::vec3* dev_image = NULL;
+static Geom* dev_geoms = NULL;
+static Material* dev_materials = NULL;
+static PathSegment* dev_paths = NULL;
+static ShadeableIntersection* dev_intersections = NULL;
 static GBufferPixel* dev_gBuffer = NULL;
 // TODO: static variables for device memory, any extra info you need, etc
 // ...
 
-void pathtraceInit(Scene *scene) {
-    hst_scene = scene;
-    const Camera &cam = hst_scene->state.camera;
-    const int pixelcount = cam.resolution.x * cam.resolution.y;
+static glm::vec3* dev_image_denoise = NULL;
+static glm::vec3* dev_image_denoise_tmp = NULL;
+static glm::vec3* dev_image_gBuffer = NULL;
 
-    cudaMalloc(&dev_image, pixelcount * sizeof(glm::vec3));
-    cudaMemset(dev_image, 0, pixelcount * sizeof(glm::vec3));
+static cudaEvent_t startEvent = NULL;
+static cudaEvent_t endEvent = NULL;
 
-  	cudaMalloc(&dev_paths, pixelcount * sizeof(PathSegment));
+void pathtraceInit(Scene* scene) {
+  hst_scene = scene;
+  const Camera& cam = hst_scene->state.camera;
+  const int pixelcount = cam.resolution.x * cam.resolution.y;
 
-  	cudaMalloc(&dev_geoms, scene->geoms.size() * sizeof(Geom));
-  	cudaMemcpy(dev_geoms, scene->geoms.data(), scene->geoms.size() * sizeof(Geom), cudaMemcpyHostToDevice);
+  cudaMalloc(&dev_image, pixelcount * sizeof(glm::vec3));
+  cudaMemset(dev_image, 0, pixelcount * sizeof(glm::vec3));
 
-  	cudaMalloc(&dev_materials, scene->materials.size() * sizeof(Material));
-  	cudaMemcpy(dev_materials, scene->materials.data(), scene->materials.size() * sizeof(Material), cudaMemcpyHostToDevice);
+  cudaMalloc(&dev_paths, pixelcount * sizeof(PathSegment));
 
-  	cudaMalloc(&dev_intersections, pixelcount * sizeof(ShadeableIntersection));
-  	cudaMemset(dev_intersections, 0, pixelcount * sizeof(ShadeableIntersection));
+  cudaMalloc(&dev_geoms, scene->geoms.size() * sizeof(Geom));
+  cudaMemcpy(dev_geoms, scene->geoms.data(), scene->geoms.size() * sizeof(Geom), cudaMemcpyHostToDevice);
 
-    cudaMalloc(&dev_gBuffer, pixelcount * sizeof(GBufferPixel));
+  cudaMalloc(&dev_materials, scene->materials.size() * sizeof(Material));
+  cudaMemcpy(dev_materials, scene->materials.data(), scene->materials.size() * sizeof(Material), cudaMemcpyHostToDevice);
 
-    // TODO: initialize any extra device memeory you need
+  cudaMalloc(&dev_intersections, pixelcount * sizeof(ShadeableIntersection));
+  cudaMemset(dev_intersections, 0, pixelcount * sizeof(ShadeableIntersection));
 
-    checkCUDAError("pathtraceInit");
+  cudaMalloc(&dev_gBuffer, pixelcount * sizeof(GBufferPixel));
+
+  // TODO: initialize any extra device memeory you need
+
+  cudaMalloc(&dev_image_denoise, pixelcount * sizeof(glm::vec3));
+  cudaMemset(dev_image_denoise, 0, pixelcount * sizeof(glm::vec3));
+
+  cudaMalloc(&dev_image_denoise_tmp, pixelcount * sizeof(glm::vec3));
+  cudaMemset(dev_image_denoise_tmp, 0, pixelcount * sizeof(glm::vec3));
+
+  cudaMalloc(&dev_image_gBuffer, pixelcount * sizeof(glm::vec3));
+  cudaMemset(dev_image_gBuffer, 0, pixelcount * sizeof(glm::vec3));
+
+  cudaEventCreate(&startEvent);
+  cudaEventCreate(&endEvent);
+
+  checkCUDAError("pathtraceInit");
 }
 
 void pathtraceFree() {
-    cudaFree(dev_image);  // no-op if dev_image is null
-  	cudaFree(dev_paths);
-  	cudaFree(dev_geoms);
-  	cudaFree(dev_materials);
-  	cudaFree(dev_intersections);
-    cudaFree(dev_gBuffer);
-    // TODO: clean up any extra device memory you created
-
-    checkCUDAError("pathtraceFree");
+  cudaFree(dev_image);  // no-op if dev_image is null
+  cudaFree(dev_paths);
+  cudaFree(dev_geoms);
+  cudaFree(dev_materials);
+  cudaFree(dev_intersections);
+  cudaFree(dev_gBuffer);
+  // TODO: clean up any extra device memory you created
+
+  cudaFree(dev_image_denoise);
+  cudaFree(dev_image_denoise_tmp);
+  cudaFree(dev_image_gBuffer);
+
+  if (startEvent != NULL)
+    cudaEventDestroy(startEvent);
+  if (endEvent != NULL)
+    cudaEventDestroy(endEvent);
+
+  checkCUDAError("pathtraceFree");
 }
 
 /**
@@ -140,98 +187,98 @@ void pathtraceFree() {
 */
 __global__ void generateRayFromCamera(Camera cam, int iter, int traceDepth, PathSegment* pathSegments)
 {
-	int x = (blockIdx.x * blockDim.x) + threadIdx.x;
-	int y = (blockIdx.y * blockDim.y) + threadIdx.y;
+  int x = (blockIdx.x * blockDim.x) + threadIdx.x;
+  int y = (blockIdx.y * blockDim.y) + threadIdx.y;
 
-	if (x < cam.resolution.x && y < cam.resolution.y) {
-		int index = x + (y * cam.resolution.x);
-		PathSegment & segment = pathSegments[index];
+  if (x < cam.resolution.x && y < cam.resolution.y) {
+    int index = x + (y * cam.resolution.x);
+    PathSegment& segment = pathSegments[index];
 
-		segment.ray.origin = cam.position;
+    segment.ray.origin = cam.position;
     segment.color = glm::vec3(1.0f, 1.0f, 1.0f);
 
-		segment.ray.direction = glm::normalize(cam.view
-			- cam.right * cam.pixelLength.x * ((float)x - (float)cam.resolution.x * 0.5f)
-			- cam.up * cam.pixelLength.y * ((float)y - (float)cam.resolution.y * 0.5f)
-			);
+    segment.ray.direction = glm::normalize(cam.view
+      - cam.right * cam.pixelLength.x * ((float)x - (float)cam.resolution.x * 0.5f)
+      - cam.up * cam.pixelLength.y * ((float)y - (float)cam.resolution.y * 0.5f)
+    );
 
-		segment.pixelIndex = index;
-		segment.remainingBounces = traceDepth;
-	}
+    segment.pixelIndex = index;
+    segment.remainingBounces = traceDepth;
+  }
 }
 
 __global__ void computeIntersections(
-	int depth
-	, int num_paths
-	, PathSegment * pathSegments
-	, Geom * geoms
-	, int geoms_size
-	, ShadeableIntersection * intersections
-	)
+  int depth
+  , int num_paths
+  , PathSegment* pathSegments
+  , Geom* geoms
+  , int geoms_size
+  , ShadeableIntersection* intersections
+)
 {
-	int path_index = blockIdx.x * blockDim.x + threadIdx.x;
-
-	if (path_index < num_paths)
-	{
-		PathSegment pathSegment = pathSegments[path_index];
-
-		float t;
-		glm::vec3 intersect_point;
-		glm::vec3 normal;
-		float t_min = FLT_MAX;
-		int hit_geom_index = -1;
-		bool outside = true;
-
-		glm::vec3 tmp_intersect;
-		glm::vec3 tmp_normal;
-
-		// naive parse through global geoms
-
-		for (int i = 0; i < geoms_size; i++)
-		{
-			Geom & geom = geoms[i];
-
-			if (geom.type == CUBE)
-			{
-				t = boxIntersectionTest(geom, pathSegment.ray, tmp_intersect, tmp_normal, outside);
-			}
-			else if (geom.type == SPHERE)
-			{
-				t = sphereIntersectionTest(geom, pathSegment.ray, tmp_intersect, tmp_normal, outside);
-			}
-
-			// Compute the minimum t from the intersection tests to determine what
-			// scene geometry object was hit first.
-			if (t > 0.0f && t_min > t)
-			{
-				t_min = t;
-				hit_geom_index = i;
-				intersect_point = tmp_intersect;
-				normal = tmp_normal;
-			}
-		}
-
-		if (hit_geom_index == -1)
-		{
-			intersections[path_index].t = -1.0f;
-		}
-		else
-		{
-			//The ray hits something
-			intersections[path_index].t = t_min;
-			intersections[path_index].materialId = geoms[hit_geom_index].materialid;
-			intersections[path_index].surfaceNormal = normal;
-		}
-	}
+  int path_index = blockIdx.x * blockDim.x + threadIdx.x;
+
+  if (path_index < num_paths)
+  {
+    PathSegment pathSegment = pathSegments[path_index];
+
+    float t;
+    glm::vec3 intersect_point;
+    glm::vec3 normal;
+    float t_min = FLT_MAX;
+    int hit_geom_index = -1;
+    bool outside = true;
+
+    glm::vec3 tmp_intersect;
+    glm::vec3 tmp_normal;
+
+    // naive parse through global geoms
+
+    for (int i = 0; i < geoms_size; i++)
+    {
+      Geom& geom = geoms[i];
+
+      if (geom.type == CUBE)
+      {
+        t = boxIntersectionTest(geom, pathSegment.ray, tmp_intersect, tmp_normal, outside);
+      }
+      else if (geom.type == SPHERE)
+      {
+        t = sphereIntersectionTest(geom, pathSegment.ray, tmp_intersect, tmp_normal, outside);
+      }
+
+      // Compute the minimum t from the intersection tests to determine what
+      // scene geometry object was hit first.
+      if (t > 0.0f && t_min > t)
+      {
+        t_min = t;
+        hit_geom_index = i;
+        intersect_point = tmp_intersect;
+        normal = tmp_normal;
+      }
+    }
+
+    if (hit_geom_index == -1)
+    {
+      intersections[path_index].t = -1.0f;
+    }
+    else
+    {
+      //The ray hits something
+      intersections[path_index].t = t_min;
+      intersections[path_index].materialId = geoms[hit_geom_index].materialid;
+      intersections[path_index].surfaceNormal = normal;
+    }
+  }
 }
 
-__global__ void shadeSimpleMaterials (
+__global__ void shadeSimpleMaterials(
   int iter
   , int num_paths
-	, ShadeableIntersection * shadeableIntersections
-	, PathSegment * pathSegments
-	, Material * materials
-	)
+  , ShadeableIntersection* shadeableIntersections
+  , PathSegment* pathSegments
+  , Material* materials
+)
 {
   int idx = blockIdx.x * blockDim.x + threadIdx.x;
   if (idx < num_paths)
@@ -260,11 +307,12 @@ __global__ void shadeSimpleMaterials (
         glm::vec3 intersectPos = intersection.t * segment.ray.direction + segment.ray.origin;
         scatterRay(segment, intersectPos, intersection.surfaceNormal, material, rng);
       }
-    // If there was no intersection, color the ray black.
-    // Lots of renderers use 4 channel color, RGBA, where A = alpha, often
-    // used for opacity, in which case they can indicate "no opacity".
-    // This can be useful for post-processing and image compositing.
-    } else {
+      // If there was no intersection, color the ray black.
+      // Lots of renderers use 4 channel color, RGBA, where A = alpha, often
+      // used for opacity, in which case they can indicate "no opacity".
+      // This can be useful for post-processing and image compositing.
+    }
+    else {
       segment.color = glm::vec3(0.0f);
       segment.remainingBounces = 0;
     }
@@ -273,161 +321,309 @@ __global__ void shadeSimpleMaterials (
   }
 }
 
-__global__ void generateGBuffer (
+__global__ void generateGBuffer(
   int num_paths,
   ShadeableIntersection* shadeableIntersections,
-	PathSegment* pathSegments,
+  PathSegment* pathSegments,
   GBufferPixel* gBuffer) {
   int idx = blockIdx.x * blockDim.x + threadIdx.x;
   if (idx < num_paths)
   {
     gBuffer[idx].t = shadeableIntersections[idx].t;
+    gBuffer[idx].normal = shadeableIntersections[idx].surfaceNormal;
+    gBuffer[idx].pos = getPointOnRay(pathSegments[idx].ray, shadeableIntersections[idx].t);
   }
 }
 
 // Add the current iteration's output to the overall image
-__global__ void finalGather(int nPaths, glm::vec3 * image, PathSegment * iterationPaths)
+__global__ void finalGather(int nPaths, glm::vec3* image, PathSegment* iterationPaths)
 {
-	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+
+  if (index < nPaths)
+  {
+    PathSegment iterationPath = iterationPaths[index];
+    image[iterationPath.pixelIndex] += iterationPath.color;
+  }
+}
+
+__global__ void denoiseATour(const Camera cam, const int stepWidth, const float c_phi, const float p_phi, const float n_phi,
+  const glm::vec3* image, glm::vec3* imageDenoise, const GBufferPixel* gBuffer)
+{
+  int x = (blockIdx.x * blockDim.x) + threadIdx.x;
+  int y = (blockIdx.y * blockDim.y) + threadIdx.y;
+
+  if (x < cam.resolution.x && y < cam.resolution.y) {
+
+    const float gaussian[9] = { 0.00390625, 0.015625, 0.0234375, 0.015625, 0.0625, 0.09375, 0.0234375, 0.09375, 0.140625 };
+
+    glm::vec3 sum = glm::vec3(0.f);
+    float cum_w = 0.f;
+
+    int index = x + (y * cam.resolution.x);
 
-	if (index < nPaths)
-	{
-		PathSegment iterationPath = iterationPaths[index];
-		image[iterationPath.pixelIndex] += iterationPath.color;
-	}
+    glm::vec3 cval = image[index];
+    glm::vec3 pval = gBuffer[index].pos;
+    glm::vec3 nval = gBuffer[index].normal;
+
+    for (int j = -2; j <= 2; j++) {
+      for (int i = -2; i <= 2; i++) {
+
+        int uvX = int(x + i * stepWidth);
+        int uvY = int(y + j * stepWidth);
+
+        //uvX = min(max(uvX, 0), cam.resolution.x - 1);
+        //uvY = min(max(uvY, 0), cam.resolution.y - 1);
+
+        if (uvX < 0 || uvX >= cam.resolution.x) continue;
+        if (uvY < 0 || uvY >= cam.resolution.y) continue;
+
+        float kernelValue = gaussian[-abs(i) + 2 + (-abs(j) + 2) * 3];
+
+        int itmp = uvX + cam.resolution.x * uvY;
+
+        glm::vec3 ctmp = image[itmp];
+        glm::vec3 t = cval - ctmp;
+        float dist2 = glm::dot(t, t);
+        float c_w = glm::min(std::exp(-(dist2) / (c_phi + EPSILON)), 1.f);
+
+        t = nval - gBuffer[itmp].normal;
+        dist2 = glm::max(glm::dot(t, t) / (stepWidth * stepWidth), 0.f);
+        float n_w = glm::min(std::exp(-(dist2) / (n_phi + EPSILON)), 1.f);
+
+        t = pval - gBuffer[itmp].pos;
+        dist2 = glm::dot(t, t);
+        float p_w = glm::min(std::exp(-(dist2) / (p_phi + EPSILON)), 1.f);
+
+        float weight = c_w * p_w * n_w;
+        sum += ctmp * weight * kernelValue;
+        cum_w += weight * kernelValue;
+      }
+    }
+
+    imageDenoise[index] = sum / cum_w;
+  }
+}
+
+__global__ void diffuseImage(int nPaths, int iter, glm::vec3* image) {
+  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+
+  if (index < nPaths)
+  {
+    glm::vec3 color = image[index];
+    color.r /= iter;
+    color.b /= iter;
+    color.g /= iter;
+
+    image[index] = color;
+  }
 }
 
 /**
  * Wrapper for the __global__ call that sets up the kernel calls and does a ton
  * of memory management
  */
-void pathtrace(int frame, int iter) {
-    const int traceDepth = hst_scene->state.traceDepth;
-    const Camera &cam = hst_scene->state.camera;
-    const int pixelcount = cam.resolution.x * cam.resolution.y;
-
-	// 2D block for generating ray from camera
-    const dim3 blockSize2d(8, 8);
-    const dim3 blocksPerGrid2d(
-            (cam.resolution.x + blockSize2d.x - 1) / blockSize2d.x,
-            (cam.resolution.y + blockSize2d.y - 1) / blockSize2d.y);
-
-	// 1D block for path tracing
-	const int blockSize1d = 128;
-
-    ///////////////////////////////////////////////////////////////////////////
-
-    // Pathtracing Recap:
-    // * Initialize array of path rays (using rays that come out of the camera)
-    //   * You can pass the Camera object to that kernel.
-    //   * Each path ray must carry at minimum a (ray, color) pair,
-    //   * where color starts as the multiplicative identity, white = (1, 1, 1).
-    //   * This has already been done for you.
-    // * NEW: For the first depth, generate geometry buffers (gbuffers)
-    // * For each depth:
-    //   * Compute an intersection in the scene for each path ray.
-    //     A very naive version of this has been implemented for you, but feel
-    //     free to add more primitives and/or a better algorithm.
-    //     Currently, intersection distance is recorded as a parametric distance,
-    //     t, or a "distance along the ray." t = -1.0 indicates no intersection.
-    //     * Color is attenuated (multiplied) by reflections off of any object
-    //   * Stream compact away all of the terminated paths.
-    //     You may use either your implementation or `thrust::remove_if` or its
-    //     cousins.
-    //     * Note that you can't really use a 2D kernel launch any more - switch
-    //       to 1D.
-    //   * Shade the rays that intersected something or didn't bottom out.
-    //     That is, color the ray by performing a color computation according
-    //     to the shader, then generate a new ray to continue the ray path.
-    //     We recommend just updating the ray's PathSegment in place.
-    //     Note that this step may come before or after stream compaction,
-    //     since some shaders you write may also cause a path to terminate.
-    // * Finally:
-    //     * if not denoising, add this iteration's results to the image
-    //     * TODO: if denoising, run kernels that take both the raw pathtraced result and the gbuffer, and put the result in the "pbo" from opengl
-
-	generateRayFromCamera <<<blocksPerGrid2d, blockSize2d >>>(cam, iter, traceDepth, dev_paths);
-	checkCUDAError("generate camera ray");
-
-	int depth = 0;
-	PathSegment* dev_path_end = dev_paths + pixelcount;
-	int num_paths = dev_path_end - dev_paths;
-
-	// --- PathSegment Tracing Stage ---
-	// Shoot ray into scene, bounce between objects, push shading chunks
+void pathtrace(int frame, int iter, bool isLast) {
+  const int traceDepth = hst_scene->state.traceDepth;
+  const Camera& cam = hst_scene->state.camera;
+  const int pixelcount = cam.resolution.x * cam.resolution.y;
+
+  // 2D block for generating ray from camera
+  const dim3 blockSize2d(8, 8);
+  const dim3 blocksPerGrid2d(
+    (cam.resolution.x + blockSize2d.x - 1) / blockSize2d.x,
+    (cam.resolution.y + blockSize2d.y - 1) / blockSize2d.y);
+
+  // 1D block for path tracing
+  const int blockSize1d = 128;
+
+  ///////////////////////////////////////////////////////////////////////////
+
+  // Pathtracing Recap:
+  // * Initialize array of path rays (using rays that come out of the camera)
+  //   * You can pass the Camera object to that kernel.
+  //   * Each path ray must carry at minimum a (ray, color) pair,
+  //   * where color starts as the multiplicative identity, white = (1, 1, 1).
+  //   * This has already been done for you.
+  // * NEW: For the first depth, generate geometry buffers (gbuffers)
+  // * For each depth:
+  //   * Compute an intersection in the scene for each path ray.
+  //     A very naive version of this has been implemented for you, but feel
+  //     free to add more primitives and/or a better algorithm.
+  //     Currently, intersection distance is recorded as a parametric distance,
+  //     t, or a "distance along the ray." t = -1.0 indicates no intersection.
+  //     * Color is attenuated (multiplied) by reflections off of any object
+  //   * Stream compact away all of the terminated paths.
+  //     You may use either your implementation or `thrust::remove_if` or its
+  //     cousins.
+  //     * Note that you can't really use a 2D kernel launch any more - switch
+  //       to 1D.
+  //   * Shade the rays that intersected something or didn't bottom out.
+  //     That is, color the ray by performing a color computation according
+  //     to the shader, then generate a new ray to continue the ray path.
+  //     We recommend just updating the ray's PathSegment in place.
+  //     Note that this step may come before or after stream compaction,
+  //     since some shaders you write may also cause a path to terminate.
+  // * Finally:
+  //     * if not denoising, add this iteration's results to the image
+  //     * TODO: if denoising, run kernels that take both the raw pathtraced result and the gbuffer, and put the result in the "pbo" from opengl
+  if (isLast)
+    cudaEventRecord(startEvent);
+
+  generateRayFromCamera << <blocksPerGrid2d, blockSize2d >> > (cam, iter, traceDepth, dev_paths);
+  checkCUDAError("generate camera ray");
+
+  int depth = 0;
+  PathSegment* dev_path_end = dev_paths + pixelcount;
+  int num_paths = dev_path_end - dev_paths;
+
+  // --- PathSegment Tracing Stage ---
+  // Shoot ray into scene, bounce between objects, push shading chunks
 
   // Empty gbuffer
   cudaMemset(dev_gBuffer, 0, pixelcount * sizeof(GBufferPixel));
 
-	// clean shading chunks
-	cudaMemset(dev_intersections, 0, pixelcount * sizeof(ShadeableIntersection));
+  // clean shading chunks
+  cudaMemset(dev_intersections, 0, pixelcount * sizeof(ShadeableIntersection));
 
   bool iterationComplete = false;
-	while (!iterationComplete) {
-
-	// tracing
-	dim3 numblocksPathSegmentTracing = (num_paths + blockSize1d - 1) / blockSize1d;
-	computeIntersections <<<numblocksPathSegmentTracing, blockSize1d>>> (
-		depth
-		, num_paths
-		, dev_paths
-		, dev_geoms
-		, hst_scene->geoms.size()
-		, dev_intersections
-		);
-	checkCUDAError("trace one bounce");
-	cudaDeviceSynchronize();
-
-  if (depth == 0) {
-    generateGBuffer<<<numblocksPathSegmentTracing, blockSize1d>>>(num_paths, dev_intersections, dev_paths, dev_gBuffer);
+  while (!iterationComplete) {
+
+    // tracing
+    dim3 numblocksPathSegmentTracing = (num_paths + blockSize1d - 1) / blockSize1d;
+    computeIntersections << <numblocksPathSegmentTracing, blockSize1d >> > (
+      depth
+      , num_paths
+      , dev_paths
+      , dev_geoms
+      , hst_scene->geoms.size()
+      , dev_intersections
+      );
+    checkCUDAError("trace one bounce");
+    cudaDeviceSynchronize();
+
+    if (depth == 0) {
+      generateGBuffer << <numblocksPathSegmentTracing, blockSize1d >> > (num_paths, dev_intersections, dev_paths, dev_gBuffer);
+    }
+
+    depth++;
+
+    shadeSimpleMaterials << <numblocksPathSegmentTracing, blockSize1d >> > (
+      iter,
+      num_paths,
+      dev_intersections,
+      dev_paths,
+      dev_materials
+      );
+    iterationComplete = depth == traceDepth;
   }
 
-	depth++;
+  if (isLast) {
+    cudaEventRecord(endEvent);
+    cudaEventSynchronize(endEvent);
+    float ms;
+    cudaEventElapsedTime(&ms, startEvent, endEvent);
+
+    std::cout << "========================" << endl;
+    std::cout << "| Time Spent Iteration: " << ms << " ms |" << endl;
+    std::cout << "========================" << endl;
+  }
 
-  shadeSimpleMaterials<<<numblocksPathSegmentTracing, blockSize1d>>> (
-    iter,
-    num_paths,
-    dev_intersections,
-    dev_paths,
-    dev_materials
-  );
-  iterationComplete = depth == traceDepth;
-	}
 
   // Assemble this iteration and apply it to the image
   dim3 numBlocksPixels = (pixelcount + blockSize1d - 1) / blockSize1d;
-	finalGather<<<numBlocksPixels, blockSize1d>>>(num_paths, dev_image, dev_paths);
+  finalGather << <numBlocksPixels, blockSize1d >> > (num_paths, dev_image, dev_paths);
+
+  ///////////////////////////////////////////////////////////////////////////
+
+#if DENOISE
+
+  if (isLast) {
+    cudaEventRecord(startEvent);
 
-    ///////////////////////////////////////////////////////////////////////////
+    cudaMemcpy(dev_image_denoise_tmp, dev_image, pixelcount * sizeof(glm::vec3), cudaMemcpyDeviceToDevice);
 
-    // CHECKITOUT: use dev_image as reference if you want to implement saving denoised images.
-    // Otherwise, screenshots are also acceptable.
-    // Retrieve image from GPU
-    cudaMemcpy(hst_scene->state.image.data(), dev_image,
-            pixelcount * sizeof(glm::vec3), cudaMemcpyDeviceToHost);
+    diffuseImage << <numBlocksPixels, blockSize1d >> > (pixelcount, iter, dev_image_denoise_tmp);
+
+    for (int stepWidth = 1; stepWidth * 4 <= ui_filterSize; stepWidth <<= 1) {
+      denoiseATour << <blocksPerGrid2d, blockSize2d >> > (cam, stepWidth, ui_colorWeight, ui_positionWeight, ui_normalWeight,
+        dev_image_denoise_tmp, dev_image_denoise, dev_gBuffer);
+
+      std::swap(dev_image_denoise_tmp, dev_image_denoise);
+    }
+    std::swap(dev_image_denoise_tmp, dev_image_denoise);
+
+    cudaEventRecord(endEvent);
+    cudaEventSynchronize(endEvent);
+    float ms;
+    cudaEventElapsedTime(&ms, startEvent, endEvent);
+
+    std::cout << "========================" << endl;
+    std::cout << "| Time Spent: " << ms << " ms |" << endl;
+    std::cout << "========================" << endl;
+  }
 
-    checkCUDAError("pathtrace");
+#endif
+
+  // CHECKITOUT: use dev_image as reference if you want to implement saving denoised images.
+  // Otherwise, screenshots are also acceptable.
+  // Retrieve image from GPU
+  cudaMemcpy(hst_scene->state.image.data(), dev_image,
+    pixelcount * sizeof(glm::vec3), cudaMemcpyDeviceToHost);
+
+  checkCUDAError("pathtrace");
 }
 
 // CHECKITOUT: this kernel "post-processes" the gbuffer/gbuffers into something that you can visualize for debugging.
-void showGBuffer(uchar4* pbo) {
-    const Camera &cam = hst_scene->state.camera;
-    const dim3 blockSize2d(8, 8);
-    const dim3 blocksPerGrid2d(
-            (cam.resolution.x + blockSize2d.x - 1) / blockSize2d.x,
-            (cam.resolution.y + blockSize2d.y - 1) / blockSize2d.y);
-
-    // CHECKITOUT: process the gbuffer results and send them to OpenGL buffer for visualization
-    gbufferToPBO<<<blocksPerGrid2d, blockSize2d>>>(pbo, cam.resolution, dev_gBuffer);
+void showGBuffer(uchar4* pbo, int iter, int type) {
+  const Camera& cam = hst_scene->state.camera;
+  const dim3 blockSize2d(8, 8);
+  const dim3 blocksPerGrid2d(
+    (cam.resolution.x + blockSize2d.x - 1) / blockSize2d.x,
+    (cam.resolution.y + blockSize2d.y - 1) / blockSize2d.y);
+
+  // CHECKITOUT: process the gbuffer results and send them to OpenGL buffer for visualization
+  gbufferToPBO << <blocksPerGrid2d, blockSize2d >> > (pbo, dev_image_gBuffer, cam.resolution, dev_gBuffer, type);
+
+  cudaMemcpy(hst_scene->state.image.data(), dev_image_gBuffer,
+    cam.resolution.x * cam.resolution.y * sizeof(glm::vec3), cudaMemcpyDeviceToHost);
+
+  for (int x = 0; x < cam.resolution.x; x++) {
+    for (int y = 0; y < cam.resolution.y; y++) {
+      int index = x + (y * cam.resolution.x);
+      hst_scene->state.image[index] *= iter;
+    }
+  }
 }
 
 void showImage(uchar4* pbo, int iter) {
-const Camera &cam = hst_scene->state.camera;
-    const dim3 blockSize2d(8, 8);
-    const dim3 blocksPerGrid2d(
-            (cam.resolution.x + blockSize2d.x - 1) / blockSize2d.x,
-            (cam.resolution.y + blockSize2d.y - 1) / blockSize2d.y);
-
-    // Send results to OpenGL buffer for rendering
-    sendImageToPBO<<<blocksPerGrid2d, blockSize2d>>>(pbo, cam.resolution, iter, dev_image);
+  const Camera& cam = hst_scene->state.camera;
+  const dim3 blockSize2d(8, 8);
+  const dim3 blocksPerGrid2d(
+    (cam.resolution.x + blockSize2d.x - 1) / blockSize2d.x,
+    (cam.resolution.y + blockSize2d.y - 1) / blockSize2d.y);
+
+  // Send results to OpenGL buffer for rendering
+  sendImageToPBO << <blocksPerGrid2d, blockSize2d >> > (pbo, cam.resolution, iter, dev_image);
 }
+
+void showImageDenoise(uchar4* pbo, int iter) {
+  const Camera& cam = hst_scene->state.camera;
+  const dim3 blockSize2d(8, 8);
+  const dim3 blocksPerGrid2d(
+    (cam.resolution.x + blockSize2d.x - 1) / blockSize2d.x,
+    (cam.resolution.y + blockSize2d.y - 1) / blockSize2d.y);
+
+  // Send results to OpenGL buffer for rendering
+  sendImageToPBO << <blocksPerGrid2d, blockSize2d >> > (pbo, cam.resolution, 1, dev_image_denoise);
+
+  cudaMemcpy(hst_scene->state.image.data(), dev_image_denoise,
+    cam.resolution.x * cam.resolution.y * sizeof(glm::vec3), cudaMemcpyDeviceToHost);
+
+  for (int x = 0; x < cam.resolution.x; x++) {
+    for (int y = 0; y < cam.resolution.y; y++) {
+      int index = x + (y * cam.resolution.x);
+      hst_scene->state.image[index] *= iter;
+    }
+  }
+}
\ No newline at end of file
diff --git a/src/pathtrace.h b/src/pathtrace.h
index 9e12f440..e573d19f 100644
--- a/src/pathtrace.h
+++ b/src/pathtrace.h
@@ -5,6 +5,7 @@
 
 void pathtraceInit(Scene *scene);
 void pathtraceFree();
-void pathtrace(int frame, int iteration);
-void showGBuffer(uchar4 *pbo);
+void pathtrace(int frame, int iteration, bool isLast);
+void showGBuffer(uchar4 *pbo, int iter, int type);
 void showImage(uchar4 *pbo, int iter);
+void showImageDenoise(uchar4* pbo, int iter);
diff --git a/src/preview.cpp b/src/preview.cpp
index 3ca27180..105eeacf 100644
--- a/src/preview.cpp
+++ b/src/preview.cpp
@@ -196,8 +196,8 @@ void drawGui(int windowWidth, int windowHeight) {
     ImGui::NewFrame();
 
     // Dear imgui define
-    ImVec2 minSize(300.f, 220.f);
-    ImVec2 maxSize((float)windowWidth * 0.5, (float)windowHeight * 0.3);
+    ImVec2 minSize(330.f, 250.f);
+    ImVec2 maxSize((float)windowWidth * 0.5, (float)windowHeight * 0.4);
     ImGui::SetNextWindowSizeConstraints(minSize, maxSize);
 
     ImGui::SetNextWindowPos(ui_hide ? ImVec2(-1000.f, -1000.f) : ImVec2(0.0f, 0.0f));
@@ -214,14 +214,15 @@ void drawGui(int windowWidth, int windowHeight) {
 
     ImGui::Checkbox("Denoise", &ui_denoise);
 
-    ImGui::SliderInt("Filter Size", &ui_filterSize, 0, 100);
-    ImGui::SliderFloat("Color Weight", &ui_colorWeight, 0.0f, 10.0f);
-    ImGui::SliderFloat("Normal Weight", &ui_normalWeight, 0.0f, 10.0f);
-    ImGui::SliderFloat("Position Weight", &ui_positionWeight, 0.0f, 10.0f);
+    ImGui::SliderInt("Filter Size", &ui_filterSize, 0, 400);
+    ImGui::SliderFloat("Color Weight", &ui_colorWeight, 0.0f, 2.0f);
+    ImGui::SliderFloat("Normal Weight", &ui_normalWeight, 0.0f, 2.0f);
+    ImGui::SliderFloat("Position Weight", &ui_positionWeight, 0.0f, 2.0f);
 
     ImGui::Separator();
 
-    ImGui::Checkbox("Show GBuffer", &ui_showGbuffer);
+    ImGui::Checkbox("Show GBuffer (0: Normal, 1: Position, 2: Color)", &ui_showGbuffer);
+    ImGui::SliderInt("GBuffer Type", &ui_typeGbuffer, 0, 2);
 
     ImGui::Separator();
 
diff --git a/src/sceneStructs.h b/src/sceneStructs.h
index da7e558a..e59234c8 100644
--- a/src/sceneStructs.h
+++ b/src/sceneStructs.h
@@ -79,4 +79,7 @@ struct ShadeableIntersection {
 // What information might be helpful for guiding a denoising filter?
 struct GBufferPixel {
   float t;
+  glm::vec3 normal;
+  glm::vec3 pos;
+  int materialId;
 };