CIS565-Fall-2018 · emily-vo · Oct 15, 2018 · Oct 16, 2018
diff --git a/README.html b/README.html
diff --git a/README.md b/README.md
@@ -5,16 +5,85 @@ CUDA Rasterizer
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 4**
 
-* (TODO) YOUR NAME HERE
-	* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Emily Vo
+  * [LinkedIn](linkedin.com/in/emilyvo), [personal website](emilyhvo.com)
+* Tested on: Windows 10, i7-7700HQ @ 2.8GHz 16GB, GTX 1060 6GB (Personal Computer)
+Updated the CMakeLists.txt to sm_61.
 
-### (TODO: Your README)
+### Blinn
+![](img/blinn_normals.gif)
 
-*DO NOT* leave the README to the last minute! It is a crucial part of the
-project, and we will not be able to grade you without a good README.
+### Lambert
+![](img/lambert_normals.gif)
 
 
+### README
+This project is a rasterizer implemented in CUDA. A typical rasterization pipeline was implemented. First, the vertex attributes were computed and assembled (e.g. eye position, eye normal, texture coordinates, and texture properties). The primitives were also assembled. The primitives are then passed to the rasterization step, where the program can choose to rasterize triangles, points, or lines. The fragment's normal, eye points and UVs are determined through barycentric interpolation. The fragment color can be read from a texture in the rasterize step using the newly interpolated UVs. Finally, The fragments are passed to the render step, where two types of shaders are implemented - Blinn and Lambert. 
+
+#### Features
+##### Point Cloud 
+
+![](img/pointcloud.png)
+
+To render a point cloud, I used the NDC x and y coordinates. I tested if they were in the bounds of the screen, and then simply colored the fragment with a normal.
+
+##### Wireframe Rendering
+
+![](img/linerender.PNG)
+
+For each line, I iterated over the x values from x-min to x-max and then solve for y in each iteration using the point-slope formula. If the x and y points are both in the screen, then I color the fragment with the normals.
+
+##### Triangle Rendering
+
+![](img/blinn.png)
+
+![](img/lambert.png)
+
+To render the triangle, I iterate over the bounding box of triangles, and test if the point is inside the triangle using barycentric coordinates. I color the fragment if it is in the bounds of the triangle.
+
+##### Texture Mapping
+For vanilla texture mapping, I simply converted the UVs to texture space and converted those values to indices within the flat texture array.
+
+In general, reading from textures are slower because you can sometimes experience cache misses when attempting to fetch a texture value. When you have a cache miss, you end up wasting more cycles trying to search for the value in physical memory.
+
+##### Bilinear Interpolation
+Bilinear interpolation is a basic resampling technique. When reading from the texture, I sampled 4 texture coordinates that surround the uv coordinates. I then passed in the distance to these corner texture coordinates as the t-value to interpolation functions. In bilinear interpolation, you first interpolate across the x-direction at the upper bounds and lound bounds of the square domain you're interpolating. You then interpolate acros the y-direction using the two interpolated x-values to get one final scalar value. In general, you get smoother looking textures. 
+
+Because you end up reading from a texture 4 times per fragment, you get a much more expensive computation. 
+
+
+![](https://demofox2.files.wordpress.com/2015/04/bilinear.png?w=534&zoom=2)
+
+Source: https://blog.demofox.org/2015/04/30/bilinear-filtering-bilinear-interpolation/
+
+##### Perspective Correct UVs
+
+![](img/NoPC.png)
+
+In the case of a checkerboard plane, it is easy to see that without perspective correct UVs, the appearance of the plane is extremely distorted.
+
+![](img/PC.png)
+
+With perspective correct UVs, the texture now appears correct with foreshortening.
+
+#### Performance Analysis
+
+![](img/FPS_vs_primitive.png)
+
+This graph shows rendering triangles is most expensive, and rendering lines is less expensive, and rendering the points is the least expensive. It makes sense that it would be slower when you need to rasterize every pixel in the triangle for each frame, versus when you only need to shade a few points.
+
+![](img/FPS_vs_shader.png)
+
+Flat shading is not expensive. Lambert is more expensive because lambert's coefficient must be computed every frame. Blinn is more expensive because lambert's law must be computed as well as the specular component, and their contributions are weighted in the final pixel color.
+
+![](img/FPS_vs_texturing.png)
+
+The first bar is vanilla texturing. The next is bilinear texturing, and the last is bilinear with perspective correct. Bilinear is expensive because you need to sample the texture 4 times per fragment, increasing the chances of a cache miss.
+
+![](img/featuresTime.png)
+
+In this graph, each part of the pipeline is timed with different features. Throughout the graph, vertex and primitive assembly remains the same, except for when texture attributes need to be set. That explains why the vertex and primitive assembly slows down a little bit when textures are turned on. Rasterization varies the most between each feature. Rasterization experiences a huge slowdown when triangles are turned down, and another slowdown when texture mapping is turned down. Rasterization with bilinear becomes extremely slow, probably from the increased likelihood of cache misses. Fortunately, perspective correction is not as expensive to add on top of texturing and is a huge benefit to the appearance of the renders, since it is just a handful of additional mathematical operations.
+
 ### Credits
 
 * [tinygltfloader](https://github.com/syoyo/tinygltfloader) by [@soyoyo](https://github.com/syoyo)

diff --git a/img/BL.png b/img/BL.png
diff --git a/img/FPS_vs_primitive.png b/img/FPS_vs_primitive.png
diff --git a/img/FPS_vs_shader.png b/img/FPS_vs_shader.png
diff --git a/img/FPS_vs_texturing.png b/img/FPS_vs_texturing.png
diff --git a/img/NoPC.png b/img/NoPC.png
diff --git a/img/PC.png b/img/PC.png
diff --git a/img/blinn.png b/img/blinn.png
diff --git a/img/blinn_normals.gif b/img/blinn_normals.gif
diff --git a/img/featuresTime.png b/img/featuresTime.png
diff --git a/img/lambert.png b/img/lambert.png
diff --git a/img/lambert_normals.gif b/img/lambert_normals.gif
diff --git a/img/linerender.PNG b/img/linerender.PNG
diff --git a/img/pointcloud.png b/img/pointcloud.png
diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
@@ -6,5 +6,5 @@ set(SOURCE_FILES
 
 cuda_add_library(src
     ${SOURCE_FILES}
-    OPTIONS -arch=sm_20
+    OPTIONS -arch=sm_61
     )
diff --git a/src/common.cu b/src/common.cu
@@ -0,0 +1,39 @@
+#include "common.h"
+
+void checkCUDAErrorFn(const char *msg, const char *file, int line) {
+    cudaError_t err = cudaGetLastError();
+    if (cudaSuccess == err) {
+        return;
+    }
+
+    fprintf(stderr, "CUDA error");
+    if (file) {
+        fprintf(stderr, " (%s:%d)", file, line);
+    }
+    fprintf(stderr, ": %s: %s\n", msg, cudaGetErrorString(err));
+    exit(EXIT_FAILURE);
+}
+
+
+namespace StreamCompaction {
+    namespace Common {
+
+        /**
+         * Maps an array to an array of 0s and 1s for stream compaction. Elements
+         * which map to 0 will be removed, and elements which map to 1 will be kept.
+         */
+        __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
+            // TODO
+        }
+
+        /**
+         * Performs scatter on an array. That is, for each element in idata,
+         * if bools[idx] == 1, it copies idata[idx] to odata[indices[idx]].
+         */
+        __global__ void kernScatter(int n, int *odata,
+                const int *idata, const int *bools, const int *indices) {
+            // TODO
+        }
+
+    }
+}
diff --git a/src/common.h b/src/common.h
@@ -0,0 +1,132 @@
+#pragma once
+
+#include <cuda.h>
+#include <cuda_runtime.h>
+
+#include <cstdio>
+#include <cstring>
+#include <cmath>
+#include <algorithm>
+#include <chrono>
+#include <stdexcept>
+
+#define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__)
+#define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__)
+
+/**
+ * Check for CUDA errors; print and exit if there was a problem.
+ */
+void checkCUDAErrorFn(const char *msg, const char *file = NULL, int line = -1);
+
+inline int ilog2(int x) {
+    int lg = 0;
+    while (x >>= 1) {
+        ++lg;
+    }
+    return lg;
+}
+
+inline int ilog2ceil(int x) {
+    return x == 1 ? 0 : ilog2(x - 1) + 1;
+}
+
+namespace StreamCompaction {
+    namespace Common {
+        __global__ void kernMapToBoolean(int n, int *bools, const int *idata);
+
+        __global__ void kernScatter(int n, int *odata,
+                const int *idata, const int *bools, const int *indices);
+
+	    /**
+	    * This class is used for timing the performance
+	    * Uncopyable and unmovable
+        *
+        * Adapted from WindyDarian(https://github.com/WindyDarian)
+	    */
+	    class PerformanceTimer
+	    {
+	    public:
+		    PerformanceTimer()
+		    {
+			    cudaEventCreate(&event_start);
+			    cudaEventCreate(&event_end);
+		    }
+
+		    ~PerformanceTimer()
+		    {
+			    cudaEventDestroy(event_start);
+			    cudaEventDestroy(event_end);
+		    }
+
+		    void startCpuTimer()
+		    {
+			    if (cpu_timer_started) { throw std::runtime_error("CPU timer already started"); }
+			    cpu_timer_started = true;
+
+			    time_start_cpu = std::chrono::high_resolution_clock::now();
+		    }
+
+		    void endCpuTimer()
+		    {
+			    time_end_cpu = std::chrono::high_resolution_clock::now();
+
+			    if (!cpu_timer_started) { throw std::runtime_error("CPU timer not started"); }
+
+			    std::chrono::duration<double, std::milli> duro = time_end_cpu - time_start_cpu;
+			    prev_elapsed_time_cpu_milliseconds =
+				    static_cast<decltype(prev_elapsed_time_cpu_milliseconds)>(duro.count());
+
+			    cpu_timer_started = false;
+		    }
+
+		    void startGpuTimer()
+		    {
+			    if (gpu_timer_started) { throw std::runtime_error("GPU timer already started"); }
+			    gpu_timer_started = true;
+
+			    cudaEventRecord(event_start);
+		    }
+
+		    void endGpuTimer()
+		    {
+			    cudaEventRecord(event_end);
+			    cudaEventSynchronize(event_end);
+
+			    if (!gpu_timer_started) { throw std::runtime_error("GPU timer not started"); }
+
+			    cudaEventElapsedTime(&prev_elapsed_time_gpu_milliseconds, event_start, event_end);
+			    gpu_timer_started = false;
+		    }
+
+		    float getCpuElapsedTimeForPreviousOperation() //noexcept //(damn I need VS 2015
+		    {
+			    return prev_elapsed_time_cpu_milliseconds;
+		    }
+
+		    float getGpuElapsedTimeForPreviousOperation() //noexcept
+		    {
+			    return prev_elapsed_time_gpu_milliseconds;
+		    }
+
+		    // remove copy and move functions
+		    PerformanceTimer(const PerformanceTimer&) = delete;
+		    PerformanceTimer(PerformanceTimer&&) = delete;
+		    PerformanceTimer& operator=(const PerformanceTimer&) = delete;
+		    PerformanceTimer& operator=(PerformanceTimer&&) = delete;
+
+	    private:
+		    cudaEvent_t event_start = nullptr;
+		    cudaEvent_t event_end = nullptr;
+
+		    using time_point_t = std::chrono::high_resolution_clock::time_point;
+		    time_point_t time_start_cpu;
+		    time_point_t time_end_cpu;
+
+		    bool cpu_timer_started = false;
+		    bool gpu_timer_started = false;
+
+		    float prev_elapsed_time_cpu_milliseconds = 0.f;
+		    float prev_elapsed_time_gpu_milliseconds = 0.f;
+	    };
+    }
+}