CIS565-Fall-2021 · LaurelinTheGold · Sep 20, 2021 · Sep 20, 2021 · Sep 21, 2021 · Sep 21, 2021
diff --git a/.vscode/settings.json b/.vscode/settings.json
@@ -0,0 +1,8 @@
+{
+    "files.associations": {
+        "charconv": "cpp",
+        "xstring": "cpp",
+        "xtree": "cpp",
+        "chrono": "cpp"
+    }
+}
diff --git a/README.md b/README.md
@@ -3,12 +3,132 @@ CUDA Stream Compaction
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Richard Chen
+  * [LinkedIn](https://www.linkedin.com/in/richardrlchen/)
+* Tested on: Windows 11, i7-10875H @ 2.3GHz 16GB, RTX 2060 MAXQ 6GB (PC)
+
+## Overview
+The purpose of this project was to explore parallel algorithms, something a GPU 
+excels at. To that end, we used the GPU to perform stream compaction via scan.
+Stream compaction filters out elements from a list, in this case discarding the 0
+elements from the input. Scan is a fold but all the intermediate steps are also 
+available. 
+
+## Features
+* CPU implementation of scan 
+* CPU implementation of stream compaction
+* Naive parallel version of scan for GPU implemented in CUDA
+* Work efficient parallel version of scan for GPU implemented in CUDA
+* Stream Compaction that leverages the GPU scan
+
+## Example Output
+Tested on 256 element array
+```
+****************
+** SCAN TESTS **
+****************
+    [  28   7  23   2  29  27  22  19   2  38  26  47  45 ...  43   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 0.0005ms    (std::chrono Measured)
+    [   0  28  35  58  60  89 116 138 157 159 197 223 270 ... 6050 6093 ]
+==== cpu scan, non-power-of-two ====
+   elapsed time: 0.0003ms    (std::chrono Measured)
+    [   0  28  35  58  60  89 116 138 157 159 197 223 270 ... 6016 6031 ]
+    passed
+==== naive scan, power-of-two ====
+   elapsed time: 0.243616ms    (CUDA Measured)
+    a[1] = 28, b[1] = 0
+    FAIL VALUE
+==== naive scan, non-power-of-two ====
+   elapsed time: 0.17504ms    (CUDA Measured)
+    a[1] = 28, b[1] = 0
+    FAIL VALUE
+==== work-efficient scan, power-of-two ====
+   elapsed time: 0.325728ms    (CUDA Measured)
+    passed
+==== work-efficient scan, non-power-of-two ====
+   elapsed time: 0.335584ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 0.1088ms    (CUDA Measured)
+    passed
+==== thrust scan, non-power-of-two ====
+   elapsed time: 0.048928ms    (CUDA Measured)
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   0   3   1   3   1   1   0   2   2   0   0   1   2 ...   3   0 ]
+==== cpu compact without scan, power-of-two ====
+   elapsed time: 0.0013ms    (std::chrono Measured)
+    [   3   1   3   1   1   2   2   1   2   2   2   1   2 ...   3   3 ]
+    passed
+==== cpu compact without scan, non-power-of-two ====
+   elapsed time: 0.0012ms    (std::chrono Measured)
+    [   3   1   3   1   1   2   2   1   2   2   2   1   2 ...   2   2 ]
+    passed
+==== cpu compact with scan ====
+   elapsed time: 0.0015ms    (std::chrono Measured)
+    [   3   1   3   1   1   2   2   1   2   2   2   1   2 ...   3   3 ]
+    passed
+==== work-efficient compact, power-of-two ====
+   elapsed time: 0.36784ms    (CUDA Measured)
+    passed
+==== work-efficient compact, non-power-of-two ====
+   elapsed time: 0.3792ms    (CUDA Measured)
+    passed
+Press any key to continue . . .
+```
+
+## Performance Analysis
+Optimizing Block Size
+<br>
+<img src="img/time_blocksize.png">  
+
+There does not seem to be a significant difference when tested on size 2^16. 
+
+Performance as a function of array length
+<br>
+<img src="img/scan_time.png">  
+
+Notice that the time axis is log scaled. The crossover between the naive and 
+efficient implementations happens around length 2^19. 
+
+Stream Compaction as a function of array length
+<br>
+<img src="img/stream_time.png">  
+
+Notice that the time axis is log scaled. Even with arrays of up to length 
+2^28, the single threaded CPU is still faster for stream compactification. 
+This goes to show how parallelism and concurrency should be used situationally
+as in many scenarios, the drawbacks might outweigh the benefits. 
+
+### Nsight
+Profiling the execution on arrays of length 2^16
+<br>  
+<img src="img/profile_timeline.png"> 
+In the CUDA HW row, green represents host->device and 
+red represents device->host. Thus we can see that these pairs neatly bracket
+the naive implementation, the work-efficient implementation, and using 
+the thrust library. 
+
+Up to 936ms is the naive GPU version. It looks about half as long as the 
+work efficient implementation. The work efficient implementation iterates through
+the layers of the "tree" twice, once on the upsweep and once on the downsweep. 
+Until the arrays are large enough, the larger number of computations performed
+will not overcome the overhead incurred from taking twice the number of steps. 
+When tested on length 2^20 arrays, the work efficient implementation was faster.
+
+The thrust implementation is significantly faster and also has far fewer kernel 
+calls. This seems reasonable as there probably is overhead that is incurred when
+spawning multiple kernels so being able to fit the operation into fewer kernel
+calls would drastically cut down on this. Additionally, thrust probably has 
+shared memory optimizations, memory access optimizations, and more efficient
+computation. 
+
+For all 3, the kernel executions themselves have lots of empty time in between
+computations which suggests there is some form of IO bottleneck. 
 
-### (TODO: Your README)
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
 
diff --git a/img/profile_timeline.png b/img/profile_timeline.png
diff --git a/img/scan_time.png b/img/scan_time.png
diff --git a/img/stream_time.png b/img/stream_time.png
diff --git a/img/time_blocksize.png b/img/time_blocksize.png
diff --git a/src/main.cpp b/src/main.cpp
@@ -13,21 +13,22 @@
 #include <stream_compaction/thrust.h>
 #include "testing_helpers.hpp"
 
-const int SIZE = 1 << 8; // feel free to change the size of array
+const int SIZE = 1 << 12;  // feel free to change the size of array
 const int NPOT = SIZE - 3; // Non-Power-Of-Two
 int *a = new int[SIZE];
 int *b = new int[SIZE];
 int *c = new int[SIZE];
 
-int main(int argc, char* argv[]) {
+int main(int argc, char *argv[])
+{
     // Scan tests
 
     printf("\n");
     printf("****************\n");
     printf("** SCAN TESTS **\n");
     printf("****************\n");
 
-    genArray(SIZE - 1, a, 50);  // Leave a 0 at the end to test that edge case
+    genArray(SIZE - 1, a, 50); // Leave a 0 at the end to test that edge case
     a[SIZE - 1] = 0;
     printArray(SIZE, a, true);
 
@@ -81,7 +82,16 @@ int main(int argc, char* argv[]) {
     //printArray(NPOT, c, true);
     printCmpResult(NPOT, b, c);
 
-    zeroArray(SIZE, c);
+    // int tmpTest[] = {0, 1, 2, 3, 4, 5, 6, 7};
+    // int *tmpTestOut = new int[8];
+    // zeroArray(8, tmpTestOut);
+    // printDesc("Small array slides example scan");
+    // StreamCompaction::Efficient::scan(8, tmpTestOut, tmpTest);
+    // printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    // //printArray(NPOT, c, true);
+    // printCmpResult(NPOT, b, c);
+    // delete[] tmpTestOut;
+
     printDesc("thrust scan, power-of-two");
     StreamCompaction::Thrust::scan(SIZE, c, a);
     printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
@@ -102,7 +112,7 @@ int main(int argc, char* argv[]) {
 
     // Compaction tests
 
-    genArray(SIZE - 1, a, 4);  // Leave a 0 at the end to test that edge case
+    genArray(SIZE - 1, a, 4); // Leave a 0 at the end to test that edge case
     a[SIZE - 1] = 0;
     printArray(SIZE, a, true);
 

diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
@@ -1,38 +1,57 @@
 #include "common.h"
 
-void checkCUDAErrorFn(const char *msg, const char *file, int line) {
+void checkCUDAErrorFn(const char *msg, const char *file, int line)
+{
     cudaError_t err = cudaGetLastError();
-    if (cudaSuccess == err) {
+    if (cudaSuccess == err)
+    {
         return;
     }
 
     fprintf(stderr, "CUDA error");
-    if (file) {
+    if (file)
+    {
         fprintf(stderr, " (%s:%d)", file, line);
     }
     fprintf(stderr, ": %s: %s\n", msg, cudaGetErrorString(err));
     exit(EXIT_FAILURE);
 }
 
-
-namespace StreamCompaction {
-    namespace Common {
+namespace StreamCompaction
+{
+    namespace Common
+    {
 
         /**
          * Maps an array to an array of 0s and 1s for stream compaction. Elements
          * which map to 0 will be removed, and elements which map to 1 will be kept.
          */
-        __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
-            // TODO
+        __global__ void kernMapToBoolean(int n, int *bools, const int *idata)
+        {
+            int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+            if (index >= n)
+            {
+                return;
+            }
+            bools[index] = !(idata[index] == 0);
         }
 
         /**
          * Performs scatter on an array. That is, for each element in idata,
          * if bools[idx] == 1, it copies idata[idx] to odata[indices[idx]].
          */
         __global__ void kernScatter(int n, int *odata,
-                const int *idata, const int *bools, const int *indices) {
-            // TODO
+                                    const int *idata, const int *bools, const int *indices)
+        {
+            int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+            if (index >= n)
+            {
+                return;
+            }
+            if (bools[index])
+            {
+                odata[indices[index]] = idata[index];
+            }
         }
 
     }

diff --git a/stream_compaction/common.h b/stream_compaction/common.h
@@ -10,32 +10,41 @@
 #include <chrono>
 #include <stdexcept>
 
+#define blockSize 1024
+
 #define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__)
 #define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__)
 
+#define checkCUDAErrorWithLine(msg) checkCUDAError(msg, __LINE__)
+
 /**
  * Check for CUDA errors; print and exit if there was a problem.
  */
 void checkCUDAErrorFn(const char *msg, const char *file = NULL, int line = -1);
 
-inline int ilog2(int x) {
+inline int ilog2(int x)
+{
     int lg = 0;
-    while (x >>= 1) {
+    while (x >>= 1)
+    {
         ++lg;
     }
     return lg;
 }
 
-inline int ilog2ceil(int x) {
+inline int ilog2ceil(int x)
+{
     return x == 1 ? 0 : ilog2(x - 1) + 1;
 }
 
-namespace StreamCompaction {
-    namespace Common {
+namespace StreamCompaction
+{
+    namespace Common
+    {
         __global__ void kernMapToBoolean(int n, int *bools, const int *idata);
 
         __global__ void kernScatter(int n, int *odata,
-                const int *idata, const int *bools, const int *indices);
+                                    const int *idata, const int *bools, const int *indices);
 
         /**
         * This class is used for timing the performance
@@ -60,7 +69,10 @@ namespace StreamCompaction {
 
             void startCpuTimer()
             {
-                if (cpu_timer_started) { throw std::runtime_error("CPU timer already started"); }
+                if (cpu_timer_started)
+                {
+                    throw std::runtime_error("CPU timer already started");
+                }
                 cpu_timer_started = true;
 
                 time_start_cpu = std::chrono::high_resolution_clock::now();
@@ -70,7 +82,10 @@ namespace StreamCompaction {
             {
                 time_end_cpu = std::chrono::high_resolution_clock::now();
 
-                if (!cpu_timer_started) { throw std::runtime_error("CPU timer not started"); }
+                if (!cpu_timer_started)
+                {
+                    throw std::runtime_error("CPU timer not started");
+                }
 
                 std::chrono::duration<double, std::milli> duro = time_end_cpu - time_start_cpu;
                 prev_elapsed_time_cpu_milliseconds =
@@ -81,7 +96,10 @@ namespace StreamCompaction {
 
             void startGpuTimer()
             {
-                if (gpu_timer_started) { throw std::runtime_error("GPU timer already started"); }
+                if (gpu_timer_started)
+                {
+                    throw std::runtime_error("GPU timer already started");
+                }
                 gpu_timer_started = true;
 
                 cudaEventRecord(event_start);
@@ -92,7 +110,10 @@ namespace StreamCompaction {
                 cudaEventRecord(event_end);
                 cudaEventSynchronize(event_end);
 
-                if (!gpu_timer_started) { throw std::runtime_error("GPU timer not started"); }
+                if (!gpu_timer_started)
+                {
+                    throw std::runtime_error("GPU timer not started");
+                }
 
                 cudaEventElapsedTime(&prev_elapsed_time_gpu_milliseconds, event_start, event_end);
                 gpu_timer_started = false;
@@ -109,10 +130,10 @@ namespace StreamCompaction {
             }
 
             // remove copy and move functions
-            PerformanceTimer(const PerformanceTimer&) = delete;
-            PerformanceTimer(PerformanceTimer&&) = delete;
-            PerformanceTimer& operator=(const PerformanceTimer&) = delete;
-            PerformanceTimer& operator=(PerformanceTimer&&) = delete;
+            PerformanceTimer(const PerformanceTimer &) = delete;
+            PerformanceTimer(PerformanceTimer &&) = delete;
+            PerformanceTimer &operator=(const PerformanceTimer &) = delete;
+            PerformanceTimer &operator=(PerformanceTimer &&) = delete;
 
         private:
             cudaEvent_t event_start = nullptr;