CIS565-Fall-2021 · JiyuHuang · Sep 19, 2021 · Sep 19, 2021 · Sep 20, 2021 · Sep 20, 2021
diff --git a/README.md b/README.md
@@ -3,12 +3,107 @@ CUDA Stream Compaction
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Jiyu Huang
+  * [LinkedIn](https://www.linkedin.com/in/jiyu-huang-0123/)
+* Tested on: Windows 10, i7-6700 @ 3.41GHz 16GB, Quadro P1000 4096MB (Towne 062 Lab)
 
-### (TODO: Your README)
+### Overview
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+This project involves CUDA implementation of GPU parallel algorithms such as scan (prefix sum), stream compaction, and Radix sort. Specifically, the following are implemented:
 
+* CPU scan (serialized, used as comparison)
+* CPU stream compact (serialized, used as comparison)
+* CPU stream compact with CPU scan (serialized, used as comparison)
+* naive version of GPU scan
+* work-efficient GPU scan
+* work-efficient GPU scan with shared memory and no bank conflict
+* GPU stream compaction using optimized work-efficient GPU scan
+* GPU Radix sort using optimized work-efficient GPU scan
+
+Thrust library's version of exclusive scan is also used as comparison.
+
+### Performance Analysis
+
+The performance of various implementations of scan are illustrated below.
+
+![performance_chart](/img/performance_chart.png)
+![performance_chart_large](/img/performance_chart_large.png)
+
+* As can be seen from the graph, starting from array size 2^15 (32768), GPU algorithms show performance advantages towards the CPU implementation, due to the advantages of parallelism.
+
+* The naive version of GPU scan performs sufficiently well until the array size reaches 2^17 (131072), after which point the performance drops significantly. As shown in the large array graph, it becomes the slowest implementation, even slower than CPU implementation. This is due to the fact that the naive version of GPU scan is not work efficient and in total performs the most amount of computations.
+
+* The work-efficient version of GPU scan initially performs worse than other implementations, but catches up and ends up reducing execution time compared to CPU implemetation and naive GPU implementation. The initial slowness likely results from the extra amount of kernel invocations.
+
+* The shared memory optimization has a significant impact on improving performance and is the fastest implementation in this project, as it should be; operating on shared memory efficiently does prove to be much faster than operating on global memory.
+
+* Thrust library's scan function is almost always the fastest version (except for when the array size is small, or when the array size goes from 2^17 (131072) to 2^18 (262144), where Thrust's scan function has a sudden performance drop).
+
+I would like to delve deeper into the execution timelines for each implementation (and understand why Thrust's scan is so fast), but since I am using the lab computer with no admin access to enable Nsight tracing, I'm temporarily unable to do that.
+
+### Test Results
+
+The following test output is generated with array size of one million (2^20). Extra Radix sort tests (one for power-of-two, one for non-power-of-two) testing the sorting correctness are also included at the end.
+
+```
+****************
+** SCAN TESTS **
+****************
+    [  19  18  31  41  16  17   6  12  41   4   7  45  31 ...  11   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 1.6692ms    (std::chrono Measured)
+    [   0  19  37  68 109 125 142 148 160 201 205 212 257 ... 25680674 25680685 ]
+==== cpu scan, non-power-of-two ====
+   elapsed time: 1.6274ms    (std::chrono Measured)
+    passed
+==== naive scan, power-of-two ====
+   elapsed time: 2.83338ms    (CUDA Measured)
+    passed
+==== naive scan, non-power-of-two ====
+   elapsed time: 2.64646ms    (CUDA Measured)
+    passed
+==== work-efficient scan, power-of-two ====
+   elapsed time: 1.11798ms    (CUDA Measured)
+    passed
+==== work-efficient scan, non-power-of-two ====
+   elapsed time: 0.829952ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 0.434304ms    (CUDA Measured)
+    passed
+==== thrust scan, non-power-of-two ====
+   elapsed time: 0.463584ms    (CUDA Measured)
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   2   2   1   2   1   2   2   1   2   0   3   1   1 ...   0   0 ]
+==== cpu compact without scan, power-of-two ====
+   elapsed time: 3.0586ms    (std::chrono Measured)
+    [   2   2   1   2   1   2   2   1   2   3   1   1   1 ...   1   1 ]
+    passed
+==== cpu compact without scan, non-power-of-two ====
+   elapsed time: 2.5556ms    (std::chrono Measured)
+    [   2   2   1   2   1   2   2   1   2   3   1   1   1 ...   2   1 ]
+    passed
+==== cpu compact with scan ====
+   elapsed time: 5.8896ms    (std::chrono Measured)
+    [   2   2   1   2   1   2   2   1   2   3   1   1   1 ...   1   1 ]
+    passed
+==== work-efficient compact, power-of-two ====
+   elapsed time: 1.22845ms    (CUDA Measured)
+    passed
+==== work-efficient compact, non-power-of-two ====
+   elapsed time: 1.25226ms    (CUDA Measured)
+    passed
+
+*****************************
+** RADIX SORT TESTS **
+*****************************
+    [   2   6  17   6  13  14  18  13   2  16  19   9   1 ...   4   0 ]
+==== radix sort, power-of-two ====
+    passed
+==== radix sort, non-power-of-two ====
+    passed
+```
diff --git a/img/data.xlsx b/img/data.xlsx
diff --git a/img/performance_chart.png b/img/performance_chart.png
diff --git a/img/performance_chart_large.png b/img/performance_chart_large.png
diff --git a/src/main.cpp b/src/main.cpp
@@ -13,7 +13,7 @@
 #include <stream_compaction/thrust.h>
 #include "testing_helpers.hpp"
 
-const int SIZE = 1 << 8; // feel free to change the size of array
+const int SIZE = 1 << 20; // feel free to change the size of array
 const int NPOT = SIZE - 3; // Non-Power-Of-Two
 int *a = new int[SIZE];
 int *b = new int[SIZE];
@@ -44,7 +44,7 @@ int main(int argc, char* argv[]) {
     printDesc("cpu scan, non-power-of-two");
     StreamCompaction::CPU::scan(NPOT, c, a);
     printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
-    printArray(NPOT, b, true);
+    //printArray(NPOT, b, true);
     printCmpResult(NPOT, b, c);
 
     zeroArray(SIZE, c);
@@ -147,6 +147,29 @@ int main(int argc, char* argv[]) {
     //printArray(count, c, true);
     printCmpLenResult(count, expectedNPOT, b, c);
 
+    printf("\n");
+    printf("*****************************\n");
+    printf("** RADIX SORT TESTS **\n");
+    printf("*****************************\n");
+
+    genArray(SIZE - 1, a, 20);  // Leave a 0 at the end to test that edge case
+    a[SIZE - 1] = 0;
+    printArray(SIZE, a, true);
+
+    zeroArray(SIZE, b);
+    printDesc("radix sort, power-of-two");
+    StreamCompaction::Efficient::radixSort(SIZE, b, a);
+    std::memcpy(c, a, SIZE * sizeof(int));
+    std::sort(c, c + SIZE);
+    printCmpResult(SIZE, b, c);
+
+    zeroArray(NPOT, b);
+    printDesc("radix sort, non-power-of-two");
+    StreamCompaction::Efficient::radixSort(NPOT, b, a);
+    std::memcpy(c, a, NPOT * sizeof(int));
+    std::sort(c, c + NPOT);
+    printCmpResult(NPOT, b, c);
+
     system("pause"); // stop Win32 console from closing on exit
     delete[] a;
     delete[] b;

diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
@@ -23,7 +23,11 @@ namespace StreamCompaction {
          * which map to 0 will be removed, and elements which map to 1 will be kept.
          */
         __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
-            // TODO
+            int index = blockIdx.x * blockDim.x + threadIdx.x;
+            if (index >= n) {
+                return;
+            }
+            bools[index] = idata[index] == 0 ? 0 : 1;
         }
 
         /**
@@ -32,7 +36,13 @@ namespace StreamCompaction {
          */
         __global__ void kernScatter(int n, int *odata,
                 const int *idata, const int *bools, const int *indices) {
-            // TODO
+            int index = blockIdx.x * blockDim.x + threadIdx.x;
+            if (index >= n) {
+                return;
+            }
+            if (bools[index] == 1) {
+                odata[indices[index]] = idata[index]; 
+            }
         }
 
     }

diff --git a/stream_compaction/common.h b/stream_compaction/common.h
@@ -13,6 +13,8 @@
 #define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__)
 #define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__)
 
+const int blockSize = 512;
+
 /**
  * Check for CUDA errors; print and exit if there was a problem.
  */

diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
@@ -12,14 +12,21 @@ namespace StreamCompaction {
             return timer;
         }
 
+        void scanHelper(int n, int* odata, const int* idata) {
+            odata[0] = 0;
+            for (int i = 1; i < n; ++i) {
+                odata[i] = idata[i - 1] + odata[i - 1];
+            }
+        }
+
         /**
          * CPU scan (prefix sum).
          * For performance analysis, this is supposed to be a simple for loop.
          * (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first.
          */
         void scan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
-            // TODO
+            scanHelper(n, odata, idata);
             timer().endCpuTimer();
         }
 
@@ -30,9 +37,14 @@ namespace StreamCompaction {
          */
         int compactWithoutScan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
-            // TODO
+            int j = 0;
+            for (int i = 0; i < n; ++i) {
+                if (idata[i] != 0) {
+                    odata[j++] = idata[i];
+                }
+            }
             timer().endCpuTimer();
-            return -1;
+            return j;
         }
 
         /**
@@ -42,9 +54,21 @@ namespace StreamCompaction {
          */
         int compactWithScan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
-            // TODO
+            int *boolArray = new int[n];
+            int *scanArray = new int[n];
+            for (int i = 0; i < n; ++i) {
+                boolArray[i] = idata[i] == 0 ? 0 : 1;
+            }
+            scanHelper(n, scanArray, boolArray);
+            int num = 0;
+            for (int i = 0; i < n; ++i) {
+                if (boolArray[i] != 0) {
+                    odata[scanArray[i]] = idata[i];
+                    ++num;
+                }
+            }
             timer().endCpuTimer();
-            return -1;
+            return num;
         }
     }
 }