Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
107 changes: 101 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,107 @@ CUDA Stream Compaction

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**

* (TODO) YOUR NAME HERE
* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* Jiyu Huang
* [LinkedIn](https://www.linkedin.com/in/jiyu-huang-0123/)
* Tested on: Windows 10, i7-6700 @ 3.41GHz 16GB, Quadro P1000 4096MB (Towne 062 Lab)

### (TODO: Your README)
### Overview

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)
This project involves CUDA implementation of GPU parallel algorithms such as scan (prefix sum), stream compaction, and Radix sort. Specifically, the following are implemented:

* CPU scan (serialized, used as comparison)
* CPU stream compact (serialized, used as comparison)
* CPU stream compact with CPU scan (serialized, used as comparison)
* naive version of GPU scan
* work-efficient GPU scan
* work-efficient GPU scan with shared memory and no bank conflict
* GPU stream compaction using optimized work-efficient GPU scan
* GPU Radix sort using optimized work-efficient GPU scan

Thrust library's version of exclusive scan is also used as comparison.

### Performance Analysis

The performance of various implementations of scan are illustrated below.

![performance_chart](/img/performance_chart.png)
![performance_chart_large](/img/performance_chart_large.png)

* As can be seen from the graph, starting from array size 2^15 (32768), GPU algorithms show performance advantages towards the CPU implementation, due to the advantages of parallelism.

* The naive version of GPU scan performs sufficiently well until the array size reaches 2^17 (131072), after which point the performance drops significantly. As shown in the large array graph, it becomes the slowest implementation, even slower than CPU implementation. This is due to the fact that the naive version of GPU scan is not work efficient and in total performs the most amount of computations.

* The work-efficient version of GPU scan initially performs worse than other implementations, but catches up and ends up reducing execution time compared to CPU implemetation and naive GPU implementation. The initial slowness likely results from the extra amount of kernel invocations.

* The shared memory optimization has a significant impact on improving performance and is the fastest implementation in this project, as it should be; operating on shared memory efficiently does prove to be much faster than operating on global memory.

* Thrust library's scan function is almost always the fastest version (except for when the array size is small, or when the array size goes from 2^17 (131072) to 2^18 (262144), where Thrust's scan function has a sudden performance drop).

I would like to delve deeper into the execution timelines for each implementation (and understand why Thrust's scan is so fast), but since I am using the lab computer with no admin access to enable Nsight tracing, I'm temporarily unable to do that.

### Test Results

The following test output is generated with array size of one million (2^20). Extra Radix sort tests (one for power-of-two, one for non-power-of-two) testing the sorting correctness are also included at the end.

```
****************
** SCAN TESTS **
****************
[ 19 18 31 41 16 17 6 12 41 4 7 45 31 ... 11 0 ]
==== cpu scan, power-of-two ====
elapsed time: 1.6692ms (std::chrono Measured)
[ 0 19 37 68 109 125 142 148 160 201 205 212 257 ... 25680674 25680685 ]
==== cpu scan, non-power-of-two ====
elapsed time: 1.6274ms (std::chrono Measured)
passed
==== naive scan, power-of-two ====
elapsed time: 2.83338ms (CUDA Measured)
passed
==== naive scan, non-power-of-two ====
elapsed time: 2.64646ms (CUDA Measured)
passed
==== work-efficient scan, power-of-two ====
elapsed time: 1.11798ms (CUDA Measured)
passed
==== work-efficient scan, non-power-of-two ====
elapsed time: 0.829952ms (CUDA Measured)
passed
==== thrust scan, power-of-two ====
elapsed time: 0.434304ms (CUDA Measured)
passed
==== thrust scan, non-power-of-two ====
elapsed time: 0.463584ms (CUDA Measured)
passed

*****************************
** STREAM COMPACTION TESTS **
*****************************
[ 2 2 1 2 1 2 2 1 2 0 3 1 1 ... 0 0 ]
==== cpu compact without scan, power-of-two ====
elapsed time: 3.0586ms (std::chrono Measured)
[ 2 2 1 2 1 2 2 1 2 3 1 1 1 ... 1 1 ]
passed
==== cpu compact without scan, non-power-of-two ====
elapsed time: 2.5556ms (std::chrono Measured)
[ 2 2 1 2 1 2 2 1 2 3 1 1 1 ... 2 1 ]
passed
==== cpu compact with scan ====
elapsed time: 5.8896ms (std::chrono Measured)
[ 2 2 1 2 1 2 2 1 2 3 1 1 1 ... 1 1 ]
passed
==== work-efficient compact, power-of-two ====
elapsed time: 1.22845ms (CUDA Measured)
passed
==== work-efficient compact, non-power-of-two ====
elapsed time: 1.25226ms (CUDA Measured)
passed

*****************************
** RADIX SORT TESTS **
*****************************
[ 2 6 17 6 13 14 18 13 2 16 19 9 1 ... 4 0 ]
==== radix sort, power-of-two ====
passed
==== radix sort, non-power-of-two ====
passed
```
Binary file added img/data.xlsx
Binary file not shown.
Binary file added img/performance_chart.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/performance_chart_large.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
27 changes: 25 additions & 2 deletions src/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
#include <stream_compaction/thrust.h>
#include "testing_helpers.hpp"

const int SIZE = 1 << 8; // feel free to change the size of array
const int SIZE = 1 << 20; // feel free to change the size of array
const int NPOT = SIZE - 3; // Non-Power-Of-Two
int *a = new int[SIZE];
int *b = new int[SIZE];
Expand Down Expand Up @@ -44,7 +44,7 @@ int main(int argc, char* argv[]) {
printDesc("cpu scan, non-power-of-two");
StreamCompaction::CPU::scan(NPOT, c, a);
printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
printArray(NPOT, b, true);
//printArray(NPOT, b, true);
printCmpResult(NPOT, b, c);

zeroArray(SIZE, c);
Expand Down Expand Up @@ -147,6 +147,29 @@ int main(int argc, char* argv[]) {
//printArray(count, c, true);
printCmpLenResult(count, expectedNPOT, b, c);

printf("\n");
printf("*****************************\n");
printf("** RADIX SORT TESTS **\n");
printf("*****************************\n");

genArray(SIZE - 1, a, 20); // Leave a 0 at the end to test that edge case
a[SIZE - 1] = 0;
printArray(SIZE, a, true);

zeroArray(SIZE, b);
printDesc("radix sort, power-of-two");
StreamCompaction::Efficient::radixSort(SIZE, b, a);
std::memcpy(c, a, SIZE * sizeof(int));
std::sort(c, c + SIZE);
printCmpResult(SIZE, b, c);

zeroArray(NPOT, b);
printDesc("radix sort, non-power-of-two");
StreamCompaction::Efficient::radixSort(NPOT, b, a);
std::memcpy(c, a, NPOT * sizeof(int));
std::sort(c, c + NPOT);
printCmpResult(NPOT, b, c);

system("pause"); // stop Win32 console from closing on exit
delete[] a;
delete[] b;
Expand Down
14 changes: 12 additions & 2 deletions stream_compaction/common.cu
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,11 @@ namespace StreamCompaction {
* which map to 0 will be removed, and elements which map to 1 will be kept.
*/
__global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
// TODO
int index = blockIdx.x * blockDim.x + threadIdx.x;
if (index >= n) {
return;
}
bools[index] = idata[index] == 0 ? 0 : 1;
}

/**
Expand All @@ -32,7 +36,13 @@ namespace StreamCompaction {
*/
__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices) {
// TODO
int index = blockIdx.x * blockDim.x + threadIdx.x;
if (index >= n) {
return;
}
if (bools[index] == 1) {
odata[indices[index]] = idata[index];
}
}

}
Expand Down
2 changes: 2 additions & 0 deletions stream_compaction/common.h
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@
#define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__)
#define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__)

const int blockSize = 512;

/**
* Check for CUDA errors; print and exit if there was a problem.
*/
Expand Down
34 changes: 29 additions & 5 deletions stream_compaction/cpu.cu
Original file line number Diff line number Diff line change
Expand Up @@ -12,14 +12,21 @@ namespace StreamCompaction {
return timer;
}

void scanHelper(int n, int* odata, const int* idata) {
odata[0] = 0;
for (int i = 1; i < n; ++i) {
odata[i] = idata[i - 1] + odata[i - 1];
}
}

/**
* CPU scan (prefix sum).
* For performance analysis, this is supposed to be a simple for loop.
* (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first.
*/
void scan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
scanHelper(n, odata, idata);
timer().endCpuTimer();
}

Expand All @@ -30,9 +37,14 @@ namespace StreamCompaction {
*/
int compactWithoutScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
int j = 0;
for (int i = 0; i < n; ++i) {
if (idata[i] != 0) {
odata[j++] = idata[i];
}
}
timer().endCpuTimer();
return -1;
return j;
}

/**
Expand All @@ -42,9 +54,21 @@ namespace StreamCompaction {
*/
int compactWithScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
int *boolArray = new int[n];
int *scanArray = new int[n];
for (int i = 0; i < n; ++i) {
boolArray[i] = idata[i] == 0 ? 0 : 1;
}
scanHelper(n, scanArray, boolArray);
int num = 0;
for (int i = 0; i < n; ++i) {
if (boolArray[i] != 0) {
odata[scanArray[i]] = idata[i];
++num;
}
}
timer().endCpuTimer();
return -1;
return num;
}
}
}
Loading