Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
{
"files.associations": {
"charconv": "cpp",
"xstring": "cpp",
"xtree": "cpp",
"chrono": "cpp"
}
}
132 changes: 126 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,132 @@ CUDA Stream Compaction

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**

* (TODO) YOUR NAME HERE
* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* Richard Chen
* [LinkedIn](https://www.linkedin.com/in/richardrlchen/)
* Tested on: Windows 11, i7-10875H @ 2.3GHz 16GB, RTX 2060 MAXQ 6GB (PC)

## Overview
The purpose of this project was to explore parallel algorithms, something a GPU
excels at. To that end, we used the GPU to perform stream compaction via scan.
Stream compaction filters out elements from a list, in this case discarding the 0
elements from the input. Scan is a fold but all the intermediate steps are also
available.

## Features
* CPU implementation of scan
* CPU implementation of stream compaction
* Naive parallel version of scan for GPU implemented in CUDA
* Work efficient parallel version of scan for GPU implemented in CUDA
* Stream Compaction that leverages the GPU scan

## Example Output
Tested on 256 element array
```
****************
** SCAN TESTS **
****************
[ 28 7 23 2 29 27 22 19 2 38 26 47 45 ... 43 0 ]
==== cpu scan, power-of-two ====
elapsed time: 0.0005ms (std::chrono Measured)
[ 0 28 35 58 60 89 116 138 157 159 197 223 270 ... 6050 6093 ]
==== cpu scan, non-power-of-two ====
elapsed time: 0.0003ms (std::chrono Measured)
[ 0 28 35 58 60 89 116 138 157 159 197 223 270 ... 6016 6031 ]
passed
==== naive scan, power-of-two ====
elapsed time: 0.243616ms (CUDA Measured)
a[1] = 28, b[1] = 0
FAIL VALUE
==== naive scan, non-power-of-two ====
elapsed time: 0.17504ms (CUDA Measured)
a[1] = 28, b[1] = 0
FAIL VALUE
==== work-efficient scan, power-of-two ====
elapsed time: 0.325728ms (CUDA Measured)
passed
==== work-efficient scan, non-power-of-two ====
elapsed time: 0.335584ms (CUDA Measured)
passed
==== thrust scan, power-of-two ====
elapsed time: 0.1088ms (CUDA Measured)
passed
==== thrust scan, non-power-of-two ====
elapsed time: 0.048928ms (CUDA Measured)
passed

*****************************
** STREAM COMPACTION TESTS **
*****************************
[ 0 3 1 3 1 1 0 2 2 0 0 1 2 ... 3 0 ]
==== cpu compact without scan, power-of-two ====
elapsed time: 0.0013ms (std::chrono Measured)
[ 3 1 3 1 1 2 2 1 2 2 2 1 2 ... 3 3 ]
passed
==== cpu compact without scan, non-power-of-two ====
elapsed time: 0.0012ms (std::chrono Measured)
[ 3 1 3 1 1 2 2 1 2 2 2 1 2 ... 2 2 ]
passed
==== cpu compact with scan ====
elapsed time: 0.0015ms (std::chrono Measured)
[ 3 1 3 1 1 2 2 1 2 2 2 1 2 ... 3 3 ]
passed
==== work-efficient compact, power-of-two ====
elapsed time: 0.36784ms (CUDA Measured)
passed
==== work-efficient compact, non-power-of-two ====
elapsed time: 0.3792ms (CUDA Measured)
passed
Press any key to continue . . .
```

## Performance Analysis
Optimizing Block Size
<br>
<img src="img/time_blocksize.png">

There does not seem to be a significant difference when tested on size 2^16.

Performance as a function of array length
<br>
<img src="img/scan_time.png">

Notice that the time axis is log scaled. The crossover between the naive and
efficient implementations happens around length 2^19.

Stream Compaction as a function of array length
<br>
<img src="img/stream_time.png">

Notice that the time axis is log scaled. Even with arrays of up to length
2^28, the single threaded CPU is still faster for stream compactification.
This goes to show how parallelism and concurrency should be used situationally
as in many scenarios, the drawbacks might outweigh the benefits.

### Nsight
Profiling the execution on arrays of length 2^16
<br>
<img src="img/profile_timeline.png">
In the CUDA HW row, green represents host->device and
red represents device->host. Thus we can see that these pairs neatly bracket
the naive implementation, the work-efficient implementation, and using
the thrust library.

Up to 936ms is the naive GPU version. It looks about half as long as the
work efficient implementation. The work efficient implementation iterates through
the layers of the "tree" twice, once on the upsweep and once on the downsweep.
Until the arrays are large enough, the larger number of computations performed
will not overcome the overhead incurred from taking twice the number of steps.
When tested on length 2^20 arrays, the work efficient implementation was faster.

The thrust implementation is significantly faster and also has far fewer kernel
calls. This seems reasonable as there probably is overhead that is incurred when
spawning multiple kernels so being able to fit the operation into fewer kernel
calls would drastically cut down on this. Additionally, thrust probably has
shared memory optimizations, memory access optimizations, and more efficient
computation.

For all 3, the kernel executions themselves have lots of empty time in between
computations which suggests there is some form of IO bottleneck.

### (TODO: Your README)

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)

Binary file added img/profile_timeline.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/scan_time.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/stream_time.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/time_blocksize.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
20 changes: 15 additions & 5 deletions src/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -13,21 +13,22 @@
#include <stream_compaction/thrust.h>
#include "testing_helpers.hpp"

const int SIZE = 1 << 8; // feel free to change the size of array
const int SIZE = 1 << 12; // feel free to change the size of array
const int NPOT = SIZE - 3; // Non-Power-Of-Two
int *a = new int[SIZE];
int *b = new int[SIZE];
int *c = new int[SIZE];

int main(int argc, char* argv[]) {
int main(int argc, char *argv[])
{
// Scan tests

printf("\n");
printf("****************\n");
printf("** SCAN TESTS **\n");
printf("****************\n");

genArray(SIZE - 1, a, 50); // Leave a 0 at the end to test that edge case
genArray(SIZE - 1, a, 50); // Leave a 0 at the end to test that edge case
a[SIZE - 1] = 0;
printArray(SIZE, a, true);

Expand Down Expand Up @@ -81,7 +82,16 @@ int main(int argc, char* argv[]) {
//printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);

zeroArray(SIZE, c);
// int tmpTest[] = {0, 1, 2, 3, 4, 5, 6, 7};
// int *tmpTestOut = new int[8];
// zeroArray(8, tmpTestOut);
// printDesc("Small array slides example scan");
// StreamCompaction::Efficient::scan(8, tmpTestOut, tmpTest);
// printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
// //printArray(NPOT, c, true);
// printCmpResult(NPOT, b, c);
// delete[] tmpTestOut;

printDesc("thrust scan, power-of-two");
StreamCompaction::Thrust::scan(SIZE, c, a);
printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
Expand All @@ -102,7 +112,7 @@ int main(int argc, char* argv[]) {

// Compaction tests

genArray(SIZE - 1, a, 4); // Leave a 0 at the end to test that edge case
genArray(SIZE - 1, a, 4); // Leave a 0 at the end to test that edge case
a[SIZE - 1] = 0;
printArray(SIZE, a, true);

Expand Down
39 changes: 29 additions & 10 deletions stream_compaction/common.cu
Original file line number Diff line number Diff line change
@@ -1,38 +1,57 @@
#include "common.h"

void checkCUDAErrorFn(const char *msg, const char *file, int line) {
void checkCUDAErrorFn(const char *msg, const char *file, int line)
{
cudaError_t err = cudaGetLastError();
if (cudaSuccess == err) {
if (cudaSuccess == err)
{
return;
}

fprintf(stderr, "CUDA error");
if (file) {
if (file)
{
fprintf(stderr, " (%s:%d)", file, line);
}
fprintf(stderr, ": %s: %s\n", msg, cudaGetErrorString(err));
exit(EXIT_FAILURE);
}


namespace StreamCompaction {
namespace Common {
namespace StreamCompaction
{
namespace Common
{

/**
* Maps an array to an array of 0s and 1s for stream compaction. Elements
* which map to 0 will be removed, and elements which map to 1 will be kept.
*/
__global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
// TODO
__global__ void kernMapToBoolean(int n, int *bools, const int *idata)
{
int index = (blockIdx.x * blockDim.x) + threadIdx.x;
if (index >= n)
{
return;
}
bools[index] = !(idata[index] == 0);
}

/**
* Performs scatter on an array. That is, for each element in idata,
* if bools[idx] == 1, it copies idata[idx] to odata[indices[idx]].
*/
__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices) {
// TODO
const int *idata, const int *bools, const int *indices)
{
int index = (blockIdx.x * blockDim.x) + threadIdx.x;
if (index >= n)
{
return;
}
if (bools[index])
{
odata[indices[index]] = idata[index];
}
}

}
Expand Down
49 changes: 35 additions & 14 deletions stream_compaction/common.h
Original file line number Diff line number Diff line change
Expand Up @@ -10,32 +10,41 @@
#include <chrono>
#include <stdexcept>

#define blockSize 1024

#define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__)
#define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__)

#define checkCUDAErrorWithLine(msg) checkCUDAError(msg, __LINE__)

/**
* Check for CUDA errors; print and exit if there was a problem.
*/
void checkCUDAErrorFn(const char *msg, const char *file = NULL, int line = -1);

inline int ilog2(int x) {
inline int ilog2(int x)
{
int lg = 0;
while (x >>= 1) {
while (x >>= 1)
{
++lg;
}
return lg;
}

inline int ilog2ceil(int x) {
inline int ilog2ceil(int x)
{
return x == 1 ? 0 : ilog2(x - 1) + 1;
}

namespace StreamCompaction {
namespace Common {
namespace StreamCompaction
{
namespace Common
{
__global__ void kernMapToBoolean(int n, int *bools, const int *idata);

__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices);
const int *idata, const int *bools, const int *indices);

/**
* This class is used for timing the performance
Expand All @@ -60,7 +69,10 @@ namespace StreamCompaction {

void startCpuTimer()
{
if (cpu_timer_started) { throw std::runtime_error("CPU timer already started"); }
if (cpu_timer_started)
{
throw std::runtime_error("CPU timer already started");
}
cpu_timer_started = true;

time_start_cpu = std::chrono::high_resolution_clock::now();
Expand All @@ -70,7 +82,10 @@ namespace StreamCompaction {
{
time_end_cpu = std::chrono::high_resolution_clock::now();

if (!cpu_timer_started) { throw std::runtime_error("CPU timer not started"); }
if (!cpu_timer_started)
{
throw std::runtime_error("CPU timer not started");
}

std::chrono::duration<double, std::milli> duro = time_end_cpu - time_start_cpu;
prev_elapsed_time_cpu_milliseconds =
Expand All @@ -81,7 +96,10 @@ namespace StreamCompaction {

void startGpuTimer()
{
if (gpu_timer_started) { throw std::runtime_error("GPU timer already started"); }
if (gpu_timer_started)
{
throw std::runtime_error("GPU timer already started");
}
gpu_timer_started = true;

cudaEventRecord(event_start);
Expand All @@ -92,7 +110,10 @@ namespace StreamCompaction {
cudaEventRecord(event_end);
cudaEventSynchronize(event_end);

if (!gpu_timer_started) { throw std::runtime_error("GPU timer not started"); }
if (!gpu_timer_started)
{
throw std::runtime_error("GPU timer not started");
}

cudaEventElapsedTime(&prev_elapsed_time_gpu_milliseconds, event_start, event_end);
gpu_timer_started = false;
Expand All @@ -109,10 +130,10 @@ namespace StreamCompaction {
}

// remove copy and move functions
PerformanceTimer(const PerformanceTimer&) = delete;
PerformanceTimer(PerformanceTimer&&) = delete;
PerformanceTimer& operator=(const PerformanceTimer&) = delete;
PerformanceTimer& operator=(PerformanceTimer&&) = delete;
PerformanceTimer(const PerformanceTimer &) = delete;
PerformanceTimer(PerformanceTimer &&) = delete;
PerformanceTimer &operator=(const PerformanceTimer &) = delete;
PerformanceTimer &operator=(PerformanceTimer &&) = delete;

private:
cudaEvent_t event_start = nullptr;
Expand Down
Loading