diff --git a/README.md b/README.md
index 0e38ddb..28d3411 100644
--- a/README.md
+++ b/README.md
@@ -1,14 +1,165 @@
-CUDA Stream Compaction
-======================
+**University of Pennsylvania, CIS 565: GPU Programming and Architecture**
+# Project 2 - CUDA Stream Compaction
-**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
+* Jonas Oppenheim ([LinkedIn](https://www.linkedin.com/in/jonasoppenheim/), [GitHub](https://github.com/oppenheimj/), [personal](http://www.jonasoppenheim.com/))
+* Tested on: Windows 10, Ryzen 9 5950x, 32GB, RTX 3080 (personal machine)
-* (TODO) YOUR NAME HERE
- * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+
-### (TODO: Your README)
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+## Introduction
+This is the second project of my GPU Programming course at UPenn. The goal of this project was to implement two different algorithms both on the CPU and GPU and compare performance. Specifically, we were tasked with implementing the scan and stream compaction algorithms. Brief descriptions of both follow.
+The [scan](https://en.wikipedia.org/wiki/Prefix_sum) algorithm involves summing array elements up to each index. For example, given array X=[x1, x2, ..., xn], the output would be [x1, x1+x2, x1+x2+x3, ..., x1+..+xn]. The two variations of this algorithm are _inclusive_ and _exclusive_ scan and the distinction between these variations is uninteresting. The stream compaction algorithm is essentially a high performance filter operation that is commonly used to remove zeros from an array. Stream compaction uses the scan algorithm as one of its steps. What these two algorithms have in common is that they are simple to understand, trivial to implement on the CPU, but _embarassingly parallel_, meaning that they're begging to be implemented on the GPU. The details underling both algorithms and their parallel implementations are provided in [Chapter 39 of GPU Gems 3](https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda).
+
+I implemented parts 1 - 4, and bonus part 7 using shared memory in the GPU Efficient Scan kernel.
+
+The next section covers a performance overview of the various implementations and the final section contains my concluding thoughts about the assignment.
+
+## Performance analysis
+The following two plots demonstrate how the Scan and Stream Compaction algorithms perform on arrays of varying sizes on both the CPU and GPU. The block size used on the GPU was 128. Array sizes were tested both with exact powers-of-two and non-powers-of-two. There was no noticable difference in performance between the two, since the algorithm begins by simply padding the non-power-of-two array. The plots below show only the power-of-two results.
+
+
+
+
+The left figure shows Scan runtimes. It is seen that the naive GPU implementation is essentially as inefficient as the non-parallel CPU implementation. As expected, Thrust on the GPU outperforms my "efficient" GPU implementation, but at least there is a noticable difference between my naive and efficient GPU implementations.
+
+The right figure shows the Stream Compaction runtimes where it is seen that both CPU implementations slow down rapidly while the GPU implementation actually suffers from an Out Of Memory error before breaking a sweat. This is a clear example of the space-time tradeoff in algorithmic efficiency.
+
+A few additional notes follow.
+- There was an extremely pernicious bug where I used a `float` instead of an `int` and ended up with semi-determinsitic off-by-one-or-two errors with large array sizes. This same value is a `float` in GPU Gems 3, [Chapter 39](https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda), which is the inspiration for this project.
+- The GPU Scan function signature originally looked like
+ ```
+ void scan(int n, int *odata, const int *idata)
+ ```
+ This is a function that returns nothing and takes as input two pointers to CPU-side arrays. The algorithm begins by copying `idata` to the GPU and ends by copying the result from the GPU into `odata`. An issue I ran into was that for arrays of a sufficiently large size, this function would need to make a recursive call. The reason for this is covered towards the end of the aforementioned GPU Gems 3 chapter.
+
+ I modified this function signature to look like
+ ```
+ int* scan(int n, int *odata, int *idata)
+ ```
+ If the number of blocks needed is greater than 1, then a recursive call is made, where `odata` is `NULL` and `idata` is a GPU-side pointer.
+
+ The base case is when the requisite number of blocks is 1. In this case, if `odata` is supplied then it is assumed that there was no recursive call and `idata` is a CPU-side pointer. If `odata` was not supplied, then this is assumed to have been a recursive call, `idata` is assumed to be a GPU-side pointer, and the function returns another pointer to GPU memory.
+
+ What follows is an example of the logs generated during recursive calls.
+ ```
+ ==== work-efficient scan, power-of-two ====
+
+ paddedN: 16777216
+ grid size: 65536
+ block size: 128
+
+ paddedN: 65536
+ grid size: 256
+ block size: 128
+
+ paddedN: 256
+ grid size: 1
+ block size: 128
+ ```
+ This function was difficult but fun to implement and I am pleased with how well it kept up with Thrust's Scan implementation. Part of the reason may be due to the fact that I had my kernel copy global memory to shared memory, which is much faster.
+
+## Concluding thoughts
+- The work-efficient parallel scan algorithm was the most complicated code I've ever written. While coding, I felt as though I was building a card castle in my mind and any distraction would knock it over and I'd have to start over. It just required a large mental cache in order to make progress.
+- I think that well-written CPU code is self-documenting; if you use descriptive variable and function names, and give each function a single purpose, then there is little need for code comments. I'm starting to believe that no matter how nicely written CUDA code is, it will always require comments for a future reader. With CPU code, I may sacrifice some performance for readability. With GPU code, I'm realizing that all readability is sacrificed for performance. Machine-efficient code is simply not readable.
+- I think this may be a good assignment to start with because it allows us to discover clearly how to take a simple CPU algorithm and implement it on the GPU for maximum performance. Also, we were shown how there are many ways the algorithms can be optimized, but I was a little frustrated that that lecture was *the day before* this was due. I think we could get a lot out of spending time optimizing this algorithm.
+
+## Program output
+```
+****************
+** SCAN TESTS **
+****************
+ [ 8 18 44 9 36 16 44 17 6 39 36 19 10 ... 42 0 ]
+==== cpu scan, power-of-two ====
+ elapsed time: 0.0061ms (std::chrono Measured)
+
+==== cpu scan, non-power-of-two ====
+ elapsed time: 0.007ms (std::chrono Measured)
+ passed
+
+==== naive scan, power-of-two ====
+ elapsed time: 0.591936ms (CUDA Measured)
+ passed
+
+==== naive scan, non-power-of-two ====
+ elapsed time: 0.576512ms (CUDA Measured)
+ passed
+
+==== work-efficient scan, power-of-two ====
+
+paddedN: 16384
+grid size: 32
+block size: 256
+
+paddedN: 32
+grid size: 1
+block size: 16
+ elapsed time: 0ms (CUDA Measured)
+ passed
+
+==== work-efficient scan, non-power-of-two ====
+
+paddedN: 16384
+grid size: 32
+block size: 256
+
+paddedN: 32
+grid size: 1
+block size: 16
+ elapsed time: 0ms (CUDA Measured)
+ passed
+
+==== thrust scan, power-of-two ====
+ elapsed time: 0.195584ms (CUDA Measured)
+ passed
+
+==== thrust scan, non-power-of-two ====
+ elapsed time: 0.041984ms (CUDA Measured)
+ passed
+
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+ [ 0 0 1 0 2 2 3 1 3 3 0 3 2 ... 2 0 ]
+==== cpu compact without scan, power-of-two ====
+ elapsed time: 0.0285ms (std::chrono Measured)
+ passed
+
+==== cpu compact without scan, non-power-of-two ====
+ elapsed time: 0.0255ms (std::chrono Measured)
+ passed
+
+==== cpu compact with scan ====
+ elapsed time: 0.0806ms (std::chrono Measured)
+ passed
+
+==== work-efficient compact, power-of-two ====
+
+paddedN: 16384
+grid size: 32
+block size: 256
+
+paddedN: 32
+grid size: 1
+block size: 16
+ elapsed time: 0.045056ms (CUDA Measured)
+ [ 1 2 2 3 1 3 3 3 2 3 1 1 2 ... 1 2 ]
+ passed
+
+==== work-efficient compact, non-power-of-two ====
+
+paddedN: 16384
+grid size: 32
+block size: 256
+
+paddedN: 32
+grid size: 1
+block size: 16
+ elapsed time: 0.041984ms (CUDA Measured)
+ [ 1 2 2 3 1 3 3 3 2 3 1 1 2 ... 1 3 ]
+ passed
+
+Press any key to continue . . .
+```
diff --git a/img/compaction.jpg b/img/compaction.jpg
new file mode 100644
index 0000000..9fe6423
Binary files /dev/null and b/img/compaction.jpg differ
diff --git a/img/download.png b/img/download.png
new file mode 100644
index 0000000..bf1b95a
Binary files /dev/null and b/img/download.png differ
diff --git a/img/proj2_1.png b/img/proj2_1.png
new file mode 100644
index 0000000..fcbef17
Binary files /dev/null and b/img/proj2_1.png differ
diff --git a/img/proj2_2.png b/img/proj2_2.png
new file mode 100644
index 0000000..85dcabe
Binary files /dev/null and b/img/proj2_2.png differ
diff --git a/src/main.cpp b/src/main.cpp
index 896ac2b..c231574 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -13,39 +13,51 @@
#include
#include "testing_helpers.hpp"
-const int SIZE = 1 << 8; // feel free to change the size of array
+const int SIZE = 1 << 16;
const int NPOT = SIZE - 3; // Non-Power-Of-Two
int *a = new int[SIZE];
int *b = new int[SIZE];
int *c = new int[SIZE];
int main(int argc, char* argv[]) {
- // Scan tests
-
printf("\n");
printf("****************\n");
printf("** SCAN TESTS **\n");
printf("****************\n");
- genArray(SIZE - 1, a, 50); // Leave a 0 at the end to test that edge case
+ genArray(SIZE - 1, a, 50);
+ // Leave a 0 at the end to test that edge case
a[SIZE - 1] = 0;
printArray(SIZE, a, true);
- // initialize b using StreamCompaction::CPU::scan you implement
- // We use b for further comparison. Make sure your StreamCompaction::CPU::scan is correct.
- // At first all cases passed because b && c are all zeroes.
+
+ // We have arrays a, b, and c.
+ // a - This array contains the original data
+ // b - This array we initialized with our CPU::scan() function
+ // c - This array is blank and gets repeatedly populated and then wiped each test
+
+ /////* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan
+ ////onesArray(SIZE, c);
+ ////printDesc("1s array for finding bugs");
+ ////StreamCompaction::Naive::scan(SIZE, c, a);
+
+
+ ////printArray(SIZE, c, true); */
+ // This populates b, which is used for later comparisons.
zeroArray(SIZE, b);
printDesc("cpu scan, power-of-two");
StreamCompaction::CPU::scan(SIZE, b, a);
printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
- printArray(SIZE, b, true);
+ //printArray(SIZE, b, true);
+ std::cout << std::endl;
zeroArray(SIZE, c);
printDesc("cpu scan, non-power-of-two");
StreamCompaction::CPU::scan(NPOT, c, a);
printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
- printArray(NPOT, b, true);
+ //printArray(NPOT, b, true);
printCmpResult(NPOT, b, c);
+ std::cout << std::endl;
zeroArray(SIZE, c);
printDesc("naive scan, power-of-two");
@@ -53,12 +65,7 @@ int main(int argc, char* argv[]) {
printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);
-
- /* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan
- onesArray(SIZE, c);
- printDesc("1s array for finding bugs");
- StreamCompaction::Naive::scan(SIZE, c, a);
- printArray(SIZE, c, true); */
+ std::cout << std::endl;
zeroArray(SIZE, c);
printDesc("naive scan, non-power-of-two");
@@ -66,6 +73,7 @@ int main(int argc, char* argv[]) {
printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, c, true);
printCmpResult(NPOT, b, c);
+ std::cout << std::endl;
zeroArray(SIZE, c);
printDesc("work-efficient scan, power-of-two");
@@ -73,13 +81,15 @@ int main(int argc, char* argv[]) {
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);
+ std::cout << std::endl;
zeroArray(SIZE, c);
printDesc("work-efficient scan, non-power-of-two");
StreamCompaction::Efficient::scan(NPOT, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
- //printArray(NPOT, c, true);
+ printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);
+ std::cout << std::endl;
zeroArray(SIZE, c);
printDesc("thrust scan, power-of-two");
@@ -87,6 +97,7 @@ int main(int argc, char* argv[]) {
printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);
+ std::cout << std::endl;
zeroArray(SIZE, c);
printDesc("thrust scan, non-power-of-two");
@@ -94,15 +105,15 @@ int main(int argc, char* argv[]) {
printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);
+ std::cout << std::endl;
printf("\n");
printf("*****************************\n");
printf("** STREAM COMPACTION TESTS **\n");
printf("*****************************\n");
- // Compaction tests
-
- genArray(SIZE - 1, a, 4); // Leave a 0 at the end to test that edge case
+ genArray(SIZE - 1, a, 4);
+ // Leave a 0 at the end to test that edge case
a[SIZE - 1] = 0;
printArray(SIZE, a, true);
@@ -115,16 +126,18 @@ int main(int argc, char* argv[]) {
count = StreamCompaction::CPU::compactWithoutScan(SIZE, b, a);
printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
expectedCount = count;
- printArray(count, b, true);
+ //printArray(count, b, true);
printCmpLenResult(count, expectedCount, b, b);
+ std::cout << std::endl;
zeroArray(SIZE, c);
printDesc("cpu compact without scan, non-power-of-two");
count = StreamCompaction::CPU::compactWithoutScan(NPOT, c, a);
printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
expectedNPOT = count;
- printArray(count, c, true);
+ //printArray(count, c, true);
printCmpLenResult(count, expectedNPOT, b, c);
+ std::cout << std::endl;
zeroArray(SIZE, c);
printDesc("cpu compact with scan");
@@ -132,22 +145,26 @@ int main(int argc, char* argv[]) {
printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
printArray(count, c, true);
printCmpLenResult(count, expectedCount, b, c);
+ std::cout << std::endl;
zeroArray(SIZE, c);
printDesc("work-efficient compact, power-of-two");
count = StreamCompaction::Efficient::compact(SIZE, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
- //printArray(count, c, true);
+ printArray(count, c, true);
printCmpLenResult(count, expectedCount, b, c);
+ std::cout << std::endl;
zeroArray(SIZE, c);
printDesc("work-efficient compact, non-power-of-two");
count = StreamCompaction::Efficient::compact(NPOT, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
- //printArray(count, c, true);
+ printArray(count, c, true);
printCmpLenResult(count, expectedNPOT, b, c);
+ std::cout << std::endl;
- system("pause"); // stop Win32 console from closing on exit
+ // stop Win32 console from closing on exit
+ system("pause");
delete[] a;
delete[] b;
delete[] c;
diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
index 719fa11..6a32a91 100644
--- a/stream_compaction/cpu.cu
+++ b/stream_compaction/cpu.cu
@@ -3,6 +3,7 @@
#include "common.h"
+#include
namespace StreamCompaction {
namespace CPU {
using StreamCompaction::Common::PerformanceTimer;
@@ -19,7 +20,10 @@ namespace StreamCompaction {
*/
void scan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
- // TODO
+ odata[0] = 0;
+ for (int i = 1; i < n; i++) {
+ odata[i] = odata[i - 1] + idata[i - 1];
+ }
timer().endCpuTimer();
}
@@ -30,9 +34,15 @@ namespace StreamCompaction {
*/
int compactWithoutScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
- // TODO
+ int o = 0;
+ for (int i = 0; i < n; i++) {
+ if (idata[i] != 0) {
+ odata[o] = idata[i];
+ o++;
+ }
+ }
timer().endCpuTimer();
- return -1;
+ return o;
}
/**
@@ -42,9 +52,27 @@ namespace StreamCompaction {
*/
int compactWithScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
- // TODO
+ int* mask = new int[n];
+ for (int i = 0; i < n; i++) {
+ mask[i] = idata[i] == 0 ? 0 : 1;
+ }
+
+ int* scatter = new int[n];
+
+ scatter[0] = 0;
+ for (int i = 1; i < n; i++) {
+ scatter[i] = scatter[i - 1] + mask[i - 1];
+ }
+
+ int o = 0;
+ for (int i = 0; i < n; i++) {
+ if (mask[i] != 0) {
+ odata[scatter[i]] = idata[i];
+ o++;
+ }
+ }
timer().endCpuTimer();
- return -1;
+ return o;
}
}
}
diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu
index 2db346e..8ac9059 100644
--- a/stream_compaction/efficient.cu
+++ b/stream_compaction/efficient.cu
@@ -2,6 +2,7 @@
#include
#include "common.h"
#include "efficient.h"
+#include
namespace StreamCompaction {
namespace Efficient {
@@ -12,13 +13,194 @@ namespace StreamCompaction {
return timer;
}
+ __global__ void kernIncrement(int n, int* global_odata, int* global_blockIncrements) {
+ int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+
+ if (index < n) {
+ global_odata[2 * index + 0] += global_blockIncrements[blockIdx.x];
+ global_odata[2 * index + 1] += global_blockIncrements[blockIdx.x];
+ }
+ }
+
+ __global__ void kernScan(int nThreadsNeeded, int n, int* global_idata, int* global_odata, int* global_blockSums) {
+ // n is 2x the blockSize (total num elements being operated on by the block)
+ extern __shared__ int shared_array[];
+
+ int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+
+ if (index >= nThreadsNeeded) {
+ return;
+ }
+
+ shared_array[2 * threadIdx.x + 0] = global_idata[2 * index + 0];
+ shared_array[2 * threadIdx.x + 1] = global_idata[2 * index + 1];
+
+ // SWEEP UP
+ int offset = 1;
+ // Cute way of looping e.g. if n = 1024, then d = 512, 256, 128, 64, 32, 16, 8, 4, 2, 1
+ for (int d = n >> 1; d > 0; d >>= 1) {
+ __syncthreads();
+
+ if (threadIdx.x < d) {
+ int thisThreadsIndex1 = offset * (2 * threadIdx.x + 1) - 1;
+ int thisThreadsIndex2 = offset * (2 * threadIdx.x + 2) - 1;
+
+ shared_array[thisThreadsIndex2] += shared_array[thisThreadsIndex1];
+ }
+
+ offset *= 2;
+ }
+
+ // SWEEP DOWN
+ if (threadIdx.x == 0) {
+ if (global_blockSums) {
+ global_blockSums[blockIdx.x] = shared_array[n - 1];
+ }
+
+ shared_array[n - 1] = 0;
+ }
+
+ for (int d = 1; d < n; d *= 2) {
+ // offset begins at 1024 and is set to 512
+ offset >>= 1;
+ __syncthreads();
+
+ if (threadIdx.x < d) {
+ int thisThreadsIndex1 = offset * (2 * threadIdx.x + 1) - 1;
+ int thisThreadsIndex2 = offset * (2 * threadIdx.x + 2) - 1;
+
+ int tmp = shared_array[thisThreadsIndex1];
+ shared_array[thisThreadsIndex1] = shared_array[thisThreadsIndex2];
+ shared_array[thisThreadsIndex2] += tmp;
+ }
+ }
+
+ __syncthreads();
+ global_odata[2 * index + 0] = shared_array[2 * threadIdx.x + 0];
+ global_odata[2 * index + 1] = shared_array[2 * threadIdx.x + 1];
+ }
+
/**
* Performs prefix-sum (aka scan) on idata, storing the result into odata.
*/
- void scan(int n, int *odata, const int *idata) {
- timer().startGpuTimer();
- // TODO
- timer().endGpuTimer();
+ int* scan(int n, int *odata, int *idata) {
+ int paddedN = int(pow(2.0, ilog2ceil(n)));
+ int nThreadsNeeded = paddedN / 2;
+ int blockSize = std::min(128, nThreadsNeeded);
+ int nBlocks = ceil(nThreadsNeeded * 1.0 / blockSize);
+ int nElementsPerBlock = 2 * blockSize;
+
+ std::cout << std::endl;
+ std::cout << "paddedN: " << paddedN << std::endl;
+ std::cout << "grid size: " << nBlocks << std::endl;
+ std::cout << "block size: " << blockSize << std::endl;
+
+ int* device_odata;
+ cudaMalloc((void**)&device_odata, paddedN * sizeof(int));
+ checkCUDAError("cudaMalloc device_odata failed!");
+ cudaDeviceSynchronize();
+
+ if (nBlocks == 1) {
+ if (odata) {
+ int* device_idata;
+ cudaMalloc((void**)&device_idata, paddedN * sizeof(int));
+ checkCUDAError("cudaMalloc device_idata failed!");
+ cudaDeviceSynchronize();
+
+ cudaMemcpy(device_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+ checkCUDAError("cudaMalloc device_idata failed!");
+ cudaDeviceSynchronize();
+
+ timer().startGpuTimer();
+ kernScan << > > (nThreadsNeeded, nElementsPerBlock, device_idata, device_odata, NULL);
+ timer().endGpuTimer();
+ checkCUDAError("kernScan failed!");
+ cudaDeviceSynchronize();
+
+ cudaMemcpy(odata, device_odata, n * sizeof(int), cudaMemcpyDeviceToHost);
+ checkCUDAError("cudaMemcpy device_odata failed!");
+ cudaDeviceSynchronize();
+ return NULL;
+ } else {
+ kernScan << <1, blockSize, nElementsPerBlock * sizeof(int) >> > (nThreadsNeeded, nElementsPerBlock, idata, device_odata, NULL);
+ checkCUDAError("kernScan failed!");
+ cudaDeviceSynchronize();
+
+ return device_odata;
+ }
+ } else {
+ int* device_idata;
+ cudaMalloc((void**)&device_idata, paddedN * sizeof(int));
+ checkCUDAError("cudaMalloc device_idata failed!");
+ cudaDeviceSynchronize();
+
+ int* device_blockSums;
+ cudaMalloc((void**)&device_blockSums, nBlocks * sizeof(int));
+ checkCUDAError("cudaMalloc device_blockSums failed!");
+ cudaDeviceSynchronize();
+
+ cudaMemcpy(device_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+ checkCUDAError("cudaMemcpy idata failed!");
+ cudaDeviceSynchronize();
+
+ //if (odata) {
+ // timer().startGpuTimer();
+ //}
+
+ kernScan << > > (nThreadsNeeded, nElementsPerBlock, device_idata, device_odata, device_blockSums);
+ checkCUDAError("kernScan failed!");
+ cudaDeviceSynchronize();
+
+ int* device_blockIncrements = scan(nBlocks, NULL, device_blockSums); // (blockSumScan)
+ checkCUDAError("Recursive call failed!");
+ cudaDeviceSynchronize();
+
+
+ kernIncrement << > > (nThreadsNeeded, device_odata, device_blockIncrements);
+ checkCUDAError("kernIncrement failed!");
+ cudaDeviceSynchronize();
+
+ //if (odata) {
+ // timer().endGpuTimer();
+ //}
+
+ if (!odata) {
+ return device_odata;
+ }
+ }
+
+ cudaMemcpy(odata, device_odata, n * sizeof(int), cudaMemcpyDeviceToHost);
+ checkCUDAError("cudaMemcpy failed!");
+ cudaDeviceSynchronize();
+
+ return NULL;
+ }
+
+ __global__ void kernMapToBoolean(int n, int* global_idata, int* global_booleanMask) {
+ int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+
+ if (n <= index) {
+ return;
+ }
+
+ global_booleanMask[index] = global_idata[index] ? 1 : 0;
+ }
+
+ __global__ void kernScatter(int n, int* global_idata, int* global_odata, int* global_booleanMask, int* global_booleanMaskScan) {
+ int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+
+ if (n <= index) {
+ return;
+ }
+
+ if (global_booleanMask[index]) {
+ global_odata[global_booleanMaskScan[index]] = global_idata[index];
+ }
+
+ __syncthreads();
+ if (index == 0) {
+ global_booleanMaskScan[n - 1] += global_booleanMask[n - 1];
+ }
}
/**
@@ -30,11 +212,41 @@ namespace StreamCompaction {
* @param idata The array of elements to compact.
* @returns The number of elements remaining after compaction.
*/
- int compact(int n, int *odata, const int *idata) {
- timer().startGpuTimer();
- // TODO
- timer().endGpuTimer();
- return -1;
+ int compact(int n, int *odata, int *idata) {
+ int paddedN = int(pow(2, ilog2ceil(n)));
+ int blockSize = 256;
+ int nBlocks = ceil(paddedN * 1.0 / blockSize);
+
+ int* device_idata;
+ cudaMalloc((void**)&device_idata, paddedN * sizeof(int));
+ checkCUDAError("cudaMalloc device_idata failed!");
+
+ int* device_booleanMask;
+ cudaMalloc((void**)&device_booleanMask, paddedN * sizeof(int));
+ checkCUDAError("cudaMalloc device_booleanMask failed!");
+
+ int* device_odata;
+ cudaMalloc((void**)&device_odata, paddedN * sizeof(int));
+ checkCUDAError("cudaMalloc device_odata failed!");
+
+ cudaMemcpy(device_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+ cudaDeviceSynchronize();
+
+ timer().startGpuTimer();
+ kernMapToBoolean << > > (n, device_idata, device_booleanMask);
+ cudaDeviceSynchronize();
+
+ int* device_booleanMaskScan = scan(n, NULL, device_booleanMask);
+
+ kernScatter << > > (n, device_idata, device_odata, device_booleanMask, device_booleanMaskScan);
+ timer().endGpuTimer();
+
+ int size;
+ cudaMemcpy(&size, device_booleanMaskScan + paddedN - 1, sizeof(int), cudaMemcpyDeviceToHost);
+ cudaMemcpy(odata, device_odata, paddedN * sizeof(int), cudaMemcpyDeviceToHost);
+
+ // cudaFree(device_flipflopA);
+ return size;
}
}
}
diff --git a/stream_compaction/efficient.h b/stream_compaction/efficient.h
index 803cb4f..068767e 100644
--- a/stream_compaction/efficient.h
+++ b/stream_compaction/efficient.h
@@ -6,8 +6,8 @@ namespace StreamCompaction {
namespace Efficient {
StreamCompaction::Common::PerformanceTimer& timer();
- void scan(int n, int *odata, const int *idata);
+ int* scan(int n, int *odata, int *idata);
- int compact(int n, int *odata, const int *idata);
+ int compact(int n, int *odata, int *idata);
}
}
diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu
index 4308876..20fab94 100644
--- a/stream_compaction/naive.cu
+++ b/stream_compaction/naive.cu
@@ -2,6 +2,7 @@
#include
#include "common.h"
#include "naive.h"
+#include
namespace StreamCompaction {
namespace Naive {
@@ -11,15 +12,72 @@ namespace StreamCompaction {
static PerformanceTimer timer;
return timer;
}
- // TODO: __global__
+
+ __global__ void kernIterateScan(int n, int d, int* srcBuffer, int* desBuffer) {
+ int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+
+ if (n <= index) {
+ return;
+ }
+
+ int p = 1 << d - 1;
+ if (index < n) {
+ desBuffer[index] = srcBuffer[index] + (p <= index ? srcBuffer[index - p] : 0);
+ }
+ }
+
+ // This is totally amateurish, but hey, this *is* supposed to be the "naive" scan!
+ __global__ void kernConvertInclusiveToExclusive(int n, int* srcBuffer, int* desBuffer) {
+ int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+
+ if (index && index < n) {
+ desBuffer[index] = srcBuffer[index-1];
+ }
+ }
/**
* Performs prefix-sum (aka scan) on idata, storing the result into odata.
*/
void scan(int n, int *odata, const int *idata) {
- timer().startGpuTimer();
- // TODO
- timer().endGpuTimer();
+ int paddedN = int(pow(2, ilog2ceil(n)));
+
+ int blockSize = 128;
+ int gridSize = ceil(paddedN * 1.0 / blockSize);
+ int* device_flipflopA;
+ int* device_flipflopB;
+
+ int* device_srcBuffer;
+ int* device_desBuffer;
+
+ cudaMalloc((void**)&device_flipflopA, paddedN * sizeof(int));
+ checkCUDAError("cudaMalloc dev_A failed!");
+
+ cudaMalloc((void**)&device_flipflopB, paddedN * sizeof(int));
+ checkCUDAError("cudaMalloc dev_B failed!");
+
+ cudaMemcpy(device_flipflopA, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+ cudaDeviceSynchronize();
+
+ timer().startGpuTimer();
+ for (int d = 1; d <= ilog2ceil(n); d++) {
+ device_srcBuffer = d % 2 == 0 ? device_flipflopB : device_flipflopA;
+ device_desBuffer = d % 2 == 0 ? device_flipflopA : device_flipflopB;
+ kernIterateScan << > > (n, d, device_srcBuffer, device_desBuffer);
+ cudaDeviceSynchronize();
+ checkCUDAError("Iteration error");
+ }
+
+ kernConvertInclusiveToExclusive << > > (n, device_desBuffer, device_srcBuffer);
+ cudaDeviceSynchronize();
+ timer().endGpuTimer();
+
+ cudaMemcpy(odata, device_srcBuffer, n * sizeof(int), cudaMemcpyDeviceToHost);
+ cudaDeviceSynchronize();
+
+ cudaFree(device_flipflopA);
+ cudaFree(device_flipflopB);
+
+ odata[0] = 0;
}
}
}
diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu
index 1def45e..c1a25f4 100644
--- a/stream_compaction/thrust.cu
+++ b/stream_compaction/thrust.cu
@@ -14,15 +14,24 @@ namespace StreamCompaction {
static PerformanceTimer timer;
return timer;
}
+
/**
* Performs prefix-sum (aka scan) on idata, storing the result into odata.
*/
void scan(int n, int *odata, const int *idata) {
- timer().startGpuTimer();
- // TODO use `thrust::exclusive_scan`
- // example: for device_vectors dv_in and dv_out:
- // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin());
- timer().endGpuTimer();
+ thrust::host_vector thrustHostVector(idata, idata+n);
+ thrust::device_vector thrustDeviceVector = thrustHostVector;
+
+ timer().startGpuTimer();
+ thrust::exclusive_scan(thrustDeviceVector.begin(), thrustDeviceVector.end(), thrustDeviceVector.begin());
+ timer().endGpuTimer();
+
+ // This came from Stack Overflow.
+ // There is undoubtedly a better way to do this.
+ int i = 0;
+ for (thrust::device_vector::iterator iter = thrustDeviceVector.begin(); iter != thrustDeviceVector.end(); iter++) {
+ odata[i++] = *iter;
+ }
}
}
}