diff --git a/README.md b/README.md
index 0e38ddb..b7b6615 100644
--- a/README.md
+++ b/README.md
@@ -1,14 +1,210 @@
-CUDA Stream Compaction
-======================
-
+# CUDA Stream Compaction
**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
-* (TODO) YOUR NAME HERE
- * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Haoquan Liang
+ * [LinkedIn](https://www.linkedin.com/in/leohaoquanliang/)
+* Tested on: Windows 10, Ryzen 7 5800X 8 Core 3.80 GHz, NVIDIA GeForce RTX 3080 Ti 12 GB
+
+# Table of Contents
+[Features](#features)
+[Performance Analysis](#perf_anal)
+[Extra Credit](#ec)
+[Debugging Notes/Example](#debug)
+[Output Sample](#output)
+[Future Improvement](#future)
+
+
+
+# Features
+* CPU Scan
+* CPU Compaction with/without Scan
+* GPU Naive Scan
+* GPU Work-efficient Scan
+* GPU Compaction with Scan
+* Thrust Scan/Compaction
+* GPU Scan Optimization (Extra Credit)
+* Radix Sort (Extra Credit)
+
+
+
+# Performance Analysis
+The data is generated in Release mode
+## Scan
+
+
+
+Thrust's implementation is always the most efficient method. GPU work-efficient scan starts outperforming the CPU when the input size is greater than 2^20. GPU naive method is always the least efficient method.
+This is totally expected, as there are more overheads on the GPU side, and the benefits of parallelism only starts to show up when the input size is big enough.
+
+## Compaction
+
+
+
+CPU with scan is the least efficient in this case, which makes sense since it actually did a lot of extra work without the benefit of parallelism. GPU work-efficient compaction starts outperforming the regular CPU implementation when the input size is greater than 2^18. This is early than the scan case, because compaction can take a lot of advantages from parallelism.
+
+## Sort
+
+
+
+The CPU side uses the std::sort, which uses the Introsort algorithm that has a time complexity of O(N log(N)). It is already a very efficient sorting algorithm, but the GPU's Radix sort still manages to beat it when the input size is more than 2^23.
+
+
+
+# Extra Credit
+### Why is My GPU Approach So Slow?
+There are many reasons for the GPU approach to be slower, and following are some of my ideas:
+* There are way more computation overheads for the GPU scan, which is due to the nature of the algorithm being used. There isn't really any way to improve it.
+* We are using Global memory in our implementation, instead of Shared Memory that is a lot faster.
+* In both Up/Down Sweep, when d value is high, there will be a lot of threads idling. I added code to make sure that if the data at that index won't be checked for the current d value, it should exit early. There are further improvements that can be done by improving the locality of the data.
+* Block size can affect the performance. I tried varying the block size based on the input size, setting a very small block size, and setting a very big block size. Unfortunately, all the above methods decreased the performance. I also checked the block size used by Thrust, and it seems to be using 128. So I used 128 in my implementation too. It will be interesting to dive deeper and learn the reasons in the future.
+
+### Radix Sort
+I implemented the radix sort and wrote a test case using the std::sort on the CPU side. The algorithm works perfectly and start outperforming the CPU side when input is greater than 2^23.
+In main.cpp, I constructed array a (power-of-two size) and array b (non-power-of-two size), and I called StreamCompaction::Efficient::radixSort() to generate the result, and print whether the result matches the result from the CPU.
+This is the result
+```
+*****************************
+** RADIX SORT TESTS **
+*****************************
+ [ 71 23 76 5 97 9 58 85 81 38 37 70 2 ... 80 0 ]
+==== cpu std::sort, power-of-two ====
+ elapsed time: 1086.39ms (std::chrono Measured)
+ [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ]
+==== cpu std::sort, non-power-of-two ====
+ elapsed time: 1067.79ms (std::chrono Measured)
+ [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 99 99 ]
+==== radix sort, power-of-two ====
+ elapsed time: 803.583ms (CUDA Measured)
+ passed
+==== radix sort, non-power-of-two ====
+ elapsed time: 785.894ms (CUDA Measured)
+ passed
+```
+for these lines:
+```
+ printf("\n");
+ printf("*****************************\n");
+ printf("** RADIX SORT TESTS **\n");
+ printf("*****************************\n");
+ // Sort Test
+ genArray(SIZE - 1, a, 100);
+ printArray(SIZE, a, true);
+
+ zeroArray(SIZE, b);
+ printDesc("cpu std::sort, power-of-two");
+ StreamCompaction::CPU::sort(SIZE, b, a);
+ printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
+ printArray(SIZE, b, true);
+
+ zeroArray(SIZE, c);
+ printDesc("cpu std::sort, non-power-of-two");
+ StreamCompaction::CPU::sort(NPOT, c, a);
+ printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
+ printArray(NPOT, c, true);
+
+ zeroArray(SIZE, d);
+ printDesc("radix sort, power-of-two");
+ StreamCompaction::Efficient::radixSort(SIZE, d, a);
+ printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+ printCmpResult(SIZE, b, d);
+
+ zeroArray(SIZE, d);
+ printDesc("radix sort, non-power-of-two");
+ StreamCompaction::Efficient::radixSort(NPOT, d, a);
+ printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+ printCmpResult(NPOT,c, d);
+
+```
+
+
+
+# Debugging Notes/Example
+For this project, I mainly used the watch and memory tool from the VS debugger. I want to clearly see the changes of the array at each step, so I copy the result from the Kernel to a local CPU array, use the watch to find out its memory, and then use the memory tool to see if it's the desired result.
+Following is an example:
+
+My Radix sort is generating a wrong result, so I use the above method to go through each step. At step 5, the result wasn't expected (it should have all the 0s and 2s followed by 1s and 3s but instead they are all mixed together). I figured out it's the KernScatter function that didn't generate the correct result, so I went to check for it and find the bug.
+
+
+
+# Output Sample
+
+For input size 2^26:
+```
+
+****************
+** SCAN TESTS **
+****************
+ [ 21 17 24 5 47 20 8 18 12 37 41 44 21 ... 25 0 ]
+==== cpu scan, power-of-two ====
+ elapsed time: 36.0359ms (std::chrono Measured)
+ [ 0 21 38 62 67 114 134 142 160 172 209 250 294 ... 1643665856 1643665881 ]
+==== cpu scan, non-power-of-two ====
+ elapsed time: 35.3576ms (std::chrono Measured)
+ [ 0 21 38 62 67 114 134 142 160 172 209 250 294 ... 1643665797 1643665811 ]
+ passed
+==== naive scan, power-of-two ====
+ elapsed time: 53.4917ms (CUDA Measured)
+ passed
+==== naive scan, non-power-of-two ====
+ elapsed time: 47.9857ms (CUDA Measured)
+ passed
+==== work-efficient scan, power-of-two ====
+ elapsed time: 24.1193ms (CUDA Measured)
+ passed
+==== work-efficient scan, non-power-of-two ====
+ elapsed time: 21.9279ms (CUDA Measured)
+ passed
+==== thrust scan, power-of-two ====
+ elapsed time: 0.766976ms (CUDA Measured)
+ passed
+==== thrust scan, non-power-of-two ====
+ elapsed time: 0.86528ms (CUDA Measured)
+ passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+ [ 3 2 1 3 3 0 3 2 2 1 1 0 2 ... 2 0 ]
+==== cpu compact without scan, power-of-two ====
+ elapsed time: 128.336ms (std::chrono Measured)
+ [ 3 2 1 3 3 3 2 2 1 1 2 3 2 ... 2 2 ]
+ passed
+==== cpu compact without scan, non-power-of-two ====
+ elapsed time: 122.404ms (std::chrono Measured)
+ [ 3 2 1 3 3 3 2 2 1 1 2 3 2 ... 2 2 ]
+ passed
+==== cpu compact with scan ====
+ elapsed time: 271.842ms (std::chrono Measured)
+ [ 3 2 1 3 3 3 2 2 1 1 2 3 2 ... 2 2 ]
+ passed
+==== work-efficient compact, power-of-two ====
+ elapsed time: 33.268ms (CUDA Measured)
+ passed
+==== work-efficient compact, non-power-of-two ====
+ elapsed time: 33.4285ms (CUDA Measured)
+ passed
-### (TODO: Your README)
+*****************************
+** RADIX SORT TESTS **
+*****************************
+ [ 34 42 27 4 19 27 7 9 22 13 49 20 41 ... 8 0 ]
+==== cpu std::sort, power-of-two ====
+ elapsed time: 946.368ms (std::chrono Measured)
+ [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 49 49 ]
+==== cpu std::sort, non-power-of-two ====
+ elapsed time: 908.604ms (std::chrono Measured)
+ [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 49 49 ]
+==== radix sort, power-of-two ====
+ elapsed time: 846.838ms (CUDA Measured)
+ passed
+==== radix sort, non-power-of-two ====
+ elapsed time: 815.581ms (CUDA Measured)
+ passed
+```
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+
+# Future Improvement
+* Using Shared Memory for GPU scan
+* Improve the locality of the array to help retiring more idling threads
+* Learn more about Thrust's implementation (why it is so efficient)
diff --git a/img/compact-npot.png b/img/compact-npot.png
new file mode 100644
index 0000000..11cb35d
Binary files /dev/null and b/img/compact-npot.png differ
diff --git a/img/compact-pot.png b/img/compact-pot.png
new file mode 100644
index 0000000..a45b93b
Binary files /dev/null and b/img/compact-pot.png differ
diff --git a/img/debug_radix_sort0.png b/img/debug_radix_sort0.png
new file mode 100644
index 0000000..a09634d
Binary files /dev/null and b/img/debug_radix_sort0.png differ
diff --git a/img/debug_radix_sort1.png b/img/debug_radix_sort1.png
new file mode 100644
index 0000000..34619f4
Binary files /dev/null and b/img/debug_radix_sort1.png differ
diff --git a/img/debug_radix_sort2.png b/img/debug_radix_sort2.png
new file mode 100644
index 0000000..faba2a8
Binary files /dev/null and b/img/debug_radix_sort2.png differ
diff --git a/img/debug_radix_sort3.png b/img/debug_radix_sort3.png
new file mode 100644
index 0000000..a010c87
Binary files /dev/null and b/img/debug_radix_sort3.png differ
diff --git a/img/debug_radix_sort4.png b/img/debug_radix_sort4.png
new file mode 100644
index 0000000..e6bd54a
Binary files /dev/null and b/img/debug_radix_sort4.png differ
diff --git a/img/debug_radix_sort5.png b/img/debug_radix_sort5.png
new file mode 100644
index 0000000..3e3829d
Binary files /dev/null and b/img/debug_radix_sort5.png differ
diff --git a/img/debug_radix_sort6.png b/img/debug_radix_sort6.png
new file mode 100644
index 0000000..1558452
Binary files /dev/null and b/img/debug_radix_sort6.png differ
diff --git a/img/debug_radix_sort7--changed-D.png b/img/debug_radix_sort7--changed-D.png
new file mode 100644
index 0000000..9060ad6
Binary files /dev/null and b/img/debug_radix_sort7--changed-D.png differ
diff --git a/img/debug_radix_sort8-still-wrong.png b/img/debug_radix_sort8-still-wrong.png
new file mode 100644
index 0000000..4adcb99
Binary files /dev/null and b/img/debug_radix_sort8-still-wrong.png differ
diff --git a/img/debug_radix_sort9-yes.png b/img/debug_radix_sort9-yes.png
new file mode 100644
index 0000000..894fd82
Binary files /dev/null and b/img/debug_radix_sort9-yes.png differ
diff --git a/img/scan-npot.png b/img/scan-npot.png
new file mode 100644
index 0000000..77ab74a
Binary files /dev/null and b/img/scan-npot.png differ
diff --git a/img/scan-pot.png b/img/scan-pot.png
new file mode 100644
index 0000000..9a35893
Binary files /dev/null and b/img/scan-pot.png differ
diff --git a/img/sort-npot.png b/img/sort-npot.png
new file mode 100644
index 0000000..0a5d8f5
Binary files /dev/null and b/img/sort-npot.png differ
diff --git a/img/sort-pot.png b/img/sort-pot.png
new file mode 100644
index 0000000..f30fe78
Binary files /dev/null and b/img/sort-pot.png differ
diff --git a/src/main.cpp b/src/main.cpp
index 896ac2b..c26b930 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -13,11 +13,13 @@
#include
#include "testing_helpers.hpp"
-const int SIZE = 1 << 8; // feel free to change the size of array
+const int SIZE = 1 << 26; // feel free to change the size of array
const int NPOT = SIZE - 3; // Non-Power-Of-Two
int *a = new int[SIZE];
int *b = new int[SIZE];
int *c = new int[SIZE];
+int* d = new int[SIZE];
+
int main(int argc, char* argv[]) {
// Scan tests
@@ -147,8 +149,42 @@ int main(int argc, char* argv[]) {
//printArray(count, c, true);
printCmpLenResult(count, expectedNPOT, b, c);
+ printf("\n");
+ printf("*****************************\n");
+ printf("** RADIX SORT TESTS **\n");
+ printf("*****************************\n");
+ // Sort Test
+ genArray(SIZE - 1, a, 50);
+ printArray(SIZE, a, true);
+
+ zeroArray(SIZE, b);
+ printDesc("cpu std::sort, power-of-two");
+ StreamCompaction::CPU::sort(SIZE, b, a);
+ printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
+ printArray(SIZE, b, true);
+
+ zeroArray(SIZE, c);
+ printDesc("cpu std::sort, non-power-of-two");
+ StreamCompaction::CPU::sort(NPOT, c, a);
+ printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
+ printArray(NPOT, c, true);
+
+ zeroArray(SIZE, d);
+ printDesc("radix sort, power-of-two");
+ StreamCompaction::Efficient::radixSort(SIZE, d, a);
+ printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+ printCmpResult(SIZE, b, d);
+
+ zeroArray(SIZE, d);
+ printDesc("radix sort, non-power-of-two");
+ StreamCompaction::Efficient::radixSort(NPOT, d, a);
+ printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+ printCmpResult(NPOT,c, d);
+
+
system("pause"); // stop Win32 console from closing on exit
delete[] a;
delete[] b;
delete[] c;
+ delete[] d;
}
diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
index 2ed6d63..72b08f2 100644
--- a/stream_compaction/common.cu
+++ b/stream_compaction/common.cu
@@ -23,16 +23,28 @@ namespace StreamCompaction {
* which map to 0 will be removed, and elements which map to 1 will be kept.
*/
__global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
- // TODO
+ int index = threadIdx.x + (blockIdx.x * blockDim.x);
+ if (index >= n) {
+ return;
+ }
+ bools[index] = idata[index] == 0 ? 0 : 1;
}
/**
* Performs scatter on an array. That is, for each element in idata,
* if bools[idx] == 1, it copies idata[idx] to odata[indices[idx]].
*/
- __global__ void kernScatter(int n, int *odata,
- const int *idata, const int *bools, const int *indices) {
- // TODO
+ __global__ void kernScatter(int n, int* odata,
+ const int* idata, const int* bools, const int* indices) {
+ int index = threadIdx.x + (blockIdx.x * blockDim.x);
+ if (index >= n) {
+ return;
+ }
+
+ if (bools[index]) {
+ odata[indices[index]] = idata[index];
+ }
+
}
}
diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
index 719fa11..19fe7c9 100644
--- a/stream_compaction/cpu.cu
+++ b/stream_compaction/cpu.cu
@@ -1,6 +1,5 @@
#include
#include "cpu.h"
-
#include "common.h"
namespace StreamCompaction {
@@ -19,7 +18,12 @@ namespace StreamCompaction {
*/
void scan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
- // TODO
+ int identity = 0;
+ odata[0] = identity; // exclusive scan
+ for (int i = 1; i < n; ++i)
+ {
+ odata[i] = odata[i - 1] + idata[i - 1];
+ }
timer().endCpuTimer();
}
@@ -30,9 +34,17 @@ namespace StreamCompaction {
*/
int compactWithoutScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
- // TODO
+ int oIndex = 0;
+ for (int i = 0; i < n; ++i)
+ {
+ if (idata[i] != 0)
+ {
+ odata[oIndex] = idata[i];
+ oIndex++;
+ }
+ }
timer().endCpuTimer();
- return -1;
+ return oIndex; // the number of elements remaining
}
/**
@@ -41,10 +53,45 @@ namespace StreamCompaction {
* @returns the number of elements remaining after compaction.
*/
int compactWithScan(int n, int *odata, const int *idata) {
+ int* keepData = new int[n];
+ // Step 1 and Step 2
+ timer().startCpuTimer();
+ int identity = 0;
+ odata[0] = identity; // exclusive scan
+
+ for (int i = 0; i < n; ++i)
+ {
+ keepData[i] = (idata[i] == 0)? 0 : 1;
+ if (i > 0)
+ odata[i] = odata[i - 1] + keepData[i - 1];
+
+ }
+
+ // Step 3
+ int oIndex = 0;
+ for (int i = 0; i < n; ++i)
+ {
+ if (keepData[i])
+ {
+ odata[oIndex] = idata[i];
+ oIndex++;
+ }
+ }
+ timer().endCpuTimer();
+ delete[] keepData;
+ return oIndex;
+ }
+
+ /*
+ * CPU sort using std::sort
+ */
+ void sort(int n, int* odata, const int* idata)
+ {
+ memcpy(odata, idata, n * sizeof(int));
timer().startCpuTimer();
- // TODO
+ std:: sort(odata, odata + n);
timer().endCpuTimer();
- return -1;
}
+
}
}
diff --git a/stream_compaction/cpu.h b/stream_compaction/cpu.h
index 873c047..222b77a 100644
--- a/stream_compaction/cpu.h
+++ b/stream_compaction/cpu.h
@@ -11,5 +11,7 @@ namespace StreamCompaction {
int compactWithoutScan(int n, int *odata, const int *idata);
int compactWithScan(int n, int *odata, const int *idata);
+
+ void sort(int n, int* odata, const int* idata);
}
}
diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu
index 2db346e..cf0c998 100644
--- a/stream_compaction/efficient.cu
+++ b/stream_compaction/efficient.cu
@@ -3,6 +3,8 @@
#include "common.h"
#include "efficient.h"
+#define blockSize 128
+
namespace StreamCompaction {
namespace Efficient {
using StreamCompaction::Common::PerformanceTimer;
@@ -12,13 +14,74 @@ namespace StreamCompaction {
return timer;
}
+ __global__ void kernUpSweep(int n, int d, int* x)
+ {
+ int index = threadIdx.x + (blockIdx.x * blockDim.x);
+ if (index >= n) {
+ return;
+ }
+ if (index % (1 << (d+1)) == 0)
+ {
+
+ x[index + (1 << (d + 1)) - 1] += x[index + (1 << d ) - 1];
+ }
+ }
+
+ __global__ void kernDownSweep(int n, int d, int* x)
+ {
+ int index = threadIdx.x + (blockIdx.x * blockDim.x);
+ if (index >= n) {
+ return;
+ }
+
+ if (index % (1 << (d + 1)) == 0)
+ {
+
+ int t = x[index + (1 << d) - 1];
+ x[index + (1 << d) - 1] = x[index + (1 << (d + 1)) - 1];
+ x[index + (1 << (d + 1)) - 1] += t;
+ }
+ }
+ void upDownSweep(int n, int* data)
+ {
+ dim3 fullBlocksPerGrid((blockSize + n - 1) / blockSize);
+
+ for (int d = 0; d <= ilog2ceil(n) - 1; ++d) {
+ kernUpSweep << < fullBlocksPerGrid, blockSize>> > (n, d, data);
+ }
+ cudaDeviceSynchronize();
+
+ cudaMemset(data + n - 1, 0, sizeof(int));
+ checkCUDAError("cudaMemset failed!");
+
+ for (int d = ilog2ceil(n) - 1; d >= 0; --d) {
+
+ kernDownSweep << < fullBlocksPerGrid, blockSize >> > (n, d, data);
+ }
+ cudaDeviceSynchronize();
+ }
+
+
/**
* Performs prefix-sum (aka scan) on idata, storing the result into odata.
*/
- void scan(int n, int *odata, const int *idata) {
+ void scan(int n, int* odata, const int* idata) {
+ int intermArraySize = 1 << ilog2ceil(n);
+ dim3 fullBlocksPerGrid((blockSize + intermArraySize - 1) / blockSize);
+
+ int* dev_data;
+ cudaMalloc((void**)&dev_data, intermArraySize * sizeof(int));
+ checkCUDAError("cudaMalloc dev_data failed!");
+ cudaMemcpy(dev_data, idata, sizeof(int) * n, cudaMemcpyHostToDevice);
+
timer().startGpuTimer();
- // TODO
+
+ upDownSweep(intermArraySize, dev_data);
+
timer().endGpuTimer();
+ cudaMemcpy(odata, dev_data, sizeof(int) * n, cudaMemcpyDeviceToHost);
+ cudaFree(dev_data);
+
}
/**
@@ -30,11 +93,191 @@ namespace StreamCompaction {
* @param idata The array of elements to compact.
* @returns The number of elements remaining after compaction.
*/
- int compact(int n, int *odata, const int *idata) {
+ int compact(int n, int* odata, const int* idata) {
+ dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize);
+ int arraySize = 1 << ilog2ceil(n);
+
+ int* dev_indices;
+ int* dev_bool;
+ int* dev_idata;
+ int* dev_odata;
+ cudaMalloc((void**)&dev_indices, arraySize * sizeof(int));
+ checkCUDAError("cudaMalloc dev_indices failed!");
+ cudaMalloc((void**)&dev_bool, arraySize * sizeof(int));
+ checkCUDAError("cudaMalloc dev_bool failed!");
+ cudaMalloc((void**)&dev_idata, arraySize * sizeof(int));
+ checkCUDAError("cudaMalloc dev_idata failed!");
+
+ cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+ checkCUDAError("cudaMemcpy dev_bool to dev_data failed!");
+
+
timer().startGpuTimer();
- // TODO
+ // Step 1
+ StreamCompaction::Common::kernMapToBoolean << > > (n, dev_bool, dev_idata);
+ cudaDeviceSynchronize();
+
+ cudaMemcpy(dev_indices, dev_bool, n * sizeof(int), cudaMemcpyDeviceToDevice);
+ checkCUDAError("cudaMemcpy dev_bool to dev_data failed!");
+
+
+ // Step 2
+ upDownSweep(arraySize, dev_indices);
+
+ int returnSize = 0;
+ cudaMemcpy(&returnSize, dev_indices + arraySize - 1, sizeof(int), cudaMemcpyDeviceToHost);
+ checkCUDAError("cudaMemcpy dev_indices to host failed!");
+
+ // Di shared this edge case with me
+ // When the input array has the last element non-zero, it will fail
+ // hence we can add the last bit of the bool array to the return size to make sure that this case is covered
+ int lastBool = 0;
+ cudaMemcpy(&lastBool, dev_bool + arraySize - 1, sizeof(int), cudaMemcpyDeviceToHost);
+ checkCUDAError("cudaMemcpy dev_bool to host failed!");
+ returnSize += lastBool;
+
+
+ cudaMalloc((void**)&dev_odata, returnSize * sizeof(int));
+ checkCUDAError("cudaMalloc dev_odata failed!");
+
+
+ // Step 3
+ StreamCompaction::Common::kernScatter <<>>(arraySize, dev_odata,
+ dev_idata, dev_bool, dev_indices);
+ cudaDeviceSynchronize();
+
+
timer().endGpuTimer();
- return -1;
+
+ cudaMemcpy(odata, dev_odata, returnSize * sizeof(int), cudaMemcpyDeviceToHost);
+ checkCUDAError("cudaMemcpy dev_odata to host failed!");
+
+ cudaFree(dev_indices);
+ cudaFree(dev_odata);
+ cudaFree(dev_idata);
+ cudaFree(dev_bool);
+
+ return returnSize;
}
+
+ __global__ void kernComputeE(int n, int i, int* odata, int* idata)
+ {
+ int index = threadIdx.x + (blockIdx.x * blockDim.x);
+ if (index >= n) {
+ return;
+ }
+ odata[index] = (idata[index] << (31-i)) >> 31 ? 0 : 1;
+ }
+ __global__ void kernComputeT(int n, int f, int* odata, int* idata)
+ {
+ int index = threadIdx.x + (blockIdx.x * blockDim.x);
+ if (index >= n) {
+ return;
+ }
+ odata[index] = index - idata[index] + f;
+ }
+ __global__ void kernComputeD(int n, int* d, int* b, int* t, int* f)
+ {
+ int index = threadIdx.x + (blockIdx.x * blockDim.x);
+ if (index >= n) {
+ return;
+ }
+ // b is just the opposite of e
+ d[index] = b[index] == 0 ? t[index] : f[index];
+ }
+
+ __global__ void kernScatter(int n, int* d, int*odata, int* idata)
+ {
+ int index = threadIdx.x + (blockIdx.x * blockDim.x);
+ if (index >= n) {
+ return;
+ }
+ odata[d[index]] = idata[index];
+ }
+
+ void radixSort(int n, int* odata, const int* idata)
+ {
+ dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize);
+ int arraySize = 1 << ilog2ceil(n);
+
+ int* dev_e;
+ int* dev_f;
+ int* dev_t;
+ int* dev_d;
+ int* dev_idata;
+ int* dev_odata;
+ cudaMalloc((void**)&dev_e, arraySize * sizeof(int));
+ checkCUDAError("cudaMalloc dev_e failed!");
+ cudaMalloc((void**)&dev_f, arraySize * sizeof(int));
+ checkCUDAError("cudaMalloc dev_f failed!");
+ cudaMalloc((void**)&dev_t, arraySize * sizeof(int));
+ checkCUDAError("cudaMalloc dev_t failed!");
+ cudaMalloc((void**)&dev_d, arraySize * sizeof(int));
+ checkCUDAError("cudaMalloc dev_d failed!");
+ cudaMalloc((void**)&dev_idata, arraySize * sizeof(int));
+ checkCUDAError("cudaMalloc dev_idata failed!");
+ cudaMalloc((void**)&dev_odata, arraySize * sizeof(int));
+ checkCUDAError("cudaMalloc dev_odata failed!");
+
+ cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+ checkCUDAError("cudaMemcpy dev_bool to dev_data failed!");
+
+ timer().startGpuTimer();
+
+ for (int i = 0; i < 32; ++i)
+ {
+ // Step 1: compute e array
+ kernComputeE << < fullBlocksPerGrid, blockSize >> > (n, i, dev_e, dev_idata);
+ cudaDeviceSynchronize();
+ // pad a 0 for non-power-of-two case
+ if (arraySize > n)
+ {
+ cudaMemset(dev_e + arraySize - 1, 0, sizeof(int));
+ }
+
+
+ // Step 2: exclusive scan e array and store it in f
+ cudaMemcpy(dev_f, dev_e, arraySize * sizeof(int), cudaMemcpyDeviceToDevice);
+ upDownSweep(arraySize, dev_f);
+ cudaDeviceSynchronize();
+
+
+ // Step 3: compute total false
+ int totalFalse = 0;
+ int lastF = 0;
+ cudaMemcpy(&totalFalse, dev_e + arraySize - 1, sizeof(int), cudaMemcpyDeviceToHost);
+ checkCUDAError("cudaMemcpy dev_e to host failed!");
+ cudaDeviceSynchronize();
+
+ cudaMemcpy(&lastF, dev_f + arraySize - 1, sizeof(int), cudaMemcpyDeviceToHost);
+ checkCUDAError("cudaMemcpy dev_f to host failed!");
+ cudaDeviceSynchronize();
+ totalFalse += lastF;
+
+ // Step 4: compute t array
+ kernComputeT << < fullBlocksPerGrid, blockSize >> > (arraySize, totalFalse, dev_t, dev_f);
+ cudaDeviceSynchronize();
+
+
+ // Step 5: scatter
+ kernComputeD << < fullBlocksPerGrid, blockSize >> > (arraySize, dev_d, dev_e, dev_t, dev_f);
+ cudaDeviceSynchronize();
+
+ kernScatter << < fullBlocksPerGrid, blockSize >> > (arraySize, dev_d, dev_odata, dev_idata);
+
+ cudaMemcpy(dev_idata, dev_odata, arraySize * sizeof(int), cudaMemcpyDeviceToDevice);
+ }
+ timer().endGpuTimer();
+
+ cudaMemcpy(odata, dev_odata, arraySize * sizeof(int), cudaMemcpyDeviceToHost);
+ cudaFree(dev_e);
+ cudaFree(dev_f);
+ cudaFree(dev_t);
+ cudaFree(dev_d);
+ cudaFree(dev_idata);
+ cudaFree(dev_odata);
+ }
+
+
}
}
diff --git a/stream_compaction/efficient.h b/stream_compaction/efficient.h
index 803cb4f..fb46db4 100644
--- a/stream_compaction/efficient.h
+++ b/stream_compaction/efficient.h
@@ -8,6 +8,12 @@ namespace StreamCompaction {
void scan(int n, int *odata, const int *idata);
+ void scanRecursion(int n, int* data, dim3 blockPerGrid);
+
+
int compact(int n, int *odata, const int *idata);
+
+ void radixSort(int n, int* odata, const int* idata);
+
}
}
diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu
index 4308876..8e95d78 100644
--- a/stream_compaction/naive.cu
+++ b/stream_compaction/naive.cu
@@ -3,6 +3,8 @@
#include "common.h"
#include "naive.h"
+#define blockSize 128
+
namespace StreamCompaction {
namespace Naive {
using StreamCompaction::Common::PerformanceTimer;
@@ -12,14 +14,61 @@ namespace StreamCompaction {
return timer;
}
// TODO: __global__
+ __global__ void KernNaiveScanIteration(int n, int d, int* odata, int* idata)
+ {
+ int index = threadIdx.x + (blockIdx.x * blockDim.x);
+ if (index >= n )
+ {
+ return;
+ }
+ if (index >= (1 << (d - 1)))
+ odata[index] = idata[index - (1 << (d - 1))] + idata[index];
+ else
+ odata[index] = idata[index];
+ }
+
/**
* Performs prefix-sum (aka scan) on idata, storing the result into odata.
*/
void scan(int n, int *odata, const int *idata) {
+ dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize);
+
+ int* dev_odata1;
+ int* dev_odata2;
+ cudaMalloc((void**)&dev_odata1, n * sizeof(int));
+ checkCUDAError("cudaMalloc dev_odata1 failed!");
+ cudaMalloc((void**)&dev_odata2, n * sizeof(int));
+ checkCUDAError("cudaMalloc dev_odata2 failed!");
+
+ cudaMemcpy(dev_odata1, idata, sizeof(int) * n, cudaMemcpyHostToDevice);
+ cudaMemcpy(dev_odata2, idata, sizeof(int) * n, cudaMemcpyHostToDevice);
+
timer().startGpuTimer();
- // TODO
+
+ int nParity = ilog2ceil(n) % 2;
+
+ for (int d = 1; d <= ilog2ceil(n); ++d)
+ {
+ if (d % 2 != nParity)
+ {
+ KernNaiveScanIteration << > > (n, d, dev_odata2, dev_odata1);
+ }
+ else
+ {
+ KernNaiveScanIteration << > > (n, d, dev_odata1, dev_odata2);
+ }
+ }
+
+ // Inclusive to exclusive
+ odata[0] = 0;
+ cudaMemcpy(odata + 1, dev_odata1, sizeof(int) * (n - 1), cudaMemcpyDeviceToHost);
timer().endGpuTimer();
+
+ //Clean up
+ cudaFree(dev_odata1);
+ cudaFree(dev_odata2);
+
}
}
}
diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu
index 1def45e..ee072c3 100644
--- a/stream_compaction/thrust.cu
+++ b/stream_compaction/thrust.cu
@@ -18,11 +18,18 @@ namespace StreamCompaction {
* Performs prefix-sum (aka scan) on idata, storing the result into odata.
*/
void scan(int n, int *odata, const int *idata) {
+ thrust::device_vector dv_in(idata, idata+n);
+ thrust::device_vector dv_out(n);
+
+
timer().startGpuTimer();
- // TODO use `thrust::exclusive_scan`
- // example: for device_vectors dv_in and dv_out:
- // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin());
+
+ thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin());
+
timer().endGpuTimer();
+
+ thrust::copy(dv_out.begin(), dv_out.end(), odata);
+
}
}
}