CIS565-Fall-2022 · horo-ursa · Sep 18, 2022 · Sep 19, 2022 · Sep 19, 2022 · Sep 19, 2022
diff --git a/README.md b/README.md
@@ -3,12 +3,172 @@ CUDA Stream Compaction
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Shixuan Fang
+  * [LinkedIn](https://www.linkedin.com/in/shixuan-fang-4aba78222/)
+* Tested on: Windows 11, i7-12700k, RTX3080Ti (Personal Computer)
 
-### (TODO: Your README)
+# Overview
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+This project is mainly about **Parallel Scan(prefix sum)** and **Stream Compaction** algorithms implemented with CUDA. Scan is about computing the prefix sum of an array, and stream compaction is about deleting all elements in an array that meet certain condition. These algorithms seem to be inherently sequential at the first glance, but with GPU we can convert these algorithms into very efficient parallel algorithms.
 
+One application of Parallel Scan is Summed Area Table, which is a very important algorithm real-time rendering, especially for pre-computation. Another one is Radix Sort, which is a sorting algorithm that can run in parallel. Stream Compaction is very important in ray tracing, which can help delete unnecessory rays.
+
+# Description
+In this project, I mainly implemented these algorithms:
+- Naive CPU Scan & Stream compaction
+- Naive GPU Scan 
+- Optimized Efficient GPU Scan & Stream Compaction (which break the scanning process into up-sweep and down-sweep stages)
+- GPU Scan with Thrust Library
+- More-Efficient GPU Scan with dynamic number of blocks and threads (extra credit)
+- GPU Scan Using Shared Memory && Hardware Optimization (extra credit)
+- Radix Sort (extra credit)
+
+# Output
+
+```
+****************
+** SCAN TESTS **
+****************
+    [  37   0  28  30  42  12  17  24  24  11   7  27   6 ...  23   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 53.0417ms    (std::chrono Measured)
+==== cpu scan, non-power-of-two ====
+   elapsed time: 53.5831ms    (std::chrono Measured)
+    passed
+==== naive scan, power-of-two ====
+   elapsed time: 9.51712ms    (CUDA Measured)
+    passed
+==== naive scan, non-power-of-two ====
+   elapsed time: 9.58666ms    (CUDA Measured)
+    passed
+==== work-efficient scan, power-of-two ====
+   elapsed time: 8.21565ms    (CUDA Measured)
+    passed
+==== work-efficient scan, non-power-of-two ====
+   elapsed time: 7.78179ms    (CUDA Measured)
+    passed
+==== optimized work-efficient scan, power-of-two ====
+   elapsed time: 4.05126ms    (CUDA Measured)
+    passed
+==== optimized work-efficient scan, non-power-of-two ====
+   elapsed time: 4.13146ms    (CUDA Measured)
+    passed
+==== work-efficient scan with shared memory, power-of-two ====
+   elapsed time: 1.93226ms    (CUDA Measured)
+    passed
+==== work-efficient scan with shared memory, non power-of-two ====
+   elapsed time: 1.75846ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 1.15917ms    (CUDA Measured)
+    passed
+==== thrust scan, non-power-of-two ====
+   elapsed time: 0.641792ms    (CUDA Measured)
+    passed
+
+*****************************
+** Radix Sort TESTS **
+*****************************
+==== Radix sort ====
+   elapsed time: 0.641792ms    (CUDA Measured)
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   2   1   2   1   3   1   2   2   2   2   3   3   2 ...   0   0 ]
+==== cpu compact without scan, power-of-two ====
+   elapsed time: 58.6994ms    (std::chrono Measured)
+    passed
+==== cpu compact without scan, non-power-of-two ====
+   elapsed time: 58.559ms    (std::chrono Measured)
+    passed
+==== cpu compact with scan ====
+   elapsed time: 146.589ms    (std::chrono Measured)
+    passed
+==== work-efficient compact, power-of-two ====
+   elapsed time: 11.0158ms    (CUDA Measured)
+    passed
+==== work-efficient compact, non-power-of-two ====
+   elapsed time: 10.992ms    (CUDA Measured)
+    passed
+```
+
+
+# Performance Analysis
+
+- Scan runtime with different block size with array size = 2^25.
+
+![Performance with different block size (1)](https://user-images.githubusercontent.com/54868517/190933826-32efa973-fad8-4ba4-9508-5a614bd00540.png)
+
+
+As seen in this graph, **blockSize = 256** has the best performance for Naive, Efficient, and Efficient with shared memory, therefore these blockSize are set to **256**, **blockSize = 128** has the best performance for Optimized Efficient, which is then set to **128**
+
+
+- Compare all of these GPU Scan implementations to CPU version of Scan 
+
+![17d41e321d148779ee125d48673dcc7](https://user-images.githubusercontent.com/54868517/190935115-3d188d84-bad3-4d8c-b0e6-bf09634840b1.jpg)
+
+- In this graph, array size starts with 2^20 and ends with 2^28; generally, GPU scan algorithms are faster than CPU serial scan with array size larger than 2^16 due to great amount of parallel computing.
+- As seen clearly in this graph, other than thrust library, Efficient scan with shared memory performed the best, which is much faster than CPU scan, even cost only half time compared with Optimized Efficient scan. This implies that **global memory accessing** is one of the biggest performance bottleneck.
+- We also noticed that the Optimized Efficient Scan(with dynamic block numer/grid size) outperform the naive Efficient Scan. This implies that **the number of blocks that are launched in a kernel do affect performance**. When we update the number of blocks in Up-Sweep stage and Down-Sweep stage dynamically, the kernels can run faster.
+- Thust::exclusive_scan is always the fastest method, which is even much master than Efficient Scan with shared memory.
+ - As seen in the following screenshot from Nsight Timeline, the thrust api has ```cudaMemcpyAsync()``` and ``` cudaStreamSynchronize``` functions which are not used in my implementation. According to a stackoverflow post, ```cudaMemcpyAsync()``` will **returns control to the host immediately (just like a kernel call does) rather than waiting for the data copy to be completed.** I assume this will greatly improve the performance since we can do something else while cpu is copying data.
+![be2a7736d65f0a09e21ce97efc51d27](https://user-images.githubusercontent.com/54868517/190936662-7465d555-925c-4a09-aa21-9e67cc1f1aea.jpg)
+
+# Extra Credit
+
+**1. Optimized GPU Efficient Scan**
+```
+__global__ void kernUpSweep(int n, int stride, int* data)
+{
+    int index = blockIdx.x * blockDim.x + threadIdx.x;
+    if (index >= n) return;
+    if (index % stride != 0 || index + stride -1  >= n) return;
+    data[index + stride - 1] += data[index + stride / 2 - 1];
+}
+```
+For both kernUpSweep and kernDownSweep kernels used in the basic scan function, there are ```%``` operations, which are very slow to compute in GPU. After optimization, the new KernOptimizedUpSweep looks like this:
+```
+__global__ void kernOptimizedUpSweep(int n, int d, int offset, int* data)
+{
+    int index = blockIdx.x * blockDim.x + threadIdx.x;
+    if (index >= n) return;
+    if (index < d) {
+        int ai = offset * (2 * index + 1) - 1;
+        int bi = offset * (2 * index + 2) - 1;
+        data[bi] += data[ai];
+    }
+}
+```
+This simple change actually improved the performance a lot. 
+Another part I've changed is how many blocked are launched, this is achieved by shrinking block number by two in the for loop during up sweep stage, and expand by two during the down sweep stage.
+```
+for (int d = Num >> 1; d > 0; d >>= 1) {
+    blockNum = (d + blockSize - 1) / blockSize;
+    kernOptimizedUpSweep << <blockNum, blockSize >> > (Num, d, offset, dev_data);
+    offset <<= 1;
+}
+```
+
+**2. GPU Efficient Scan with shared memory**
+
+This is mainly achieve by the kernal ```kernOptimizedPerBlockScan```, which is adapted from GPU Gem 3 Ch 39. 
+The source code in GPU Gem 3 only works with 1 block, so I changed it to allow multiple blocks run this algorithm in parallel and then implemented the algorithm showed in class slide.
+
+**3. Radix Sort**
+
+I've also implemented Radix Sort, which can be see in ```StreamCompaction::Efficient::RadixSort()```
+
+Here is an example of how I tested it.
+```
+** Radix Sort TESTS **
+*****************************
+    [  43  34  35  24  32   8  19   0 ]
+==== Radix sort ====
+    [   0   8  19  24  32  34  35  43 ]
+==== Result from std::sort ====
+    [   0   8  19  24  32  34  35  43 ]
+   elapsed time: 0.493472ms    (CUDA Measured)
+    passed
+```
diff --git a/src/main.cpp b/src/main.cpp
@@ -13,15 +13,20 @@
 #include <stream_compaction/thrust.h>
 #include "testing_helpers.hpp"
 
-const int SIZE = 1 << 8; // feel free to change the size of array
+const int SIZE = 1 << 20; // feel free to change the size of array
 const int NPOT = SIZE - 3; // Non-Power-Of-Two
 int *a = new int[SIZE];
 int *b = new int[SIZE];
 int *c = new int[SIZE];
+int* sort = new int[SIZE];
+
 
 int main(int argc, char* argv[]) {
     // Scan tests
 
+    //printPowerOf2(SIZE);
+    //printPowerOf2(NPOT);
+
     printf("\n");
     printf("****************\n");
     printf("** SCAN TESTS **\n");
@@ -34,17 +39,19 @@ int main(int argc, char* argv[]) {
     // initialize b using StreamCompaction::CPU::scan you implement
     // We use b for further comparison. Make sure your StreamCompaction::CPU::scan is correct.
     // At first all cases passed because b && c are all zeroes.
+
+
     zeroArray(SIZE, b);
     printDesc("cpu scan, power-of-two");
     StreamCompaction::CPU::scan(SIZE, b, a);
     printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
-    printArray(SIZE, b, true);
+    //printArray(SIZE, b, true);
 
     zeroArray(SIZE, c);
     printDesc("cpu scan, non-power-of-two");
     StreamCompaction::CPU::scan(NPOT, c, a);
     printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
-    printArray(NPOT, b, true);
+    //printArray(NPOT, b, true);
     printCmpResult(NPOT, b, c);
 
     zeroArray(SIZE, c);
@@ -54,18 +61,19 @@ int main(int argc, char* argv[]) {
     //printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
 
-    /* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan
-    onesArray(SIZE, c);
-    printDesc("1s array for finding bugs");
-    StreamCompaction::Naive::scan(SIZE, c, a);
-    printArray(SIZE, c, true); */
+    ////For bug-finding only: Array of 1s to help find bugs in stream compaction or scan
+    //onesArray(SIZE, c);
+    //printDesc("1s array for finding bugs");
+    //StreamCompaction::Naive::scan(NPOT, c, a);
+    //printArray(SIZE, c, true); 
 
     zeroArray(SIZE, c);
     printDesc("naive scan, non-power-of-two");
     StreamCompaction::Naive::scan(NPOT, c, a);
     printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
     //printArray(SIZE, c, true);
     printCmpResult(NPOT, b, c);
+
 
     zeroArray(SIZE, c);
     printDesc("work-efficient scan, power-of-two");
@@ -81,6 +89,35 @@ int main(int argc, char* argv[]) {
     //printArray(NPOT, c, true);
     printCmpResult(NPOT, b, c);
 
+    zeroArray(SIZE, c);
+    printDesc("optimized work-efficient scan, power-of-two");
+    StreamCompaction::Efficient::scanOptimized(SIZE, c, a);
+    printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    //printArray(SIZE, c, true);
+    printCmpResult(SIZE, b, c);
+
+    zeroArray(SIZE, c);
+    printDesc("optimized work-efficient scan, non-power-of-two");
+    StreamCompaction::Efficient::scanOptimized(NPOT, c, a);
+    printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    //printArray(NPOT, c, true);
+    printCmpResult(NPOT, b, c);
+
+    ////printArray(SIZE, a, true);
+    zeroArray(SIZE, c);
+    printDesc("work-efficient scan with shared memory, power-of-two");
+    StreamCompaction::Efficient::scanWithSharedMemory(SIZE, c, a);
+    printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    //printArray(SIZE, c, true);
+    printCmpResult(SIZE, b, c);
+
+    //zeroArray(SIZE, c);
+    printDesc("work-efficient scan with shared memory, non power-of-two");
+    StreamCompaction::Efficient::scanWithSharedMemory(NPOT, c, a);
+    printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    //printArray(SIZE, c, true);
+    printCmpResult(NPOT, b, c);
+
     zeroArray(SIZE, c);
     printDesc("thrust scan, power-of-two");
     StreamCompaction::Thrust::scan(SIZE, c, a);
@@ -95,6 +132,19 @@ int main(int argc, char* argv[]) {
     //printArray(NPOT, c, true);
     printCmpResult(NPOT, b, c);
 
+    printf("\n");
+    printf("*****************************\n");
+    printf("** Radix Sort TESTS **\n");
+    printf("*****************************\n");
+
+    zeroArray(SIZE, c);
+    printDesc("Radix sort");
+    StreamCompaction::Efficient::RadixSort(SIZE, c, a);
+    for (int i = 0; i < SIZE; i++) sort[i] = c[i];
+    std::sort(sort, sort + SIZE);
+    printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    printCmpResult(SIZE, sort, c);
+
     printf("\n");
     printf("*****************************\n");
     printf("** STREAM COMPACTION TESTS **\n");
@@ -115,22 +165,22 @@ int main(int argc, char* argv[]) {
     count = StreamCompaction::CPU::compactWithoutScan(SIZE, b, a);
     printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
     expectedCount = count;
-    printArray(count, b, true);
+    //printArray(count, b, true);
     printCmpLenResult(count, expectedCount, b, b);
 
     zeroArray(SIZE, c);
     printDesc("cpu compact without scan, non-power-of-two");
     count = StreamCompaction::CPU::compactWithoutScan(NPOT, c, a);
     printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
     expectedNPOT = count;
-    printArray(count, c, true);
+    //printArray(count, c, true);
     printCmpLenResult(count, expectedNPOT, b, c);
 
     zeroArray(SIZE, c);
     printDesc("cpu compact with scan");
     count = StreamCompaction::CPU::compactWithScan(SIZE, c, a);
     printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
-    printArray(count, c, true);
+    //printArray(count, c, true);
     printCmpLenResult(count, expectedCount, b, c);
 
     zeroArray(SIZE, c);
@@ -147,6 +197,7 @@ int main(int argc, char* argv[]) {
     //printArray(count, c, true);
     printCmpLenResult(count, expectedNPOT, b, c);
 
+
     system("pause"); // stop Win32 console from closing on exit
     delete[] a;
     delete[] b;

diff --git a/src/testing_helpers.hpp b/src/testing_helpers.hpp
@@ -69,6 +69,13 @@ void printArray(int n, int *a, bool abridged = false) {
     printf("]\n");
 }
 
+void printPowerOf2(int n) {
+    if ((n & (n - 1)) == 0)
+        printf("%s", "N is a power of 2");
+    else
+        printf("%s", "N is not a power of 2");
+}
+
 template<typename T>
 void printElapsedTime(T time, std::string note = "")
 {

diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
@@ -1,4 +1,5 @@
 #include "common.h"
+#include "device_launch_parameters.h"
 
 void checkCUDAErrorFn(const char *msg, const char *file, int line) {
     cudaError_t err = cudaGetLastError();
@@ -24,6 +25,9 @@ namespace StreamCompaction {
          */
         __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
             // TODO
+            int index = blockIdx.x * blockDim.x + threadIdx.x;
+            if (index >= n) return;
+            bools[index] = (idata[index] == 0) ? 0 : 1;
         }
 
         /**
@@ -33,6 +37,10 @@ namespace StreamCompaction {
         __global__ void kernScatter(int n, int *odata,
                 const int *idata, const int *bools, const int *indices) {
             // TODO
+            int index = blockIdx.x * blockDim.x + threadIdx.x;
+            if (index >= n) return;
+            if (bools[index] == 1)
+                odata[indices[index]] = idata[index];
         }
 
     }