Skip to content

Commit 3de29d5

Browse files
author
Zixin Zhang
committed
first draft
1 parent 574f2a4 commit 3de29d5

File tree

6 files changed

+354
-13
lines changed

6 files changed

+354
-13
lines changed

README.md

Lines changed: 71 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,76 @@ For all GPU Scan algorithms, I choose to implement inclusive Scan first, and the
3737

3838
## Performance Analysis
3939

40+
![scan](images/scan.png)
41+
42+
When the array size is under 20,000, CPU Scan performs better than other algorithms. As the array size increases, GPU Naive Scan performs better than the rest of the algorithms. The Thrust implementation has more stable performance than the rest of the algorithms.
43+
44+
Output when array size is 65536:
45+
46+
```
47+
****************
48+
** SCAN TESTS **
49+
****************
50+
[ 27 40 6 30 21 41 41 26 20 5 6 29 41 ... 32 0 ]
51+
==== cpu scan, power-of-two ====
52+
elapsed time: 0.0972ms (std::chrono Measured)
53+
[ 0 27 67 73 103 124 165 206 232 252 257 263 292 ... 1599954 1599986 ]
54+
55+
==== cpu scan, non-power-of-two ====
56+
elapsed time: 0.085ms (std::chrono Measured)
57+
[ 0 27 67 73 103 124 165 206 232 252 257 263 292 ... 1599856 1599858 ]
58+
passed
59+
60+
==== work-efficient scan, power-of-two ====
61+
elapsed time: 0.178144ms (CUDA Measured)
62+
passed
63+
==== work-efficient scan, non-power-of-two ====
64+
elapsed time: 0.096544ms (CUDA Measured)
65+
passed
66+
==== naive scan, power-of-two ====
67+
elapsed time: 0.091232ms (CUDA Measured)
68+
passed
69+
==== naive scan, non-power-of-two ====
70+
elapsed time: 0.182464ms (CUDA Measured)
71+
passed
72+
==== thrust scan, power-of-two ====
73+
elapsed time: 0.10432ms (CUDA Measured)
74+
[ 0 27 67 73 103 124 165 206 232 252 257 263 292 ... 1599954 1599986 ]
75+
passed
76+
==== thrust scan, non-power-of-two ====
77+
elapsed time: 0.075776ms (CUDA Measured)
78+
[ 0 27 67 73 103 124 165 206 232 252 257 263 292 ... 1599856 1599858 ]
79+
passed
80+
81+
*****************************
82+
** STREAM COMPACTION TESTS **
83+
*****************************
84+
[ 0 1 0 1 3 3 2 1 0 1 2 1 2 ... 3 0 ]
85+
==== cpu compact without scan, power-of-two ====
86+
elapsed time: 0.1293ms (std::chrono Measured)
87+
[ 1 1 3 3 2 1 1 2 1 2 2 1 3 ... 3 2 ]
88+
passed
89+
==== cpu compact without scan, non-power-of-two ====
90+
elapsed time: 0.1319ms (std::chrono Measured)
91+
[ 1 1 3 3 2 1 1 2 1 2 2 1 3 ... 3 3 ]
92+
passed
93+
==== cpu compact with scan ====
94+
elapsed time: 0.6768ms (std::chrono Measured)
95+
[ 1 1 3 3 2 1 1 2 1 2 2 1 3 ... 3 2 ]
96+
passed
97+
==== work-efficient compact, power-of-two ====
98+
elapsed time: 0.096544ms (CUDA Measured)
99+
[ 1 1 3 3 2 1 1 2 1 2 2 1 3 ... 3 2 ]
100+
passed
101+
==== work-efficient compact, non-power-of-two ====
102+
elapsed time: 0.096544ms (CUDA Measured)
103+
[ 1 1 3 3 2 1 1 2 1 2 2 1 3 ... 3 3 ]
104+
passed
105+
Press any key to continue . . .
106+
```
107+
108+
109+
40110
### Block Size
41111

42112
RTX 3080 Stats:
@@ -56,7 +126,7 @@ I want to choose a block configuration that would result in the largest number o
56126

57127
- You need 1536/512 = 3 blocks to fully occupy the SM. Fortunately, SM allows up to 16 blocks. Thus, the actual number of threads that can run on this SM is 3 * 512 = 1536. We have occupied 1536/1536 = 100% of the SM.
58128

59-
## Naive Scan Analysis
129+
## Naive Scan
60130

61131
- Implemented ```NaiveGPUScan``` using shared memory.
62132
- Each thread is assigned to evolve the contents of one element in the input array.
@@ -83,15 +153,6 @@ Understand thread to data mapping:
83153
- (threadIdx.x + 1) * stride * 2 - 1 = (0 + 1) * 2 * 2 - 1 = 3
84154
- (threadIdx.x + 1) * stride * 2 - 1 = (1 + 1) * 2 * 2 - 1 = 7
85155

86-
# Question
87-
88-
```
89-
genArray(SIZE - 1, a, 50); // Leave a 0 at the end to test that edge case
90-
a[SIZE - 1] = 0;
91-
printArray(SIZE, a, true);
92-
```
93-
Why leave 0?
94-
95156

96157

97158
## Bloopers

images/plotting/.ipynb_checkpoints/CUDA Flocking-checkpoint.ipynb

Lines changed: 189 additions & 0 deletions
Large diffs are not rendered by default.

images/plotting/CUDA Flocking.ipynb

Lines changed: 91 additions & 0 deletions
Large diffs are not rendered by default.

images/scan.png

24.2 KB
Loading

src/main.cpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@
1616
#include "testing_helpers.hpp"
1717

1818
// The tests default to an array of size 1 << 8 = 256
19-
const int SIZE = 1 << 8; // feel free to change the size of array
19+
const int SIZE = 1 << 16; // feel free to change the size of array
2020
const int NPOT = SIZE - 3; // Non-Power-Of-Two
2121
int *a = new int[SIZE];
2222
int *b = new int[SIZE];

stream_compaction/common.h

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,8 +11,8 @@
1111
#include <stdexcept>
1212

1313
/*! Block size used for CUDA kernel launch. */
14-
#define blockSize 128
15-
#define sectionSize 128
14+
#define blockSize 1024
15+
#define sectionSize 1024
1616

1717
#define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__)
1818
// usage: checkCUDAError("a descriptive name of this error")

0 commit comments

Comments
 (0)