Skip to content

Commit 574f2a4

Browse files
author
Zixin Zhang
committed
README first draft
1 parent 20e9597 commit 574f2a4

File tree

1 file changed

+101
-10
lines changed

1 file changed

+101
-10
lines changed

README.md

Lines changed: 101 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,112 @@
1-
CUDA Stream Compaction
2-
======================
1+
<p align="center">
2+
<h1 align="center">Prefix Sum and Stream Compaction</h2>
3+
<h2 align="center">Author: (Charles) Zixin Zhang</h2>
4+
<p align="center">
5+
CPU and GPU Implementations of Exclusive Prefix Sum(Scan) Algorithm and Stream Compaction in CUDA C
6+
</p>
7+
</p>
38

4-
**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
9+
---
10+
## Highlights
511

6-
* (TODO) YOUR NAME HERE
7-
* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
8-
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
12+
XXXXX
913

10-
### (TODO: Your README)
1114

12-
Include analysis, etc. (Remember, this is public, so don't put
13-
anything here that you don't want to share with the world.)
15+
16+
Tested on:
17+
18+
```
19+
[SM 8.6 NVIDIA GeForce RTX 3080]
20+
Max threads per block: 1024
21+
Shared memory per block: 49152 bytes
22+
Max threads per SM: 1536
23+
Max blocks per SM: 16
24+
Max grid size: 2147483647, 65535, 65535
25+
```
26+
27+
---
28+
29+
## Features
30+
31+
- CPU Scan & Stream Compaction
32+
- Naive GPU Scan Algorithm Using Shared Memory
33+
- Work-Efficient GPU Scan Using Shared Memory & Stream Compaction
34+
- Thrust's Scan Algorithm
35+
36+
For all GPU Scan algorithms, I choose to implement inclusive Scan first, and then convert the result of inclusive Scan to exclusive Scan. This can be done in parallel with minimal code.
37+
38+
## Performance Analysis
39+
40+
### Block Size
41+
42+
RTX 3080 Stats:
43+
44+
```
45+
[SM 8.6 NVIDIA GeForce RTX 3080]
46+
Max threads per block: 1024
47+
Shared memory per block: 49152 bytes
48+
Max threads per SM: 1536
49+
Max blocks per SM: 16
50+
Max grid size: 2147483647, 65535, 65535
51+
```
52+
53+
I want to choose a block configuration that would result in the largest number of threads in the SM.
54+
55+
:heavy_check_mark: 512 threads per block
56+
57+
- You need 1536/512 = 3 blocks to fully occupy the SM. Fortunately, SM allows up to 16 blocks. Thus, the actual number of threads that can run on this SM is 3 * 512 = 1536. We have occupied 1536/1536 = 100% of the SM.
58+
59+
## Naive Scan Analysis
60+
61+
- Implemented ```NaiveGPUScan``` using shared memory.
62+
- Each thread is assigned to evolve the contents of one element in the input array.
63+
- This is largely a four step process:
64+
- compute the scan result for individual sections. Then, store their block sum to ```sumArray```
65+
- scan block sums
66+
- add scanned block sum ```i``` to all values of scanned block ```i + 1```
67+
- convert from inclusive to exclusive scan
68+
69+
In my implementation, the naive kernel can process up to 128 elements in each section by using 128 threads in each block. If the input data consists of 1,000,000 elements, we can use ceil(1,000,000 / 128) = 7813 thread blocks. With up to 2147483647 thread blocks in the x-dimension of the grid, the naive kernel can process up to 2147483647 * 128 = around 274 billion elements.
70+
71+
## Work Efficient Scan
72+
73+
Understand thread to data mapping:
74+
75+
```int index = (threadIdx.x + 1) * stride * 2 - 1;```
76+
77+
- (threadIdx.x + 1) shifts thread indices from 0, 1, 2, 3, ... to 1, 2, 3, 4, ...All indices become non-zero integers.
78+
- (threadIdx.x + 1) * stride * 2 - 1
79+
- For example, when stride = 1, we want thread 0 maps to data index [1], thread 1 maps to data index[3], etc.
80+
- (threadIdx.x + 1) * stride * 2 - 1 = (0 + 1) * 1 * 2 - 1 = 1
81+
- (threadIdx.x + 1) * stride * 2 - 1 = (1 + 1) * 1 * 2 - 1 = 3
82+
- For example, when stride = 2, we want thread 0 maps to data index [3], thread 1 maps to data index[7], etc.
83+
- (threadIdx.x + 1) * stride * 2 - 1 = (0 + 1) * 2 * 2 - 1 = 3
84+
- (threadIdx.x + 1) * stride * 2 - 1 = (1 + 1) * 2 * 2 - 1 = 7
1485

1586
# Question
87+
1688
```
1789
genArray(SIZE - 1, a, 50); // Leave a 0 at the end to test that edge case
1890
a[SIZE - 1] = 0;
1991
printArray(SIZE, a, true);
2092
```
21-
Why leave 0?
93+
Why leave 0?
94+
95+
96+
97+
## Bloopers
98+
99+
### #1
100+
101+
```
102+
CUDA error (d:\dev\565\project2-stream-compaction\stream_compaction\naive.cu:84): memCpy back failed!: an illegal memory access was encountered
103+
104+
83 cudaMemcpy(odata, d_OutputData, size, cudaMemcpyDeviceToHost);
105+
84 checkCUDAError("memCpy back failed!");
106+
```
107+
108+
- I encountered this error when implementing the naive version (without considering arbirary-length inputs) of the scan algorithm. At first, I suspected the culprit is on line 83 (because the line 84 reports the error). However, the culprit actually resides in my ```kernNaiveGPUScan``` function where I accessed ```XY[-1]``` inside the for loop.
109+
- Fix: Need a if-statement to make sure we never access```XY[-1]```. Also need to make sure ```__syncthreads()``` are **not** in the if-statement.
110+
111+
> When a ```__syncthread()``` statement is placed in an if-statement, either all or none of the threads in a block execute the path that includes the __syncthreads(). PMPP p.59
112+

0 commit comments

Comments
 (0)