You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
### In this project, I implement three different implementations of real-time lighting methods from a naive O(n) search to clustered forward plus lighting to deferred forward plus clustered lighting.
17
19
20
+
Most games engines nowadays use a mix of forward plus and clustered forward plus - though many years ago, [DOOM 2016](https://advances.realtimerendering.com/s2016/Siggraph2016_idTech6.pdf) used a form of clustered lights with cleverly scalarized access.
Presented in 2017, Michal Drobot introduces a more optimized version of clustered forward plus through Z-Binning to efficiently bin lights by depth in [Call of Duty : Infinite Warfare](https://advances.realtimerendering.com/s2017/2017_Sig_Improved_Culling_final.pdf), improving memory performance from clustered - ultimately, the ideas of forward-plus are still used today in modern real-time rendering to process and render tons of lights.
23
+
24
+
So to summarize, overall features implemented:
25
+
- Naive lighting solution
26
+
- Clustered Forward Plus Lighting
27
+
- Clustered Deferred Lighting
28
+
29
+
## Introduction to Clustered Forward Plus Lighting
30
+
To understand what clustered forward plus lighting is, we need to start from how we would typically render a scene without any optimizations and then gradually evolve to our implemented solution at the end.
31
+
32
+
The goal is to ultimately light our scene - given a pixel and the lights in the scene, we want to know how that pixel will be shaded based on the contributions of each light. If far enough, the pixel shouldn't be lit at all by attenuation, and similarly, pixels within the lighting radius of a point light should be colored brightly.
33
+
34
+
35
+
36
+
---
37
+
### Naive Lighting
38
+
In a typical forward rendering pipeline, information from a host-side vertex buffer gets passed to the GPU, where it gets processed by the vertex shader, then through primitive assembly and rasterization, becomes a fragment shaded by the fragment shader.
39
+
40
+
In the fragment shader, we can typically shade a fragment by looping through all the lights in our scene and then accumulating light contributions per light.
41
+
42
+
The base code handles contributions as such, before it's ultimately applied to the final color:
And the result is a simple lit scene. However, it's easy to immediately see how this can struggle as we scale the lights, as it's an ```O(n)``` computation to evaluate every single light in the scene!
54
+
55
+
Imagine how much computation is wasted from evaluating faraway lights or even occluded lights, and the performance is unfortunately unacceptable for rendering hundreds of lights in a scene.
56
+
57
+
---
58
+
### Forward Plus Tiled Rendering
59
+
60
+
As mentioned above, there are severe scaling issues from our naive lighting method that could greatly benefit from localizing light hotspots.
61
+
62
+
In 2012 from AMD, [Harada, et. al](https://takahiroharada.wordpress.com/wp-content/uploads/2015/04/forward_plus.pdf) introduced the concept of tiled rendering to bin lights into 2D screenspace tiles. Instead of shading a fragment by all the lights in the scene, the paper instead proposed to shade using lights only contained in the 2D tile encompassing the fragment.
63
+
64
+

65
+
<br>
66
+
*Graphic from Harada, McKee, and Yang's paper. The left shows a sccene with 3,072 lights rendered in 1280x720 resolution, while the right shows a light heatmap representing the number of lights binned per tile. Red tiles have 50 lights, green have 25, and blue have 0.*
67
+
68
+
69
+
This way, **lighting contributions are localized,** and the amount of lights processed per fragment is limited by the most lights that can be stored in a tile. This significantly reduces the number of lights processed per fragment!
70
+
71
+
While the paper introduces the technique for deferred rendering pipelines, it's easily adaptable to forward rendering using compute shaders! We can implement our tile construction and light culling as such:
72
+
73
+
```
74
+
for each 2D cluster (# of clusters determined by 2D tile size and screen dimensions):
75
+
Compute viewspace frustum AABB bounds
76
+
Loop through all lights such that:
77
+
if the light intersects with the frustum tile, add it to the bin
78
+
```
79
+
80
+
To optimize depth, 2D forward plus pipelines also introduce light culling by min/max scene depth in a tile, such that lights not included within the range don't need to be binned.
81
+
82
+
<divalign="center">
83
+
<imgsrc="img/tiledDepth.png"height="300px">
84
+
<br>
85
+
<i>Image from CIS 5650 slides, red boxes visualize min/max depth ranges per 2D tile</i>
86
+
</div>
87
+
<br>
88
+
89
+
---
90
+
### Clustered Forward Plus Rendering
91
+
While very promising, tiled 2D forward plus still introduces a possible limitation - considering an extremely large min/max Z range, this allows us to unfortunately bin a ton of lights that can reintroduce the problem of evaluating faraway lights from before.
92
+
93
+
Instead of having 2D tiles, we can instead have 3D clusters, such that each cluster has a Z-range to bin lights, solving the localization problem from before at the expense of using more memory to store an extra dimension of clusters. This is known as [Clustered Rendering](https://www.highperformancegraphics.org/previous/www_2012/media/Papers/HPG2012_Papers_Olsson.pdf), introduced by Ola Olsson at HPG 2012.
<i>Visualization of clusters from Olsson's presentation.</i>
99
+
<br>
100
+
<br>
101
+
102
+
<imgsrc="img/heatmap.png"height="300px">
103
+
</div>
104
+
<br>
105
+
106
+
Here's a heat lightmap of my Sponza scene clusters in a scene with 5,000 lights. Each cluster has 32px by 32px tile size and stores a max of 500 lights. Using a [0,1] normalized value representing the number of lights in a cluster, the color is determined by interpolating between blue to green to red, where fully red tile colors store the max number of lights, green stores 250 lights, and blue stores 0.
107
+
108
+
---
109
+
### Clustered Deferred Rendering
110
+
To address overdraw and wasted light calculations on fragments not ultimately visible at the end, we can switch back from a forward pipeline to **deferred** to optimize performance.
111
+
112
+
Deferred rendering works differently from forward rendering by evaluating all shading calculations in one pass, where we only shade fragments visible to the camera. This is done by using forward rendering to draw our scene to G-Buffers, or textures storing scene information such as albedo (material color), depth, and normal information.
113
+
114
+
<table>
115
+
<tr>
116
+
<th>Albedo</th>
117
+
<th>Normals</th>
118
+
</tr>
119
+
<tr>
120
+
<td>
121
+
<imgsrc="img/albedo.png">
122
+
</td>
123
+
<td>
124
+
<imgsrc="img/normals.png">
125
+
</td>
126
+
</tr>
127
+
128
+
<tr>
129
+
<th>Depth</th>
130
+
<th>Final Composite</th>
131
+
</tr>
132
+
<tr>
133
+
<td>
134
+
<imgsrc="img/depth.png">
135
+
</td>
136
+
<td>
137
+
<imgsrc="img/composite.png">
138
+
</td>
139
+
</tr>
140
+
141
+
</table>
142
+
143
+
Using these G-Buffers allow us to construct our final scene by compositing the results of our G-Buffers and shading fragments by on-screen only normals, depth, etc.
144
+
145
+
For highly expensive geometric scenes, deferred rendering works wonderfully by reducing heavy wasted shading calculations from overdraw. However, this is at the cost of memory use - assuming our G-buFfers are full resolution with relatively expensive texture formats used per buffer, storing and reading all of our buffers on the GPU can be incredibly expensive during the shading and writing stage.
146
+
147
+
Most games nowadays don't use pure deferred rendering for these memory reasons, and instead opt for a depth-prepass to quickly cull scene info to prevent overdraw. Deferred rendering also makes transparency incredibly difficult and is often ignored in such pipelines.
148
+
149
+
## Performance Analysis
150
+
In theory, using these optimizations, it should stand that clustered deferred should work the best over clustered forward lighting, and with naive as the slowest. I analzyed performance among the three based on different lights, clusters, and work group sizes.
20
151
152
+
Ultimately, I found that clustered deferred worked the best consistently across all tests, and that forward plus posed an inherent advantage over naive.
21
153
154
+
I tested for performance by disabling the move light compute shaders and using a light radius of 1 to get enough naive readings from the FPS stats.
22
155
156
+
---
157
+
### Performance vs. Number of Lights
158
+

23
159
24
-
[]()
160
+
As noted previously, both the deferred and forward plus solutions scaled much better across larger numbers of lights.
25
161
26
-
### (TODO: Your README)
162
+
Ms for naive increases in a linear fashion, while both forward plus and deferred scale nearly logarithmically. This is mostly from the localized binning of the lights, allowing lesser lighting calculations per fragment based on the cluster. Deferred runs much faster than forward because of the single light shading for what's on screen, avoiding overdraw from Sponza's complex geometry.
27
163
28
-
hi gamers
164
+
I would expect that for simpler scenes, like a simeple plane, the overhead from sampling deferred's expensive texture formats would cause worse performance compared to forward plus, which has most fragments present on screen immediately with little overdraw.
29
165
166
+
---
167
+
### Performance vs. Number of Clusters (Based on Tile Size)
168
+

30
169
31
-
##
170
+
To test for performance based on the number of clusters, I was able to change the tile size to adjust clusters, where smaller tile sizes correspond to more clusters, and larger clusters correspond to less clusters. For this analysis, I kept the number of clusters in the Z axis constant at 32.
32
171
172
+
From testing, I found that performance is best around 64 and 128, which **I've calculated to about 28,160 to 7040 clusters, or # just under 10,000 being optimal**.
33
173
174
+
Having less clusters, or bigger tile sizes, will cause respective frustum bounding boxes to grow in volume, meaning that it will store more lights, and therefore more lights are processed per fragment. However, by having more clusters, the boxes grow smaller, and it's less likely for more lights to be stored per cluster.
34
175
176
+
While it seems more ideal to have more clusters then to process less lights, this requires more memory since we're storing more clusters in memory (keep in mind that all max light sizes need to be determined by compile time). Figuring out this balance is not trivial, and ultimately a fine balance was found between memory bandwidth (from storing more clusters) and computation cost (from processing more lights).
0 commit comments