You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: chapters/tile_based_rendering_best_practices.adoc
+6-17Lines changed: 6 additions & 17 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -28,9 +28,8 @@ By using the right render pass configurations and memory flags, you give the imp
28
28
29
29
While every vendor has a slightly different design, they generally share common characteristics.
30
30
First, it is important to realize that the **tile size** is determined by the hardware and is not something you can query or control in core Vulkan.
31
-
Depending on the device and the complexity of your attachments, tiles might be anything from 16x16 to 64x128 pixels.
31
+
Depending on the device and the complexity of your attachments, tiles might be anything in size of pixels.
32
32
The GPU chooses a size that fits its internal memory budget.
33
-
Larger tiles generally allow for better parallelism, but they also require more on-chip memory.
34
33
Some vendor extensions (like `VK_QCOM_tile_shading`) might expose these details, but for a cross-platform app, you should assume the tile size is opaque.
35
34
36
35
Second, the **on-chip memory** used for tiles is managed entirely by the driver.
@@ -70,6 +69,8 @@ When using **Dynamic Rendering** (`VK_KHR_dynamic_rendering`), you specify these
70
69
This extension simplifies your code by removing the need for render pass and framebuffer objects, but the hardware logic remains identical.
71
70
You must remain disciplined about your load and store operations to avoid performance regressions.
72
71
72
+
When using traditional render passes, try to structure them so that the driver can "merge" subpasses.
73
+
73
74
[[transient-attachments]]
74
75
=== Transient Attachments and Lazy Allocation
75
76
@@ -123,16 +124,6 @@ It's a powerful tool, but it is strictly restricted to the **current pixel**.
123
124
You cannot use this extension to read neighboring pixels.
124
125
Common post-processing effects like bloom, FXAA, or blurs still require a separate sampling pass because they depend on a wider neighborhood of data that might cross tile boundaries.
125
126
126
-
[[vk-ext-robustness2]]
127
-
=== Robustness and Performance
128
-
129
-
Safety and performance often go hand-in-hand on mobile.
130
-
The `VK_EXT_robustness2` extension provides stricter guarantees about out-of-bounds access.
131
-
While it might seem like a debugging tool, it is highly relevant for TBR performance.
132
-
133
-
Out-of-bounds array or descriptor access on a mobile GPU can trigger expensive hardware recovery paths or even cause device hangs that require a full system reset.
134
-
By enabling features like `nullDescriptor`, you can simplify your shader logic and let the hardware handle edge cases through well-defined, efficient paths.
135
-
This is far better than the unpredictable (and often slow) behavior of undefined out-of-bounds access.
136
127
137
128
[[performance-considerations]]
138
129
== Advanced Performance Tuning
@@ -172,10 +163,10 @@ On mobile GPUs, concurrency is everything.
172
163
The GPU has a limited number of execution units (EUs), and it tries to run thousands of shader instances in parallel to hide memory latency.
173
164
If a single shader is too complex—using a large number of registers or running for hundreds of lines—it can "clog" the EUs and prevent other work from starting.
174
165
175
-
A common rule of thumb is to avoid shaders that exceed 800 lines of code.
166
+
A common rule of thumb is to avoid shaders that have monolithic large files of code.
176
167
While modern hardware is becoming more capable, complex shaders still increase register pressure.
177
168
If a shader uses too many General Purpose Registers (GPRs), the GPU may only be able to run a few threads at a time on each EU, leaving the rest of the hardware idle.
178
-
If you find yourself with a massive shader that is dragging down performance, consider splitting it into smaller draws or subpasses.
169
+
If you find yourself with a massive shader that is dragging down performance, consider splitting it into smaller draws or use subpasses with their own smaller units of work.
179
170
While this adds a small amount of overhead, the increase in EU concurrency can often lead to a net gain in frame rate.
180
171
181
172
You can also help the compiler by keeping your shader logic organized.
@@ -204,13 +195,11 @@ Advanced samplers on some hardware can even perform operations like convolution,
204
195
[[synchronization-and-subpasses]]
205
196
=== Synchronization and Pipeline Flow
206
197
207
-
In a power-constrained environment, you want the CPU and GPU to work as independently as possible.
208
198
Frequent synchronization points—like calling `vkQueueWaitIdle`—can cause the GPU to stall while waiting for the CPU, or vice versa.
209
199
210
-
When using traditional render passes, try to structure them so that the driver can "merge" subpasses.
211
200
This allows the GPU to pass data between subpasses entirely through tile memory, avoiding expensive writes to system RAM.
212
201
For this to work, subpasses usually need a simple dependency chain and consistent attachments.
213
-
Similarly, for the best throughput, ensure your swapchain has enough images (usually 3 or 4 for heavy loads) to keep the GPU busy while the CPU prepares the next frame.
202
+
For the best throughput, ensure your swapchain has enough images (usually 3 or 4 for heavy loads) to keep the GPU busy while the CPU prepares the next frame.
0 commit comments