Skip to content

Commit 4fe9c18

Browse files
committed
Address new feedback.
1 parent a4546b9 commit 4fe9c18

File tree

1 file changed

+6
-17
lines changed

1 file changed

+6
-17
lines changed

chapters/tile_based_rendering_best_practices.adoc

Lines changed: 6 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -28,9 +28,8 @@ By using the right render pass configurations and memory flags, you give the imp
2828

2929
While every vendor has a slightly different design, they generally share common characteristics.
3030
First, it is important to realize that the **tile size** is determined by the hardware and is not something you can query or control in core Vulkan.
31-
Depending on the device and the complexity of your attachments, tiles might be anything from 16x16 to 64x128 pixels.
31+
Depending on the device and the complexity of your attachments, tiles might be anything in size of pixels.
3232
The GPU chooses a size that fits its internal memory budget.
33-
Larger tiles generally allow for better parallelism, but they also require more on-chip memory.
3433
Some vendor extensions (like `VK_QCOM_tile_shading`) might expose these details, but for a cross-platform app, you should assume the tile size is opaque.
3534

3635
Second, the **on-chip memory** used for tiles is managed entirely by the driver.
@@ -70,6 +69,8 @@ When using **Dynamic Rendering** (`VK_KHR_dynamic_rendering`), you specify these
7069
This extension simplifies your code by removing the need for render pass and framebuffer objects, but the hardware logic remains identical.
7170
You must remain disciplined about your load and store operations to avoid performance regressions.
7271

72+
When using traditional render passes, try to structure them so that the driver can "merge" subpasses.
73+
7374
[[transient-attachments]]
7475
=== Transient Attachments and Lazy Allocation
7576

@@ -123,16 +124,6 @@ It's a powerful tool, but it is strictly restricted to the **current pixel**.
123124
You cannot use this extension to read neighboring pixels.
124125
Common post-processing effects like bloom, FXAA, or blurs still require a separate sampling pass because they depend on a wider neighborhood of data that might cross tile boundaries.
125126

126-
[[vk-ext-robustness2]]
127-
=== Robustness and Performance
128-
129-
Safety and performance often go hand-in-hand on mobile.
130-
The `VK_EXT_robustness2` extension provides stricter guarantees about out-of-bounds access.
131-
While it might seem like a debugging tool, it is highly relevant for TBR performance.
132-
133-
Out-of-bounds array or descriptor access on a mobile GPU can trigger expensive hardware recovery paths or even cause device hangs that require a full system reset.
134-
By enabling features like `nullDescriptor`, you can simplify your shader logic and let the hardware handle edge cases through well-defined, efficient paths.
135-
This is far better than the unpredictable (and often slow) behavior of undefined out-of-bounds access.
136127

137128
[[performance-considerations]]
138129
== Advanced Performance Tuning
@@ -172,10 +163,10 @@ On mobile GPUs, concurrency is everything.
172163
The GPU has a limited number of execution units (EUs), and it tries to run thousands of shader instances in parallel to hide memory latency.
173164
If a single shader is too complex—using a large number of registers or running for hundreds of lines—it can "clog" the EUs and prevent other work from starting.
174165

175-
A common rule of thumb is to avoid shaders that exceed 800 lines of code.
166+
A common rule of thumb is to avoid shaders that have monolithic large files of code.
176167
While modern hardware is becoming more capable, complex shaders still increase register pressure.
177168
If a shader uses too many General Purpose Registers (GPRs), the GPU may only be able to run a few threads at a time on each EU, leaving the rest of the hardware idle.
178-
If you find yourself with a massive shader that is dragging down performance, consider splitting it into smaller draws or subpasses.
169+
If you find yourself with a massive shader that is dragging down performance, consider splitting it into smaller draws or use subpasses with their own smaller units of work.
179170
While this adds a small amount of overhead, the increase in EU concurrency can often lead to a net gain in frame rate.
180171

181172
You can also help the compiler by keeping your shader logic organized.
@@ -204,13 +195,11 @@ Advanced samplers on some hardware can even perform operations like convolution,
204195
[[synchronization-and-subpasses]]
205196
=== Synchronization and Pipeline Flow
206197

207-
In a power-constrained environment, you want the CPU and GPU to work as independently as possible.
208198
Frequent synchronization points—like calling `vkQueueWaitIdle`—can cause the GPU to stall while waiting for the CPU, or vice versa.
209199

210-
When using traditional render passes, try to structure them so that the driver can "merge" subpasses.
211200
This allows the GPU to pass data between subpasses entirely through tile memory, avoiding expensive writes to system RAM.
212201
For this to work, subpasses usually need a simple dependency chain and consistent attachments.
213-
Similarly, for the best throughput, ensure your swapchain has enough images (usually 3 or 4 for heavy loads) to keep the GPU busy while the CPU prepares the next frame.
202+
For the best throughput, ensure your swapchain has enough images (usually 3 or 4 for heavy loads) to keep the GPU busy while the CPU prepares the next frame.
214203

215204
[[best-practices-summary]]
216205
== Summary of Best Practices

0 commit comments

Comments
 (0)