Skip to content

Commit 560b9f3

Browse files
committed
addressing more comments.
1 parent 5c7a132 commit 560b9f3

File tree

2 files changed

+102
-3
lines changed

2 files changed

+102
-3
lines changed

en/Building_a_Simple_Engine/Mobile_Development/03_performance_optimizations.adoc

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,11 @@
66

77
Mobile devices have significantly different hardware constraints compared to desktop systems. In this section, we'll explore key performance optimizations that are essential for achieving good performance on mobile platforms.
88

9+
[NOTE]
10+
====
11+
This chapter covers general mobile performance. For practices that arise specifically because the GPU is tile-based (TBR), see link:04_rendering_approaches.adoc[Rendering Approaches: Tile-Based Rendering].
12+
====
13+
914
=== Texture Optimizations
1015

1116
[NOTE]
@@ -181,6 +186,11 @@ struct OptimizedVertex {
181186
};
182187
----
183188

189+
[NOTE]
190+
====
191+
If you are targeting tile-based GPUs (TBR), bandwidth can be heavily impacted by attachment load/store behavior and tile flushes. See link:04_rendering_approaches.adoc[Rendering Approaches] — sections “Attachment Load/Store Operations on Tilers” and “Pipelining on Tilers: Subpass Dependencies and BY_REGION” for concrete guidance.
192+
====
193+
184194
=== Draw Call Optimizations
185195

186196
Mobile GPUs are particularly sensitive to draw call overhead:
@@ -191,6 +201,11 @@ Mobile GPUs are particularly sensitive to draw call overhead:
191201

192202
3. *Level of Detail (LOD)*: Implement LOD systems to reduce geometry complexity for distant objects.
193203

204+
[NOTE]
205+
====
206+
On tile-based GPUs, reducing CPU overhead is important, but keeping work and data on-chip via proper pipelining and subpasses often yields larger gains. See link:04_rendering_approaches.adoc[Rendering Approaches] — “Pipelining on Tilers: Subpass Dependencies and BY_REGION” for barrier/subpass patterns, and “Attachment Load/Store Operations on Tilers” for loadOp/storeOp guidance that avoids external memory traffic.
207+
====
208+
194209
=== Vendor-Specific Optimizations
195210

196211
Different mobile GPU vendors have specific architectures that may benefit from targeted optimizations.

en/Building_a_Simple_Engine/Mobile_Development/04_rendering_approaches.adoc

Lines changed: 87 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ depth_attachment.setStencilStoreOp(vk::AttachmentStoreOp::eDontCare);
4747
depth_attachment.setInitialLayout(vk::ImageLayout::eUndefined);
4848
depth_attachment.setFinalLayout(vk::ImageLayout::eDepthStencilAttachmentOptimal);
4949
50-
// When creating the image, use the transient flag
50+
// When creating the image, mark the attachment as transient
5151
vk::ImageCreateInfo image_info{};
5252
image_info.setImageType(vk::ImageType::e2D);
5353
image_info.setExtent(vk::Extent3D(width, height, 1));
@@ -56,9 +56,10 @@ image_info.setArrayLayers(1);
5656
image_info.setFormat(depth_format);
5757
image_info.setTiling(vk::ImageTiling::eOptimal);
5858
image_info.setInitialLayout(vk::ImageLayout::eUndefined);
59-
image_info.setUsage(vk::ImageUsageFlagBits::eDepthStencilAttachment);
59+
image_info.setUsage(vk::ImageUsageFlagBits::eDepthStencilAttachment | vk::ImageUsageFlagBits::eTransientAttachment);
6060
image_info.setSamples(vk::SampleCountFlagBits::e1);
61-
image_info.setFlags(vk::ImageCreateFlagBits::eTransient); // Transient flag
61+
// Prefer lazily allocated memory for transient attachments when supported
62+
// Choose memory with vk::MemoryPropertyFlagBits::eLazilyAllocated
6263
----
6364

6465
* *Render Pass Structure*: Design your render passes to take advantage of
@@ -105,6 +106,89 @@ vk::RenderPass render_pass = device.createRenderPass(render_pass_info);
105106

106107
* *Optimize for Tile Size*: Consider the tile size when designing your rendering algorithm. For example, if you know the tile size is 16x16, you might organize your data or algorithms to work efficiently with that size.
107108

109+
===== Attachment Load/Store Operations on Tilers
110+
111+
On tile-based GPUs, correctly using loadOp and storeOp is one of the highest-impact optimizations:
112+
113+
- Clear attachments with loadOp = CLEAR and initialLayout = UNDEFINED when you don't need previous contents. This avoids an external memory read for the tile.
114+
- Use storeOp = DONT_CARE for attachments whose results are not needed after the render pass (e.g., transient depth or intermediate color targets). This can prevent flushing the tile back to main memory.
115+
- For the swapchain image (or any image you will sample/transfer from later), use storeOp = STORE and set finalLayout appropriately (e.g., PRESENT_SRC_KHR for the swapchain).
116+
- For MSAA, resolve within the same render pass so the hardware can resolve from tile memory and only store the resolved image to external memory.
117+
118+
[source,cpp]
119+
----
120+
// Color attachment that we clear and present
121+
vk::AttachmentDescription color_attachment{};
122+
color_attachment.setFormat(swapchain_format);
123+
color_attachment.setSamples(vk::SampleCountFlagBits::e1);
124+
color_attachment.setLoadOp(vk::AttachmentLoadOp::eClear);
125+
color_attachment.setStoreOp(vk::AttachmentStoreOp::eStore); // we need to present
126+
color_attachment.setStencilLoadOp(vk::AttachmentLoadOp::eDontCare);
127+
color_attachment.setStencilStoreOp(vk::AttachmentStoreOp::eDontCare);
128+
color_attachment.setInitialLayout(vk::ImageLayout::eUndefined); // no need to load previous contents
129+
color_attachment.setFinalLayout(vk::ImageLayout::ePresentSrcKHR);
130+
131+
// Depth attachment used only within the pass
132+
vk::AttachmentDescription depth_attachment{};
133+
depth_attachment.setFormat(depth_format);
134+
depth_attachment.setSamples(vk::SampleCountFlagBits::e1);
135+
depth_attachment.setLoadOp(vk::AttachmentLoadOp::eClear);
136+
depth_attachment.setStoreOp(vk::AttachmentStoreOp::eDontCare); // don't flush depth to memory
137+
depth_attachment.setStencilLoadOp(vk::AttachmentLoadOp::eDontCare);
138+
depth_attachment.setStencilStoreOp(vk::AttachmentStoreOp::eDontCare);
139+
depth_attachment.setInitialLayout(vk::ImageLayout::eUndefined);
140+
depth_attachment.setFinalLayout(vk::ImageLayout::eDepthStencilAttachmentOptimal);
141+
----
142+
143+
[NOTE]
144+
====
145+
If you use dynamic rendering, the same rules apply via vk::RenderingAttachmentInfo loadOp/storeOp fields.
146+
See Vulkan Guide for background: Render Passes and Subpasses, Tile-based GPUs.
147+
====
148+
149+
===== Pipelining on Tilers: Subpass Dependencies and BY_REGION
150+
151+
Tile-based GPUs benefit from fine-grained synchronization that keeps work and data on-chip:
152+
153+
- Prefer subpasses with input attachments to keep producer/consumer within the same render pass, enabling tile-local reads.
154+
- Use vk::DependencyFlagBits::eByRegion to scope hazards to the pixel regions actually written/read, avoiding unnecessary tile flushes.
155+
- Avoid over-broad barriers (e.g., ALL_COMMANDS, MEMORY_READ/WRITE) that serialize the pipeline and may force external memory traffic. Use precise stage/access masks.
156+
157+
Example: dependency from a color-writing subpass to a subpass that reads that color as an input attachment.
158+
159+
[source,cpp]
160+
----
161+
vk::SubpassDependency dep{};
162+
dep.setSrcSubpass(0);
163+
dep.setDstSubpass(1);
164+
dep.setSrcStageMask(vk::PipelineStageFlagBits::eColorAttachmentOutput);
165+
dep.setDstStageMask(vk::PipelineStageFlagBits::eFragmentShader);
166+
dep.setSrcAccessMask(vk::AccessFlagBits::eColorAttachmentWrite);
167+
dep.setDstAccessMask(vk::AccessFlagBits::eInputAttachmentRead);
168+
dep.setDependencyFlags(vk::DependencyFlagBits::eByRegion);
169+
----
170+
171+
Example: external dependency to the first subpass of a render pass, allowing pipelining with prior pass while limiting scope by region.
172+
173+
[source,cpp]
174+
----
175+
vk::SubpassDependency externalDep{};
176+
externalDep.setSrcSubpass(VK_SUBPASS_EXTERNAL);
177+
externalDep.setDstSubpass(0);
178+
externalDep.setSrcStageMask(vk::PipelineStageFlagBits::eColorAttachmentOutput);
179+
externalDep.setDstStageMask(vk::PipelineStageFlagBits::eEarlyFragmentTests | vk::PipelineStageFlagBits::eColorAttachmentOutput);
180+
externalDep.setSrcAccessMask(vk::AccessFlagBits::eColorAttachmentWrite);
181+
externalDep.setDstAccessMask(vk::AccessFlagBits::eDepthStencilAttachmentWrite | vk::AccessFlagBits::eColorAttachmentWrite);
182+
externalDep.setDependencyFlags(vk::DependencyFlagBits::eByRegion);
183+
----
184+
185+
[NOTE]
186+
====
187+
With Synchronization2 (vkCmdPipelineBarrier2 and friends) avoid ALL_COMMANDS and prefer the minimal set of stages/access that capture your hazard. Use render pass/subpass structure when possible—it's the most tiler-friendly way to express pipelining.
188+
====
189+
190+
For further guidance, see the xref:https://docs.vulkan.org/guide/latest/[Vulkan Guide] topics on Tile-based GPUs, Render Passes, and Synchronization.
191+
108192
===== Memory Management
109193

110194
To improve the efficiency of memory allocation in TBR architectures:

0 commit comments

Comments
 (0)