Skip to content

Commit 3ad6ecd

Browse files
meta: Add tiler optimization extension documentation.
Signed-off-by: Hans-Kristian Arntzen <post@arntzen-software.no>
1 parent 617401c commit 3ad6ecd

File tree

1 file changed

+305
-0
lines changed

1 file changed

+305
-0
lines changed
Lines changed: 305 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,305 @@
1+
# Tiler Optimization Extensions
2+
3+
To allow for optimal rendering performance on GPUs which support tiling, vkd3d-proton
4+
provides some special interfaces to expose this style of rendering by keeping render targets on-chip,
5+
allowing framebuffer fetch mechanisms which default D3D12 APIs do not expose by default.
6+
7+
Applications which intend to run D3D12 over Proton on mobile hardware such as Adreno
8+
are able to leverage this new interface with minimal changes to the renderer.
9+
The expectation is that this interface will mostly be relevant for VR games
10+
which are more likely to cater to mobile concerns.
11+
12+
## Alternative to D3D12 RenderPass API
13+
14+
The default [D3D12 RenderPass API](https://microsoft.github.io/DirectX-Specs/d3d/RenderPasses.html)
15+
supports extensions for tiler optimizations with APIs like
16+
`D3D12_RENDER_PASS_ENDING_ACCESS_PRESERVE_LOCAL_SRV`.
17+
However, this API is fundamentally incompatible with Vulkan, and it is also unimplementable by
18+
virtually all mobile hardware, including mobile hardware that vkd3d-proton cares about.
19+
This vkd3d-proton interface should be supportable on a wide range of hardware.
20+
21+
It relies on [VK_KHR_dynamic_rendering_local_read](https://docs.vulkan.org/refpages/latest/refpages/source/VK_KHR_dynamic_rendering_local_read.html)
22+
as well as [VK_KHR_unified_image_layouts](https://docs.vulkan.org/refpages/latest/refpages/source/VK_KHR_unified_image_layouts.html).
23+
24+
### Additional programmable blending support
25+
26+
Unlike D3D12's RenderPass API, this extended API can express programmable blending
27+
where an attachment can be sampled from even when in `D3D12_RESOURCE_STATE_RENDER_TARGET`
28+
or `D3D12_RESOURCE_STATE_DEPTH_WRITE` resource states.
29+
30+
### Shader reuse
31+
32+
The intent is that existing shaders can be reused.
33+
For example, this shader could be used for programmable blending:
34+
35+
```
36+
Texture2D<float4> RenderTarget : register(t5, space6);
37+
38+
float4 blend(float4 dst, float4 src)
39+
{
40+
// Or whatever you want.
41+
return lerp(dst, src, src.a);
42+
}
43+
44+
float4 main(float4 pos : SV_Position, float4 color : COLOR) : SV_Target
45+
{
46+
float4 RT = RenderTarget.Load(int3(pos.xy, 0));
47+
return blend(RT, color);
48+
}
49+
```
50+
51+
where we have new APIs for remapping `t5, space6` to e.g. RTV #0 in the root signature.
52+
Applications can also redirect normal Texture2D SRVs to the bound depth or stencil attachments.
53+
This allows for typical deferred rendering scenarios where the G-buffer is read from on-chip memory instead
54+
of textures which saves significant memory bandwidth.
55+
Multi-sampled input attachments are also supported. This can be used for e.g. custom HDR resolves on-chip.
56+
57+
Layered rendering and view instancing is also supported.
58+
However, in this case, a non-arrayed `Texture2D` or `Texture2DMS` is still used in the shader.
59+
The implementation samples from the corresponding layer implicitly.
60+
61+
### `OMSetRenderTargets` support
62+
63+
This new interface is compatible with both "immediate" `OMSetRenderTargets` style rendering
64+
as well as the more dedicated RenderPass APIs. However, to be as tiler friendly as possible,
65+
it is recommended to use the RenderPass API to get the most out of this interface since
66+
it can express e.g. discarding G-buffer attachments which may no longer
67+
need to remain valid after lighting pass is done.
68+
69+
## New Device APIs
70+
71+
```
72+
typedef struct D3D12_VK_INPUT_ATTACHMENT_MAPPING
73+
{
74+
UINT RegisterSpace;
75+
UINT ShaderRegister;
76+
} D3D12_VK_INPUT_ATTACHMENT_MAPPING;
77+
78+
typedef struct D3D12_VK_INPUT_ATTACHMENT_MAPPINGS
79+
{
80+
UINT NumRenderTargets;
81+
D3D12_VK_INPUT_ATTACHMENT_MAPPING RenderTargets[D3D12_SIMULTANEOUS_RENDER_TARGET_COUNT];
82+
BOOL EnableDepth;
83+
BOOL EnableStencil;
84+
D3D12_VK_INPUT_ATTACHMENT_MAPPING Depth;
85+
D3D12_VK_INPUT_ATTACHMENT_MAPPING Stencil;
86+
} D3D12_VK_INPUT_ATTACHMENT_MAPPINGS;
87+
88+
typedef enum D3D12_VK_TILER_OPTIMIZATION_TIER
89+
{
90+
D3D12_VK_TILER_OPTIMIZATION_NOT_SUPPORTED = 0,
91+
D3D12_VK_TILER_OPTIMIZATION_TIER_1 = 1,
92+
} D3D12_VK_TILER_OPTIMIZATION_TIER;
93+
94+
[
95+
uuid(b7798d22-9fce-434d-8eeb-c3cef1056125),
96+
object,
97+
local,
98+
pointer_default(unique)
99+
]
100+
interface ID3D12DeviceExt2 : ID3D12DeviceExt1
101+
{
102+
D3D12_VK_TILER_OPTIMIZATION_TIER GetTilerOptimizationTier();
103+
HRESULT OptInToTilerOptimizations();
104+
UINT GetInputAttachmentDescriptorsCount();
105+
HRESULT CreateRootSignatureWithInputAttachments(
106+
UINT node_mask,
107+
const void *bytecode, SIZE_T bytecode_length,
108+
const D3D12_VK_INPUT_ATTACHMENT_MAPPINGS *mappings,
109+
REFIID riid, void **root_signature);
110+
void CreateInputAttachmentDescriptors(
111+
UINT render_target_descriptor_count,
112+
const D3D12_CPU_DESCRIPTOR_HANDLE *render_target_descriptors,
113+
BOOL single_descriptor_handle,
114+
const D3D12_CPU_DESCRIPTOR_HANDLE *depth_descriptor,
115+
const D3D12_CPU_DESCRIPTOR_HANDLE *stencil_descriptor,
116+
D3D12_CPU_DESCRIPTOR_HANDLE base_descriptor);
117+
}
118+
```
119+
120+
### `GetTilerOptimizationTier()`
121+
122+
This is a simple query to check if these APIs are supported by the device.
123+
There is currently only one feature tier.
124+
125+
### `HRESULT OptInToTilerOptimizations()`
126+
127+
This is intended to be called right after the ID3D12Device is created or
128+
as early as possible in the lifetime of the device.
129+
Calling this modifies the implementation in certain ways to make it compatible with
130+
tiler optimizations without adding a lot of extra API churn which would clutter an application.
131+
This call is not thread-safe and should not be called concurrently with any other API command.
132+
133+
The differences are:
134+
135+
- If a resource is created with `ALLOW_RENDER_TARGET` or `ALLOW_DEPTH_STENCIL`
136+
and the image can be sampled from (i.e., no `D3D12_RESOURCE_FLAG_DENY_SHADER_RESOURCE`),
137+
`VK_IMAGE_USAGE_INPUT_ATTACHMENT_BIT` is added automatically.
138+
- When creating RTV views, `VK_IMAGE_USAGE_INPUT_ATTACHMENT_BIT` is added automatically.
139+
- When creating DSV views, `VK_IMAGE_USAGE_INPUT_ATTACHMENT_BIT` is added automatically in some cases:
140+
- For DSV views with a single plane, the usage is added automatically.
141+
- For DSV views with both planes, the DSV is only input attachment enabled if
142+
there is exactly one plane which is marked read-only with
143+
`D3D12_DSV_FLAG_READ_ONLY_DEPTH` or `D3D12_DSV_FLAG_READ_ONLY_STENCIL`.
144+
The read-only aspect is compatible with input attachments.
145+
(This is somewhat awkward, but it removes a lot of extra API churn, and is very unlikely to come up in practice).
146+
147+
NOTE: If the application needs to use both depth and stencil input attachments at the same time,
148+
two DSVs can be created, one with read-only depth, and the other with read-only stencil. The
149+
separate DSVs can then be passed to `CreateInputAttachmentDescriptors()`.
150+
151+
### `UINT GetInputAttachmentDescriptorsCount()`
152+
153+
To be able to read from on-chip memory, the application allocates special SRVs in the descriptor heap.
154+
Rather than normal texture SRVs, Vulkan requires the use of `VK_DESCRIPTOR_TYPE_INPUT_ATTACHMENT`,
155+
so the normal `CreateShaderResourceView()` will not work.
156+
157+
When writing descriptors to the heap, multiple descriptors are written together as a group in
158+
a layout which is opaque to the application. The expectation is that 10 CBV_SRV_UAV descriptors
159+
are consumed (8 RT + Depth + Stencil), but it may be different due to descriptor packing concerns.
160+
(`vkGetDescriptorSetLayoutSizeEXT()` determines the number of descriptors.)
161+
162+
For simplicity and practicality of the implementation, the number of descriptors is fixed at the upper bound.
163+
New input attachment descriptors need only be allocated once per render pass, and a few extra wasted descriptors
164+
should not be a major concern.
165+
166+
### `HRESULT CreateRootSignatureWithInputAttachments()`
167+
168+
This is equivalent to `CreateRootSignature()`, except that extra information can be added
169+
for input attachments. `mappings` can be `NULL` in which case the call is equivalent
170+
to `CreateRootSignature()`. (It is not reasonable to modify the encoded RootSignature payload to
171+
hack in support for this, so this was determined to be the most practical solution.)
172+
173+
Input attachment mappings only work for non-arrayed descriptors. I.e., shaders which access
174+
the bound attachments through bindless means will not work with this interface
175+
since the compiler needs to statically map resource variables to a render target index to
176+
be able to take advantage of on-chip data.
177+
178+
Input attachment mappings can conflict with normal descriptor table bindings,
179+
i.e. override existing descriptor table bindings in the root signature.
180+
In this case, the input attachment mapping takes precedence.
181+
This allows applications to keep using the normal SRV path on most implementations,
182+
but selectively "opt-in" to the fast path when supported without having to modify the shader code.
183+
184+
When a `Texture2D` or `Texture2DMS` is mapped to an input attachment, that texture must only be used
185+
with simple `::Load()` functions or equivalent. It cannot be used with a sampler object.
186+
Misuse will lead to PSO creation failure.
187+
188+
The coordinate except for sampler index is ignored, and replaced with the current pixel coordinate.
189+
To make this transformation transparent, the pixel shader can sample from `int2(SV_Position.xy)` as the fallback.
190+
191+
When mappings are used, the root signature must have at least one 1 DWORD available in the root signature
192+
for the implementation to pass down additional data.
193+
194+
### `void CreateInputAttachmentDescriptors()`
195+
196+
Takes an equivalent of `OMSetRenderTargets` and writes input attachment descriptors of them.
197+
`GetInputAttachmentDescriptorsCount()` number of consecutive CBV_SRV_UAV descriptors are clobbered.
198+
199+
The main difference is that depth and stencil descriptors are separate in this interface.
200+
The RTV or DSV descriptors need not be the exact same `D3D12_CPU_DESCRIPTOR_HANDLE` passed to `OMSetRenderTargets()`,
201+
but they must be equivalent except for any read-only DSV state.
202+
203+
TODO: Add an interface for RenderPass API desc as well?
204+
205+
NULL RTVs or DSVs are ignored, and the matching descriptor in the heap is not modified.
206+
Using input attachments to sample from a NULL RTV or DSV is undefined behavior.
207+
Just use normal SRVs instead.
208+
209+
### PSO considerations
210+
211+
An input attachment which intends to read from a render target must define that render target
212+
in the PSO by using a sufficiently large `NumRenderTargets`.
213+
If an SRV is mapped to render target `N`, and `N` is greater-or-equal to `NumRenderTargets`,
214+
the input attachment must not be read from.
215+
216+
The RTV format can be `DXGI_FORMAT_UNKNOWN` if the render target is only used as an input attachment
217+
in the PSO.
218+
219+
Depth-stencil input attachments can sample from input attachments even with `DSVFormat` equal to `DXGI_FORMAT_UNKNOWN`.
220+
221+
## New CommandList APIs
222+
223+
```
224+
[
225+
uuid(9c228166-bf9e-464c-9078-ecf20a13271a),
226+
object,
227+
local,
228+
pointer_default(unique)
229+
]
230+
interface ID3D12GraphicsCommandListExt2 : ID3D12GraphicsCommandListExt1
231+
{
232+
void InputAttachmentPixelBarrier();
233+
void SetRootSignatureInputAttachments(D3D12_GPU_DESCRIPTOR_HANDLE handle);
234+
void SetInputAttachmentFeedback(UINT render_target_concurrent_mask, BOOL depth_concurrent, BOOL stencil_concurrent);
235+
}
236+
```
237+
238+
### `void InputAttachmentPixelBarrier()`
239+
240+
While an image is in `RENDER_TARGET` or `DEPTH_WRITE` resource states
241+
(or equivalent `D3D12_BARRIER_LAYOUT_RENDER_TARGET` or `D3D12_BARRIER_LAYOUT_DEPTH_STENCIL_WRITE`),
242+
it cannot be sampled from as an input attachment without performing a per-pixel barrier.
243+
This can be called at any time, even inside a render pass.
244+
Only render target writes before the pixel barrier are visible to input attachment reads after the barrier.
245+
246+
NOTE: Unlike D3D12, Vulkan supports this use case in the `VK_IMAGE_LAYOUT_GENERAL` image layout,
247+
which is why this feature requires `VK_KHR_unified_image_layouts`.
248+
This is a pure memory barrier, and not a layout transition.
249+
250+
### `void SetRootSignatureInputAttachments()`
251+
252+
Binds the descriptors for input attachments for graphics pipelines.
253+
Unlike normal root parameters, this argument is never invalidated by binding new graphics root signatures.
254+
It also does not need a root parameter index.
255+
It can safely be called once per OMSetRenderTargets and forgotten about.
256+
The descriptor handle must point to the currently bound descriptor heap.
257+
258+
### `void SetInputAttachmentFeedback()`
259+
260+
Programmable blending use cases and G-buffer deferred rendering are similar, but have different data access patterns.
261+
262+
In programmable blending, there is concurrent access of the render target while sampling from it.
263+
Even with appropriate barriers in place, there may be hazards when render targets are compressed,
264+
(as compressed render targets typically operate on some block structure)
265+
leading to garbage pixels being read in the input attachment unless the implementation knows about this case up front.
266+
In typical G-buffer deferred rendering there is no such issue since the data flow is clearly separated by a writing phase,
267+
then a read-only phase.
268+
269+
By default, feedback is disabled. For render targets, if bit N is set in the mask, feedback is enabled for render target N.
270+
271+
IMPORTANT: Calling this will end the render pass internally, so this should not be called last-minute while inside a render pass.
272+
For performance, set this state up front, alongside `OMSetRenderTargets()` or right before `BeginRenderPass()`.
273+
If you don't know up front, just enable full feedback for the render pass.
274+
It should be fine on most implementations anyway.
275+
276+
#### Note on framebuffer coherency
277+
278+
The input attachments in this extension do not support fully coherent framebuffers and input attachments,
279+
meaning that attempting programmable blending with overlapping geometry will not work, even
280+
when feedback is enabled.
281+
282+
To outline the "levels" of input attachment access and what is and isn't supported:
283+
284+
##### Simple case, no feedback needed (supported in TIER_1)
285+
286+
- Write to attachment
287+
- InputAttachmentPixelBarrier
288+
- Only read from attachment via input attachment (this is basically an SRV now)
289+
290+
##### Basic programmable blending, feedback needed (supported in TIER_1)
291+
292+
- Render full-screen quad while reading from the attachment in the shader at the same time
293+
- InputAttachmentPixelBarrier
294+
- Render full-screen quad while reading from the attachment in the shader at the same time
295+
- InputAttachmentPixelBarrier
296+
- Render full-screen quad while reading from the attachment in the shader at the same time
297+
- ...
298+
299+
##### Fully coherent programmable blending (not supported in TIER_1)
300+
301+
- Render complex mesh with overlapping geometry, each fragment achieving correct programmable blending.
302+
303+
Vulkan supports the fully coherent use case through `VK_EXT_rasterization_order_attachment_access`,
304+
but this is supported only by a few select mobile IHVs. Could be exposed through a TIER_2 in theory at some point
305+
if need be.

0 commit comments

Comments
 (0)