|
| 1 | +# Tiler Optimization Extensions |
| 2 | + |
| 3 | +To allow for optimal rendering performance on GPUs which support tiling, vkd3d-proton |
| 4 | +provides some special interfaces to expose this style of rendering by keeping render targets on-chip, |
| 5 | +allowing framebuffer fetch mechanisms which default D3D12 APIs do not expose by default. |
| 6 | + |
| 7 | +Applications which intend to run D3D12 over Proton on mobile hardware such as Adreno |
| 8 | +are able to leverage this new interface with minimal changes to the renderer. |
| 9 | +The expectation is that this interface will mostly be relevant for VR games |
| 10 | +which are more likely to cater to mobile concerns. |
| 11 | + |
| 12 | +## Alternative to D3D12 RenderPass API |
| 13 | + |
| 14 | +The default [D3D12 RenderPass API](https://microsoft.github.io/DirectX-Specs/d3d/RenderPasses.html) |
| 15 | +supports extensions for tiler optimizations with APIs like |
| 16 | +`D3D12_RENDER_PASS_ENDING_ACCESS_PRESERVE_LOCAL_SRV`. |
| 17 | +However, this API is fundamentally incompatible with Vulkan, and it is also unimplementable by |
| 18 | +virtually all mobile hardware, including mobile hardware that vkd3d-proton cares about. |
| 19 | +This vkd3d-proton interface should be supported on a wide range of hardware. |
| 20 | + |
| 21 | +It relies on [VK_KHR_dynamic_rendering_local_read](https://docs.vulkan.org/refpages/latest/refpages/source/VK_KHR_dynamic_rendering_local_read.html) |
| 22 | +as well as [VK_KHR_unified_image_layouts](https://docs.vulkan.org/refpages/latest/refpages/source/VK_KHR_unified_image_layouts.html). |
| 23 | + |
| 24 | +### Additional programmable blending support |
| 25 | + |
| 26 | +Unlike D3D12's RenderPass API, this extended API can express programmable blending |
| 27 | +where an attachment can be sampled from even when in `D3D12_RESOURCE_STATE_RENDER_TARGET` |
| 28 | +or `D3D12_RESOURCE_STATE_DEPTH_WRITE` resource states. |
| 29 | + |
| 30 | +### Shader reuse |
| 31 | + |
| 32 | +The intent is that existing shaders can be reused. |
| 33 | +For example, this shader could be used for programmable blending: |
| 34 | + |
| 35 | +``` |
| 36 | +Texture2D<float4> RenderTarget : register(t5, space6); |
| 37 | +
|
| 38 | +float4 blend(float4 dst, float4 src) |
| 39 | +{ |
| 40 | + // Or whatever you want. |
| 41 | + return lerp(dst, src, src.a); |
| 42 | +} |
| 43 | +
|
| 44 | +float4 main(float4 pos : SV_Position, float4 color : COLOR) : SV_Target |
| 45 | +{ |
| 46 | + float4 RT = RenderTarget.Load(int3(pos.xy, 0)); |
| 47 | + return blend(RT, color); |
| 48 | +} |
| 49 | +``` |
| 50 | + |
| 51 | +Where we have new APIs for remapping `t5, space6` to e.g. RTV #0 in the root signature. |
| 52 | +Applications can also redirect normal Texture2D SRVs to the bound depth or stencil attachments. |
| 53 | +This allows for typical deferred rendering scenarios where the G-buffer is read from on-chip memory instead |
| 54 | +of textures. |
| 55 | +Multi-sampled attachments are also supported. |
| 56 | +This can be used for e.g. custom HDR resolves on-chip. |
| 57 | + |
| 58 | +Layered rendering and view instancing is also supported. |
| 59 | +However, in this case, a non-arrayed `Texture2D` or `Texture2DMS` is still used in the shader. |
| 60 | +The implementation samples from the corresponding layer implicitly. |
| 61 | + |
| 62 | +### `OMSetRenderTargets` support |
| 63 | + |
| 64 | +This new interface is compatible with both "immediate" `OMSetRenderTargets` style rendering |
| 65 | +as well as the more dedicated RenderPass APIs. However, to be as tiler friendly as possible, |
| 66 | +it is recommended to use the RenderPass API to get the most out of this interface. |
| 67 | + |
| 68 | +## New Device APIs |
| 69 | + |
| 70 | +``` |
| 71 | +typedef struct D3D12_VK_INPUT_ATTACHMENT_MAPPING |
| 72 | +{ |
| 73 | + UINT RegisterSpace; |
| 74 | + UINT ShaderRegister; |
| 75 | +} D3D12_VK_INPUT_ATTACHMENT_MAPPING; |
| 76 | +
|
| 77 | +typedef struct D3D12_VK_INPUT_ATTACHMENT_MAPPINGS |
| 78 | +{ |
| 79 | + UINT NumRenderTargets; |
| 80 | + D3D12_VK_INPUT_ATTACHMENT_MAPPING RenderTargets[D3D12_SIMULTANEOUS_RENDER_TARGET_COUNT]; |
| 81 | + BOOL EnableDepth; |
| 82 | + BOOL EnableStencil; |
| 83 | + D3D12_VK_INPUT_ATTACHMENT_MAPPING Depth; |
| 84 | + D3D12_VK_INPUT_ATTACHMENT_MAPPING Stencil; |
| 85 | +} D3D12_VK_INPUT_ATTACHMENT_MAPPINGS; |
| 86 | +
|
| 87 | +typedef enum D3D12_VK_TILER_OPTIMIZATION_TIER |
| 88 | +{ |
| 89 | + D3D12_VK_TILER_OPTIMIZATION_NOT_SUPPORTED = 0, |
| 90 | + D3D12_VK_TILER_OPTIMIZATION_TIER_1 = 1, |
| 91 | +} D3D12_VK_TILER_OPTIMIZATION_TIER; |
| 92 | +
|
| 93 | +[ |
| 94 | + uuid(b7798d22-9fce-434d-8eeb-c3cef1056125), |
| 95 | + object, |
| 96 | + local, |
| 97 | + pointer_default(unique) |
| 98 | +] |
| 99 | +interface ID3D12DeviceExt2 : ID3D12DeviceExt1 |
| 100 | +{ |
| 101 | + D3D12_VK_TILER_OPTIMIZATION_TIER GetTilerOptimizationTier(); |
| 102 | + HRESULT OptInToTilerOptimizations(); |
| 103 | + UINT GetInputAttachmentDescriptorsCount(); |
| 104 | + HRESULT CreateRootSignatureWithInputAttachments( |
| 105 | + UINT node_mask, |
| 106 | + const void *bytecode, SIZE_T bytecode_length, |
| 107 | + const D3D12_VK_INPUT_ATTACHMENT_MAPPINGS *mappings, |
| 108 | + REFIID riid, void **root_signature); |
| 109 | + void CreateInputAttachmentDescriptors(D3D12_CPU_DESCRIPTOR_HANDLE base_descriptor, |
| 110 | + UINT render_target_descriptor_count, |
| 111 | + const D3D12_CPU_DESCRIPTOR_HANDLE *render_target_descriptors, |
| 112 | + BOOL single_descriptor_handle, |
| 113 | + const D3D12_CPU_DESCRIPTOR_HANDLE *depth_descriptor, |
| 114 | + const D3D12_CPU_DESCRIPTOR_HANDLE *stencil_descriptor); |
| 115 | +} |
| 116 | +``` |
| 117 | + |
| 118 | +### `GetTilerOptimizationTier()` |
| 119 | + |
| 120 | +This is a simple query to check if these APIs are supported by the device. |
| 121 | +There is currently only one feature tier. |
| 122 | + |
| 123 | +### `HRESULT OptInToTilerOptimizations()` |
| 124 | + |
| 125 | +This is intended to be called right after the ID3D12Device is created. |
| 126 | +Setting this modifes the implementation in certain ways to make it compatible with |
| 127 | +tiler optimizations without adding a lot of extra API churn. |
| 128 | +This call is not thread-safe and should not be called concurrently with any other API command. |
| 129 | + |
| 130 | +The differences are: |
| 131 | + |
| 132 | +- If a resource is created with `ALLOW_RENDER_TARGET` or `ALLOW_DEPTH_STENCIL`, `VK_IMAGE_USAGE_INPUT_ATTACHMENT_BIT` is |
| 133 | + added automatically. |
| 134 | +- When creating RTV views, `VK_IMAGE_USAGE_INPUT_ATTACHMENT_BIT` is added automatically. |
| 135 | +- When creating DSV views, `VK_IMAGE_USAGE_INPUT_ATTACHMENT_BIT` is added automatically in some cases: |
| 136 | + - For DSV views with a single plane, the usage is added automatically. |
| 137 | + - For DSV views with both planes, the DSV is only input attachment enabled if |
| 138 | + there is exactly one plane which is marked read-only with |
| 139 | + `D3D12_DSV_FLAG_READ_ONLY_DEPTH` or `D3D12_DSV_FLAG_READ_ONLY_STENCIL`. |
| 140 | + The read-only aspect is compatible with input attachments. |
| 141 | + (This is somewhat awkward, but it removes a lot of extra API churn, and is very unlikely to come up in practice). |
| 142 | + |
| 143 | +### `UINT GetInputAttachmentDescriptorsCount()` |
| 144 | + |
| 145 | +To be able to read from on-chip memory, the application allocates special SRVs in the descriptor heap. |
| 146 | +Rather than normal texture SRVs, Vulkan requires the use of `VK_DESCRIPTOR_TYPE_INPUT_ATTACHMENT`. |
| 147 | +When writing descriptors to the heap, multiple descriptors are written together as a group in |
| 148 | +a layout which is opaque to the application. The expectation is that 10 CBV_SRV_UAV descriptors |
| 149 | +are consumed (8 RT + Depth + Stencil), but it may be different due to descriptor packing concerns. |
| 150 | + |
| 151 | +For simplicity and practicality of the implementation, the number of descriptors is fixed at the upper bound. |
| 152 | +New input attachment descriptors need only be allocated once per render pass. |
| 153 | + |
| 154 | +### `HRESULT CreateRootSignatureWithInputAttachments()` |
| 155 | + |
| 156 | +This is equivalent to CreateRootSignature, except that extra information can be added |
| 157 | +for input attachments. `mappings` can be `NULL` in which case the call is equivalent |
| 158 | +to `CreateRootSignature()`. (It is not reasonable to modify the encoded RootSignature payload to |
| 159 | +hack in support for this, so this was determined to be the most practical solution.) |
| 160 | + |
| 161 | +Input attachment mappings only work for non-arrayed descriptors. I.e., shaders which access |
| 162 | +the bound attachments through bindless means will not work with this interface |
| 163 | +since the compiler needs to statically map resource variables to a render target index to |
| 164 | +be able to take advantage of on-chip data. |
| 165 | + |
| 166 | +Input attachment mappings can conflict with normal descriptor table bindings, |
| 167 | +i.e. override existing descriptor table bindings in the root signature. |
| 168 | +In this case, the input attachment mapping takes precedence. |
| 169 | +This allows applications to keep using the normal SRV path on most implementations, |
| 170 | +but selectively "opt-in" to the fast path when supported without having to modify the shader code. |
| 171 | + |
| 172 | +When a `Texture2D` or `Texture2DMS` is mapped to an input attachment, that texture must only be used |
| 173 | +with simple `::Load()` functions or equivalent. It cannot be used with a sampler object. |
| 174 | +Misuse will lead to PSO creation failure. |
| 175 | + |
| 176 | +The coordinate except for sampler index is ignored, and replaced with the current pixel coordinate. |
| 177 | +To make this transformation transparent, the pixel shader can sample from `int2(SV_Position.xy)`. |
| 178 | + |
| 179 | +When mappings are used, the root signature must have at least one 1 DWORD available in the root signature |
| 180 | +for the implementation to pass down additional data. |
| 181 | + |
| 182 | +### `void CreateInputAttachmentDescriptors()` |
| 183 | + |
| 184 | +Takes an equivalent of `OMSetRenderTargets` and writes input attachment descriptors to them. |
| 185 | +`GetInputAttachmentDescriptorsCount()` number of consecutive CBV_SRV_UAV descriptors are consumed. |
| 186 | + |
| 187 | +The main difference is that depth and stencil descriptors are separate in this interface. |
| 188 | + |
| 189 | +The RTV or DSV descriptors need not be the exact same ones passed to `OMSetRenderTargets()`, |
| 190 | +but they must be equivalent except for any read-only DSV state. |
| 191 | + |
| 192 | +TODO: Add an interface for RenderPass API desc as well. |
| 193 | + |
| 194 | +NULL RTVs or DSVs are ignored, and the matching descriptor in the heap is not modified. |
| 195 | +Using input attachments to sample from a NULL RTV or DSV is undefined behavior. |
| 196 | +Just use normal SRVs instead. |
| 197 | + |
| 198 | +### PSO considerations |
| 199 | + |
| 200 | +An input attachment which intends to read from a render target must define that render target |
| 201 | +in the PSO by using a sufficiently large `NumRenderTargets`. |
| 202 | +If an SRV is mapped to render target `N`, and `N` is greater-or-equal to `NumRenderTargets`, |
| 203 | +the input attachment must not be read from. |
| 204 | + |
| 205 | +The RTV format can be `DXGI_FORMAT_UNKNOWN` if the render target is only used as an input attachment |
| 206 | +in the PSO. |
| 207 | + |
| 208 | +Depth-stencil input attachments can sample from input attachments even with `DSVFormat` equal to `DXGI_FORMAT_UNKNOWN`. |
| 209 | + |
| 210 | +## New CommandList APIs |
| 211 | + |
| 212 | +``` |
| 213 | +[ |
| 214 | + uuid(9c228166-bf9e-464c-9078-ecf20a13271a), |
| 215 | + object, |
| 216 | + local, |
| 217 | + pointer_default(unique) |
| 218 | +] |
| 219 | +interface ID3D12GraphicsCommandListExt2 : ID3D12GraphicsCommandListExt1 |
| 220 | +{ |
| 221 | + void InputAttachmentPixelBarrier(); |
| 222 | + void SetRootSignatureInputAttachments(D3D12_GPU_DESCRIPTOR_HANDLE handle); |
| 223 | + void SetInputAttachmentFeedback(UINT render_target_concurrent_mask, BOOL depth_concurrent, BOOL stencil_concurrent); |
| 224 | +} |
| 225 | +``` |
| 226 | + |
| 227 | +### `void InputAttachmentPixelBarrier()` |
| 228 | + |
| 229 | +While an image as in `RENDER_TARGET` or `DEPTH_WRITE` resource states (or equivalent in enhanced barriers), |
| 230 | +it cannot be sampled from as an input attachment without performing a per-pixel barrier. |
| 231 | +This can be called at any time, even inside a render pass. |
| 232 | +Only render target writes before the pixel barrier are visible to input attachment reads after the barrier. |
| 233 | + |
| 234 | +NOTE: Unlike D3D12, Vulkan supports this use case in the `VK_IMAGE_LAYOUT_GENERAL` image layout, |
| 235 | +which is why this feature requires `VK_KHR_unified_image_layouts`. |
| 236 | + |
| 237 | +### `void SetRootSignatureInputAttachments()` |
| 238 | + |
| 239 | +Binds the descriptors for input attachments. |
| 240 | +Unlike normal root parameters, this argument is never invalidated by binding new root signatures. |
| 241 | +It can safely be called once per OMSetRenderTargets and forgotten about. |
| 242 | +The descriptor handle must point to the currently bound descriptor heap. |
| 243 | + |
| 244 | +### `void SetInputAttachmentFeedback()` |
| 245 | + |
| 246 | +Programmable blending use cases and G-buffer deferred rendering are similar, but have different data access patterns. |
| 247 | + |
| 248 | +In programmable blending, there is concurrent access of the render target while sampling from it. |
| 249 | +In typical G-buffer deferred there is no such issue since the data flow is clearly separated by a writing phase then a read-only phase. |
| 250 | + |
| 251 | +Even with appropriate barriers in place, there may be hazards when render targets are compressed leading to garbage pixels |
| 252 | +being read in the input attachment unless some care is taken. |
| 253 | + |
| 254 | +IMPORTANT: Calling this will end the render pass internally, so this should not be called last-minute while inside a render pass. |
| 255 | +For performance, set this state up front, alongside `OMSetRenderTargets()` or right before `BeginRenderPass()`. |
| 256 | +If you don't know up front, just enable full feedback for the render pass. |
| 257 | +It should be fine on most implementations anyway. |
0 commit comments