Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
305 changes: 305 additions & 0 deletions docs/tiler_optimization_extensions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,305 @@
# Tiler Optimization Extensions

To allow for optimal rendering performance on GPUs which support tiling, vkd3d-proton
provides some special interfaces to expose this style of rendering by keeping render targets on-chip,
allowing framebuffer fetch mechanisms which default D3D12 APIs do not expose by default.

Applications which intend to run D3D12 over Proton on mobile hardware such as Adreno
are able to leverage this new interface with minimal changes to the renderer.
The expectation is that this interface will mostly be relevant for VR games
which are more likely to cater to mobile concerns.

## Alternative to D3D12 RenderPass API

The default [D3D12 RenderPass API](https://microsoft.github.io/DirectX-Specs/d3d/RenderPasses.html)
supports extensions for tiler optimizations with APIs like
`D3D12_RENDER_PASS_ENDING_ACCESS_PRESERVE_LOCAL_SRV`.
However, this API is fundamentally incompatible with Vulkan, and it is also unimplementable by
virtually all mobile hardware, including mobile hardware that vkd3d-proton cares about.
This vkd3d-proton interface should be supportable on a wide range of hardware.

It relies on [VK_KHR_dynamic_rendering_local_read](https://docs.vulkan.org/refpages/latest/refpages/source/VK_KHR_dynamic_rendering_local_read.html)
as well as [VK_KHR_unified_image_layouts](https://docs.vulkan.org/refpages/latest/refpages/source/VK_KHR_unified_image_layouts.html).

### Additional programmable blending support

Unlike D3D12's RenderPass API, this extended API can express programmable blending
where an attachment can be sampled from even when in `D3D12_RESOURCE_STATE_RENDER_TARGET`
or `D3D12_RESOURCE_STATE_DEPTH_WRITE` resource states.

### Shader reuse

The intent is that existing shaders can be reused.
For example, this shader could be used for programmable blending:

```
Texture2D<float4> RenderTarget : register(t5, space6);

float4 blend(float4 dst, float4 src)
{
// Or whatever you want.
return lerp(dst, src, src.a);
}

float4 main(float4 pos : SV_Position, float4 color : COLOR) : SV_Target
{
float4 RT = RenderTarget.Load(int3(pos.xy, 0));
return blend(RT, color);
}
```

where we have new APIs for remapping `t5, space6` to e.g. RTV #0 in the root signature.
Applications can also redirect normal Texture2D SRVs to the bound depth or stencil attachments.
This allows for typical deferred rendering scenarios where the G-buffer is read from on-chip memory instead
of textures which saves significant memory bandwidth.
Multi-sampled input attachments are also supported. This can be used for e.g. custom HDR resolves on-chip.

Layered rendering and view instancing is also supported.
However, in this case, a non-arrayed `Texture2D` or `Texture2DMS` is still used in the shader.
The implementation samples from the corresponding layer implicitly.

### `OMSetRenderTargets` support

This new interface is compatible with both "immediate" `OMSetRenderTargets` style rendering
as well as the more dedicated RenderPass APIs. However, to be as tiler friendly as possible,
it is recommended to use the RenderPass API to get the most out of this interface since
it can express e.g. discarding G-buffer attachments which may no longer
need to remain valid after lighting pass is done.

## New Device APIs

```
typedef struct D3D12_VK_INPUT_ATTACHMENT_MAPPING
{
UINT RegisterSpace;
UINT ShaderRegister;
} D3D12_VK_INPUT_ATTACHMENT_MAPPING;

typedef struct D3D12_VK_INPUT_ATTACHMENT_MAPPINGS
{
UINT NumRenderTargets;
D3D12_VK_INPUT_ATTACHMENT_MAPPING RenderTargets[D3D12_SIMULTANEOUS_RENDER_TARGET_COUNT];
BOOL EnableDepth;
BOOL EnableStencil;
D3D12_VK_INPUT_ATTACHMENT_MAPPING Depth;
D3D12_VK_INPUT_ATTACHMENT_MAPPING Stencil;
} D3D12_VK_INPUT_ATTACHMENT_MAPPINGS;

typedef enum D3D12_VK_TILER_OPTIMIZATION_TIER
{
D3D12_VK_TILER_OPTIMIZATION_NOT_SUPPORTED = 0,
D3D12_VK_TILER_OPTIMIZATION_TIER_1 = 1,
} D3D12_VK_TILER_OPTIMIZATION_TIER;

[
uuid(b7798d22-9fce-434d-8eeb-c3cef1056125),
object,
local,
pointer_default(unique)
]
interface ID3D12DeviceExt2 : ID3D12DeviceExt1
{
D3D12_VK_TILER_OPTIMIZATION_TIER GetTilerOptimizationTier();
HRESULT OptInToTilerOptimizations();
UINT GetInputAttachmentDescriptorsCount();
HRESULT CreateRootSignatureWithInputAttachments(
UINT node_mask,
const void *bytecode, SIZE_T bytecode_length,
const D3D12_VK_INPUT_ATTACHMENT_MAPPINGS *mappings,
REFIID riid, void **root_signature);
void CreateInputAttachmentDescriptors(
UINT render_target_descriptor_count,
const D3D12_CPU_DESCRIPTOR_HANDLE *render_target_descriptors,
BOOL single_descriptor_handle,
const D3D12_CPU_DESCRIPTOR_HANDLE *depth_descriptor,
const D3D12_CPU_DESCRIPTOR_HANDLE *stencil_descriptor,
D3D12_CPU_DESCRIPTOR_HANDLE base_descriptor);
}
```

### `GetTilerOptimizationTier()`

This is a simple query to check if these APIs are supported by the device.
There is currently only one feature tier.

### `HRESULT OptInToTilerOptimizations()`

This is intended to be called right after the ID3D12Device is created or
as early as possible in the lifetime of the device.
Calling this modifies the implementation in certain ways to make it compatible with
tiler optimizations without adding a lot of extra API churn which would clutter an application.
This call is not thread-safe and should not be called concurrently with any other API command.

The differences are:

- If a resource is created with `ALLOW_RENDER_TARGET` or `ALLOW_DEPTH_STENCIL`
and the image can be sampled from (i.e., no `D3D12_RESOURCE_FLAG_DENY_SHADER_RESOURCE`),
`VK_IMAGE_USAGE_INPUT_ATTACHMENT_BIT` is added automatically.
- When creating RTV views, `VK_IMAGE_USAGE_INPUT_ATTACHMENT_BIT` is added automatically.
- When creating DSV views, `VK_IMAGE_USAGE_INPUT_ATTACHMENT_BIT` is added automatically in some cases:
- For DSV views with a single plane, the usage is added automatically.
- For DSV views with both planes, the DSV is only input attachment enabled if
there is exactly one plane which is marked read-only with
`D3D12_DSV_FLAG_READ_ONLY_DEPTH` or `D3D12_DSV_FLAG_READ_ONLY_STENCIL`.
The read-only aspect is compatible with input attachments.
(This is somewhat awkward, but it removes a lot of extra API churn, and is very unlikely to come up in practice).

NOTE: If the application needs to use both depth and stencil input attachments at the same time,
two DSVs can be created, one with read-only depth, and the other with read-only stencil. The
separate DSVs can then be passed to `CreateInputAttachmentDescriptors()`.

### `UINT GetInputAttachmentDescriptorsCount()`

To be able to read from on-chip memory, the application allocates special SRVs in the descriptor heap.
Rather than normal texture SRVs, Vulkan requires the use of `VK_DESCRIPTOR_TYPE_INPUT_ATTACHMENT`,
so the normal `CreateShaderResourceView()` will not work.

When writing descriptors to the heap, multiple descriptors are written together as a group in
a layout which is opaque to the application. The expectation is that 10 CBV_SRV_UAV descriptors
are consumed (8 RT + Depth + Stencil), but it may be different due to descriptor packing concerns.
(`vkGetDescriptorSetLayoutSizeEXT()` determines the number of descriptors.)

For simplicity and practicality of the implementation, the number of descriptors is fixed at the upper bound.
New input attachment descriptors need only be allocated once per render pass, and a few extra wasted descriptors
should not be a major concern.

### `HRESULT CreateRootSignatureWithInputAttachments()`

This is equivalent to `CreateRootSignature()`, except that extra information can be added
for input attachments. `mappings` can be `NULL` in which case the call is equivalent
to `CreateRootSignature()`. (It is not reasonable to modify the encoded RootSignature payload to
hack in support for this, so this was determined to be the most practical solution.)

Input attachment mappings only work for non-arrayed descriptors. I.e., shaders which access
the bound attachments through bindless means will not work with this interface
since the compiler needs to statically map resource variables to a render target index to
be able to take advantage of on-chip data.

Input attachment mappings can conflict with normal descriptor table bindings,
i.e. override existing descriptor table bindings in the root signature.
In this case, the input attachment mapping takes precedence.
This allows applications to keep using the normal SRV path on most implementations,
but selectively "opt-in" to the fast path when supported without having to modify the shader code.

When a `Texture2D` or `Texture2DMS` is mapped to an input attachment, that texture must only be used
with simple `::Load()` functions or equivalent. It cannot be used with a sampler object.
Misuse will lead to PSO creation failure.

The coordinate except for sampler index is ignored, and replaced with the current pixel coordinate.
To make this transformation transparent, the pixel shader can sample from `int2(SV_Position.xy)` as the fallback.

When mappings are used, the root signature must have at least one 1 DWORD available in the root signature
for the implementation to pass down additional data.

### `void CreateInputAttachmentDescriptors()`

Takes an equivalent of `OMSetRenderTargets` and writes input attachment descriptors of them.
`GetInputAttachmentDescriptorsCount()` number of consecutive CBV_SRV_UAV descriptors are clobbered.

The main difference is that depth and stencil descriptors are separate in this interface.
The RTV or DSV descriptors need not be the exact same `D3D12_CPU_DESCRIPTOR_HANDLE` passed to `OMSetRenderTargets()`,
but they must be equivalent except for any read-only DSV state.

TODO: Add an interface for RenderPass API desc as well?

NULL RTVs or DSVs are ignored, and the matching descriptor in the heap is not modified.
Using input attachments to sample from a NULL RTV or DSV is undefined behavior.
Just use normal SRVs instead.

### PSO considerations

An input attachment which intends to read from a render target must define that render target
in the PSO by using a sufficiently large `NumRenderTargets`.
If an SRV is mapped to render target `N`, and `N` is greater-or-equal to `NumRenderTargets`,
the input attachment must not be read from.

The RTV format can be `DXGI_FORMAT_UNKNOWN` if the render target is only used as an input attachment
in the PSO.

Depth-stencil input attachments can sample from input attachments even with `DSVFormat` equal to `DXGI_FORMAT_UNKNOWN`.

## New CommandList APIs

```
[
uuid(9c228166-bf9e-464c-9078-ecf20a13271a),
object,
local,
pointer_default(unique)
]
interface ID3D12GraphicsCommandListExt2 : ID3D12GraphicsCommandListExt1
{
void InputAttachmentPixelBarrier();
void SetRootSignatureInputAttachments(D3D12_GPU_DESCRIPTOR_HANDLE handle);
void SetInputAttachmentFeedback(UINT render_target_concurrent_mask, BOOL depth_concurrent, BOOL stencil_concurrent);
}
```

### `void InputAttachmentPixelBarrier()`

While an image is in `RENDER_TARGET` or `DEPTH_WRITE` resource states
(or equivalent `D3D12_BARRIER_LAYOUT_RENDER_TARGET` or `D3D12_BARRIER_LAYOUT_DEPTH_STENCIL_WRITE`),
it cannot be sampled from as an input attachment without performing a per-pixel barrier.
This can be called at any time, even inside a render pass.
Only render target writes before the pixel barrier are visible to input attachment reads after the barrier.

NOTE: Unlike D3D12, Vulkan supports this use case in the `VK_IMAGE_LAYOUT_GENERAL` image layout,
which is why this feature requires `VK_KHR_unified_image_layouts`.
This is a pure memory barrier, and not a layout transition.

### `void SetRootSignatureInputAttachments()`

Binds the descriptors for input attachments for graphics pipelines.
Unlike normal root parameters, this argument is never invalidated by binding new graphics root signatures.
It also does not need a root parameter index.
It can safely be called once per OMSetRenderTargets and forgotten about.
The descriptor handle must point to the currently bound descriptor heap.

### `void SetInputAttachmentFeedback()`

Programmable blending use cases and G-buffer deferred rendering are similar, but have different data access patterns.

In programmable blending, there is concurrent access of the render target while sampling from it.
Even with appropriate barriers in place, there may be hazards when render targets are compressed,
(as compressed render targets typically operate on some block structure)
leading to garbage pixels being read in the input attachment unless the implementation knows about this case up front.
In typical G-buffer deferred rendering there is no such issue since the data flow is clearly separated by a writing phase,
then a read-only phase.

By default, feedback is disabled. For render targets, if bit N is set in the mask, feedback is enabled for render target N.

IMPORTANT: Calling this will end the render pass internally, so this should not be called last-minute while inside a render pass.
For performance, set this state up front, alongside `OMSetRenderTargets()` or right before `BeginRenderPass()`.
If you don't know up front, just enable full feedback for the render pass.
It should be fine on most implementations anyway.

#### Note on framebuffer coherency

The input attachments in this extension do not support fully coherent framebuffers and input attachments,
meaning that attempting programmable blending with overlapping geometry will not work, even
when feedback is enabled.

To outline the "levels" of input attachment access and what is and isn't supported:

##### Simple case, no feedback needed (supported in TIER_1)

- Write to attachment
- InputAttachmentPixelBarrier
- Only read from attachment via input attachment (this is basically an SRV now)

##### Basic programmable blending, feedback needed (supported in TIER_1)

- Render full-screen quad while reading from the attachment in the shader at the same time
- InputAttachmentPixelBarrier
- Render full-screen quad while reading from the attachment in the shader at the same time
- InputAttachmentPixelBarrier
- Render full-screen quad while reading from the attachment in the shader at the same time
- ...

##### Fully coherent programmable blending (not supported in TIER_1)

- Render complex mesh with overlapping geometry, each fragment achieving correct programmable blending.

Vulkan supports the fully coherent use case through `VK_EXT_rasterization_order_attachment_access`,
but this is supported only by a few select mobile IHVs. Could be exposed through a TIER_2 in theory at some point
if need be.
13 changes: 13 additions & 0 deletions include/vkd3d_command_list_vkd3d_ext.idl
Original file line number Diff line number Diff line change
Expand Up @@ -40,3 +40,16 @@ interface ID3D12GraphicsCommandListExt1 : ID3D12GraphicsCommandListExt
{
HRESULT LaunchCubinShaderEx(D3D12_CUBIN_DATA_HANDLE *handle, UINT32 block_x, UINT32 block_y, UINT32 block_z, UINT32 smem_size, const void *params, UINT32 param_size, const void *raw_params, UINT32 raw_params_count);
}

[
uuid(9c228166-bf9e-464c-9078-ecf20a13271a),
object,
local,
pointer_default(unique)
]
interface ID3D12GraphicsCommandListExt2 : ID3D12GraphicsCommandListExt1
{
void InputAttachmentPixelBarrier();
void SetRootSignatureInputAttachments(D3D12_GPU_DESCRIPTOR_HANDLE handle);
void SetInputAttachmentFeedback(UINT render_target_concurrent_mask, BOOL depth_concurrent, BOOL stencil_concurrent);
}
47 changes: 47 additions & 0 deletions include/vkd3d_device_vkd3d_ext.idl
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,53 @@ interface ID3D12DeviceExt1 : ID3D12DeviceExt
HRESULT GetVulkanQueueInfoEx(ID3D12CommandQueue *queue, VkQueue *vk_queue, UINT32 *vk_queue_index, UINT32 *vk_queue_flags, UINT32 *vk_queue_family);
}

typedef struct D3D12_VK_INPUT_ATTACHMENT_MAPPING
{
UINT RegisterSpace;
UINT ShaderRegister;
} D3D12_VK_INPUT_ATTACHMENT_MAPPING;

typedef struct D3D12_VK_INPUT_ATTACHMENT_MAPPINGS
{
UINT NumRenderTargets;
D3D12_VK_INPUT_ATTACHMENT_MAPPING RenderTargets[D3D12_SIMULTANEOUS_RENDER_TARGET_COUNT];
BOOL EnableDepth;
BOOL EnableStencil;
D3D12_VK_INPUT_ATTACHMENT_MAPPING Depth;
D3D12_VK_INPUT_ATTACHMENT_MAPPING Stencil;
} D3D12_VK_INPUT_ATTACHMENT_MAPPINGS;

typedef enum D3D12_VK_TILER_OPTIMIZATION_TIER
{
D3D12_VK_TILER_OPTIMIZATION_NOT_SUPPORTED = 0,
D3D12_VK_TILER_OPTIMIZATION_TIER_1 = 1,
} D3D12_VK_TILER_OPTIMIZATION_TIER;

[
uuid(b7798d22-9fce-434d-8eeb-c3cef1056125),
object,
local,
pointer_default(unique)
]
interface ID3D12DeviceExt2 : ID3D12DeviceExt1
{
D3D12_VK_TILER_OPTIMIZATION_TIER GetTilerOptimizationTier();
HRESULT OptInToTilerOptimizations();
UINT GetInputAttachmentDescriptorsCount();
HRESULT CreateRootSignatureWithInputAttachments(
UINT node_mask,
const void *bytecode, SIZE_T bytecode_length,
const D3D12_VK_INPUT_ATTACHMENT_MAPPINGS *mappings,
REFIID riid, void **root_signature);
void CreateInputAttachmentDescriptors(
UINT render_target_descriptor_count,
const D3D12_CPU_DESCRIPTOR_HANDLE *render_target_descriptors,
BOOL single_descriptor_handle,
const D3D12_CPU_DESCRIPTOR_HANDLE *depth_descriptor,
const D3D12_CPU_DESCRIPTOR_HANDLE *stencil_descriptor,
D3D12_CPU_DESCRIPTOR_HANDLE base_descriptor);
}

/* 1:1 implementation of ffx_antilag2_dx12.h */
struct AmdAntiLagAPIData_v1
{
Expand Down
5 changes: 5 additions & 0 deletions include/vkd3d_shader.h
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,8 @@
extern "C" {
#endif /* __cplusplus */

struct D3D12_VK_INPUT_ATTACHMENT_MAPPINGS;

enum vkd3d_shader_compiler_option
{
VKD3D_SHADER_STRIP_DEBUG = 0x00000001,
Expand Down Expand Up @@ -326,6 +328,9 @@ struct vkd3d_shader_interface_info
unsigned int root_parameter_mapping_count;
const void *root_signature_blob;
size_t root_signature_blob_size;

const struct D3D12_VK_INPUT_ATTACHMENT_MAPPINGS *input_attachment_mappings;
unsigned int input_attachment_mappings_desc_set;
};

struct vkd3d_shader_descriptor_table
Expand Down
2 changes: 1 addition & 1 deletion khronos/Vulkan-Headers
Submodule Vulkan-Headers updated 68 files
+9 −13 .github/workflows/ci.yml
+4 −4 BUILD.gn
+40 −55 CMakeLists.txt
+58 −27 Makefile.release
+23 −23 include/vk_video/vulkan_video_codec_av1std.h
+1 −1 include/vk_video/vulkan_video_codec_av1std_decode.h
+1 −1 include/vk_video/vulkan_video_codec_av1std_encode.h
+9 −9 include/vk_video/vulkan_video_codec_h264std.h
+2 −2 include/vk_video/vulkan_video_codec_h264std_decode.h
+1 −1 include/vk_video/vulkan_video_codec_h264std_encode.h
+24 −24 include/vk_video/vulkan_video_codec_h265std.h
+2 −2 include/vk_video/vulkan_video_codec_h265std_decode.h
+1 −1 include/vk_video/vulkan_video_codec_h265std_encode.h
+9 −9 include/vk_video/vulkan_video_codec_vp9std.h
+1 −1 include/vk_video/vulkan_video_codec_vp9std_decode.h
+1 −1 include/vk_video/vulkan_video_codecs_common.h
+2 −1 include/vulkan/vk_icd.h
+3 −0 include/vulkan/vk_layer.h
+1 −1 include/vulkan/vk_platform.h
+5,271 −3,364 include/vulkan/vulkan.cppm
+1 −1 include/vulkan/vulkan.h
+21,472 −10,137 include/vulkan/vulkan.hpp
+7 −1 include/vulkan/vulkan_android.h
+59 −2 include/vulkan/vulkan_beta.h
+4,379 −2,212 include/vulkan/vulkan_core.h
+5 −1 include/vulkan/vulkan_directfb.h
+7,160 −7,373 include/vulkan/vulkan_enums.hpp
+3,765 −1,549 include/vulkan/vulkan_extension_inspection.hpp
+7,860 −6,424 include/vulkan/vulkan_format_traits.hpp
+21 −1 include/vulkan/vulkan_fuchsia.h
+21,178 −15,101 include/vulkan/vulkan_funcs.hpp
+3 −1 include/vulkan/vulkan_ggp.h
+14,569 −8,843 include/vulkan/vulkan_handles.hpp
+12,232 −9,232 include/vulkan/vulkan_hash.hpp
+49 −39 include/vulkan/vulkan_hpp_macros.hpp
+3 −1 include/vulkan/vulkan_ios.h
+3 −1 include/vulkan/vulkan_macos.h
+9 −1 include/vulkan/vulkan_metal.h
+74 −5 include/vulkan/vulkan_ohos.h
+24,551 −20,781 include/vulkan/vulkan_raii.hpp
+7 −1 include/vulkan/vulkan_screen.h
+686 −596 include/vulkan/vulkan_shared.hpp
+8,051 −4,253 include/vulkan/vulkan_static_assertions.hpp
+113,108 −66,457 include/vulkan/vulkan_structs.hpp
+6,561 −5,587 include/vulkan/vulkan_to_string.hpp
+3 −1 include/vulkan/vulkan_vi.h
+308 −155 include/vulkan/vulkan_video.cppm
+398 −247 include/vulkan/vulkan_video.hpp
+5 −1 include/vulkan/vulkan_wayland.h
+31 −1 include/vulkan/vulkan_win32.h
+5 −1 include/vulkan/vulkan_xcb.h
+5 −1 include/vulkan/vulkan_xlib.h
+5 −1 include/vulkan/vulkan_xlib_xrandr.h
+3 −1 registry/apiconventions.py
+370 −67 registry/base_generator.py
+33 −2 registry/cgenerator.py
+44 −53 registry/generator.py
+1 −1 registry/parse_dependency.py
+158 −22 registry/reg.py
+3 −2 registry/spec_tools/conventions.py
+1 −1 registry/spec_tools/util.py
+1 −1 registry/stripAPI.py
+22,093 −15,595 registry/validusage.json
+120 −119 registry/video.xml
+4,051 −2,249 registry/vk.xml
+17 −1 registry/vkconventions.py
+142 −32 registry/vulkan_object.py
+5 −2 tests/CMakeLists.txt
Loading
Loading