Compute filter downsample and other fixes #166

zlatinski · 2025-10-18T04:00:11Z

1ab36f1 common: add helper PoolNodeHandler class to CommandBufferPool
fa9c970 common: VkBufferResource documentation
d37c7d0 VulkanShaderCompiler: Fix shaderc singleton
4c7ca1d common: VulkanFilter - fixed the header file
327b7a2 common: VulkanDeviceContext set queue family for XFER
a004479 common: Optimize shader by precomputing half-resolution in push constants
f2fc2d4 common: Add row/column replication flags to bitmask
7efca55 common: Refactor filter options to use bitmask flags instead of multiple bools
ed989f8 common: Fix Y-subsampling to use fixed 2x2 blocks regardless of output format
821db54 common: add subsampled Y output image to the filter
7394fb5 common: YuvCompute - Add Y-only and CbCr-only plane support

Copilot

Pull Request Overview

Adds Y-subsampled (2x2) luma output to the compute filter pipeline, introduces a subsampled Y image pool to the encoder, refactors filter configuration to bitmask flags, and fixes/optimizes shader compilation and dispatch.

Adds optional subsampled Y output (binding 9) and half-res dispatch in VulkanFilterYuvCompute, with new push constants and flags
Allocates and uses a dedicated subsampled Y image pool in the encoder; updates call sites to new filter API
Introduces a thread-safe shared shaderc compiler singleton; various utility and testing additions

Reviewed Changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 17 comments.

Show a summary per file

File	Description
vk_video_encoder/libs/VkVideoEncoder/VkVideoEncoder.h	Adds subsampled image resource member and getter; integrates with pools
vk_video_encoder/libs/VkVideoEncoder/VkVideoEncoder.cpp	Allocates/configures subsampled Y pool; updates filter flags and RecordCommandBuffer call
vk_video_decoder/libs/VkVideoDecoder/VkVideoDecoder.cpp	Adapts to new RecordCommandBuffer signature
scripts/test_y_subsample.py	New end-to-end test harness for subsampling; remote/local execution
common/libs/VkCodecUtils/vulkan_internal_dump.cpp	New helper for dumping YCbCr/subsampled surfaces to files
common/libs/VkCodecUtils/VulkanShaderCompiler.h/.cpp	Adds shared shaderc compiler with refcount and options; thread-safety
common/libs/VkCodecUtils/VulkanFilterYuvCompute.h/.cpp	Flag refactor, binding 9 support, 2x2 block processing, push constants, descriptor updates
common/libs/VkCodecUtils/VulkanFilter.h	Changes base virtual RecordCommandBuffer signature; adds optional subsampled output
common/libs/VkCodecUtils/VulkanDeviceContext.cpp	Improves transfer queue family selection
common/libs/VkCodecUtils/VulkanComputePipeline.h	Adds setters for manual pipeline creation
common/libs/VkCodecUtils/VulkanCommandBufferPool.h	Adds PoolNodeHandler RAII helper for command buffers
common/libs/VkCodecUtils/VkImageResource.h/.cpp	Adjusts image view/plane handling (constructor/getters/destructor)
common/libs/VkCodecUtils/VkBufferResource.h/.cpp	API docs and naming cleanup; minor refactor

Comments suppressed due to low confidence (1)

common/libs/VkCodecUtils/VulkanFilterYuvCompute.cpp:1

This overload accepts subsampledYImageView but never binds it to binding 9, so the shader's subsampledYImage write will target an uninitialized descriptor. Mirror the binding 9 write logic from the image->image path when subsampledYImageView is provided.

/*

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

scripts/test_y_subsample.py

common/libs/VkCodecUtils/vulkan_internal_dump.cpp

common/libs/VkCodecUtils/VkImageResource.cpp

common/libs/VkCodecUtils/VulkanFilterYuvCompute.cpp

common/libs/VkCodecUtils/VulkanFilter.h

common/libs/VkCodecUtils/VkBufferResource.h

Copilot

Pull Request Overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 5 comments.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-10-18T07:26:29Z

vk_video_encoder/libs/VkVideoEncoder/VkVideoEncoder.cpp

+            m_vkDevCtx->GetComputeQueueFamilyIdx(),
+            VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT,  // Device-local for max GPU performance
+            nullptr,  // pVideoProfile
+            VK_IMAGE_ASPECT_PLANE_0_BIT, // Y-plane only image


The subsampled Y images are created with formats VK_FORMAT_R8_UNORM or VK_FORMAT_R16_UNORM (non-multiplanar), but Configure() is passed VK_IMAGE_ASPECT_PLANE_0_BIT. For non-multiplanar formats the image view aspect must be VK_IMAGE_ASPECT_COLOR_BIT; using PLANE_0 will cause invalid image view creation. Replace VK_IMAGE_ASPECT_PLANE_0_BIT with VK_IMAGE_ASPECT_COLOR_BIT.

Suggested change

VK_IMAGE_ASPECT_PLANE_0_BIT, // Y-plane only image

VK_IMAGE_ASPECT_COLOR_BIT, // Single-plane Y image (non-multiplanar)

Copilot · 2025-10-18T07:26:29Z

common/libs/VkCodecUtils/VulkanFilterYuvCompute.cpp

+                                false,                 // isInput (it's output)
+                                9,                     // binding
+                                0,                     // set
+                                true);                 // imageArray


Binding 9 is declared as image2DArray, but the subsampled image pool is configured without arrays (useImageArray/useImageViewArray are false). This descriptor/view type mismatch will fail validation. Either change this to a non-array binding (imageArray=false) or configure the subsampled image pool to create array images/views.

Suggested change

true); // imageArray

false); // imageArray

Copilot · 2025-10-18T07:26:29Z

vk_video_encoder/libs/VkVideoEncoder/VkVideoEncoder.cpp

+            false,    // useImageArray
+            false,    // useImageViewArray  


The subsampled Y shader binding (binding 9) is generated as image2DArray, but the pool here disables image arrays and array views. To keep descriptor/view types compatible, either set useImageArray/useImageViewArray to true for this pool or adjust the shader to use image2D (non-array) for the subsampled Y output.

Suggested change

false, // useImageArray

false, // useImageViewArray

true, // useImageArray

true, // useImageViewArray

Copilot · 2025-10-18T07:26:30Z

common/libs/VkCodecUtils/VulkanShaderCompiler.cpp

+            std::cerr << "VulkanShaderCompiler: Failed to initialize shared shaderc compiler!" << std::endl;
+            return nullptr;
+        }
+        std::cout << "VulkanShaderCompiler: Initialized shared shaderc compiler" << std::endl;


[nitpick] Unconditional std::cout logging in a library can be noisy in production contexts. Consider gating these logs behind a debug flag/environment variable or using the project's logging facility with appropriate levels.

Copilot · 2025-10-18T07:26:30Z

common/libs/VkCodecUtils/VulkanShaderCompiler.cpp

+        // Last instance - release the shared compiler
+        shaderc_compiler_release(g_sharedCompiler);
+        g_sharedCompiler = nullptr;
+        std::cout << "VulkanShaderCompiler: Released shared shaderc compiler" << std::endl;


[nitpick] Unconditional std::cout logging in a library can be noisy in production contexts. Consider gating these logs behind a debug flag/environment variable or using the project's logging facility with appropriate levels.

t-boiko · 2025-11-27T13:32:46Z

common/libs/VkCodecUtils/VulkanFilterYuvCompute.cpp

+        shaderStr <<
+            "    // Calculate chroma output dimensions\n"
+            "    uint chromaWidth = (pushConstants.outputWidth + " << (chromaHorzRatio - 1) << ") / " << chromaHorzRatio << ";\n"
+            "    uint chromaHeight = (pushConstants.outputHeight + " << (chromaVertRatio - 1) << ") / " << chromaVertRatio << ";\n";


Since these are always powers of 2, division can be replaced with right shift:
4:4:4: chromaHorzRatio=1, chromaVertRatio=1 → shift 0, shift 0
4:2:2: chromaHorzRatio=2, chromaVertRatio=1 → shift 1, shift 0
4:2:0: chromaHorzRatio=2, chromaVertRatio=2 → shift 1, shift 1
However, it is possible a compiler optimizes this

t-boiko · 2025-11-27T13:38:47Z

common/libs/VkCodecUtils/VulkanFilterYuvCompute.cpp

-            shaderStr << "true";
+            shaderStr <<
+                "    ivec2 inputChromaMax = ivec2((pushConstants.inputWidth + " << (chromaHorzRatio-1) << ") / " << chromaHorzRatio << " - 1, "
+                "(pushConstants.inputHeight + " << (chromaVertRatio-1) << ") / " << chromaVertRatio << " - 1);\n";


the same as above

t-boiko · 2025-11-27T19:46:25Z

common/libs/VkCodecUtils/VkImageResource.cpp

+                                                    VK_IMAGE_ASPECT_PLANE_1_BIT |
+                                                    VK_IMAGE_ASPECT_PLANE_2_BIT);
+
+        while (yCbCrAspectMask != 0) {


Shouldn't we just use if(yCbCrAspectMask != 0){} instead of while(yCbCrAspectMask != 0){...break;} here?

Sorry, I don't understand what you are proposing here.

I meant to change to if (yCbCrAspectMask ! = 0) at line 243 and remove correspondingbreak at line 280

I have the same question as @t-boiko. Why is the while-loop used here? Is it to allow for an early termination i..e the break on line 266?

t-boiko · 2025-11-27T19:52:24Z

common/libs/VkCodecUtils/VulkanFilterYuvCompute.cpp

+                                     bool enableShift,
+                                     bool hasChroma = true)
+{
+    if (!enableShift) {


Optional: Can we check this before calling the function? and not to use enableShift as an extra parameter at all?
But from the other side it accumulates all the ifs here, in general it's OK

All these functions are in general inlined by the compiler, so it doesn't really matter.

t-boiko · 2025-11-27T21:55:45Z

common/libs/VkCodecUtils/VulkanFilterYuvCompute.cpp

+                }
+                float scale = 1.0f / (chromaHorzRatio * chromaVertRatio);
+                shaderStr << "    cb *= " << scale << ";\n";
+                shaderStr << "    cr *= " << scale << ";\n";


so, we use box filter for chroma downsampling. we could reduce aliasing with lowpass filter but it's OK to use box in our case

We don't care about the filtering for AQ analyses. We want this as cheap as possible.

zlatinski · 2025-12-04T00:46:57Z

common/libs/VkCodecUtils/VulkanFilterYuvCompute.cpp

+                }
+                float scale = 1.0f / (chromaHorzRatio * chromaVertRatio);
+                shaderStr << "    cb *= " << scale << ";\n";
+                shaderStr << "    cr *= " << scale << ";\n";


The ratios are never 0:

// Get input/output chroma subsampling ratios
const uint32_t inputChromaHorzRatio = (inputMpInfo != nullptr) ? (1 << inputMpInfo->planesLayout.secondaryPlaneSubsampledX) : 1;
const uint32_t inputChromaVertRatio = (inputMpInfo != nullptr) ? (1 << inputMpInfo->planesLayout.secondaryPlaneSubsampledY) : 1;
const uint32_t outputChromaHorzRatio = (outputMpInfo != nullptr) ? (1 << outputMpInfo->planesLayout.secondaryPlaneSubsampledX) : 1;
const uint32_t outputChromaVertRatio = (outputMpInfo != nullptr) ? (1 << outputMpInfo->planesLayout.secondaryPlaneSubsampledY) : 1;

They can be "1" - or "1 << value".

srinathkr-nv

Please add some details in the commit message for the commit "filter: fix the normalization handling".

In the commit "common: Add row/column replication flags to bitmask", please add a line specifying that the FILTER_ prefix for the filter flags introduced in the previous commit is being removed, and the reason for removing the prefix (I'm assuming it is for conciseness).

srinathkr-nv · 2025-12-23T07:34:13Z

common/libs/VkCodecUtils/VkImageResource.cpp

+                                                    VK_IMAGE_ASPECT_PLANE_1_BIT |
+                                                    VK_IMAGE_ASPECT_PLANE_2_BIT);
+
+        while (yCbCrAspectMask != 0) {


I have the same question as @t-boiko. Why is the while-loop used here? Is it to allow for an early termination i..e the break on line 266?

srinathkr-nv · 2025-12-23T07:35:17Z

common/libs/VkCodecUtils/VulkanComputePipeline.h

        return m_pipeline;
    }

+    // Setters for manual pipeline creation (bypassing shaderc for pre-compiled SPIR-V)


None of these additions are used anywhere in this series of changes. Can they be removed?

None of these additions are used anywhere in this series of changes. Can they be removed?

OK, yes. I have a separate commit for that, now.

These lines are still present in the latest commit series and continue to be unused.

srinathkr-nv · 2025-12-23T07:36:31Z

common/libs/VkCodecUtils/VulkanCommandBufferPool.h

+     *       // Submit command buffer using cmd.GetNode()->...
+     *   }  // Auto cleanup on scope exit
+     */
+    class PoolNodeHandler


This class isn't being used anywhere, either in this series or in #171 . Can you describe the need for it?

I was thinking of using it as part of a slightly different design, but then decided against it. So, I've removed this commit now.

srinathkr-nv · 2025-12-23T08:19:44Z

vk_video_encoder/libs/VkVideoEncoder/VkVideoEncoder.cpp

+            filterFlags |= VulkanFilterYuvCompute::FLAG_OUTPUT_LSB_TO_MSB_SHIFT;
+        }
+#ifdef NV_AQ_GPU_LIB_SUPPORTED
+        if (m_aqAnalyzes) {


Lines 1335 - 1340 should be moved to #171 , as that is where the NV_AQ_GPU_LIB_SUPPORTED define is being introduced.

This was done intentionally, so if the AQ integration is not used, not to have the overhead of using the sub-sampling. This define doesn't do any harm here, because it is not defined yet.
I didn't want to build infrastructure just for 2x2 sub-sampling to be overwritten later on with AQ. I would rather not change this.

srinathkr-nv

The commit "encoder: quard m_inputSubsampledImage with AQ supported" has a typo in the commit message (quard -> guard), but this is also a change which seems more suited for inclusion in #171 rather than here because the NV_AQ_GPU_LIB_SUPPORTED define is being introduced there.

Add support to the image views of of Y and Cb/Cr only planes. Add an aspect mask to the image pool. Signed-off-by: Tony Zlatinski <[email protected]>

This commit adds 2x2 Y-channel subsampling capability to VulkanFilterYuvCompute for Adaptive Quantization (AQ) preprocessing, with optimal GPU performance through chroma-resolution block-based dispatch. FEATURES ADDED: 1. Y-Subsampling Output (Binding 9) - Added descriptor binding 9 for subsampled Y output image (R8/R16 format) - Generates half-resolution luma map via 2x2 box filter averaging - Output available in VkVideoEncodeFrameInfo::subsampledImageResource - Proper ref-counted lifetime management within encode frame info 2. Optimal Chroma-Resolution Dispatch - Dispatch at chroma resolution instead of luma resolution - One thread processes: NxM Y pixels + 1 CbCr pixel + 1 subsampled Y - 75% thread reduction for 4:2:0 formats (e.g., 1920x1080: 2.07M → 518K threads) - Zero branch divergence (all threads execute same code path) - Localized memory access pattern (2x2 block reads/writes) - Applied to all 4 RecordCommandBuffer overloads (Image/Buffer combinations) 3. Clean Gen Function Architecture - GenBlockCoordinates() - Block coordinate calculation with replication support - GenReadYCbCrBlock() - Read NxM Y pixels + 1 chroma sample - GenConvertYCbCrBlock() - Format conversion for entire block - GenApplyBlockOutputShift() - LSB-to-MSB shift for 10/12-bit output - GenWriteYCbCrBlock() - Write NxM Y pixels + 1 chroma sample - GenComputeAndWriteSubsampledY() - Box filter and subsampled Y output 4. Unified RecordCommandBuffer API - Added optional subsampledImageView parameter to all 4 overloads - Backward compatible (nullptr default) - Conditional binding 9 descriptor (only when subsampledImageView provided) - Consistent parameter order across all overloads (bufferIdx first) TECHNICAL DETAILS: Shader Generation Pipeline (InitYCBCRCOPY): - Stage 1-2: Header, push constants, descriptor bindings - Stage 3-8: Helper functions (normalization, conversion, if needed) - Stage 9: Main function declaration - Stage 10: GenBlockCoordinates() - chromaPos, lumaPos, bounds, replication - Stage 11: GenReadYCbCrBlock() - Read 4 Y + 1 CbCr with edge clamping - Stage 12: GenConvertYCbCrBlock() - Apply format conversion if needed - Stage 13: GenApplyBlockOutputShift() - LSB→MSB shift for 10/12-bit - Stage 14: GenWriteYCbCrBlock() - Write 4 Y + 1 CbCr to output - Stage 15: GenComputeAndWriteSubsampledY() - Box filter to binding 9 Dispatch Strategy: - OLD: Dispatch at luma resolution, conditional chroma processing * 1920x1080: 2,073,600 threads, 50% branch divergence - NEW: Dispatch at chroma resolution, block-based processing * 1920x1080 (4:2:0): 518,400 threads (75% reduction) * Each thread: 4 Y reads, 1 CbCr read, conversions, 4 Y writes, 1 CbCr write, 1 subsampled Y write * Zero branch divergence, better cache locality Resource Management: - Added subsampledImageResource to VkVideoEncodeFrameInfo struct - Ref-counted lifetime management (released in Reset()) - Image pool allocation in VkVideoEncoder::InitEncoder() - Half-resolution format matches output bit depth (R8 or R16) - Usage flags: STORAGE | SAMPLED | TRANSFER_SRC | TRANSFER_DST Format Support: - Chroma subsampling: 4:2:0, 4:2:2, 4:4:4 - Bit depths: 8, 10, 12, 16-bit - Plane layouts: 2-plane (NV12) and 3-plane (I420) - Input/Output: Image and buffer combinations PERFORMANCE: Thread Count Reduction (4:2:0): - 352x288: 101,376 → 25,344 (75% reduction) - 1280x720: 921,600 → 230,400 (75% reduction) - 1920x1080: 2,073,600 → 518,400 (75% reduction) - 3840x2160: 8,294,400 → 2,073,600 (75% reduction) Benefits: - Lower thread launch overhead - Better GPU occupancy (higher work per thread) - Zero warp divergence - Improved cache utilization (localized 2x2 reads) - Estimated 2-4x speedup on YCbCr conversion pass TESTING: Verified on NVIDIA GeForce RTX 4080: - H.264 8-bit @ 352x288: 10 frames ✓ - H.264 8-bit @ 1920x1080: 100 frames ✓ (stress test, no pool exhaustion) - HEVC 10-bit @ 176x144: 5 frames ✓ (LSB-to-MSB shift verified) - AV1 8-bit @ 352x288: 10 frames ✓ Build status: Zero errors, zero warnings (clean build) API CHANGES: VulkanFilterYuvCompute.h: - Added optional subsampledImageView parameter to all RecordCommandBuffer overloads - Added subsampledImageResourceInfo parameter for future extensibility VkVideoEncoder.h: - Added subsampledImageResource member to VkVideoEncodeFrameInfo - Released in Reset() for proper pool recycling COMPATIBILITY: - Backward compatible (subsampledImageView defaults to nullptr) - Existing code works without changes - Binding 9 only used when subsampledImageView is provided - No API breaks Signed-off-by: Tony Zlatinski <[email protected]>

Fixed GenHandleSourcePositionWithReplicate to clamp to valid range. Problem: - GenHandleSourcePositionWithReplicate clamped to width/height: ivec2 srcPos = min(pos, ivec2(pushConstants.inputWidth, pushConstants.inputHeight)); - For 1920x1080, allows coordinates up to (1920, 1080) - Valid range is [0..1919] x [0..1079] - Off-by-one error allowed sampling outside image bounds Root Cause: min(pos, inputWidth) can return inputWidth (e.g., 1920) But valid coords are [0..inputWidth-1] (e.g., [0..1919]) Solution: - Changed to: min(pos, ivec2(inputWidth - 1, inputHeight - 1)) - Now clamps to [0..1919] x [0..1079] for 1920x1080 - Prevents out-of-bounds image access Verified other clamping code: ✅ inputLumaMax uses (inputWidth - 1, inputHeight - 1) - correct ✅ inputChromaMax uses (halfInputWidth - 1, halfInputHeight - 1) - correct ✅ All min(lumaPos + offset, inputLumaMax) operations - correct Impact: - Fixes potential GPU memory access violations - Fixes edge pixel replication (was potentially reading invalid data) - Critical fix for stability Note: This function is used in YCBCR2RGBA filter, not the YCBCRCOPY path Signed-off-by: Tony Zlatinski <[email protected]>

…t format Critical fix for AQ Y-subsampling when output is 4:4:4 (or any non-4:2:0). Problem: - Previously used output format's chroma subsampling to determine Y block size - When output is 4:4:4 (no chroma subsampling), generated 1x1 blocks - This resulted in NO actual Y downsampling (subsampledY = yOut00 * 1) - AQ algorithms received full-resolution Y data instead of 2x2 downsampled Root Cause: const uint32_t chromaHorzRatio = (1 << outputMpInfo->planesLayout.secondaryPlaneSubsampledX); For 4:4:4: secondaryPlaneSubsampledX = 0, so chromaHorzRatio = 1 Fix: - Always use fixed 2x2 Y subsampling for binding 9 (AQ input) - Independent of output format's chroma subsampling - const uint32_t ySubsampleHorzRatio = 2; - const uint32_t ySubsampleVertRatio = 2; Result: ✅ Reads 2x2 Y block (y00, y10, y01, y11) ✅ Writes all 4 Y pixels to output (works for 4:4:4) ✅ Computes proper box filter: (y00+y10+y01+y11)*0.25 ✅ Writes half-resolution Y to binding 9 ✅ Works correctly for 4:2:0, 4:2:2, AND 4:4:4 outputs common: Add 4:4:4 output support to Y-subsampling filter Fixed GenWriteYCbCrBlock to handle 4:4:4 output formats correctly. Problem: - Always dispatched with 2x2 blocks for optimal Y subsampling - For 4:4:4 output, was only writing 1 chroma sample instead of 2x2 - This caused green/corrupted chroma in 4:4:4 encoded output Solution: - Added outputChromaHorzSubsampling/VertSubsampling parameters to GenWriteYCbCrBlock - Detects 4:4:4 output (subsampling == 1) - For 4:4:4: writes 2x2 chroma pixels (one per Y pixel) - For 4:2:0/4:2:2: writes 1 chroma sample (existing behavior) Result: ✅ 4:4:4 input → 4:4:4 output works correctly ✅ 4:2:0 input → 4:2:0 output works correctly (existing) ✅ Always 2x2 dispatch for optimal efficiency ✅ Always 2x2 Y subsampling for AQ (independent of format) common: Fix dispatch to always use half Y resolution (2x2 blocks) Critical fix: dispatch was using output chroma resolution instead of fixed 2x2. Problem: - RecordCommandBuffer dispatch used output format chroma subsampling - For 4:4:4 output: chromaSubsampleX/Y = 1, dispatched at FULL resolution - For 1920x1080 4:4:4: dispatched 1920x1080 threads instead of 960x540 - Caused incorrect processing and wasted 75% of GPU threads Root Cause: const uint32_t chromaWidth = (width + chromaSubsampleX - 1) / chromaSubsampleX; For 4:4:4: chromaSubsampleX=1, so chromaWidth = width (WRONG!) Solution: - Always dispatch at half Y resolution: (width+1)/2 x (height+1)/2 - One thread per 2x2 Y block, independent of output format - Matches shader expectations (GenBlockCoordinates uses 2x2) Fixed in 3 RecordCommandBuffer overloads: 1. Image->Image (line 2698) 2. Buffer->Image (line 2875) 3. Image->Buffer (line 3053) Result: ✅ 4:4:4: 960x540 dispatch (correct!) ✅ 4:2:0: 960x540 dispatch (same efficiency) ✅ Consistent with shader 2x2 block processing ✅ 75% thread efficiency for all formats common: Fix Buffer->Buffer dispatch to use half Y resolution Completed the dispatch fix for all 4 RecordCommandBuffer overloads. Problem: - Buffer->Buffer case was missed in previous commit - Was still dispatching at full resolution instead of half - Inconsistent with other 3 overloads and shader expectations Solution: - Changed from full resolution to (width+1)/2 x (height+1)/2 - Now all 4 cases dispatch at half Y resolution All RecordCommandBuffer overloads now fixed: 1. Image->Image ✅ (previous commit) 2. Buffer->Image ✅ (previous commit) 3. Image->Buffer ✅ (previous commit) 4. Buffer->Buffer ✅ (this commit) Result: Complete consistency across all path variants common: Add 4:4:4 input support - read and average 4 chroma samples Critical fix for 4:4:4 input formats to preserve chroma detail. Problem: - GenReadYCbCrBlock always read only 1 chroma sample - For 4:4:4 input → 4:4:4 output, this downsampled chroma to 4:2:0 then upsampled - Lost all chroma detail, caused incorrect colors Root Cause: // Old code - always reads 1 chroma float cb = imageLoad(inputImageCb, ivec3(srcChromaPos, ...)).r; For 4:4:4 input, each Y pixel has its own chroma, but we only read 1 sample and replicated it to all 4 output pixels. Solution: - Pass inputChromaHorzSubsampling/VertSubsampling to GenReadYCbCrBlock - For 4:4:4 input (subsampling==1): read 4 chroma samples and average - For 4:2:0 input (subsampling==2): read 1 chroma sample (existing) Generated code for 4:4:4 input: // Read 4 chroma samples and average float cb00 = imageLoad(inputImageCb, lumaPos + ivec2(0,0)).r; float cr00 = imageLoad(inputImageCr, lumaPos + ivec2(0,0)).r; float cb10 = imageLoad(inputImageCb, lumaPos + ivec2(1,0)).r; // ... (4 samples total) cb = (cb00 + cb10 + cb01 + cb11) * 0.25; Result: ✅ 4:4:4 input → 4:4:4 output: preserves chroma detail ✅ 4:2:0 input → 4:2:0 output: unchanged (1 sample) ✅ 4:4:4 input → 4:2:0 output: proper downsampling ✅ All input/output format combinations work correctly common: Make Y-subsampling optional via enableYSubsampling parameter Added enableYSubsampling parameter to control binding 9 generation. Problem: - Y subsampling (binding 9) was always generated even if not needed - m_enableYSubsampling flag existed but wasn't configurable - Wasted resources when AQ is not being used Solution: - Added enableYSubsampling parameter (default=false) to: • VulkanFilterYuvCompute::Create() • VulkanFilterYuvCompute constructor - Conditional generation of: • Binding 9 descriptor layout (only if enabled) • GenComputeAndWriteSubsampledY() call (only if enabled) - Added runtime check in RecordCommandBuffer(): • Asserts if subsampledImageView provided but flag is false • Returns VK_ERROR_INITIALIZATION_FAILED to catch API misuse Usage: // Without Y subsampling (default, backward compatible) VulkanFilterYuvCompute::Create(...); // enableYSubsampling defaults to false // With Y subsampling for AQ VulkanFilterYuvCompute::Create(..., true); // enables binding 9 Benefits: ✅ Backward compatible (defaults to false) ✅ No wasted shader code when AQ not used ✅ No wasted binding 9 descriptor when not needed ✅ Runtime validation for API misuse ✅ Clean separation of concerns Signed-off-by: Tony Zlatinski <[email protected]>

…ple bools Replaced 3 separate boolean parameters with a single uint32_t bitmask. Problem: - VulkanFilterYuvCompute::Create() had 3 separate bool parameters: • inputEnableMsbToLsbShift • outputEnableLsbToMsbShift • enableYSubsampling - Hard to extend with new options - Verbose API with growing parameter list Solution: - Created FilterFlags enum with bitmask values: • FILTER_FLAG_NONE = 0 • FILTER_FLAG_INPUT_MSB_TO_LSB_SHIFT = (1 << 0) • FILTER_FLAG_OUTPUT_LSB_TO_MSB_SHIFT = (1 << 1) • FILTER_FLAG_ENABLE_Y_SUBSAMPLING = (1 << 2) - Replaced 3 bool parameters with single uint32_t filterFlags - Updated Create() and constructor signatures - Updated call sites in encoder and decoder API Change: // Old (3 separate bools): Create(..., false, true, true); // New (single bitmask): uint32_t flags = FILTER_FLAG_OUTPUT_LSB_TO_MSB_SHIFT | FILTER_FLAG_ENABLE_Y_SUBSAMPLING; Create(..., flags); Benefits: ✅ Cleaner API with fewer parameters ✅ Easy to extend with new flags ✅ Self-documenting with named constants ✅ Type-safe enum Add row/column replication flags to bitmask Added two new flags for edge replication control. New Flags: - FLAG_ENABLE_ROW_COLUMN_REPLICATION_ONE: Replicate one row/column at edges - FLAG_ENABLE_ROW_COLUMN_REPLICATION_ALL: Replicate all out-of-bounds pixels Implementation: - m_enableRowAndColumnReplication is set to true if either flag is set - Previously was hardcoded to true in constructor - Now configurable via filterFlags parameter Constructor initialization: m_enableRowAndColumnReplication = (filterFlags & (FLAG_ENABLE_ROW_COLUMN_REPLICATION_ONE | FLAG_ENABLE_ROW_COLUMN_REPLICATION_ALL)) != 0 Encoder updated: - Added FLAG_ENABLE_ROW_COLUMN_REPLICATION_ALL to default flags - Maintains previous behavior (replication was always on) Complete FilterFlags enum: FLAG_NONE = 0 FLAG_INPUT_MSB_TO_LSB_SHIFT = (1 << 0) FLAG_OUTPUT_LSB_TO_MSB_SHIFT = (1 << 1) FLAG_ENABLE_Y_SUBSAMPLING = (1 << 2) FLAG_ENABLE_ROW_COLUMN_REPLICATION_ONE = (1 << 3) FLAG_ENABLE_ROW_COLUMN_REPLICATION_ALL = (1 << 4) Result: All filter options now controlled via clean bitmask API Signed-off-by: Tony Zlatinski <[email protected]>

…ants Eliminated redundant division operations in every shader thread. Problem: - GenBlockCoordinates generated division in shader main(): uint chromaWidth = (pushConstants.outputWidth + 1) / 2; uint chromaHeight = (pushConstants.outputHeight + 1) / 2; - Computed 518,400 times per frame for 1920x1080 (every thread!) - inputChromaMax also computed divisions per thread - Wasted ALU cycles on trivial math Solution: - Added halfInputWidth/Height and halfOutputWidth/Height to PushConstants - CPU computes once: (width + 1) / 2 - Shader uses precomputed: pushConstants.halfOutputWidth - Updated all 4 RecordCommandBuffer overloads to populate fields Generated shader changes: Before: uint chromaWidth = (pushConstants.outputWidth + 1) / 2; ivec2 inputChromaMax = ivec2((pushConstants.inputWidth + 1) / 2 - 1, ...); After: uint chromaWidth = pushConstants.halfOutputWidth; // Direct read! ivec2 inputChromaMax = ivec2(pushConstants.halfInputWidth - 1, ...); PushConstants struct updated: + uint halfInputWidth, halfInputHeight (precomputed) + uint halfOutputWidth, halfOutputHeight (precomputed) Push constant size fix: - Changed from hardcoded "6 * sizeof(uint32_t)" - Now uses sizeof(PushConstants) for correctness Performance gain: ✅ Eliminates 1M+ division ops per frame (1920x1080) ✅ 2 integer divisions → 2 direct reads per thread ✅ Cleaner shader code ✅ All tests still passing Signed-off-by: Tony Zlatinski <[email protected]>

Signed-off-by: Tony Zlatinski <[email protected]>

Fix corruption with singleton pattern and explicit Vulkan target Problem: Multiple VulkanShaderCompiler instances were each creating their own shaderc_compiler_t, causing state corruption and runtime shader compilation failures with errors like: - 'double' : not supported with this profile: none - compute shaders require es profile with version 310 or above Root Causes: 1. Each VulkanShaderCompiler created a separate shaderc_compiler_t instance 2. shaderc_compile_into_spv() was called with nullptr options (no target env) 3. Multiple compilers interfering with each other's internal state Solution: 1. Singleton Pattern: - Single shared shaderc_compiler_t for entire process (g_sharedCompiler) - Reference counting (g_compilerRefCount) to manage lifecycle - Thread-safe with std::mutex lock during compilation 2. Explicit Compile Options: - Set target_env to vulkan 1.2 (not default/none) - Set target SPIR-V version to 1.5 - Properly initialize and release compile options 3. VulkanComputePipeline Setters: - Added SetPipeline/SetShaderModule/SetPipelineCache/SetDeviceContext - Allows external code to populate pipeline without calling CreatePipeline() - Enables pre-compiled SPIR-V to bypass shaderc entirely Impact: - Fixes runtime shader compilation errors for VulkanFilterYuvCompute - Reduces overhead (one compiler instead of N) - Thread-safe compilation - Proper resource cleanup on shutdown Signed-off-by: Tony Zlatinski <[email protected]>

Signed-off-by: Tony Zlatinski <[email protected]>

- Fix GLSL type error by casting yuv to vec3 in normalization shader - Change denormalization functions to return uint/uvec types with proper clamping and rounding for correct integer output - Add hasChroma parameter to GenConvertYCbCrBlock, GenApplyBlockOutputShift, and GenConvertYCbCrFormat to support luma-only (Y-plane) formats - Fix typo: subsampledYImage -> subsampledImageY - Add GetFormatBitDepth() helper for correct bit depth detection when MSB/LSB shift is enabled on 16-bit formats - Conditionally generate chroma packing functions only when output has chroma Signed-off-by: Tony Zlatinski <[email protected]>

- Wrap m_inputSubsampledImagePool and subsampledImageResource with #ifdef NV_AQ_GPU_LIB_SUPPORTED to allow building without AQ library - Only enable FLAG_ENABLE_Y_SUBSAMPLING when m_aqAnalyzes is active - Align subsampled image extents to 32 pixels for AV1 worst case - Increase pool size by 4 for better resource availability - Guard GetSubsampledYImagePool() accessor with AQ compile flag Signed-off-by: Tony Zlatinski <[email protected]>

- Add R10X6 and R12X4 single-channel packed formats to image view creation - Extend vkFormatInfo table with 10/12-bit packed formats: R10X6/R12X4 (1-ch), R10X6G10X6/R12X4G12X4 (2-ch), and 4-channel variants - Replace local getFormatTexelSize() with vkFormatLookUp() for texel size - Add VulkanAqProcessor.h include guarded with NV_AQ_GPU_LIB_SUPPORTED Signed-off-by: Tony Zlatinski <[email protected]>

zlatinski · 2025-12-24T22:34:44Z

Please add some details in the commit message for the commit "filter: fix the normalization handling".

Yes, I've missed to add description and sing those commit. Fixed now.

In the commit "common: Add row/column replication flags to bitmask", please add a line specifying that the FILTER_ prefix for the filter flags introduced in the previous commit is being removed, and the reason for removing the prefix (I'm assuming it is for conciseness).

OK, sure. I actually merged those two commits.

I have the same question as @t-boiko. Why is the while-loop used here? Is it to allow for an early termination i..e the break on line 266?

Yes, because of the breaks. This is the way I'm avoiding goto(s) that I don't tolerate.

zlatinski · 2025-12-24T22:57:43Z

The commit "encoder: quard m_inputSubsampledImage with AQ supported" has a typo in the commit message (quard -> guard), but this is also a change which seems more suited for inclusion in #171 rather than here because the NV_AQ_GPU_LIB_SUPPORTED define is being introduced there.

Fixed. Thank you for the review!

zlatinski requested review from Copilot and mklefrancois October 18, 2025 04:01

Copilot AI reviewed Oct 18, 2025

View reviewed changes

zlatinski force-pushed the compute-filter-downsample+fixes branch from 1ab36f1 to 90e21fd Compare October 18, 2025 07:18

zlatinski requested a review from Copilot October 18, 2025 07:21

zlatinski force-pushed the compute-filter-downsample+fixes branch from 90e21fd to 9f1ef6c Compare October 18, 2025 07:25

Copilot AI reviewed Oct 18, 2025

View reviewed changes

t-boiko reviewed Nov 27, 2025

View reviewed changes

zlatinski commented Dec 4, 2025

View reviewed changes

zlatinski force-pushed the compute-filter-downsample+fixes branch from 2ce3a24 to bb1776c Compare December 4, 2025 16:20

srinathkr-nv suggested changes Dec 23, 2025

View reviewed changes

zlatinski added 13 commits December 24, 2025 16:02

common: YuvCompute - Add Y and CbCr-only plane

b75646a

Add support to the image views of of Y and Cb/Cr only planes. Add an aspect mask to the image pool. Signed-off-by: Tony Zlatinski <[email protected]>

common: VulkanDeviceContext set queue family for XFER

45d0936

Signed-off-by: Tony Zlatinski <[email protected]>

common: VulkanFilter - fixed the header file

160a15f

Signed-off-by: Tony Zlatinski <[email protected]>

common: VkBufferResource documentation

ae348ad

Signed-off-by: Tony Zlatinski <[email protected]>

zlatinski force-pushed the compute-filter-downsample+fixes branch 2 times, most recently from d5c7cd4 to 8e2d3ea Compare December 25, 2025 00:05

zlatinski merged commit 5bbfaab into main Jan 1, 2026
5 checks passed

	VK_IMAGE_ASPECT_PLANE_0_BIT, // Y-plane only image
	VK_IMAGE_ASPECT_COLOR_BIT, // Single-plane Y image (non-multiplanar)

Compute filter downsample and other fixes #166

Compute filter downsample and other fixes #166

Uh oh!

Conversation

zlatinski commented Oct 18, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Oct 18, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 18, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 18, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 18, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 18, 2025

Choose a reason for hiding this comment

Uh oh!

t-boiko Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

t-boiko Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

t-boiko Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

t-boiko Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srinathkr-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

t-boiko Nov 27, 2025 •

edited

Loading

t-boiko Nov 27, 2025 •

edited

Loading

t-boiko Dec 4, 2025 •

edited

Loading

t-boiko Nov 27, 2025 •

edited

Loading

zlatinski commented Dec 24, 2025 •

edited

Loading