Skip to content

Conversation

@zlatinski
Copy link
Contributor

1ab36f1 common: add helper PoolNodeHandler class to CommandBufferPool
fa9c970 common: VkBufferResource documentation
d37c7d0 VulkanShaderCompiler: Fix shaderc singleton
4c7ca1d common: VulkanFilter - fixed the header file
327b7a2 common: VulkanDeviceContext set queue family for XFER
a004479 common: Optimize shader by precomputing half-resolution in push constants
f2fc2d4 common: Add row/column replication flags to bitmask
7efca55 common: Refactor filter options to use bitmask flags instead of multiple bools
ed989f8 common: Fix Y-subsampling to use fixed 2x2 blocks regardless of output format
821db54 common: add subsampled Y output image to the filter
7394fb5 common: YuvCompute - Add Y-only and CbCr-only plane support

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds Y-subsampled (2x2) luma output to the compute filter pipeline, introduces a subsampled Y image pool to the encoder, refactors filter configuration to bitmask flags, and fixes/optimizes shader compilation and dispatch.

  • Adds optional subsampled Y output (binding 9) and half-res dispatch in VulkanFilterYuvCompute, with new push constants and flags
  • Allocates and uses a dedicated subsampled Y image pool in the encoder; updates call sites to new filter API
  • Introduces a thread-safe shared shaderc compiler singleton; various utility and testing additions

Reviewed Changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 17 comments.

Show a summary per file
File Description
vk_video_encoder/libs/VkVideoEncoder/VkVideoEncoder.h Adds subsampled image resource member and getter; integrates with pools
vk_video_encoder/libs/VkVideoEncoder/VkVideoEncoder.cpp Allocates/configures subsampled Y pool; updates filter flags and RecordCommandBuffer call
vk_video_decoder/libs/VkVideoDecoder/VkVideoDecoder.cpp Adapts to new RecordCommandBuffer signature
scripts/test_y_subsample.py New end-to-end test harness for subsampling; remote/local execution
common/libs/VkCodecUtils/vulkan_internal_dump.cpp New helper for dumping YCbCr/subsampled surfaces to files
common/libs/VkCodecUtils/VulkanShaderCompiler.h/.cpp Adds shared shaderc compiler with refcount and options; thread-safety
common/libs/VkCodecUtils/VulkanFilterYuvCompute.h/.cpp Flag refactor, binding 9 support, 2x2 block processing, push constants, descriptor updates
common/libs/VkCodecUtils/VulkanFilter.h Changes base virtual RecordCommandBuffer signature; adds optional subsampled output
common/libs/VkCodecUtils/VulkanDeviceContext.cpp Improves transfer queue family selection
common/libs/VkCodecUtils/VulkanComputePipeline.h Adds setters for manual pipeline creation
common/libs/VkCodecUtils/VulkanCommandBufferPool.h Adds PoolNodeHandler RAII helper for command buffers
common/libs/VkCodecUtils/VkImageResource.h/.cpp Adjusts image view/plane handling (constructor/getters/destructor)
common/libs/VkCodecUtils/VkBufferResource.h/.cpp API docs and naming cleanup; minor refactor
Comments suppressed due to low confidence (1)

common/libs/VkCodecUtils/VulkanFilterYuvCompute.cpp:1

  • This overload accepts subsampledYImageView but never binds it to binding 9, so the shader's subsampledYImage write will target an uninitialized descriptor. Mirror the binding 9 write logic from the image->image path when subsampledYImageView is provided.
/*

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@zlatinski zlatinski force-pushed the compute-filter-downsample+fixes branch from 1ab36f1 to 90e21fd Compare October 18, 2025 07:18
@zlatinski zlatinski requested a review from Copilot October 18, 2025 07:21
@zlatinski zlatinski force-pushed the compute-filter-downsample+fixes branch from 90e21fd to 9f1ef6c Compare October 18, 2025 07:25
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 5 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

m_vkDevCtx->GetComputeQueueFamilyIdx(),
VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT, // Device-local for max GPU performance
nullptr, // pVideoProfile
VK_IMAGE_ASPECT_PLANE_0_BIT, // Y-plane only image
Copy link

Copilot AI Oct 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The subsampled Y images are created with formats VK_FORMAT_R8_UNORM or VK_FORMAT_R16_UNORM (non-multiplanar), but Configure() is passed VK_IMAGE_ASPECT_PLANE_0_BIT. For non-multiplanar formats the image view aspect must be VK_IMAGE_ASPECT_COLOR_BIT; using PLANE_0 will cause invalid image view creation. Replace VK_IMAGE_ASPECT_PLANE_0_BIT with VK_IMAGE_ASPECT_COLOR_BIT.

Suggested change
VK_IMAGE_ASPECT_PLANE_0_BIT, // Y-plane only image
VK_IMAGE_ASPECT_COLOR_BIT, // Single-plane Y image (non-multiplanar)

Copilot uses AI. Check for mistakes.
false, // isInput (it's output)
9, // binding
0, // set
true); // imageArray
Copy link

Copilot AI Oct 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Binding 9 is declared as image2DArray, but the subsampled image pool is configured without arrays (useImageArray/useImageViewArray are false). This descriptor/view type mismatch will fail validation. Either change this to a non-array binding (imageArray=false) or configure the subsampled image pool to create array images/views.

Suggested change
true); // imageArray
false); // imageArray

Copilot uses AI. Check for mistakes.
Comment on lines 1349 to 1350
false, // useImageArray
false, // useImageViewArray
Copy link

Copilot AI Oct 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The subsampled Y shader binding (binding 9) is generated as image2DArray, but the pool here disables image arrays and array views. To keep descriptor/view types compatible, either set useImageArray/useImageViewArray to true for this pool or adjust the shader to use image2D (non-array) for the subsampled Y output.

Suggested change
false, // useImageArray
false, // useImageViewArray
true, // useImageArray
true, // useImageViewArray

Copilot uses AI. Check for mistakes.
std::cerr << "VulkanShaderCompiler: Failed to initialize shared shaderc compiler!" << std::endl;
return nullptr;
}
std::cout << "VulkanShaderCompiler: Initialized shared shaderc compiler" << std::endl;
Copy link

Copilot AI Oct 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Unconditional std::cout logging in a library can be noisy in production contexts. Consider gating these logs behind a debug flag/environment variable or using the project's logging facility with appropriate levels.

Copilot uses AI. Check for mistakes.
// Last instance - release the shared compiler
shaderc_compiler_release(g_sharedCompiler);
g_sharedCompiler = nullptr;
std::cout << "VulkanShaderCompiler: Released shared shaderc compiler" << std::endl;
Copy link

Copilot AI Oct 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Unconditional std::cout logging in a library can be noisy in production contexts. Consider gating these logs behind a debug flag/environment variable or using the project's logging facility with appropriate levels.

Copilot uses AI. Check for mistakes.
shaderStr <<
" // Calculate chroma output dimensions\n"
" uint chromaWidth = (pushConstants.outputWidth + " << (chromaHorzRatio - 1) << ") / " << chromaHorzRatio << ";\n"
" uint chromaHeight = (pushConstants.outputHeight + " << (chromaVertRatio - 1) << ") / " << chromaVertRatio << ";\n";
Copy link
Contributor

@t-boiko t-boiko Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these are always powers of 2, division can be replaced with right shift:
4:4:4: chromaHorzRatio=1, chromaVertRatio=1 → shift 0, shift 0
4:2:2: chromaHorzRatio=2, chromaVertRatio=1 → shift 1, shift 0
4:2:0: chromaHorzRatio=2, chromaVertRatio=2 → shift 1, shift 1
However, it is possible a compiler optimizes this

shaderStr << "true";
shaderStr <<
" ivec2 inputChromaMax = ivec2((pushConstants.inputWidth + " << (chromaHorzRatio-1) << ") / " << chromaHorzRatio << " - 1, "
"(pushConstants.inputHeight + " << (chromaVertRatio-1) << ") / " << chromaVertRatio << " - 1);\n";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same as above

VK_IMAGE_ASPECT_PLANE_1_BIT |
VK_IMAGE_ASPECT_PLANE_2_BIT);

while (yCbCrAspectMask != 0) {
Copy link
Contributor

@t-boiko t-boiko Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we just use if(yCbCrAspectMask != 0){} instead of while(yCbCrAspectMask != 0){...break;} here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't understand what you are proposing here.

Copy link
Contributor

@t-boiko t-boiko Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant to change to if (yCbCrAspectMask ! = 0) at line 243 and remove correspondingbreak at line 280

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same question as @t-boiko. Why is the while-loop used here? Is it to allow for an early termination i..e the break on line 266?

bool enableShift,
bool hasChroma = true)
{
if (!enableShift) {
Copy link
Contributor

@t-boiko t-boiko Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: Can we check this before calling the function? and not to use enableShift as an extra parameter at all?
But from the other side it accumulates all the ifs here, in general it's OK

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these functions are in general inlined by the compiler, so it doesn't really matter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, agree

}
float scale = 1.0f / (chromaHorzRatio * chromaVertRatio);
shaderStr << " cb *= " << scale << ";\n";
shaderStr << " cr *= " << scale << ";\n";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, we use box filter for chroma downsampling. we could reduce aliasing with lowpass filter but it's OK to use box in our case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't care about the filtering for AQ analyses. We want this as cheap as possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, agree

}
float scale = 1.0f / (chromaHorzRatio * chromaVertRatio);
shaderStr << " cb *= " << scale << ";\n";
shaderStr << " cr *= " << scale << ";\n";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ratios are never 0:

// Get input/output chroma subsampling ratios
const uint32_t inputChromaHorzRatio = (inputMpInfo != nullptr) ? (1 << inputMpInfo->planesLayout.secondaryPlaneSubsampledX) : 1;
const uint32_t inputChromaVertRatio = (inputMpInfo != nullptr) ? (1 << inputMpInfo->planesLayout.secondaryPlaneSubsampledY) : 1;
const uint32_t outputChromaHorzRatio = (outputMpInfo != nullptr) ? (1 << outputMpInfo->planesLayout.secondaryPlaneSubsampledX) : 1;
const uint32_t outputChromaVertRatio = (outputMpInfo != nullptr) ? (1 << outputMpInfo->planesLayout.secondaryPlaneSubsampledY) : 1;

They can be "1" - or "1 << value".

@zlatinski zlatinski force-pushed the compute-filter-downsample+fixes branch from 2ce3a24 to bb1776c Compare December 4, 2025 16:20
Copy link
Contributor

@srinathkr-nv srinathkr-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some details in the commit message for the commit "filter: fix the normalization handling".

In the commit "common: Add row/column replication flags to bitmask", please add a line specifying that the FILTER_ prefix for the filter flags introduced in the previous commit is being removed, and the reason for removing the prefix (I'm assuming it is for conciseness).

VK_IMAGE_ASPECT_PLANE_1_BIT |
VK_IMAGE_ASPECT_PLANE_2_BIT);

while (yCbCrAspectMask != 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same question as @t-boiko. Why is the while-loop used here? Is it to allow for an early termination i..e the break on line 266?

return m_pipeline;
}

// Setters for manual pipeline creation (bypassing shaderc for pre-compiled SPIR-V)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of these additions are used anywhere in this series of changes. Can they be removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of these additions are used anywhere in this series of changes. Can they be removed?

OK, yes. I have a separate commit for that, now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines are still present in the latest commit series and continue to be unused.

* // Submit command buffer using cmd.GetNode()->...
* } // Auto cleanup on scope exit
*/
class PoolNodeHandler
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class isn't being used anywhere, either in this series or in #171 . Can you describe the need for it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of using it as part of a slightly different design, but then decided against it. So, I've removed this commit now.

filterFlags |= VulkanFilterYuvCompute::FLAG_OUTPUT_LSB_TO_MSB_SHIFT;
}
#ifdef NV_AQ_GPU_LIB_SUPPORTED
if (m_aqAnalyzes) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines 1335 - 1340 should be moved to #171 , as that is where the NV_AQ_GPU_LIB_SUPPORTED define is being introduced.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was done intentionally, so if the AQ integration is not used, not to have the overhead of using the sub-sampling. This define doesn't do any harm here, because it is not defined yet.
I didn't want to build infrastructure just for 2x2 sub-sampling to be overwritten later on with AQ. I would rather not change this.

Copy link
Contributor

@srinathkr-nv srinathkr-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit "encoder: quard m_inputSubsampledImage with AQ supported" has a typo in the commit message (quard -> guard), but this is also a change which seems more suited for inclusion in #171 rather than here because the NV_AQ_GPU_LIB_SUPPORTED define is being introduced there.

Add support to the image views of of Y and Cb/Cr only planes.
Add an aspect mask to the image pool.

Signed-off-by: Tony Zlatinski <[email protected]>
This commit adds 2x2 Y-channel subsampling capability to VulkanFilterYuvCompute
for Adaptive Quantization (AQ) preprocessing, with optimal GPU performance through
chroma-resolution block-based dispatch.

FEATURES ADDED:

1. Y-Subsampling Output (Binding 9)
   - Added descriptor binding 9 for subsampled Y output image (R8/R16 format)
   - Generates half-resolution luma map via 2x2 box filter averaging
   - Output available in VkVideoEncodeFrameInfo::subsampledImageResource
   - Proper ref-counted lifetime management within encode frame info

2. Optimal Chroma-Resolution Dispatch
   - Dispatch at chroma resolution instead of luma resolution
   - One thread processes: NxM Y pixels + 1 CbCr pixel + 1 subsampled Y
   - 75% thread reduction for 4:2:0 formats (e.g., 1920x1080: 2.07M → 518K threads)
   - Zero branch divergence (all threads execute same code path)
   - Localized memory access pattern (2x2 block reads/writes)
   - Applied to all 4 RecordCommandBuffer overloads (Image/Buffer combinations)

3. Clean Gen Function Architecture
   - GenBlockCoordinates() - Block coordinate calculation with replication support
   - GenReadYCbCrBlock() - Read NxM Y pixels + 1 chroma sample
   - GenConvertYCbCrBlock() - Format conversion for entire block
   - GenApplyBlockOutputShift() - LSB-to-MSB shift for 10/12-bit output
   - GenWriteYCbCrBlock() - Write NxM Y pixels + 1 chroma sample
   - GenComputeAndWriteSubsampledY() - Box filter and subsampled Y output

4. Unified RecordCommandBuffer API
   - Added optional subsampledImageView parameter to all 4 overloads
   - Backward compatible (nullptr default)
   - Conditional binding 9 descriptor (only when subsampledImageView provided)
   - Consistent parameter order across all overloads (bufferIdx first)

TECHNICAL DETAILS:

Shader Generation Pipeline (InitYCBCRCOPY):
- Stage 1-2: Header, push constants, descriptor bindings
- Stage 3-8: Helper functions (normalization, conversion, if needed)
- Stage 9: Main function declaration
- Stage 10: GenBlockCoordinates() - chromaPos, lumaPos, bounds, replication
- Stage 11: GenReadYCbCrBlock() - Read 4 Y + 1 CbCr with edge clamping
- Stage 12: GenConvertYCbCrBlock() - Apply format conversion if needed
- Stage 13: GenApplyBlockOutputShift() - LSB→MSB shift for 10/12-bit
- Stage 14: GenWriteYCbCrBlock() - Write 4 Y + 1 CbCr to output
- Stage 15: GenComputeAndWriteSubsampledY() - Box filter to binding 9

Dispatch Strategy:
- OLD: Dispatch at luma resolution, conditional chroma processing
  * 1920x1080: 2,073,600 threads, 50% branch divergence
- NEW: Dispatch at chroma resolution, block-based processing
  * 1920x1080 (4:2:0): 518,400 threads (75% reduction)
  * Each thread: 4 Y reads, 1 CbCr read, conversions, 4 Y writes, 1 CbCr write, 1 subsampled Y write
  * Zero branch divergence, better cache locality

Resource Management:
- Added subsampledImageResource to VkVideoEncodeFrameInfo struct
- Ref-counted lifetime management (released in Reset())
- Image pool allocation in VkVideoEncoder::InitEncoder()
- Half-resolution format matches output bit depth (R8 or R16)
- Usage flags: STORAGE | SAMPLED | TRANSFER_SRC | TRANSFER_DST

Format Support:
- Chroma subsampling: 4:2:0, 4:2:2, 4:4:4
- Bit depths: 8, 10, 12, 16-bit
- Plane layouts: 2-plane (NV12) and 3-plane (I420)
- Input/Output: Image and buffer combinations

PERFORMANCE:

Thread Count Reduction (4:2:0):
- 352x288:    101,376 → 25,344  (75% reduction)
- 1280x720:   921,600 → 230,400 (75% reduction)
- 1920x1080:  2,073,600 → 518,400 (75% reduction)
- 3840x2160:  8,294,400 → 2,073,600 (75% reduction)

Benefits:
- Lower thread launch overhead
- Better GPU occupancy (higher work per thread)
- Zero warp divergence
- Improved cache utilization (localized 2x2 reads)
- Estimated 2-4x speedup on YCbCr conversion pass

TESTING:

Verified on NVIDIA GeForce RTX 4080:
- H.264 8-bit @ 352x288: 10 frames ✓
- H.264 8-bit @ 1920x1080: 100 frames ✓ (stress test, no pool exhaustion)
- HEVC 10-bit @ 176x144: 5 frames ✓ (LSB-to-MSB shift verified)
- AV1 8-bit @ 352x288: 10 frames ✓

Build status: Zero errors, zero warnings (clean build)

API CHANGES:

VulkanFilterYuvCompute.h:
- Added optional subsampledImageView parameter to all RecordCommandBuffer overloads
- Added subsampledImageResourceInfo parameter for future extensibility

VkVideoEncoder.h:
- Added subsampledImageResource member to VkVideoEncodeFrameInfo
- Released in Reset() for proper pool recycling

COMPATIBILITY:

- Backward compatible (subsampledImageView defaults to nullptr)
- Existing code works without changes
- Binding 9 only used when subsampledImageView is provided
- No API breaks

Signed-off-by: Tony Zlatinski <[email protected]>
Fixed GenHandleSourcePositionWithReplicate to clamp to valid range.

Problem:
- GenHandleSourcePositionWithReplicate clamped to width/height:
  ivec2 srcPos = min(pos, ivec2(pushConstants.inputWidth, pushConstants.inputHeight));
- For 1920x1080, allows coordinates up to (1920, 1080)
- Valid range is [0..1919] x [0..1079]
- Off-by-one error allowed sampling outside image bounds

Root Cause:
  min(pos, inputWidth) can return inputWidth (e.g., 1920)
  But valid coords are [0..inputWidth-1] (e.g., [0..1919])

Solution:
- Changed to: min(pos, ivec2(inputWidth - 1, inputHeight - 1))
- Now clamps to [0..1919] x [0..1079] for 1920x1080
- Prevents out-of-bounds image access

Verified other clamping code:
✅ inputLumaMax uses (inputWidth - 1, inputHeight - 1) - correct
✅ inputChromaMax uses (halfInputWidth - 1, halfInputHeight - 1) - correct
✅ All min(lumaPos + offset, inputLumaMax) operations - correct

Impact:
- Fixes potential GPU memory access violations
- Fixes edge pixel replication (was potentially reading invalid data)
- Critical fix for stability

Note: This function is used in YCBCR2RGBA filter, not the YCBCRCOPY path
Signed-off-by: Tony Zlatinski <[email protected]>
…t format

Critical fix for AQ Y-subsampling when output is 4:4:4 (or any non-4:2:0).

Problem:
- Previously used output format's chroma subsampling to determine Y block size
- When output is 4:4:4 (no chroma subsampling), generated 1x1 blocks
- This resulted in NO actual Y downsampling (subsampledY = yOut00 * 1)
- AQ algorithms received full-resolution Y data instead of 2x2 downsampled

Root Cause:
  const uint32_t chromaHorzRatio = (1 << outputMpInfo->planesLayout.secondaryPlaneSubsampledX);
  For 4:4:4: secondaryPlaneSubsampledX = 0, so chromaHorzRatio = 1

Fix:
- Always use fixed 2x2 Y subsampling for binding 9 (AQ input)
- Independent of output format's chroma subsampling
- const uint32_t ySubsampleHorzRatio = 2;
- const uint32_t ySubsampleVertRatio = 2;

Result:
✅ Reads 2x2 Y block (y00, y10, y01, y11)
✅ Writes all 4 Y pixels to output (works for 4:4:4)
✅ Computes proper box filter: (y00+y10+y01+y11)*0.25
✅ Writes half-resolution Y to binding 9
✅ Works correctly for 4:2:0, 4:2:2, AND 4:4:4 outputs

common: Add 4:4:4 output support to Y-subsampling filter

Fixed GenWriteYCbCrBlock to handle 4:4:4 output formats correctly.

Problem:
- Always dispatched with 2x2 blocks for optimal Y subsampling
- For 4:4:4 output, was only writing 1 chroma sample instead of 2x2
- This caused green/corrupted chroma in 4:4:4 encoded output

Solution:
- Added outputChromaHorzSubsampling/VertSubsampling parameters to GenWriteYCbCrBlock
- Detects 4:4:4 output (subsampling == 1)
- For 4:4:4: writes 2x2 chroma pixels (one per Y pixel)
- For 4:2:0/4:2:2: writes 1 chroma sample (existing behavior)

Result:
✅ 4:4:4 input → 4:4:4 output works correctly
✅ 4:2:0 input → 4:2:0 output works correctly (existing)
✅ Always 2x2 dispatch for optimal efficiency
✅ Always 2x2 Y subsampling for AQ (independent of format)

common: Fix dispatch to always use half Y resolution (2x2 blocks)

Critical fix: dispatch was using output chroma resolution instead of fixed 2x2.

Problem:
- RecordCommandBuffer dispatch used output format chroma subsampling
- For 4:4:4 output: chromaSubsampleX/Y = 1, dispatched at FULL resolution
- For 1920x1080 4:4:4: dispatched 1920x1080 threads instead of 960x540
- Caused incorrect processing and wasted 75% of GPU threads

Root Cause:
  const uint32_t chromaWidth = (width + chromaSubsampleX - 1) / chromaSubsampleX;
  For 4:4:4: chromaSubsampleX=1, so chromaWidth = width (WRONG!)

Solution:
- Always dispatch at half Y resolution: (width+1)/2 x (height+1)/2
- One thread per 2x2 Y block, independent of output format
- Matches shader expectations (GenBlockCoordinates uses 2x2)

Fixed in 3 RecordCommandBuffer overloads:
1. Image->Image (line 2698)
2. Buffer->Image (line 2875)
3. Image->Buffer (line 3053)

Result:
✅ 4:4:4: 960x540 dispatch (correct!)
✅ 4:2:0: 960x540 dispatch (same efficiency)
✅ Consistent with shader 2x2 block processing
✅ 75% thread efficiency for all formats

common: Fix Buffer->Buffer dispatch to use half Y resolution

Completed the dispatch fix for all 4 RecordCommandBuffer overloads.

Problem:
- Buffer->Buffer case was missed in previous commit
- Was still dispatching at full resolution instead of half
- Inconsistent with other 3 overloads and shader expectations

Solution:
- Changed from full resolution to (width+1)/2 x (height+1)/2
- Now all 4 cases dispatch at half Y resolution

All RecordCommandBuffer overloads now fixed:
1. Image->Image ✅ (previous commit)
2. Buffer->Image ✅ (previous commit)
3. Image->Buffer ✅ (previous commit)
4. Buffer->Buffer ✅ (this commit)

Result: Complete consistency across all path variants

common: Add 4:4:4 input support - read and average 4 chroma samples

Critical fix for 4:4:4 input formats to preserve chroma detail.

Problem:
- GenReadYCbCrBlock always read only 1 chroma sample
- For 4:4:4 input → 4:4:4 output, this downsampled chroma to 4:2:0 then upsampled
- Lost all chroma detail, caused incorrect colors

Root Cause:
  // Old code - always reads 1 chroma
  float cb = imageLoad(inputImageCb, ivec3(srcChromaPos, ...)).r;

For 4:4:4 input, each Y pixel has its own chroma, but we only read 1 sample
and replicated it to all 4 output pixels.

Solution:
- Pass inputChromaHorzSubsampling/VertSubsampling to GenReadYCbCrBlock
- For 4:4:4 input (subsampling==1): read 4 chroma samples and average
- For 4:2:0 input (subsampling==2): read 1 chroma sample (existing)

Generated code for 4:4:4 input:
  // Read 4 chroma samples and average
  float cb00 = imageLoad(inputImageCb, lumaPos + ivec2(0,0)).r;
  float cr00 = imageLoad(inputImageCr, lumaPos + ivec2(0,0)).r;
  float cb10 = imageLoad(inputImageCb, lumaPos + ivec2(1,0)).r;
  // ... (4 samples total)
  cb = (cb00 + cb10 + cb01 + cb11) * 0.25;

Result:
✅ 4:4:4 input → 4:4:4 output: preserves chroma detail
✅ 4:2:0 input → 4:2:0 output: unchanged (1 sample)
✅ 4:4:4 input → 4:2:0 output: proper downsampling
✅ All input/output format combinations work correctly

common: Make Y-subsampling optional via enableYSubsampling parameter

Added enableYSubsampling parameter to control binding 9 generation.

Problem:
- Y subsampling (binding 9) was always generated even if not needed
- m_enableYSubsampling flag existed but wasn't configurable
- Wasted resources when AQ is not being used

Solution:
- Added enableYSubsampling parameter (default=false) to:
  • VulkanFilterYuvCompute::Create()
  • VulkanFilterYuvCompute constructor
- Conditional generation of:
  • Binding 9 descriptor layout (only if enabled)
  • GenComputeAndWriteSubsampledY() call (only if enabled)
- Added runtime check in RecordCommandBuffer():
  • Asserts if subsampledImageView provided but flag is false
  • Returns VK_ERROR_INITIALIZATION_FAILED to catch API misuse

Usage:
  // Without Y subsampling (default, backward compatible)
  VulkanFilterYuvCompute::Create(...); // enableYSubsampling defaults to false

  // With Y subsampling for AQ
  VulkanFilterYuvCompute::Create(..., true); // enables binding 9

Benefits:
✅ Backward compatible (defaults to false)
✅ No wasted shader code when AQ not used
✅ No wasted binding 9 descriptor when not needed
✅ Runtime validation for API misuse
✅ Clean separation of concerns

Signed-off-by: Tony Zlatinski <[email protected]>
…ple bools

Replaced 3 separate boolean parameters with a single uint32_t bitmask.

Problem:
- VulkanFilterYuvCompute::Create() had 3 separate bool parameters:
  • inputEnableMsbToLsbShift
  • outputEnableLsbToMsbShift
  • enableYSubsampling
- Hard to extend with new options
- Verbose API with growing parameter list

Solution:
- Created FilterFlags enum with bitmask values:
  • FILTER_FLAG_NONE                   = 0
  • FILTER_FLAG_INPUT_MSB_TO_LSB_SHIFT  = (1 << 0)
  • FILTER_FLAG_OUTPUT_LSB_TO_MSB_SHIFT = (1 << 1)
  • FILTER_FLAG_ENABLE_Y_SUBSAMPLING    = (1 << 2)

- Replaced 3 bool parameters with single uint32_t filterFlags
- Updated Create() and constructor signatures
- Updated call sites in encoder and decoder

API Change:
  // Old (3 separate bools):
  Create(..., false, true, true);

  // New (single bitmask):
  uint32_t flags = FILTER_FLAG_OUTPUT_LSB_TO_MSB_SHIFT |
                   FILTER_FLAG_ENABLE_Y_SUBSAMPLING;
  Create(..., flags);

Benefits:
✅ Cleaner API with fewer parameters
✅ Easy to extend with new flags
✅ Self-documenting with named constants
✅ Type-safe enum

Add row/column replication flags to bitmask

Added two new flags for edge replication control.

New Flags:
- FLAG_ENABLE_ROW_COLUMN_REPLICATION_ONE: Replicate one row/column at edges
- FLAG_ENABLE_ROW_COLUMN_REPLICATION_ALL: Replicate all out-of-bounds pixels

Implementation:
- m_enableRowAndColumnReplication is set to true if either flag is set
- Previously was hardcoded to true in constructor
- Now configurable via filterFlags parameter

Constructor initialization:
  m_enableRowAndColumnReplication =
    (filterFlags & (FLAG_ENABLE_ROW_COLUMN_REPLICATION_ONE |
                    FLAG_ENABLE_ROW_COLUMN_REPLICATION_ALL)) != 0

Encoder updated:
- Added FLAG_ENABLE_ROW_COLUMN_REPLICATION_ALL to default flags
- Maintains previous behavior (replication was always on)

Complete FilterFlags enum:
  FLAG_NONE                               = 0
  FLAG_INPUT_MSB_TO_LSB_SHIFT             = (1 << 0)
  FLAG_OUTPUT_LSB_TO_MSB_SHIFT            = (1 << 1)
  FLAG_ENABLE_Y_SUBSAMPLING               = (1 << 2)
  FLAG_ENABLE_ROW_COLUMN_REPLICATION_ONE  = (1 << 3)
  FLAG_ENABLE_ROW_COLUMN_REPLICATION_ALL  = (1 << 4)

Result: All filter options now controlled via clean bitmask API
Signed-off-by: Tony Zlatinski <[email protected]>
…ants

Eliminated redundant division operations in every shader thread.

Problem:
- GenBlockCoordinates generated division in shader main():
  uint chromaWidth = (pushConstants.outputWidth + 1) / 2;
  uint chromaHeight = (pushConstants.outputHeight + 1) / 2;
- Computed 518,400 times per frame for 1920x1080 (every thread!)
- inputChromaMax also computed divisions per thread
- Wasted ALU cycles on trivial math

Solution:
- Added halfInputWidth/Height and halfOutputWidth/Height to PushConstants
- CPU computes once: (width + 1) / 2
- Shader uses precomputed: pushConstants.halfOutputWidth
- Updated all 4 RecordCommandBuffer overloads to populate fields

Generated shader changes:
  Before:
    uint chromaWidth = (pushConstants.outputWidth + 1) / 2;
    ivec2 inputChromaMax = ivec2((pushConstants.inputWidth + 1) / 2 - 1, ...);

  After:
    uint chromaWidth = pushConstants.halfOutputWidth;  // Direct read!
    ivec2 inputChromaMax = ivec2(pushConstants.halfInputWidth - 1, ...);

PushConstants struct updated:
+ uint halfInputWidth, halfInputHeight   (precomputed)
+ uint halfOutputWidth, halfOutputHeight (precomputed)

Push constant size fix:
- Changed from hardcoded "6 * sizeof(uint32_t)"
- Now uses sizeof(PushConstants) for correctness

Performance gain:
✅ Eliminates 1M+ division ops per frame (1920x1080)
✅ 2 integer divisions → 2 direct reads per thread
✅ Cleaner shader code
✅ All tests still passing

Signed-off-by: Tony Zlatinski <[email protected]>
Fix corruption with singleton pattern and explicit Vulkan target

Problem:
Multiple VulkanShaderCompiler instances were each creating their own
shaderc_compiler_t, causing state corruption and runtime shader compilation
failures with errors like:
  - 'double' : not supported with this profile: none
  - compute shaders require es profile with version 310 or above

Root Causes:
1. Each VulkanShaderCompiler created a separate shaderc_compiler_t instance
2. shaderc_compile_into_spv() was called with nullptr options (no target env)
3. Multiple compilers interfering with each other's internal state

Solution:
1. Singleton Pattern:
   - Single shared shaderc_compiler_t for entire process (g_sharedCompiler)
   - Reference counting (g_compilerRefCount) to manage lifecycle
   - Thread-safe with std::mutex lock during compilation

2. Explicit Compile Options:
   - Set target_env to vulkan 1.2 (not default/none)
   - Set target SPIR-V version to 1.5
   - Properly initialize and release compile options

3. VulkanComputePipeline Setters:
   - Added SetPipeline/SetShaderModule/SetPipelineCache/SetDeviceContext
   - Allows external code to populate pipeline without calling CreatePipeline()
   - Enables pre-compiled SPIR-V to bypass shaderc entirely

Impact:
- Fixes runtime shader compilation errors for VulkanFilterYuvCompute
- Reduces overhead (one compiler instead of N)
- Thread-safe compilation
- Proper resource cleanup on shutdown

Signed-off-by: Tony Zlatinski <[email protected]>
- Fix GLSL type error by casting yuv to vec3 in normalization shader
- Change denormalization functions to return uint/uvec types with proper
  clamping and rounding for correct integer output
- Add hasChroma parameter to GenConvertYCbCrBlock, GenApplyBlockOutputShift,
  and GenConvertYCbCrFormat to support luma-only (Y-plane) formats
- Fix typo: subsampledYImage -> subsampledImageY
- Add GetFormatBitDepth() helper for correct bit depth detection when
  MSB/LSB shift is enabled on 16-bit formats
- Conditionally generate chroma packing functions only when output has chroma

Signed-off-by: Tony Zlatinski <[email protected]>
- Wrap m_inputSubsampledImagePool and subsampledImageResource with
  #ifdef NV_AQ_GPU_LIB_SUPPORTED to allow building without AQ library
- Only enable FLAG_ENABLE_Y_SUBSAMPLING when m_aqAnalyzes is active
- Align subsampled image extents to 32 pixels for AV1 worst case
- Increase pool size by 4 for better resource availability
- Guard GetSubsampledYImagePool() accessor with AQ compile flag

Signed-off-by: Tony Zlatinski <[email protected]>
- Add R10X6 and R12X4 single-channel packed formats to image view creation
- Extend vkFormatInfo table with 10/12-bit packed formats:
  R10X6/R12X4 (1-ch), R10X6G10X6/R12X4G12X4 (2-ch), and 4-channel variants
- Replace local getFormatTexelSize() with vkFormatLookUp() for texel size
- Add VulkanAqProcessor.h include guarded with NV_AQ_GPU_LIB_SUPPORTED

Signed-off-by: Tony Zlatinski <[email protected]>
@zlatinski
Copy link
Contributor Author

zlatinski commented Dec 24, 2025

Please add some details in the commit message for the commit "filter: fix the normalization handling".

Yes, I've missed to add description and sing those commit. Fixed now.

In the commit "common: Add row/column replication flags to bitmask", please add a line specifying that the FILTER_ prefix for the filter flags introduced in the previous commit is being removed, and the reason for removing the prefix (I'm assuming it is for conciseness).

OK, sure. I actually merged those two commits.

I have the same question as @t-boiko. Why is the while-loop used here? Is it to allow for an early termination i..e the break on line 266?

Yes, because of the breaks. This is the way I'm avoiding goto(s) that I don't tolerate.

@zlatinski
Copy link
Contributor Author

The commit "encoder: quard m_inputSubsampledImage with AQ supported" has a typo in the commit message (quard -> guard), but this is also a change which seems more suited for inclusion in #171 rather than here because the NV_AQ_GPU_LIB_SUPPORTED define is being introduced there.

Fixed. Thank you for the review!

@zlatinski zlatinski force-pushed the compute-filter-downsample+fixes branch 2 times, most recently from d5c7cd4 to 8e2d3ea Compare December 25, 2025 00:05
@zlatinski zlatinski merged commit 5bbfaab into main Jan 1, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants