-
Notifications
You must be signed in to change notification settings - Fork 50
Compute filter downsample and other fixes #166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Adds Y-subsampled (2x2) luma output to the compute filter pipeline, introduces a subsampled Y image pool to the encoder, refactors filter configuration to bitmask flags, and fixes/optimizes shader compilation and dispatch.
- Adds optional subsampled Y output (binding 9) and half-res dispatch in VulkanFilterYuvCompute, with new push constants and flags
- Allocates and uses a dedicated subsampled Y image pool in the encoder; updates call sites to new filter API
- Introduces a thread-safe shared shaderc compiler singleton; various utility and testing additions
Reviewed Changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 17 comments.
Show a summary per file
| File | Description |
|---|---|
| vk_video_encoder/libs/VkVideoEncoder/VkVideoEncoder.h | Adds subsampled image resource member and getter; integrates with pools |
| vk_video_encoder/libs/VkVideoEncoder/VkVideoEncoder.cpp | Allocates/configures subsampled Y pool; updates filter flags and RecordCommandBuffer call |
| vk_video_decoder/libs/VkVideoDecoder/VkVideoDecoder.cpp | Adapts to new RecordCommandBuffer signature |
| scripts/test_y_subsample.py | New end-to-end test harness for subsampling; remote/local execution |
| common/libs/VkCodecUtils/vulkan_internal_dump.cpp | New helper for dumping YCbCr/subsampled surfaces to files |
| common/libs/VkCodecUtils/VulkanShaderCompiler.h/.cpp | Adds shared shaderc compiler with refcount and options; thread-safety |
| common/libs/VkCodecUtils/VulkanFilterYuvCompute.h/.cpp | Flag refactor, binding 9 support, 2x2 block processing, push constants, descriptor updates |
| common/libs/VkCodecUtils/VulkanFilter.h | Changes base virtual RecordCommandBuffer signature; adds optional subsampled output |
| common/libs/VkCodecUtils/VulkanDeviceContext.cpp | Improves transfer queue family selection |
| common/libs/VkCodecUtils/VulkanComputePipeline.h | Adds setters for manual pipeline creation |
| common/libs/VkCodecUtils/VulkanCommandBufferPool.h | Adds PoolNodeHandler RAII helper for command buffers |
| common/libs/VkCodecUtils/VkImageResource.h/.cpp | Adjusts image view/plane handling (constructor/getters/destructor) |
| common/libs/VkCodecUtils/VkBufferResource.h/.cpp | API docs and naming cleanup; minor refactor |
Comments suppressed due to low confidence (1)
common/libs/VkCodecUtils/VulkanFilterYuvCompute.cpp:1
- This overload accepts subsampledYImageView but never binds it to binding 9, so the shader's subsampledYImage write will target an uninitialized descriptor. Mirror the binding 9 write logic from the image->image path when subsampledYImageView is provided.
/*
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
1ab36f1 to
90e21fd
Compare
90e21fd to
9f1ef6c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 17 out of 17 changed files in this pull request and generated 5 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| m_vkDevCtx->GetComputeQueueFamilyIdx(), | ||
| VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT, // Device-local for max GPU performance | ||
| nullptr, // pVideoProfile | ||
| VK_IMAGE_ASPECT_PLANE_0_BIT, // Y-plane only image |
Copilot
AI
Oct 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The subsampled Y images are created with formats VK_FORMAT_R8_UNORM or VK_FORMAT_R16_UNORM (non-multiplanar), but Configure() is passed VK_IMAGE_ASPECT_PLANE_0_BIT. For non-multiplanar formats the image view aspect must be VK_IMAGE_ASPECT_COLOR_BIT; using PLANE_0 will cause invalid image view creation. Replace VK_IMAGE_ASPECT_PLANE_0_BIT with VK_IMAGE_ASPECT_COLOR_BIT.
| VK_IMAGE_ASPECT_PLANE_0_BIT, // Y-plane only image | |
| VK_IMAGE_ASPECT_COLOR_BIT, // Single-plane Y image (non-multiplanar) |
| false, // isInput (it's output) | ||
| 9, // binding | ||
| 0, // set | ||
| true); // imageArray |
Copilot
AI
Oct 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Binding 9 is declared as image2DArray, but the subsampled image pool is configured without arrays (useImageArray/useImageViewArray are false). This descriptor/view type mismatch will fail validation. Either change this to a non-array binding (imageArray=false) or configure the subsampled image pool to create array images/views.
| true); // imageArray | |
| false); // imageArray |
| false, // useImageArray | ||
| false, // useImageViewArray |
Copilot
AI
Oct 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The subsampled Y shader binding (binding 9) is generated as image2DArray, but the pool here disables image arrays and array views. To keep descriptor/view types compatible, either set useImageArray/useImageViewArray to true for this pool or adjust the shader to use image2D (non-array) for the subsampled Y output.
| false, // useImageArray | |
| false, // useImageViewArray | |
| true, // useImageArray | |
| true, // useImageViewArray |
| std::cerr << "VulkanShaderCompiler: Failed to initialize shared shaderc compiler!" << std::endl; | ||
| return nullptr; | ||
| } | ||
| std::cout << "VulkanShaderCompiler: Initialized shared shaderc compiler" << std::endl; |
Copilot
AI
Oct 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Unconditional std::cout logging in a library can be noisy in production contexts. Consider gating these logs behind a debug flag/environment variable or using the project's logging facility with appropriate levels.
| // Last instance - release the shared compiler | ||
| shaderc_compiler_release(g_sharedCompiler); | ||
| g_sharedCompiler = nullptr; | ||
| std::cout << "VulkanShaderCompiler: Released shared shaderc compiler" << std::endl; |
Copilot
AI
Oct 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Unconditional std::cout logging in a library can be noisy in production contexts. Consider gating these logs behind a debug flag/environment variable or using the project's logging facility with appropriate levels.
| shaderStr << | ||
| " // Calculate chroma output dimensions\n" | ||
| " uint chromaWidth = (pushConstants.outputWidth + " << (chromaHorzRatio - 1) << ") / " << chromaHorzRatio << ";\n" | ||
| " uint chromaHeight = (pushConstants.outputHeight + " << (chromaVertRatio - 1) << ") / " << chromaVertRatio << ";\n"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since these are always powers of 2, division can be replaced with right shift:
4:4:4: chromaHorzRatio=1, chromaVertRatio=1 → shift 0, shift 0
4:2:2: chromaHorzRatio=2, chromaVertRatio=1 → shift 1, shift 0
4:2:0: chromaHorzRatio=2, chromaVertRatio=2 → shift 1, shift 1
However, it is possible a compiler optimizes this
| shaderStr << "true"; | ||
| shaderStr << | ||
| " ivec2 inputChromaMax = ivec2((pushConstants.inputWidth + " << (chromaHorzRatio-1) << ") / " << chromaHorzRatio << " - 1, " | ||
| "(pushConstants.inputHeight + " << (chromaVertRatio-1) << ") / " << chromaVertRatio << " - 1);\n"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the same as above
| VK_IMAGE_ASPECT_PLANE_1_BIT | | ||
| VK_IMAGE_ASPECT_PLANE_2_BIT); | ||
|
|
||
| while (yCbCrAspectMask != 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we just use if(yCbCrAspectMask != 0){} instead of while(yCbCrAspectMask != 0){...break;} here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't understand what you are proposing here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant to change to if (yCbCrAspectMask ! = 0) at line 243 and remove correspondingbreak at line 280
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the same question as @t-boiko. Why is the while-loop used here? Is it to allow for an early termination i..e the break on line 266?
| bool enableShift, | ||
| bool hasChroma = true) | ||
| { | ||
| if (!enableShift) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional: Can we check this before calling the function? and not to use enableShift as an extra parameter at all?
But from the other side it accumulates all the ifs here, in general it's OK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these functions are in general inlined by the compiler, so it doesn't really matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, agree
| } | ||
| float scale = 1.0f / (chromaHorzRatio * chromaVertRatio); | ||
| shaderStr << " cb *= " << scale << ";\n"; | ||
| shaderStr << " cr *= " << scale << ";\n"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, we use box filter for chroma downsampling. we could reduce aliasing with lowpass filter but it's OK to use box in our case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't care about the filtering for AQ analyses. We want this as cheap as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, agree
| } | ||
| float scale = 1.0f / (chromaHorzRatio * chromaVertRatio); | ||
| shaderStr << " cb *= " << scale << ";\n"; | ||
| shaderStr << " cr *= " << scale << ";\n"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ratios are never 0:
// Get input/output chroma subsampling ratios
const uint32_t inputChromaHorzRatio = (inputMpInfo != nullptr) ? (1 << inputMpInfo->planesLayout.secondaryPlaneSubsampledX) : 1;
const uint32_t inputChromaVertRatio = (inputMpInfo != nullptr) ? (1 << inputMpInfo->planesLayout.secondaryPlaneSubsampledY) : 1;
const uint32_t outputChromaHorzRatio = (outputMpInfo != nullptr) ? (1 << outputMpInfo->planesLayout.secondaryPlaneSubsampledX) : 1;
const uint32_t outputChromaVertRatio = (outputMpInfo != nullptr) ? (1 << outputMpInfo->planesLayout.secondaryPlaneSubsampledY) : 1;
They can be "1" - or "1 << value".
2ce3a24 to
bb1776c
Compare
srinathkr-nv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add some details in the commit message for the commit "filter: fix the normalization handling".
In the commit "common: Add row/column replication flags to bitmask", please add a line specifying that the FILTER_ prefix for the filter flags introduced in the previous commit is being removed, and the reason for removing the prefix (I'm assuming it is for conciseness).
| VK_IMAGE_ASPECT_PLANE_1_BIT | | ||
| VK_IMAGE_ASPECT_PLANE_2_BIT); | ||
|
|
||
| while (yCbCrAspectMask != 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the same question as @t-boiko. Why is the while-loop used here? Is it to allow for an early termination i..e the break on line 266?
| return m_pipeline; | ||
| } | ||
|
|
||
| // Setters for manual pipeline creation (bypassing shaderc for pre-compiled SPIR-V) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of these additions are used anywhere in this series of changes. Can they be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of these additions are used anywhere in this series of changes. Can they be removed?
OK, yes. I have a separate commit for that, now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These lines are still present in the latest commit series and continue to be unused.
| * // Submit command buffer using cmd.GetNode()->... | ||
| * } // Auto cleanup on scope exit | ||
| */ | ||
| class PoolNodeHandler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class isn't being used anywhere, either in this series or in #171 . Can you describe the need for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of using it as part of a slightly different design, but then decided against it. So, I've removed this commit now.
| filterFlags |= VulkanFilterYuvCompute::FLAG_OUTPUT_LSB_TO_MSB_SHIFT; | ||
| } | ||
| #ifdef NV_AQ_GPU_LIB_SUPPORTED | ||
| if (m_aqAnalyzes) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lines 1335 - 1340 should be moved to #171 , as that is where the NV_AQ_GPU_LIB_SUPPORTED define is being introduced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was done intentionally, so if the AQ integration is not used, not to have the overhead of using the sub-sampling. This define doesn't do any harm here, because it is not defined yet.
I didn't want to build infrastructure just for 2x2 sub-sampling to be overwritten later on with AQ. I would rather not change this.
srinathkr-nv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The commit "encoder: quard m_inputSubsampledImage with AQ supported" has a typo in the commit message (quard -> guard), but this is also a change which seems more suited for inclusion in #171 rather than here because the NV_AQ_GPU_LIB_SUPPORTED define is being introduced there.
Add support to the image views of of Y and Cb/Cr only planes. Add an aspect mask to the image pool. Signed-off-by: Tony Zlatinski <[email protected]>
This commit adds 2x2 Y-channel subsampling capability to VulkanFilterYuvCompute for Adaptive Quantization (AQ) preprocessing, with optimal GPU performance through chroma-resolution block-based dispatch. FEATURES ADDED: 1. Y-Subsampling Output (Binding 9) - Added descriptor binding 9 for subsampled Y output image (R8/R16 format) - Generates half-resolution luma map via 2x2 box filter averaging - Output available in VkVideoEncodeFrameInfo::subsampledImageResource - Proper ref-counted lifetime management within encode frame info 2. Optimal Chroma-Resolution Dispatch - Dispatch at chroma resolution instead of luma resolution - One thread processes: NxM Y pixels + 1 CbCr pixel + 1 subsampled Y - 75% thread reduction for 4:2:0 formats (e.g., 1920x1080: 2.07M → 518K threads) - Zero branch divergence (all threads execute same code path) - Localized memory access pattern (2x2 block reads/writes) - Applied to all 4 RecordCommandBuffer overloads (Image/Buffer combinations) 3. Clean Gen Function Architecture - GenBlockCoordinates() - Block coordinate calculation with replication support - GenReadYCbCrBlock() - Read NxM Y pixels + 1 chroma sample - GenConvertYCbCrBlock() - Format conversion for entire block - GenApplyBlockOutputShift() - LSB-to-MSB shift for 10/12-bit output - GenWriteYCbCrBlock() - Write NxM Y pixels + 1 chroma sample - GenComputeAndWriteSubsampledY() - Box filter and subsampled Y output 4. Unified RecordCommandBuffer API - Added optional subsampledImageView parameter to all 4 overloads - Backward compatible (nullptr default) - Conditional binding 9 descriptor (only when subsampledImageView provided) - Consistent parameter order across all overloads (bufferIdx first) TECHNICAL DETAILS: Shader Generation Pipeline (InitYCBCRCOPY): - Stage 1-2: Header, push constants, descriptor bindings - Stage 3-8: Helper functions (normalization, conversion, if needed) - Stage 9: Main function declaration - Stage 10: GenBlockCoordinates() - chromaPos, lumaPos, bounds, replication - Stage 11: GenReadYCbCrBlock() - Read 4 Y + 1 CbCr with edge clamping - Stage 12: GenConvertYCbCrBlock() - Apply format conversion if needed - Stage 13: GenApplyBlockOutputShift() - LSB→MSB shift for 10/12-bit - Stage 14: GenWriteYCbCrBlock() - Write 4 Y + 1 CbCr to output - Stage 15: GenComputeAndWriteSubsampledY() - Box filter to binding 9 Dispatch Strategy: - OLD: Dispatch at luma resolution, conditional chroma processing * 1920x1080: 2,073,600 threads, 50% branch divergence - NEW: Dispatch at chroma resolution, block-based processing * 1920x1080 (4:2:0): 518,400 threads (75% reduction) * Each thread: 4 Y reads, 1 CbCr read, conversions, 4 Y writes, 1 CbCr write, 1 subsampled Y write * Zero branch divergence, better cache locality Resource Management: - Added subsampledImageResource to VkVideoEncodeFrameInfo struct - Ref-counted lifetime management (released in Reset()) - Image pool allocation in VkVideoEncoder::InitEncoder() - Half-resolution format matches output bit depth (R8 or R16) - Usage flags: STORAGE | SAMPLED | TRANSFER_SRC | TRANSFER_DST Format Support: - Chroma subsampling: 4:2:0, 4:2:2, 4:4:4 - Bit depths: 8, 10, 12, 16-bit - Plane layouts: 2-plane (NV12) and 3-plane (I420) - Input/Output: Image and buffer combinations PERFORMANCE: Thread Count Reduction (4:2:0): - 352x288: 101,376 → 25,344 (75% reduction) - 1280x720: 921,600 → 230,400 (75% reduction) - 1920x1080: 2,073,600 → 518,400 (75% reduction) - 3840x2160: 8,294,400 → 2,073,600 (75% reduction) Benefits: - Lower thread launch overhead - Better GPU occupancy (higher work per thread) - Zero warp divergence - Improved cache utilization (localized 2x2 reads) - Estimated 2-4x speedup on YCbCr conversion pass TESTING: Verified on NVIDIA GeForce RTX 4080: - H.264 8-bit @ 352x288: 10 frames ✓ - H.264 8-bit @ 1920x1080: 100 frames ✓ (stress test, no pool exhaustion) - HEVC 10-bit @ 176x144: 5 frames ✓ (LSB-to-MSB shift verified) - AV1 8-bit @ 352x288: 10 frames ✓ Build status: Zero errors, zero warnings (clean build) API CHANGES: VulkanFilterYuvCompute.h: - Added optional subsampledImageView parameter to all RecordCommandBuffer overloads - Added subsampledImageResourceInfo parameter for future extensibility VkVideoEncoder.h: - Added subsampledImageResource member to VkVideoEncodeFrameInfo - Released in Reset() for proper pool recycling COMPATIBILITY: - Backward compatible (subsampledImageView defaults to nullptr) - Existing code works without changes - Binding 9 only used when subsampledImageView is provided - No API breaks Signed-off-by: Tony Zlatinski <[email protected]>
Fixed GenHandleSourcePositionWithReplicate to clamp to valid range. Problem: - GenHandleSourcePositionWithReplicate clamped to width/height: ivec2 srcPos = min(pos, ivec2(pushConstants.inputWidth, pushConstants.inputHeight)); - For 1920x1080, allows coordinates up to (1920, 1080) - Valid range is [0..1919] x [0..1079] - Off-by-one error allowed sampling outside image bounds Root Cause: min(pos, inputWidth) can return inputWidth (e.g., 1920) But valid coords are [0..inputWidth-1] (e.g., [0..1919]) Solution: - Changed to: min(pos, ivec2(inputWidth - 1, inputHeight - 1)) - Now clamps to [0..1919] x [0..1079] for 1920x1080 - Prevents out-of-bounds image access Verified other clamping code: ✅ inputLumaMax uses (inputWidth - 1, inputHeight - 1) - correct ✅ inputChromaMax uses (halfInputWidth - 1, halfInputHeight - 1) - correct ✅ All min(lumaPos + offset, inputLumaMax) operations - correct Impact: - Fixes potential GPU memory access violations - Fixes edge pixel replication (was potentially reading invalid data) - Critical fix for stability Note: This function is used in YCBCR2RGBA filter, not the YCBCRCOPY path Signed-off-by: Tony Zlatinski <[email protected]>
…t format Critical fix for AQ Y-subsampling when output is 4:4:4 (or any non-4:2:0). Problem: - Previously used output format's chroma subsampling to determine Y block size - When output is 4:4:4 (no chroma subsampling), generated 1x1 blocks - This resulted in NO actual Y downsampling (subsampledY = yOut00 * 1) - AQ algorithms received full-resolution Y data instead of 2x2 downsampled Root Cause: const uint32_t chromaHorzRatio = (1 << outputMpInfo->planesLayout.secondaryPlaneSubsampledX); For 4:4:4: secondaryPlaneSubsampledX = 0, so chromaHorzRatio = 1 Fix: - Always use fixed 2x2 Y subsampling for binding 9 (AQ input) - Independent of output format's chroma subsampling - const uint32_t ySubsampleHorzRatio = 2; - const uint32_t ySubsampleVertRatio = 2; Result: ✅ Reads 2x2 Y block (y00, y10, y01, y11) ✅ Writes all 4 Y pixels to output (works for 4:4:4) ✅ Computes proper box filter: (y00+y10+y01+y11)*0.25 ✅ Writes half-resolution Y to binding 9 ✅ Works correctly for 4:2:0, 4:2:2, AND 4:4:4 outputs common: Add 4:4:4 output support to Y-subsampling filter Fixed GenWriteYCbCrBlock to handle 4:4:4 output formats correctly. Problem: - Always dispatched with 2x2 blocks for optimal Y subsampling - For 4:4:4 output, was only writing 1 chroma sample instead of 2x2 - This caused green/corrupted chroma in 4:4:4 encoded output Solution: - Added outputChromaHorzSubsampling/VertSubsampling parameters to GenWriteYCbCrBlock - Detects 4:4:4 output (subsampling == 1) - For 4:4:4: writes 2x2 chroma pixels (one per Y pixel) - For 4:2:0/4:2:2: writes 1 chroma sample (existing behavior) Result: ✅ 4:4:4 input → 4:4:4 output works correctly ✅ 4:2:0 input → 4:2:0 output works correctly (existing) ✅ Always 2x2 dispatch for optimal efficiency ✅ Always 2x2 Y subsampling for AQ (independent of format) common: Fix dispatch to always use half Y resolution (2x2 blocks) Critical fix: dispatch was using output chroma resolution instead of fixed 2x2. Problem: - RecordCommandBuffer dispatch used output format chroma subsampling - For 4:4:4 output: chromaSubsampleX/Y = 1, dispatched at FULL resolution - For 1920x1080 4:4:4: dispatched 1920x1080 threads instead of 960x540 - Caused incorrect processing and wasted 75% of GPU threads Root Cause: const uint32_t chromaWidth = (width + chromaSubsampleX - 1) / chromaSubsampleX; For 4:4:4: chromaSubsampleX=1, so chromaWidth = width (WRONG!) Solution: - Always dispatch at half Y resolution: (width+1)/2 x (height+1)/2 - One thread per 2x2 Y block, independent of output format - Matches shader expectations (GenBlockCoordinates uses 2x2) Fixed in 3 RecordCommandBuffer overloads: 1. Image->Image (line 2698) 2. Buffer->Image (line 2875) 3. Image->Buffer (line 3053) Result: ✅ 4:4:4: 960x540 dispatch (correct!) ✅ 4:2:0: 960x540 dispatch (same efficiency) ✅ Consistent with shader 2x2 block processing ✅ 75% thread efficiency for all formats common: Fix Buffer->Buffer dispatch to use half Y resolution Completed the dispatch fix for all 4 RecordCommandBuffer overloads. Problem: - Buffer->Buffer case was missed in previous commit - Was still dispatching at full resolution instead of half - Inconsistent with other 3 overloads and shader expectations Solution: - Changed from full resolution to (width+1)/2 x (height+1)/2 - Now all 4 cases dispatch at half Y resolution All RecordCommandBuffer overloads now fixed: 1. Image->Image ✅ (previous commit) 2. Buffer->Image ✅ (previous commit) 3. Image->Buffer ✅ (previous commit) 4. Buffer->Buffer ✅ (this commit) Result: Complete consistency across all path variants common: Add 4:4:4 input support - read and average 4 chroma samples Critical fix for 4:4:4 input formats to preserve chroma detail. Problem: - GenReadYCbCrBlock always read only 1 chroma sample - For 4:4:4 input → 4:4:4 output, this downsampled chroma to 4:2:0 then upsampled - Lost all chroma detail, caused incorrect colors Root Cause: // Old code - always reads 1 chroma float cb = imageLoad(inputImageCb, ivec3(srcChromaPos, ...)).r; For 4:4:4 input, each Y pixel has its own chroma, but we only read 1 sample and replicated it to all 4 output pixels. Solution: - Pass inputChromaHorzSubsampling/VertSubsampling to GenReadYCbCrBlock - For 4:4:4 input (subsampling==1): read 4 chroma samples and average - For 4:2:0 input (subsampling==2): read 1 chroma sample (existing) Generated code for 4:4:4 input: // Read 4 chroma samples and average float cb00 = imageLoad(inputImageCb, lumaPos + ivec2(0,0)).r; float cr00 = imageLoad(inputImageCr, lumaPos + ivec2(0,0)).r; float cb10 = imageLoad(inputImageCb, lumaPos + ivec2(1,0)).r; // ... (4 samples total) cb = (cb00 + cb10 + cb01 + cb11) * 0.25; Result: ✅ 4:4:4 input → 4:4:4 output: preserves chroma detail ✅ 4:2:0 input → 4:2:0 output: unchanged (1 sample) ✅ 4:4:4 input → 4:2:0 output: proper downsampling ✅ All input/output format combinations work correctly common: Make Y-subsampling optional via enableYSubsampling parameter Added enableYSubsampling parameter to control binding 9 generation. Problem: - Y subsampling (binding 9) was always generated even if not needed - m_enableYSubsampling flag existed but wasn't configurable - Wasted resources when AQ is not being used Solution: - Added enableYSubsampling parameter (default=false) to: • VulkanFilterYuvCompute::Create() • VulkanFilterYuvCompute constructor - Conditional generation of: • Binding 9 descriptor layout (only if enabled) • GenComputeAndWriteSubsampledY() call (only if enabled) - Added runtime check in RecordCommandBuffer(): • Asserts if subsampledImageView provided but flag is false • Returns VK_ERROR_INITIALIZATION_FAILED to catch API misuse Usage: // Without Y subsampling (default, backward compatible) VulkanFilterYuvCompute::Create(...); // enableYSubsampling defaults to false // With Y subsampling for AQ VulkanFilterYuvCompute::Create(..., true); // enables binding 9 Benefits: ✅ Backward compatible (defaults to false) ✅ No wasted shader code when AQ not used ✅ No wasted binding 9 descriptor when not needed ✅ Runtime validation for API misuse ✅ Clean separation of concerns Signed-off-by: Tony Zlatinski <[email protected]>
…ple bools
Replaced 3 separate boolean parameters with a single uint32_t bitmask.
Problem:
- VulkanFilterYuvCompute::Create() had 3 separate bool parameters:
• inputEnableMsbToLsbShift
• outputEnableLsbToMsbShift
• enableYSubsampling
- Hard to extend with new options
- Verbose API with growing parameter list
Solution:
- Created FilterFlags enum with bitmask values:
• FILTER_FLAG_NONE = 0
• FILTER_FLAG_INPUT_MSB_TO_LSB_SHIFT = (1 << 0)
• FILTER_FLAG_OUTPUT_LSB_TO_MSB_SHIFT = (1 << 1)
• FILTER_FLAG_ENABLE_Y_SUBSAMPLING = (1 << 2)
- Replaced 3 bool parameters with single uint32_t filterFlags
- Updated Create() and constructor signatures
- Updated call sites in encoder and decoder
API Change:
// Old (3 separate bools):
Create(..., false, true, true);
// New (single bitmask):
uint32_t flags = FILTER_FLAG_OUTPUT_LSB_TO_MSB_SHIFT |
FILTER_FLAG_ENABLE_Y_SUBSAMPLING;
Create(..., flags);
Benefits:
✅ Cleaner API with fewer parameters
✅ Easy to extend with new flags
✅ Self-documenting with named constants
✅ Type-safe enum
Add row/column replication flags to bitmask
Added two new flags for edge replication control.
New Flags:
- FLAG_ENABLE_ROW_COLUMN_REPLICATION_ONE: Replicate one row/column at edges
- FLAG_ENABLE_ROW_COLUMN_REPLICATION_ALL: Replicate all out-of-bounds pixels
Implementation:
- m_enableRowAndColumnReplication is set to true if either flag is set
- Previously was hardcoded to true in constructor
- Now configurable via filterFlags parameter
Constructor initialization:
m_enableRowAndColumnReplication =
(filterFlags & (FLAG_ENABLE_ROW_COLUMN_REPLICATION_ONE |
FLAG_ENABLE_ROW_COLUMN_REPLICATION_ALL)) != 0
Encoder updated:
- Added FLAG_ENABLE_ROW_COLUMN_REPLICATION_ALL to default flags
- Maintains previous behavior (replication was always on)
Complete FilterFlags enum:
FLAG_NONE = 0
FLAG_INPUT_MSB_TO_LSB_SHIFT = (1 << 0)
FLAG_OUTPUT_LSB_TO_MSB_SHIFT = (1 << 1)
FLAG_ENABLE_Y_SUBSAMPLING = (1 << 2)
FLAG_ENABLE_ROW_COLUMN_REPLICATION_ONE = (1 << 3)
FLAG_ENABLE_ROW_COLUMN_REPLICATION_ALL = (1 << 4)
Result: All filter options now controlled via clean bitmask API
Signed-off-by: Tony Zlatinski <[email protected]>
…ants
Eliminated redundant division operations in every shader thread.
Problem:
- GenBlockCoordinates generated division in shader main():
uint chromaWidth = (pushConstants.outputWidth + 1) / 2;
uint chromaHeight = (pushConstants.outputHeight + 1) / 2;
- Computed 518,400 times per frame for 1920x1080 (every thread!)
- inputChromaMax also computed divisions per thread
- Wasted ALU cycles on trivial math
Solution:
- Added halfInputWidth/Height and halfOutputWidth/Height to PushConstants
- CPU computes once: (width + 1) / 2
- Shader uses precomputed: pushConstants.halfOutputWidth
- Updated all 4 RecordCommandBuffer overloads to populate fields
Generated shader changes:
Before:
uint chromaWidth = (pushConstants.outputWidth + 1) / 2;
ivec2 inputChromaMax = ivec2((pushConstants.inputWidth + 1) / 2 - 1, ...);
After:
uint chromaWidth = pushConstants.halfOutputWidth; // Direct read!
ivec2 inputChromaMax = ivec2(pushConstants.halfInputWidth - 1, ...);
PushConstants struct updated:
+ uint halfInputWidth, halfInputHeight (precomputed)
+ uint halfOutputWidth, halfOutputHeight (precomputed)
Push constant size fix:
- Changed from hardcoded "6 * sizeof(uint32_t)"
- Now uses sizeof(PushConstants) for correctness
Performance gain:
✅ Eliminates 1M+ division ops per frame (1920x1080)
✅ 2 integer divisions → 2 direct reads per thread
✅ Cleaner shader code
✅ All tests still passing
Signed-off-by: Tony Zlatinski <[email protected]>
Signed-off-by: Tony Zlatinski <[email protected]>
Signed-off-by: Tony Zlatinski <[email protected]>
Fix corruption with singleton pattern and explicit Vulkan target Problem: Multiple VulkanShaderCompiler instances were each creating their own shaderc_compiler_t, causing state corruption and runtime shader compilation failures with errors like: - 'double' : not supported with this profile: none - compute shaders require es profile with version 310 or above Root Causes: 1. Each VulkanShaderCompiler created a separate shaderc_compiler_t instance 2. shaderc_compile_into_spv() was called with nullptr options (no target env) 3. Multiple compilers interfering with each other's internal state Solution: 1. Singleton Pattern: - Single shared shaderc_compiler_t for entire process (g_sharedCompiler) - Reference counting (g_compilerRefCount) to manage lifecycle - Thread-safe with std::mutex lock during compilation 2. Explicit Compile Options: - Set target_env to vulkan 1.2 (not default/none) - Set target SPIR-V version to 1.5 - Properly initialize and release compile options 3. VulkanComputePipeline Setters: - Added SetPipeline/SetShaderModule/SetPipelineCache/SetDeviceContext - Allows external code to populate pipeline without calling CreatePipeline() - Enables pre-compiled SPIR-V to bypass shaderc entirely Impact: - Fixes runtime shader compilation errors for VulkanFilterYuvCompute - Reduces overhead (one compiler instead of N) - Thread-safe compilation - Proper resource cleanup on shutdown Signed-off-by: Tony Zlatinski <[email protected]>
Signed-off-by: Tony Zlatinski <[email protected]>
- Fix GLSL type error by casting yuv to vec3 in normalization shader - Change denormalization functions to return uint/uvec types with proper clamping and rounding for correct integer output - Add hasChroma parameter to GenConvertYCbCrBlock, GenApplyBlockOutputShift, and GenConvertYCbCrFormat to support luma-only (Y-plane) formats - Fix typo: subsampledYImage -> subsampledImageY - Add GetFormatBitDepth() helper for correct bit depth detection when MSB/LSB shift is enabled on 16-bit formats - Conditionally generate chroma packing functions only when output has chroma Signed-off-by: Tony Zlatinski <[email protected]>
- Wrap m_inputSubsampledImagePool and subsampledImageResource with #ifdef NV_AQ_GPU_LIB_SUPPORTED to allow building without AQ library - Only enable FLAG_ENABLE_Y_SUBSAMPLING when m_aqAnalyzes is active - Align subsampled image extents to 32 pixels for AV1 worst case - Increase pool size by 4 for better resource availability - Guard GetSubsampledYImagePool() accessor with AQ compile flag Signed-off-by: Tony Zlatinski <[email protected]>
- Add R10X6 and R12X4 single-channel packed formats to image view creation - Extend vkFormatInfo table with 10/12-bit packed formats: R10X6/R12X4 (1-ch), R10X6G10X6/R12X4G12X4 (2-ch), and 4-channel variants - Replace local getFormatTexelSize() with vkFormatLookUp() for texel size - Add VulkanAqProcessor.h include guarded with NV_AQ_GPU_LIB_SUPPORTED Signed-off-by: Tony Zlatinski <[email protected]>
Yes, I've missed to add description and sing those commit. Fixed now.
OK, sure. I actually merged those two commits.
Yes, because of the breaks. This is the way I'm avoiding goto(s) that I don't tolerate. |
Fixed. Thank you for the review! |
d5c7cd4 to
8e2d3ea
Compare
1ab36f1 common: add helper PoolNodeHandler class to CommandBufferPool
fa9c970 common: VkBufferResource documentation
d37c7d0 VulkanShaderCompiler: Fix shaderc singleton
4c7ca1d common: VulkanFilter - fixed the header file
327b7a2 common: VulkanDeviceContext set queue family for XFER
a004479 common: Optimize shader by precomputing half-resolution in push constants
f2fc2d4 common: Add row/column replication flags to bitmask
7efca55 common: Refactor filter options to use bitmask flags instead of multiple bools
ed989f8 common: Fix Y-subsampling to use fixed 2x2 blocks regardless of output format
821db54 common: add subsampled Y output image to the filter
7394fb5 common: YuvCompute - Add Y-only and CbCr-only plane support