Conversation
This shader can be used to decode meshlets encoded via meshopt_encodeMeshlet. The implementation is not optimized yet and represents a line-by-line port of the scalar C++ decoding; it will be improved separately. Each meshlet is decoded serially in a separate thread. Vertex data is easy to decode using wave intrinsics; this will be done separately too.
The load mask is easy to construct dynamically based on the length. This is not faster but it aligns better with the GPU decoding method, and it's the only remaining function-local static array so we might as well remove it.
This change converts the CPU-friendly equivalent with a branch (plus a simplified version of our branchless CPU code that could generate branches depending on the support for predicated loads...) with a fully branchless decoder. While restarts usually just happen in the first triangle of a meshlet and as such the restart branch is coherent, occasionally meshlets have restarts in the middle of the sequence which is rarely aligned between different meshlets. A branchless sequence seems to never be a regression and usually results in ~15% better throughput on NV GPUs.
Similarly to the CPU SIMD decoding, we can decode triangles in pairs; we still use the same branchless scalar logic, but this allows us to read the code byte just once, which helps reduce inefficient cache traffic and improves performance further by up to 10%.
Reformatted meshletdec.slang using slangd and added minor clarifications to the code and the documentation.
Link the Slang example shader for better discovery.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
demo/meshletdec.slangcan be used to decode meshlets encoded viameshopt_encodeMeshlet.Each meshlet is decoded serially in a separate thread; this implementation approach works well for stream decode (when a large set of meshlets needs to be decoded). It's not a good fit for mesh shaders, unless multiple sub-meshlets are used and decoded on multiple threads. Note that the vertex decoding in particular can be performed using cooperative wave decoding; triangle decoding might be implementable in a similar fashion but it would require further research. In either case, this implementation doesn't pursue that, and is designed for applications that already have streaming decode that runs on the GPU, in case
meshopt_decodeMeshletsis inconvenient to use.The example implementation decodes each vertex/triangle into uint32; alternative engine-specific packing code should be easy to incorporate, modulo potential efficiency concerns about byte/unaligned writes. In that case it might also be worthwhile to decode meshlet data into shared memory and then write repacked data into global memory using optimally aligned transactions. All of this is left as an exercise to the reader :)
On NVIDIA GeForce RTX 5070, this implementation decodes 64/96 meshlets at 17-23B triangles/sec (equivalent to 120-150 GB/s of output data); it's approximately equivalent to ~16 Zen4 cores except that multi-core CPU decoding quickly hits memory bandwidth limit well below 100+GB/s on typical systems. This should run fine in async compute too if it's co-scheduled with ALU-intensive code, as the decoding is light on ALU and is mostly L2 cache access bound.
This contribution is sponsored by Valve.