Skip to content

demo: Add an example Slang meshlet decoder#1018

Merged
zeux merged 6 commits intomasterfrom
mlc-slang
Feb 12, 2026
Merged

demo: Add an example Slang meshlet decoder#1018
zeux merged 6 commits intomasterfrom
mlc-slang

Conversation

@zeux
Copy link
Owner

@zeux zeux commented Feb 11, 2026

demo/meshletdec.slang can be used to decode meshlets encoded via meshopt_encodeMeshlet.

Each meshlet is decoded serially in a separate thread; this implementation approach works well for stream decode (when a large set of meshlets needs to be decoded). It's not a good fit for mesh shaders, unless multiple sub-meshlets are used and decoded on multiple threads. Note that the vertex decoding in particular can be performed using cooperative wave decoding; triangle decoding might be implementable in a similar fashion but it would require further research. In either case, this implementation doesn't pursue that, and is designed for applications that already have streaming decode that runs on the GPU, in case meshopt_decodeMeshlets is inconvenient to use.

The example implementation decodes each vertex/triangle into uint32; alternative engine-specific packing code should be easy to incorporate, modulo potential efficiency concerns about byte/unaligned writes. In that case it might also be worthwhile to decode meshlet data into shared memory and then write repacked data into global memory using optimally aligned transactions. All of this is left as an exercise to the reader :)

On NVIDIA GeForce RTX 5070, this implementation decodes 64/96 meshlets at 17-23B triangles/sec (equivalent to 120-150 GB/s of output data); it's approximately equivalent to ~16 Zen4 cores except that multi-core CPU decoding quickly hits memory bandwidth limit well below 100+GB/s on typical systems. This should run fine in async compute too if it's co-scheduled with ALU-intensive code, as the decoding is light on ALU and is mostly L2 cache access bound.

This contribution is sponsored by Valve.

zeux added 6 commits February 10, 2026 19:44
This shader can be used to decode meshlets encoded via meshopt_encodeMeshlet.
The implementation is not optimized yet and represents a line-by-line port
of the scalar C++ decoding; it will be improved separately.

Each meshlet is decoded serially in a separate thread. Vertex data is easy
to decode using wave intrinsics; this will be done separately too.
The load mask is easy to construct dynamically based on the length. This is
not faster but it aligns better with the GPU decoding method, and it's the
only remaining function-local static array so we might as well remove it.
This change converts the CPU-friendly equivalent with a branch (plus a
simplified version of our branchless CPU code that could generate
branches depending on the support for predicated loads...) with a fully
branchless decoder. While restarts usually just happen in the first
triangle of a meshlet and as such the restart branch is coherent,
occasionally meshlets have restarts in the middle of the sequence which
is rarely aligned between different meshlets.

A branchless sequence seems to never be a regression and usually
results in ~15% better throughput on NV GPUs.
Similarly to the CPU SIMD decoding, we can decode triangles in pairs; we
still use the same branchless scalar logic, but this allows us to read
the code byte just once, which helps reduce inefficient cache traffic
and improves performance further by up to 10%.
Reformatted meshletdec.slang using slangd and added minor clarifications
to the code and the documentation.
Link the Slang example shader for better discovery.
@zeux zeux merged commit d033a68 into master Feb 12, 2026
13 checks passed
@zeux zeux deleted the mlc-slang branch February 12, 2026 17:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant