Skip to content

clusterizer: Implement global meshlet flow#851

Merged
zeux merged 10 commits intomasterfrom
clflow
Mar 11, 2025
Merged

clusterizer: Implement global meshlet flow#851
zeux merged 10 commits intomasterfrom
clflow

Conversation

@zeux
Copy link
Owner

@zeux zeux commented Mar 10, 2025

Up until now, the clusterizer made all decisions about how to continue and restart the meshlets based on local information - either current meshlet (for continuation) or previous meshlet (for restart). On large meshes, this
often resulted in a meandering traversal that left gaps in the mesh. These gaps would need to be filled later;
because the gaps had uneven sizes, this could result in disconnected clusters.

This change introduces global flow: the starting triangle for meshlets is now selected to prioritize a particular global traversal of the mesh. This is based on sorting by distance to a specific anchor point (arbitrarily chosen to be the negative corner of the bounding box), as well as by the sum of live counts which was the restart metric before this change. Both are important: distance sorting results in forcing meshlets to cover gaps in the mesh earlier which reduces the chance of disconnects, while live sorting results in a much cleaner meshlet fill locally.

Because of the live sorting, we can't easily use the KD tree (also, since we don't remove nodes from a KD tree, quering the same point will become progressively slower as more and more triangles would need to be skipped). This technically turns the sort above in a O(N) operation; if done before every meshlet, the entire process becomes O(N^2) and unusably slow for large meshes.

Instead, we maintain a set of triangle seeds of a limited size, and add a few seeds after finishing every meshlet, minimizing the metric above. Some corners are cut for performance, such as just selecting a single neighbor triangle per vertex and using a simpler replacement logic. The set is re-scored when starting every meshlet; this needs to be done for live triangles as we don't maintain that metric per triangle in an incremental fashion; however, since the set is of a limited size, the entire process stays linear and the performance degradation for meshlet generation is minimal (<1%).

This results in a significant improvement in cluster disconnections in various meshlet configurations (note, while 1.0 is the theoretical optimum, on a few of these meshes the mesh has many small disjoint features which makes 1.0 impossible to ever reach):

image

Reducing cluster splits also results in a small reduction in boundary size (which slightly improves vertex sharing and reduces locked edges in clustered simplification) and occasionally a small reduction in overall meshlet count. Testing the rasterization performance on geometry dense scenes with various cluster culling optimizations yields a small runtime speedup (3-5% depending on the mesh and GPU, AMD/NV were tested). Raytracing performance seems to be affected to a smaller degree, because the test harness uses more aggressive "flex" setup than the chart above, but the overall number of meshlets is also slightly reduced because fewer of them need to be split.

This optimization interplays well with the previous optimization (#794); so both local and global criteria are crucial to get this right, and future improvements might be possible in both. "prior" is the behavior as of the previous meshoptimizer release (v0.21):

image

This contribution is sponsored by Valve.

zeux added 10 commits March 6, 2025 11:20
Further changes to clusterizer require analysis for the "stock" raster
cluster configuration as well.
We need a specific point to anchor the overall traversal to, to ensure
the traversal stays local and expands in all sides equally. Currently
we start at an origin in mesh local space and do not control the flow;
it doesn't matter as much *where* we start, but the origin may be
embedded inside the mesh which makes the order less predictable.
To maintain meshlet global flow, we need to start new meshlets in the
order that prioritizes smaller distances to the corner. However, to
maintain local flow we still have to minimize the live_triangle based
scoring we use right now. To avoid high algorithmic complexity we need
to maintain a list of triangles of a limited size that we will use to
evaluate these; for now, just add the triangle that's closest to the
corner to the list.
When the next triangle does not fit into the meshlet, we analyze the
neighbors of each vertex and append the best triangles to the seed list.

This process is not precise: we do not filter out duplicate neighbors
between different vertices, and we do not sort the replaced triangles
perfectly, just taking the first available slot instead.
Whenever the best triangle doesn't fit into the current meshlet we now
prune the seed list and select the best seed triangle accounrding to the
live+distance metric; this replaces the logic for using the meshlet
neighbors. That helps ensure the global flow of meshlets remains
simultaneously optimal from the liveness perspective and clustered
spatially, which reduces the chance of split meshlets down the line.

The integration is currently partial, as the algorithm structure doesn't
lend itself for a single natural place to add this logic; notably, if a
meshlet is split, we actually don't use it for seeding and don't
correctly select the starting triangle for the next meshlet, which
will be addressed in the future.
For now we aren't filtering the seeds from the meshlet precisely, and we
may get duplicates: we select one neighbor triangle per vertex which
means we will likely see the same triangle as a neighbor of another
vertex in the same meshlet.

Using non-strict comparison for replacement mitigates this issue
somewhat, as it allows replacing the triangle with itself instead of
forcing the triangle to occupy another slot. This improves the flow by
giving more options to choose from during selection later.
Instead of using the seeds when the adjacency selection didn't yield a
result, we use the seeds when the meshlet is going to be split; this
covers some cases when adjacency runs out of triangles, and also covers
split_factor based splits, slightly improving the flow for flex variant
as well.
The liveness based scoring is now part of the seed selection, so we no
longer need special cases in getNeighborTriangle.
When the seed list is at capacity, instead of discarding the seeds from
the new meshlet we now replace the last few seeds. This seems to
slightly improve the flow for larger meshes, effectively "compacting"
the list a little at every iteration.
@zeux
Copy link
Owner Author

zeux commented Mar 10, 2025

For future me: I've experimented with more precise replacement criteria, including 506f03c and a fully precise 4 element top-n; unfortunately, they are all improving the results only ever-so-slightly in aggregate, and making results worse on a couple meshes where it would be nice to not regress. This might be due to some properties of the metric that I don't fully understand, or due to the metric just being suboptimal; it's possible that a better global metric will be able to take better advantage of this, but for now it's probably best to leave this as is even though the replacement is somewhat ad-hoc in certain cases.

@zeux zeux merged commit 7634a68 into master Mar 11, 2025
26 checks passed
@zeux zeux deleted the clflow branch March 11, 2025 15:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant