Skip to content

Speed up HLSL preprocessing and prepared SPIR-V hot paths#1029

Open
AnastaZIuk wants to merge 46 commits intomasterfrom
unroll
Open

Speed up HLSL preprocessing and prepared SPIR-V hot paths#1029
AnastaZIuk wants to merge 46 commits intomasterfrom
unroll

Conversation

@AnastaZIuk
Copy link
Member

@AnastaZIuk AnastaZIuk commented Mar 24, 2026

Summary

  • reduce Wave include overhead in the hot HLSL path with an explicit per-session include cache using separate read and write session caches
  • classify builtin and generated include roots by provenance instead of guessing from include spelling
  • teach source-built nsc to accept -isystem so toolchain include roots can be registered explicitly in builtins-off flows
  • keep the prepared-SPIR-V hot-path improvements: a single-entrypoint trim fast path and validation once per unique content hash
  • thread one IGPUPipelineCache through compute, resolve, ImGui, and fullscreen present in the paired EX31 flow
  • update the Examples pointer to the paired Devsh-Graphics-Programming/Nabla-Examples-and-Tests#262

Root cause

Three costs were stacking on top of each other.

First, the preprocess part comes from avoidable HLSL include debt in the hot path:

  • path_tracing/concepts.hlsl on the base branch pulls bxdf/common.hlsl only to synthesize a placeholder interaction for Ray::setInteraction; that edge comes from 4d186db76f
  • member_test_macros.hlsl on the base branch uses the umbrella boost/preprocessor.hpp even though this header only needs a narrow subset; that comes from 72972a9d6e
  • the custom Wave include bridge on this path was introduced in 12afd3d42d, which added the custom Boost.Wave context and include-path classes for the HLSL preprocessor; dxc_compile_flags pragma bookkeeping was later layered on in ae4386064cf; later merges, cleanup, depfile plumbing, and backports carried the same path forward but are not the semantic origin of the extra per-include work

Second, the base include-loader path paid redundant work before preprocessing reached DXC. The current disk-backed include body load path in IShaderCompiler.cpp comes from 5ac3b55552 and later loader reshapes like cc37325f28c. Per-lookup content hashing on that path was added in cf9a866623. The hot include bridge also lacked an explicit notion of builtin/generated include roots, which made toolchain headers harder to classify and cache cleanly.

Third, the pre-fast-path trimmer always validated and walked the incoming module before it could know whether the requested entrypoint set already matched the prepared shader. The old flow is visible in ISPIRVEntryPointTrimmer.cpp#L104-L246. That shape comes from cfb4bd1da6 and 9f3f823124.

The fullscreen-present helper was introduced in 2b08a15064. In that shape CFullScreenTriangle.cpp#L120 did not yet thread an external pipeline cache, so compute and present could not populate the same cache blob.

What this changes

  • cache and reuse include resolution results explicitly per preprocess session through separate read and write session caches
  • classify builtin and generated roots when they are registered instead of inferring special treatment from include spelling
  • let nsc accept -isystem and map those roots to system-classified include search paths in source-built flows
  • keep toolchain and generated headers on the fast path without changing the normal "" versus <> search semantics
  • trim token bookkeeping in CWaveStringResolver
  • replace the umbrella Boost include in member_test_macros.hlsl with the narrow Boost headers it actually uses
  • remove redundant public HLSL includes from hot headers and stop pulling bxdf/common.hlsl into path_tracing/concepts.hlsl
  • short-circuit ISPIRVEntryPointTrimmer when the incoming module is already a prepared single-entrypoint shader
  • cache successful validation per unique SPIR-V blob so hot paths keep validation without paying for it again
  • thread an external pipeline cache through FullScreenTriangle so EX31 can share one cache object across compute and present

Validation

Validation was run on AMD Ryzen 5 5600G with Radeon Graphics (6C/12T).

Exact local source-built nsc Release -P sweeps on the current EX31 scene rules taken from the generated build commands show:

  • 9 heavy scene rules total
  • min 2.424 s
  • avg 2.503 s
  • max 2.632 s

Local source-built nsc preprocess profiles on the current EX31 heavy sphere rule show:

  • builtins OFF: include_requests=586, include_lookups=316, resolution_cache_skips=270, session_lookup_found=0
  • builtins ON: include_requests=586, include_lookups=234, resolution_cache_skips=352, session_lookup_found=44

The paired EX31 branch builds and runs in RelWithDebInfo with both builtins modes. Current warm-cache validation on the paired branch is:

  • builtins OFF: first_render_submit_ms=1533
  • builtins ON: first_render_submit_ms=1850

Prepared-shader and pipeline-cache validation on the paired EX31 branch is recorded in Devsh-Graphics-Programming/Nabla-Examples-and-Tests#262.

@AnastaZIuk AnastaZIuk changed the title Support EX31 precompiled path tracer fast paths on unroll Reduce HLSL preprocess overhead and speed up prepared SPIR-V hot paths Mar 24, 2026
@AnastaZIuk AnastaZIuk changed the title Reduce HLSL preprocess overhead and speed up prepared SPIR-V hot paths Speed up HLSL preprocessing and prepared SPIR-V hot paths Mar 24, 2026
Comment on lines -683 to +749
if (auto contents = m_defaultFileSystemLoader->getInclude(requestingSourceDir.string(), lookupName))
retVal = std::move(contents);
else retVal = std::move(trySearchPaths(lookupName));
if (asset::detail::isGloballyResolvedIncludeName(lookupName))
{
if (auto contents = tryIncludeGenerators(lookupName))
retVal = std::move(contents);
else if (auto contents = trySearchPaths(lookupName, needHash))
retVal = std::move(contents);
else retVal = m_defaultFileSystemLoader->getInclude(requestingSourceDir.string(), lookupName, needHash);
}
else
{
if (auto contents = m_defaultFileSystemLoader->getInclude(requestingSourceDir.string(), lookupName, needHash))
retVal = std::move(contents);
else if (auto contents = tryIncludeGenerators(lookupName))
retVal = std::move(contents);
else retVal = std::move(trySearchPaths(lookupName, needHash));
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explain the reason for this change

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you shouldn't try different include generators, the include generators should only be reachable with #include <> a and not #include ""

Also why should the precedence of a search path and default include loaders change depending on the path ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants