Commit ea5ce04
authored
Re-enable ReadyToRun AOT compilation (#10848)
* feat(build): enable ReadyToRun AOT compilation for Nethermind.Runner
Enable PublishReadyToRun when publishing with a specific RuntimeIdentifier
to improve startup performance. Switch Dockerfiles from -a to -r flag so
the RID is set explicitly for crossgen2. Add CI publish-runner job to
validate R2R builds on linux-x64 and linux-arm64.
* fix(build): drop --locked-mode from RID-specific restore
Passing -r <rid> to dotnet restore --locked-mode fails with NU1004
because the lock file contains all five RuntimeIdentifiers but -r
narrows it to one, causing a mismatch.
* fix(build): restore locked-mode by separating RID from restore
Restore without -r so the lock file's full RuntimeIdentifiers set
matches, then pass -r to publish only where it sets RuntimeIdentifier
for R2R without conflicting with the lock file.
* fix(build): drop --no-restore from publish for R2R runtime packs
crossgen2 needs the target runtime pack which is only fetched when
restoring with RuntimeIdentifier + PublishReadyToRun. Let publish do
its own implicit restore to fetch the RID-specific runtime pack while
the locked-mode restore still validates package integrity.
* feat(benchmarks): upgrade BDN to nightly with R2R toolchain support
Bump BenchmarkDotNet to 0.16.0-nightly.20260310.466 which includes
composite R2R toolchain (dotnet/BenchmarkDotNet#2967). Nightly feeds
are scoped to benchmark projects only via RestoreAdditionalProjectSources
in Directory.Build.props.
* fix(benchmarks): add BDN nightly feeds to nuget.config with source mapping
RestoreAdditionalProjectSources doesn't bypass package source mapping,
so the feeds must be declared in nuget.config. Package source mapping
restricts them to BenchmarkDotNet and Microsoft.Diagnostics.Runtime
packages only — non-benchmark projects are unaffected.
* fix(benchmarks): adapt to BDN 0.16.0 API changes
- Remove obsolete EvaluateOverheadAttribute from LruCache benchmarks
- Replace InProcessNoEmitToolchain.Instance with .Default
* fix(build): restrict NuGet audit to nuget.org only
The dotnet-tools and bdn-nightly feeds don't support vulnerability
auditing, causing NU1900 errors with TreatWarningsAsErrors.
* feat(build): enable composite ReadyToRun for cross-assembly inlining
Composite R2R compiles all assemblies as a single unit, enabling
crossgen2 to inline across assembly boundaries (e.g. Nethermind.Evm
into Nethermind.Core). Tier-1 PGO still recompiles hot paths at
runtime with full optimizations.
* feat(build): enable OptimizationPreference=Speed and GC Large Pages
- OptimizationPreference=Speed tells crossgen2 to favor execution
speed over code size in R2R output
- GC Large Pages (2MB pages) reduces TLB misses for large heaps;
silently falls back to regular pages if OS permissions are missing
* feat(pgo): add static PGO profile infrastructure and collection workflow
- Add PublishReadyToRunMibcPaths to csproj (conditional on file existence)
- Add Dockerfile.pgo that builds without R2R and enables EventPipe for
JIT profile collection
- Add collect-pgo-profile.yml workflow that runs EXPB realblocks,
extracts .nettrace, converts to .mibc via dotnet-pgo, and uploads
as artifact. Runs weekly on Sunday and optionally creates a PR.
* fix(pgo): trigger collection on push to ready-to-run branch
Add push trigger on ready-to-run branch so the workflow runs
automatically on this commit. To be changed to weekly-only before merge.
* fix(pgo): remove auto-PR creation, artifact-only workflow
* fix(pgo): use local image name and add pull_policy:never for EXPB
EXPB tries to docker pull from Docker Hub. Use a local-only image
tag and inject pull_policy:never into the rendered config to prevent
the pull attempt.
* fix(pgo): use publish-docker.yml to build and push PGO image
Use the same publish-docker.yml → Docker Hub route as regular EXPB
benchmarks. Trigger publish-docker with Dockerfile.pgo, wait for it,
then run EXPB against the pushed nethermindeth/nethermind:pgo-collect
image.
* fix(pgo): remove DOTNET_ReadyToRun=0 to fix slow startup
App assemblies are already built without R2R via PublishReadyToRun=false.
Disabling framework R2R caused all .NET runtime code to be JIT-compiled,
making startup too slow for EXPB. Also reduce circular buffer to 1GB.
* fix(pgo): restore full keywords and override EventSource.IsSupported
- Restore MethodDiagnostic and TypeDiagnostic keywords (0x1E000080018)
for richer profile data
- Override EventSource.IsSupported=false from runtimeconfig.json via
DOTNET_System_Diagnostics_Tracing_EventSource_IsSupported=true
(runtimeconfig disables EventSource which blocks EventPipe collection)
* fix(pgo): reduce EventPipe overhead for faster startup
- Reduce circular buffer from 1024MB to 256MB
- Lower trace level from 5 (Verbose) to 4 (Informational), sufficient
for dotnet-pgo method-level profile data
* feat(pgo): add edge/block profiling for branch optimization
Enable DOTNET_WritePGOData to capture Tier-1 edge/block profiling
data alongside EventPipe method traces. The workflow now merges both
sources into the final .mibc via dotnet-pgo merge, giving crossgen2
branch layout optimization data in addition to method-level profiles.
* fix(pgo): build dotnet-pgo from source, not available on NuGet
dotnet-pgo is not published as a global tool. Build it from the
dotnet/runtime repo using sparse checkout to fetch only the tool.
* fix(pgo): tail Nethermind container logs during EXPB run
Add background docker logs tail prefixed with [nethermind] so we can
see what Nethermind is doing while EXPB waits for JSON-RPC availability.
* fix(pgo): default state layout to flat
* fix(pgo): disable GC Large Pages in PGO collection image
GC Large Pages causes a 93 GiB virtual memory reservation that exceeds
the EXPB container's 64GB memory limit. Disable via env var override
since runtimeconfig.json has System.GC.LargePages=true baked in.
* fix(pgo): remove global.json before building dotnet-pgo
The dotnet/runtime global.json requires SDK 11.0 preview. Remove it
so the build uses the installed SDK (10.0).
* fix(pgo): use unique Docker tag per run to avoid cache
Use pgo-$run_id as the tag so each run pushes and pulls a fresh image,
avoiding stale Docker Hub CDN or local layer cache.
* fix(pgo): include dotnet-pgo project dependencies in sparse checkout
dotnet-pgo references ILCompiler.Reflection.ReadyToRun and
ILCompiler.TypeSystem from sibling directories.
* fix(pgo): fix GC env var casing and cap region range to 32 GiB
- Fix DOTNET_gcLargePages to DOTNET_GCLargePages (correct casing)
- Add DOTNET_GCRegionRange=0x800000000 (32 GiB) to cap GC virtual
memory reservation within the 64GB container limit
* fix(pgo): add Microsoft.NETCore.Platforms to sparse checkout for dotnet-pgo build
* fix(pgo): use no-cone sparse checkout with repo build infra for dotnet-pgo
* fix(pgo): use full shallow clone of dotnet/runtime instead of sparse checkout
* fix(pgo): use runtime build.sh to build dotnet-pgo with proper Arcade SDK
Removing global.json broke Arcade SDK resolution, which broke CPM
package version resolution (NU1015). Use the runtime repo's own build
script which handles SDK acquisition and build infrastructure.
* fix(pgo): install required SDK and build only dotnet-pgo project
The clr.tools subset builds ILCompiler which needs native libjitinterface_x64.so.
Instead, install the SDK version from global.json (preserves Arcade SDK + CPM)
and build only the dotnet-pgo csproj directly.
* fix(pgo): narrow trace/jit file search to PGO data directories
Search specific PGO output directories instead of broad RUNNER_TEMP/tmp
scans. Use newest .nettrace file when multiple exist.
* fix(pgo): tolerate partial clr.tools build failure if dotnet-pgo produced
ILCompiler needs native libjitinterface_x64.so which isn't available,
but dotnet-pgo builds successfully before that failure. Allow the build
to fail and check for dotnet-pgo.dll output. Use a wrapper script to
invoke it via dotnet.
* fix(pgo): validate PGO search directories exist before find
Skip missing directories to avoid find errors, and add || true to
prevent set -e from aborting on empty find results.
* fix(pgo): specify explicit nettrace filename pattern in EventPipe output
* fix(pgo): persist PGO data via writable EXPB volume
EXPB removes the container during cleanup, destroying the anonymous
Docker volume with EventPipe/PGO data. Fix by injecting a writable
'pgo' extra_volume in the EXPB config so data persists in the EXPB
output directory. Extract step now reads from there instead of the
dead container.
* fix(pgo): fix YAML syntax error in python3 injection, use sed instead
* fix(pgo): copy entire dotnet-pgo output dir, not just the DLL
The wrapper script failed with missing libhostpolicy.so because only
dotnet-pgo.dll was copied. The runtimeconfig.json, deps.json, and
dependency DLLs are needed for framework-dependent execution.
* fix(pgo): fix PGODataPath to use full file path and build only dotnet-pgo
PGODataPath must be a complete file path, not a directory — the runtime
calls fopen directly on the value, so a directory path fails silently.
Also build only dotnet-pgo.csproj instead of the full clr.tools subset
to avoid the ILCompiler libjitinterface_x64.so error.
* fix(pgo): use InitializeDotNetCli for SDK install, fix PGODataPath
- build.sh --restore doesn't install SDK to .dotnet/; use
eng/common/tools.sh InitializeDotNetCli directly instead
- PGODataPath must be a full file path, not a directory — the runtime
calls fopen directly on the value, so a directory path fails silently
- Build only dotnet-pgo.csproj, not the full clr.tools subset
* fix(pgo): revert to build.sh clr.tools with ILCompiler failure tolerated
InitializeDotNetCli doesn't work outside build.sh context. Go back to
the approach that successfully produced the .mibc artifact: build the
full clr.tools subset and tolerate the ILCompiler failure.
* fix(pgo): use absolute path for --projects in build.sh
Arcade's Build.proj doesn't resolve relative paths from the repo root.
* fix: suppress NU1900 so flaky NuGet feeds don't break builds
NU1900 fires when vulnerability metadata can't be fetched from a
package source. With TreatWarningsAsErrors this breaks the build
when the dotnet-tools Azure DevOps feed is unreachable. Actual
vulnerability findings (NU1901-NU1904) remain errors.
* fix(pgo): don't merge .jit files into .mibc — incompatible formats
dotnet-pgo merge expects .mibc (ZIP/PE) but WritePGOData produces a
JIT-internal text format meant for ReadPGOData. The .jit file is still
uploaded as a raw artifact for optional runtime-level PGO.
* fix(pgo): pass Nethermind assemblies as --reference to dotnet-pgo
Extract assemblies from the PGO Docker image so dotnet-pgo can resolve
generic method signatures against R2R tables, eliminating the 'Unable
to parse' warnings for generics like Dictionary<K,V>, List<T>, etc.
* chore(pgo): bump upload-artifact to v7 (Node.js 24 support)
* feat(pgo): add PGO profile for R2R-guided optimization
Generated from 5000 mainnet blocks via EventPipe profiling. Crossgen2
automatically picks this up via PublishReadyToRunMibcPaths in the
Runner csproj, guiding method layout and inlining in R2R compilation.
* chore(pgo): remove WritePGOData/jit generation
The .jit file is 31MB and only useful for runtime-level JIT PGO via
ReadPGOData. With R2R composite enabled, hot methods are already
AOT-compiled using the .mibc profile — the .jit adds marginal value.
* feat(pgo): run weekly on master and create PR to update .mibc profile
- Remove push trigger on ready-to-run branch
- Keep schedule (Sunday 02:00 UTC) and workflow_dispatch
- Add update-pgo-profile job that downloads the .mibc artifact,
checks for changes, and creates a PR to update the checked-in profile
- Add contents:write and pull-requests:write permissions
* fix: preserve inherited NoWarn in Nethermind.KeyStore.csproj
Was overwriting $(NoWarn) instead of appending, losing the NU1900
suppression from Directory.Build.props.
* feat(pgo): apply .mibc profile to all R2R builds, not just Runner
Move PublishReadyToRunMibcPaths to root Directory.Build.props so any
project publishing with R2R (including BDN benchmarks) picks up the
PGO profile automatically.
* fix: restore --no-restore flag and inline rid variable in Dockerfiles
Address PR review feedback:
- Add back --no-restore to dotnet publish for reproducible builds
- Inline the rid variable directly in the -r argument
* fix: pass RID to dotnet restore for R2R runtime pack resolution
With -r (RuntimeIdentifier), PublishReadyToRun activates and needs the
crossgen2 runtime pack. Restore must know the RID to fetch it.
* feat: add entrypoint script to configure GC region range from cgroup limits
Reads the container memory limit from cgroups v1/v2 and sets
DOTNET_GCRegionRange to 75% of available memory. This ensures large
pages (enabled in runtimeconfig) don't over-commit when the container
has a memory limit. Users can override via DOTNET_GCRegionRange env var.
Chiseled image has no shell so retains the direct entrypoint.
* Revert "feat: add entrypoint script to configure GC region range from cgroup limits"
This reverts commit a517005.
* fix: cap GCRegionRange to 32 GiB in Docker images
Prevents CLR_E_GC_OOM (0x8013200E) crash loops when the .NET runtime
over-estimates available memory during GC initialization. The 32 GiB
cap is virtual address space only (not physical), safe for all
deployment sizes, and overridable via -e DOTNET_GCRegionRange.
* fix: don't pass -r to dotnet restore to preserve lock file compatibility
Passing -r to restore changes the RID set, breaking --locked-mode
(NU1004). Restore without RID, let publish handle runtime-specific
package resolution. Drops --no-restore since publish needs to resolve
the R2R crossgen2 runtime pack for the target RID.
* fix: revert to master's -a flag for dotnet publish
-r breaks the lock file RID set (NU1004). Master's -a $arch works
correctly with --locked-mode and still triggers R2R with .mibc profile.
* fix: use -r for R2R publish, let publish resolve crossgen2 runtime pack
-a with --no-restore silently skips R2R because the crossgen2 runtime
pack is never downloaded. Use -r without --no-restore so publish
handles the RID-specific restore. The initial dotnet restore
--locked-mode pins NuGet packages; publish adds only the runtime pack.
* fix: two-phase restore for locked packages + R2R runtime pack
First restore validates NuGet packages against the lock file.
Second restore with -r fetches the crossgen2 runtime pack for R2R.
Publish uses --no-restore since everything is already resolved.
* feat: per-RID lock files for reproducible R2R Docker builds
Generate platform-specific lock files (packages.lock.linux-x64.json,
packages.lock.linux-arm64.json) so dotnet restore -r can run in
--locked-mode with the crossgen2 runtime pack included. Single restore
+ --no-restore on publish for full reproducibility.
* fix: copy per-RID lock file to packages.lock.json before restore
dotnet restore doesn't forward -p:NuGetLockFilePath to NuGet's
restore target. Copy the platform-specific lock file over the
default name so --locked-mode reads it correctly.
* fix(pgo): move MibcPaths to Directory.Build.targets for correct eval order
PublishReadyToRunMibcPaths in Directory.Build.props was evaluated
before the csproj set PublishReadyToRun=true, so the condition never
matched. Move to Directory.Build.targets which runs after the csproj.
Confirmed crossgen2 receives the profile via PublishReadyToRunMibcPaths.
* fix: revert to locked restore + publish with implicit R2R restore
Per-RID lock files don't work cross-platform — generating on Windows
doesn't produce Linux RID sections. Use the simple approach: locked
restore for NuGet packages, then publish -r handles crossgen2 runtime
pack via its own implicit restore.
* fix: two-phase restore for locked NuGet + deterministic R2R runtime pack
1. dotnet restore --locked-mode: pins NuGet packages to lock file
2. dotnet restore -r: adds crossgen2 runtime pack (deterministic from
the SDK version, which is pinned by Docker image digest)
3. dotnet publish --no-restore: no network access, fully reproducible
* fix: copy Directory.Build.targets into Docker build context
The new root Directory.Build.targets (for PGO .mibc path) wasn't
included in the Docker COPY commands, causing MSB4019.
* fix: pass PublishReadyToRun=true to RID restore for crossgen2 pack
dotnet restore -r sets RuntimeIdentifiers (plural) but the csproj
condition checks RuntimeIdentifier (singular), so PublishReadyToRun
isn't activated and the crossgen2 runtime pack isn't fetched.
* fix: address PR review feedback
- Fix update-pgo-profile job: needs: collect-pgo → needs: collect
- Filter publish-docker run discovery by head_sha to avoid matching
unrelated runs
- Copy Directory.Build.targets into Dockerfile.diag build context
* fix(pgo): pin EXPB and dotnet/runtime versions for reproducibility
- Pin EXPB to commit 56f83b1 (no releases/tags available)
- Pin dotnet/runtime to v10.0.5 tag matching our SDK 10.0.201 and
aspnet:10.0.5 Docker base images
* fix: address PR review feedback (round 3)
- Dockerfile.pgo: add two-phase restore with --no-restore on publish
- Log tail: use exec + process group kill to reliably terminate
docker logs -f (subshell PID alone didn't stop the child process)
- build-solutions.yml: mirror two-phase restore approach with
--no-restore on publish for deterministic CI builds
* feat: auto-detect hugepages and enable GC Large Pages via entrypoint
Remove System.GC.LargePages from runtimeconfig — it's now enabled
dynamically by the entrypoint script only when hugepages are available
(/sys/kernel/mm/hugepages free_hugepages > 0).
When enabled, GCRegionRange is capped to min(32 GiB, 75% of container
memory limit) to prevent CLR_E_GC_OOM. All env vars are skipped if
already set by the user.
Chiseled image has no shell so large pages must be opted-in manually
via -e DOTNET_GCLargePages=1 -e DOTNET_GCRegionRange=0x800000000.
* fix: address PR review feedback
- Align download-artifact to v7 matching upload-artifact v7
- Remove redundant second restore in Dockerfile.pgo (PublishReadyToRun=false
doesn't need crossgen2 runtime pack, so RID-specific restore is a no-op)
* fix: restrict second restore to offline sources for reproducibility
The RID-specific restore for the crossgen2 runtime pack now only
resolves from local sources (SDK library-packs + NuGet package cache)
to ensure no unexpected packages are pulled from the network. All
NuGet dependencies are already pinned by the first --locked-mode
restore; the second restore only needs the SDK-bundled runtime pack.
* fix: add SDK packs directory to offline restore sources
The crossgen2 runtime pack lives in /usr/share/dotnet/packs/, not just
library-packs. Also add DOTNET_ROOT/packs for CI where the SDK may be
in a different location.
* fix: revert offline restore — PackageSourceMapping blocks unnamed sources
NuGet's PackageSourceMapping ignores --source paths that aren't named
in nuget.config, causing NU1100 for the crossgen2 runtime pack. The
pack is deterministic from the SDK version (pinned by Docker image
digest), so standard restore is reproducible without offline sources.
* fix: include available memory in GCRegionRange calculation
Use min(32 GiB, available_memory * 75%, cgroup_limit * 75%) to cap
the GC region range. Previously only checked cgroup limits, which
misses bare-metal/VM deployments where memory may be limited but
no cgroup is configured.
* fix: use 37.5% for GCRegionRange to account for large pages double-commit
The GC has a known bug where large pages cause ~2x the region range to
be committed (dotnet/runtime#103203). Since large pages can't be lazily
committed, GCRegionRange becomes actual physical memory. Use 37.5%
(half of 75%) so the effective commit stays within 75% of available
memory after the GC doubles it.
* fix: cap GCRegionRange to available hugepages
Large pages are unpageable — mmap with MAP_HUGETLB fails if we request
more than the host has pre-allocated. Cap the region range to
free_hugepages * 2 MiB to ensure the allocation can succeed.
* fix: skip large pages if effective region range is too small
Don't enable large pages if the calculated GCRegionRange would be
below 4 GiB — the GC heap would be too constrained and regular pages
with lazy commit give better flexibility. The 4 GiB threshold means
~8 GiB committed (due to 2x bug) which is roughly a 16 GiB minimum
system for large pages to activate.
* fix: use 75% for bare metal, 37.5% only when cgroup limit is set
The 2x double-commit bug only matters when a cgroup memory limit is
enforced (OOM-killer counts committed pages). On bare metal, Linux
overcommit handles the wasted commit harmlessly.
48 GB available bare metal: 48 * 75% = 36 GiB region range (usable heap)
48 GB container with limit: 48 * 37.5% = 18 GiB region range (safe from OOM-kill)
* fix: always use 37.5% — hugepages are unpageable, no overcommit
Hugepages are pinned physical RAM with no overcommit or swap fallback.
The 2x double-commit must physically fit in available memory or mmap
fails. Use 37.5% unconditionally so region_range * 2 <= 75% of memory.
48 GB available: 37.5% = 18 GiB range, 36 GiB committed, 12 GiB headroom
* fix: raise large pages minimum to 18 GiB region range
Nethermind needs ~18 GiB usable GC heap for state caching. With the
2x double-commit this requires ~48 GiB available memory. Don't enable
large pages on smaller systems — regular pages with lazy commit give
better flexibility when memory is constrained.
* fix: remove 32 GiB cap — 37.5% rule and hugepage check are sufficient
The cap was a safety net for the OOM crash. With 37.5% of available
memory, 18 GiB minimum, and free_hugepages cap, the allocation is
guaranteed to fit. Removing the cap lets large-memory machines (128+
GiB) use more GC heap.
128 GiB available: 37.5% = 48 GiB usable heap (was capped at 32 GiB)
256 GiB available: 37.5% = 96 GiB usable heap
* fix: set execute permission on entrypoint.sh
* feat: add Nethermind container log tailing to benchmark workflow
Background-tail the Nethermind container logs during EXPB runs for
visibility into startup failures, exceptions, and shutdown behavior.
Matches the pattern used in collect-pgo-profile.yml.
* fix: stream container logs live instead of buffering until end
Save stdout to fd3 before expb redirects to file. Container log tail
writes to fd3 so logs appear progressively in GitHub Actions output
while expb output is captured separately for analysis.1 parent db2ef91 commit ea5ce04
File tree
22 files changed
+661
-15
lines changed- .github/workflows
- scripts
- src/Nethermind
- Nethermind.Benchmark/Core
- Nethermind.Evm.Benchmark
- Nethermind.KeyStore
- Nethermind.Runner
- pgo
22 files changed
+661
-15
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
42 | 42 | | |
43 | 43 | | |
44 | 44 | | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
0 commit comments