Skip to content

Commit ea5ce04

Browse files
authored
Re-enable ReadyToRun AOT compilation (#10848)
* feat(build): enable ReadyToRun AOT compilation for Nethermind.Runner Enable PublishReadyToRun when publishing with a specific RuntimeIdentifier to improve startup performance. Switch Dockerfiles from -a to -r flag so the RID is set explicitly for crossgen2. Add CI publish-runner job to validate R2R builds on linux-x64 and linux-arm64. * fix(build): drop --locked-mode from RID-specific restore Passing -r <rid> to dotnet restore --locked-mode fails with NU1004 because the lock file contains all five RuntimeIdentifiers but -r narrows it to one, causing a mismatch. * fix(build): restore locked-mode by separating RID from restore Restore without -r so the lock file's full RuntimeIdentifiers set matches, then pass -r to publish only where it sets RuntimeIdentifier for R2R without conflicting with the lock file. * fix(build): drop --no-restore from publish for R2R runtime packs crossgen2 needs the target runtime pack which is only fetched when restoring with RuntimeIdentifier + PublishReadyToRun. Let publish do its own implicit restore to fetch the RID-specific runtime pack while the locked-mode restore still validates package integrity. * feat(benchmarks): upgrade BDN to nightly with R2R toolchain support Bump BenchmarkDotNet to 0.16.0-nightly.20260310.466 which includes composite R2R toolchain (dotnet/BenchmarkDotNet#2967). Nightly feeds are scoped to benchmark projects only via RestoreAdditionalProjectSources in Directory.Build.props. * fix(benchmarks): add BDN nightly feeds to nuget.config with source mapping RestoreAdditionalProjectSources doesn't bypass package source mapping, so the feeds must be declared in nuget.config. Package source mapping restricts them to BenchmarkDotNet and Microsoft.Diagnostics.Runtime packages only — non-benchmark projects are unaffected. * fix(benchmarks): adapt to BDN 0.16.0 API changes - Remove obsolete EvaluateOverheadAttribute from LruCache benchmarks - Replace InProcessNoEmitToolchain.Instance with .Default * fix(build): restrict NuGet audit to nuget.org only The dotnet-tools and bdn-nightly feeds don't support vulnerability auditing, causing NU1900 errors with TreatWarningsAsErrors. * feat(build): enable composite ReadyToRun for cross-assembly inlining Composite R2R compiles all assemblies as a single unit, enabling crossgen2 to inline across assembly boundaries (e.g. Nethermind.Evm into Nethermind.Core). Tier-1 PGO still recompiles hot paths at runtime with full optimizations. * feat(build): enable OptimizationPreference=Speed and GC Large Pages - OptimizationPreference=Speed tells crossgen2 to favor execution speed over code size in R2R output - GC Large Pages (2MB pages) reduces TLB misses for large heaps; silently falls back to regular pages if OS permissions are missing * feat(pgo): add static PGO profile infrastructure and collection workflow - Add PublishReadyToRunMibcPaths to csproj (conditional on file existence) - Add Dockerfile.pgo that builds without R2R and enables EventPipe for JIT profile collection - Add collect-pgo-profile.yml workflow that runs EXPB realblocks, extracts .nettrace, converts to .mibc via dotnet-pgo, and uploads as artifact. Runs weekly on Sunday and optionally creates a PR. * fix(pgo): trigger collection on push to ready-to-run branch Add push trigger on ready-to-run branch so the workflow runs automatically on this commit. To be changed to weekly-only before merge. * fix(pgo): remove auto-PR creation, artifact-only workflow * fix(pgo): use local image name and add pull_policy:never for EXPB EXPB tries to docker pull from Docker Hub. Use a local-only image tag and inject pull_policy:never into the rendered config to prevent the pull attempt. * fix(pgo): use publish-docker.yml to build and push PGO image Use the same publish-docker.yml → Docker Hub route as regular EXPB benchmarks. Trigger publish-docker with Dockerfile.pgo, wait for it, then run EXPB against the pushed nethermindeth/nethermind:pgo-collect image. * fix(pgo): remove DOTNET_ReadyToRun=0 to fix slow startup App assemblies are already built without R2R via PublishReadyToRun=false. Disabling framework R2R caused all .NET runtime code to be JIT-compiled, making startup too slow for EXPB. Also reduce circular buffer to 1GB. * fix(pgo): restore full keywords and override EventSource.IsSupported - Restore MethodDiagnostic and TypeDiagnostic keywords (0x1E000080018) for richer profile data - Override EventSource.IsSupported=false from runtimeconfig.json via DOTNET_System_Diagnostics_Tracing_EventSource_IsSupported=true (runtimeconfig disables EventSource which blocks EventPipe collection) * fix(pgo): reduce EventPipe overhead for faster startup - Reduce circular buffer from 1024MB to 256MB - Lower trace level from 5 (Verbose) to 4 (Informational), sufficient for dotnet-pgo method-level profile data * feat(pgo): add edge/block profiling for branch optimization Enable DOTNET_WritePGOData to capture Tier-1 edge/block profiling data alongside EventPipe method traces. The workflow now merges both sources into the final .mibc via dotnet-pgo merge, giving crossgen2 branch layout optimization data in addition to method-level profiles. * fix(pgo): build dotnet-pgo from source, not available on NuGet dotnet-pgo is not published as a global tool. Build it from the dotnet/runtime repo using sparse checkout to fetch only the tool. * fix(pgo): tail Nethermind container logs during EXPB run Add background docker logs tail prefixed with [nethermind] so we can see what Nethermind is doing while EXPB waits for JSON-RPC availability. * fix(pgo): default state layout to flat * fix(pgo): disable GC Large Pages in PGO collection image GC Large Pages causes a 93 GiB virtual memory reservation that exceeds the EXPB container's 64GB memory limit. Disable via env var override since runtimeconfig.json has System.GC.LargePages=true baked in. * fix(pgo): remove global.json before building dotnet-pgo The dotnet/runtime global.json requires SDK 11.0 preview. Remove it so the build uses the installed SDK (10.0). * fix(pgo): use unique Docker tag per run to avoid cache Use pgo-$run_id as the tag so each run pushes and pulls a fresh image, avoiding stale Docker Hub CDN or local layer cache. * fix(pgo): include dotnet-pgo project dependencies in sparse checkout dotnet-pgo references ILCompiler.Reflection.ReadyToRun and ILCompiler.TypeSystem from sibling directories. * fix(pgo): fix GC env var casing and cap region range to 32 GiB - Fix DOTNET_gcLargePages to DOTNET_GCLargePages (correct casing) - Add DOTNET_GCRegionRange=0x800000000 (32 GiB) to cap GC virtual memory reservation within the 64GB container limit * fix(pgo): add Microsoft.NETCore.Platforms to sparse checkout for dotnet-pgo build * fix(pgo): use no-cone sparse checkout with repo build infra for dotnet-pgo * fix(pgo): use full shallow clone of dotnet/runtime instead of sparse checkout * fix(pgo): use runtime build.sh to build dotnet-pgo with proper Arcade SDK Removing global.json broke Arcade SDK resolution, which broke CPM package version resolution (NU1015). Use the runtime repo's own build script which handles SDK acquisition and build infrastructure. * fix(pgo): install required SDK and build only dotnet-pgo project The clr.tools subset builds ILCompiler which needs native libjitinterface_x64.so. Instead, install the SDK version from global.json (preserves Arcade SDK + CPM) and build only the dotnet-pgo csproj directly. * fix(pgo): narrow trace/jit file search to PGO data directories Search specific PGO output directories instead of broad RUNNER_TEMP/tmp scans. Use newest .nettrace file when multiple exist. * fix(pgo): tolerate partial clr.tools build failure if dotnet-pgo produced ILCompiler needs native libjitinterface_x64.so which isn't available, but dotnet-pgo builds successfully before that failure. Allow the build to fail and check for dotnet-pgo.dll output. Use a wrapper script to invoke it via dotnet. * fix(pgo): validate PGO search directories exist before find Skip missing directories to avoid find errors, and add || true to prevent set -e from aborting on empty find results. * fix(pgo): specify explicit nettrace filename pattern in EventPipe output * fix(pgo): persist PGO data via writable EXPB volume EXPB removes the container during cleanup, destroying the anonymous Docker volume with EventPipe/PGO data. Fix by injecting a writable 'pgo' extra_volume in the EXPB config so data persists in the EXPB output directory. Extract step now reads from there instead of the dead container. * fix(pgo): fix YAML syntax error in python3 injection, use sed instead * fix(pgo): copy entire dotnet-pgo output dir, not just the DLL The wrapper script failed with missing libhostpolicy.so because only dotnet-pgo.dll was copied. The runtimeconfig.json, deps.json, and dependency DLLs are needed for framework-dependent execution. * fix(pgo): fix PGODataPath to use full file path and build only dotnet-pgo PGODataPath must be a complete file path, not a directory — the runtime calls fopen directly on the value, so a directory path fails silently. Also build only dotnet-pgo.csproj instead of the full clr.tools subset to avoid the ILCompiler libjitinterface_x64.so error. * fix(pgo): use InitializeDotNetCli for SDK install, fix PGODataPath - build.sh --restore doesn't install SDK to .dotnet/; use eng/common/tools.sh InitializeDotNetCli directly instead - PGODataPath must be a full file path, not a directory — the runtime calls fopen directly on the value, so a directory path fails silently - Build only dotnet-pgo.csproj, not the full clr.tools subset * fix(pgo): revert to build.sh clr.tools with ILCompiler failure tolerated InitializeDotNetCli doesn't work outside build.sh context. Go back to the approach that successfully produced the .mibc artifact: build the full clr.tools subset and tolerate the ILCompiler failure. * fix(pgo): use absolute path for --projects in build.sh Arcade's Build.proj doesn't resolve relative paths from the repo root. * fix: suppress NU1900 so flaky NuGet feeds don't break builds NU1900 fires when vulnerability metadata can't be fetched from a package source. With TreatWarningsAsErrors this breaks the build when the dotnet-tools Azure DevOps feed is unreachable. Actual vulnerability findings (NU1901-NU1904) remain errors. * fix(pgo): don't merge .jit files into .mibc — incompatible formats dotnet-pgo merge expects .mibc (ZIP/PE) but WritePGOData produces a JIT-internal text format meant for ReadPGOData. The .jit file is still uploaded as a raw artifact for optional runtime-level PGO. * fix(pgo): pass Nethermind assemblies as --reference to dotnet-pgo Extract assemblies from the PGO Docker image so dotnet-pgo can resolve generic method signatures against R2R tables, eliminating the 'Unable to parse' warnings for generics like Dictionary<K,V>, List<T>, etc. * chore(pgo): bump upload-artifact to v7 (Node.js 24 support) * feat(pgo): add PGO profile for R2R-guided optimization Generated from 5000 mainnet blocks via EventPipe profiling. Crossgen2 automatically picks this up via PublishReadyToRunMibcPaths in the Runner csproj, guiding method layout and inlining in R2R compilation. * chore(pgo): remove WritePGOData/jit generation The .jit file is 31MB and only useful for runtime-level JIT PGO via ReadPGOData. With R2R composite enabled, hot methods are already AOT-compiled using the .mibc profile — the .jit adds marginal value. * feat(pgo): run weekly on master and create PR to update .mibc profile - Remove push trigger on ready-to-run branch - Keep schedule (Sunday 02:00 UTC) and workflow_dispatch - Add update-pgo-profile job that downloads the .mibc artifact, checks for changes, and creates a PR to update the checked-in profile - Add contents:write and pull-requests:write permissions * fix: preserve inherited NoWarn in Nethermind.KeyStore.csproj Was overwriting $(NoWarn) instead of appending, losing the NU1900 suppression from Directory.Build.props. * feat(pgo): apply .mibc profile to all R2R builds, not just Runner Move PublishReadyToRunMibcPaths to root Directory.Build.props so any project publishing with R2R (including BDN benchmarks) picks up the PGO profile automatically. * fix: restore --no-restore flag and inline rid variable in Dockerfiles Address PR review feedback: - Add back --no-restore to dotnet publish for reproducible builds - Inline the rid variable directly in the -r argument * fix: pass RID to dotnet restore for R2R runtime pack resolution With -r (RuntimeIdentifier), PublishReadyToRun activates and needs the crossgen2 runtime pack. Restore must know the RID to fetch it. * feat: add entrypoint script to configure GC region range from cgroup limits Reads the container memory limit from cgroups v1/v2 and sets DOTNET_GCRegionRange to 75% of available memory. This ensures large pages (enabled in runtimeconfig) don't over-commit when the container has a memory limit. Users can override via DOTNET_GCRegionRange env var. Chiseled image has no shell so retains the direct entrypoint. * Revert "feat: add entrypoint script to configure GC region range from cgroup limits" This reverts commit a517005. * fix: cap GCRegionRange to 32 GiB in Docker images Prevents CLR_E_GC_OOM (0x8013200E) crash loops when the .NET runtime over-estimates available memory during GC initialization. The 32 GiB cap is virtual address space only (not physical), safe for all deployment sizes, and overridable via -e DOTNET_GCRegionRange. * fix: don't pass -r to dotnet restore to preserve lock file compatibility Passing -r to restore changes the RID set, breaking --locked-mode (NU1004). Restore without RID, let publish handle runtime-specific package resolution. Drops --no-restore since publish needs to resolve the R2R crossgen2 runtime pack for the target RID. * fix: revert to master's -a flag for dotnet publish -r breaks the lock file RID set (NU1004). Master's -a $arch works correctly with --locked-mode and still triggers R2R with .mibc profile. * fix: use -r for R2R publish, let publish resolve crossgen2 runtime pack -a with --no-restore silently skips R2R because the crossgen2 runtime pack is never downloaded. Use -r without --no-restore so publish handles the RID-specific restore. The initial dotnet restore --locked-mode pins NuGet packages; publish adds only the runtime pack. * fix: two-phase restore for locked packages + R2R runtime pack First restore validates NuGet packages against the lock file. Second restore with -r fetches the crossgen2 runtime pack for R2R. Publish uses --no-restore since everything is already resolved. * feat: per-RID lock files for reproducible R2R Docker builds Generate platform-specific lock files (packages.lock.linux-x64.json, packages.lock.linux-arm64.json) so dotnet restore -r can run in --locked-mode with the crossgen2 runtime pack included. Single restore + --no-restore on publish for full reproducibility. * fix: copy per-RID lock file to packages.lock.json before restore dotnet restore doesn't forward -p:NuGetLockFilePath to NuGet's restore target. Copy the platform-specific lock file over the default name so --locked-mode reads it correctly. * fix(pgo): move MibcPaths to Directory.Build.targets for correct eval order PublishReadyToRunMibcPaths in Directory.Build.props was evaluated before the csproj set PublishReadyToRun=true, so the condition never matched. Move to Directory.Build.targets which runs after the csproj. Confirmed crossgen2 receives the profile via PublishReadyToRunMibcPaths. * fix: revert to locked restore + publish with implicit R2R restore Per-RID lock files don't work cross-platform — generating on Windows doesn't produce Linux RID sections. Use the simple approach: locked restore for NuGet packages, then publish -r handles crossgen2 runtime pack via its own implicit restore. * fix: two-phase restore for locked NuGet + deterministic R2R runtime pack 1. dotnet restore --locked-mode: pins NuGet packages to lock file 2. dotnet restore -r: adds crossgen2 runtime pack (deterministic from the SDK version, which is pinned by Docker image digest) 3. dotnet publish --no-restore: no network access, fully reproducible * fix: copy Directory.Build.targets into Docker build context The new root Directory.Build.targets (for PGO .mibc path) wasn't included in the Docker COPY commands, causing MSB4019. * fix: pass PublishReadyToRun=true to RID restore for crossgen2 pack dotnet restore -r sets RuntimeIdentifiers (plural) but the csproj condition checks RuntimeIdentifier (singular), so PublishReadyToRun isn't activated and the crossgen2 runtime pack isn't fetched. * fix: address PR review feedback - Fix update-pgo-profile job: needs: collect-pgo → needs: collect - Filter publish-docker run discovery by head_sha to avoid matching unrelated runs - Copy Directory.Build.targets into Dockerfile.diag build context * fix(pgo): pin EXPB and dotnet/runtime versions for reproducibility - Pin EXPB to commit 56f83b1 (no releases/tags available) - Pin dotnet/runtime to v10.0.5 tag matching our SDK 10.0.201 and aspnet:10.0.5 Docker base images * fix: address PR review feedback (round 3) - Dockerfile.pgo: add two-phase restore with --no-restore on publish - Log tail: use exec + process group kill to reliably terminate docker logs -f (subshell PID alone didn't stop the child process) - build-solutions.yml: mirror two-phase restore approach with --no-restore on publish for deterministic CI builds * feat: auto-detect hugepages and enable GC Large Pages via entrypoint Remove System.GC.LargePages from runtimeconfig — it's now enabled dynamically by the entrypoint script only when hugepages are available (/sys/kernel/mm/hugepages free_hugepages > 0). When enabled, GCRegionRange is capped to min(32 GiB, 75% of container memory limit) to prevent CLR_E_GC_OOM. All env vars are skipped if already set by the user. Chiseled image has no shell so large pages must be opted-in manually via -e DOTNET_GCLargePages=1 -e DOTNET_GCRegionRange=0x800000000. * fix: address PR review feedback - Align download-artifact to v7 matching upload-artifact v7 - Remove redundant second restore in Dockerfile.pgo (PublishReadyToRun=false doesn't need crossgen2 runtime pack, so RID-specific restore is a no-op) * fix: restrict second restore to offline sources for reproducibility The RID-specific restore for the crossgen2 runtime pack now only resolves from local sources (SDK library-packs + NuGet package cache) to ensure no unexpected packages are pulled from the network. All NuGet dependencies are already pinned by the first --locked-mode restore; the second restore only needs the SDK-bundled runtime pack. * fix: add SDK packs directory to offline restore sources The crossgen2 runtime pack lives in /usr/share/dotnet/packs/, not just library-packs. Also add DOTNET_ROOT/packs for CI where the SDK may be in a different location. * fix: revert offline restore — PackageSourceMapping blocks unnamed sources NuGet's PackageSourceMapping ignores --source paths that aren't named in nuget.config, causing NU1100 for the crossgen2 runtime pack. The pack is deterministic from the SDK version (pinned by Docker image digest), so standard restore is reproducible without offline sources. * fix: include available memory in GCRegionRange calculation Use min(32 GiB, available_memory * 75%, cgroup_limit * 75%) to cap the GC region range. Previously only checked cgroup limits, which misses bare-metal/VM deployments where memory may be limited but no cgroup is configured. * fix: use 37.5% for GCRegionRange to account for large pages double-commit The GC has a known bug where large pages cause ~2x the region range to be committed (dotnet/runtime#103203). Since large pages can't be lazily committed, GCRegionRange becomes actual physical memory. Use 37.5% (half of 75%) so the effective commit stays within 75% of available memory after the GC doubles it. * fix: cap GCRegionRange to available hugepages Large pages are unpageable — mmap with MAP_HUGETLB fails if we request more than the host has pre-allocated. Cap the region range to free_hugepages * 2 MiB to ensure the allocation can succeed. * fix: skip large pages if effective region range is too small Don't enable large pages if the calculated GCRegionRange would be below 4 GiB — the GC heap would be too constrained and regular pages with lazy commit give better flexibility. The 4 GiB threshold means ~8 GiB committed (due to 2x bug) which is roughly a 16 GiB minimum system for large pages to activate. * fix: use 75% for bare metal, 37.5% only when cgroup limit is set The 2x double-commit bug only matters when a cgroup memory limit is enforced (OOM-killer counts committed pages). On bare metal, Linux overcommit handles the wasted commit harmlessly. 48 GB available bare metal: 48 * 75% = 36 GiB region range (usable heap) 48 GB container with limit: 48 * 37.5% = 18 GiB region range (safe from OOM-kill) * fix: always use 37.5% — hugepages are unpageable, no overcommit Hugepages are pinned physical RAM with no overcommit or swap fallback. The 2x double-commit must physically fit in available memory or mmap fails. Use 37.5% unconditionally so region_range * 2 <= 75% of memory. 48 GB available: 37.5% = 18 GiB range, 36 GiB committed, 12 GiB headroom * fix: raise large pages minimum to 18 GiB region range Nethermind needs ~18 GiB usable GC heap for state caching. With the 2x double-commit this requires ~48 GiB available memory. Don't enable large pages on smaller systems — regular pages with lazy commit give better flexibility when memory is constrained. * fix: remove 32 GiB cap — 37.5% rule and hugepage check are sufficient The cap was a safety net for the OOM crash. With 37.5% of available memory, 18 GiB minimum, and free_hugepages cap, the allocation is guaranteed to fit. Removing the cap lets large-memory machines (128+ GiB) use more GC heap. 128 GiB available: 37.5% = 48 GiB usable heap (was capped at 32 GiB) 256 GiB available: 37.5% = 96 GiB usable heap * fix: set execute permission on entrypoint.sh * feat: add Nethermind container log tailing to benchmark workflow Background-tail the Nethermind container logs during EXPB runs for visibility into startup failures, exceptions, and shutdown behavior. Matches the pattern used in collect-pgo-profile.yml. * fix: stream container logs live instead of buffering until end Save stdout to fd3 before expb redirects to file. Container log tail writes to fd3 so logs appear progressively in GitHub Actions output while expb output is captured separately for analysis.
1 parent db2ef91 commit ea5ce04

22 files changed

+661
-15
lines changed

.github/workflows/build-solutions.yml

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,3 +42,36 @@ jobs:
4242

4343
- name: Build ${{ matrix.solution }}.slnx
4444
run: dotnet build src/Nethermind/${{ matrix.solution }}.slnx -c ${{ matrix.config }} --no-restore
45+
46+
publish-runner:
47+
name: Publish Nethermind.Runner [${{ matrix.target.label }}]
48+
runs-on: ${{ matrix.target.runner }}
49+
strategy:
50+
fail-fast: false
51+
matrix:
52+
target:
53+
- { runner: ubuntu-latest, label: linux-x64, rid: linux-x64 }
54+
- { runner: ubuntu-24.04-arm, label: linux-arm64, rid: linux-arm64 }
55+
steps:
56+
- name: Free up disk space
57+
uses: jlumbroso/free-disk-space@v1.3.1
58+
with:
59+
large-packages: false
60+
tool-cache: false
61+
62+
- name: Check out repository
63+
uses: actions/checkout@v6
64+
65+
- name: Set up .NET
66+
uses: actions/setup-dotnet@v5
67+
with:
68+
cache: true
69+
cache-dependency-path: src/Nethermind/Nethermind.Runner/packages.lock.json
70+
71+
- name: Restore Nethermind.Runner
72+
run: |
73+
dotnet restore src/Nethermind/Nethermind.Runner/Nethermind.Runner.csproj --locked-mode
74+
dotnet restore src/Nethermind/Nethermind.Runner/Nethermind.Runner.csproj -r ${{ matrix.target.rid }} -p:PublishReadyToRun=true
75+
76+
- name: Publish Nethermind.Runner for ${{ matrix.target.rid }}
77+
run: dotnet publish src/Nethermind/Nethermind.Runner/Nethermind.Runner.csproj -c release -r ${{ matrix.target.rid }} --no-restore --no-self-contained

0 commit comments

Comments
 (0)