Skip to content

Conversation

@yvan-sraka
Copy link
Collaborator

@yvan-sraka yvan-sraka commented Aug 11, 2025

This PR implements a migration from GitHub's standard runners to a hybrid infrastructure combining self-hosted and ephemeral Blacksmith runners for building Nix packages.
The implementation includes runner selection, dynamic build matrix generation, and optimized caching strategies to improve build performance and cost efficiency.

Problem Statement

The previous CI implementation had several limitations:

  1. Monolithic build process: A single job attempted to build all packages across all architectures
  2. Inefficient resource allocation: All packages used the same runner type regardless of build complexity
  3. Limited parallelization: Builds couldn't be efficiently distributed across different runner types
  4. Redundant builds: No mechanism to skip packages already available in the binary cache
  5. Poor cost optimization: Large, expensive builds ran on the same infrastructure as small, quick builds
  6. Poor job output clarity: No separation of build results made it hard to identify issues

Solution Architecture

High-Level Design

┌─────────────────┐
│   nix-eval      │  Evaluates flake, generates build matrix
│   (Blacksmith)  │  Identifies cached vs. uncached packages
└────────┬────────┘  Identifies large packages
         │
         ├──────────────┬──────────────┬
         │              │              │              
         v              v              v              
┌────────────────┐ ┌────────────────┐ ┌────────────────┐
│ aarch64-linux  │ │ aarch64-darwin │ │ x86_64-linux   │
│ Self-hosted/   │ │ Self-hosted    │ │ Blacksmith     │
│ Blacksmith     │ │ (macOS)        │ │ Ephemeral      │
└────────────────┘ └────────────────┘ └────────────────┘

Architecture Components

  1. Nix Evaluation Phase (nix-eval.yml):

    • Runs on powerful ephemeral runner (32vcpu)
    • Evaluates all flake outputs using nix-eval-jobs
    • Checks cache status for each package
    • Generates optimized build matrices per architecture
  2. Build Phases (separate jobs per architecture):

    • aarch64-linux: Self-hosted or Blacksmith ARM runners
    • aarch64-darwin: Self-hosted macOS runners
    • x86_64-linux: Blacksmith ephemeral runners
  3. Runner Selection Logic:

    • KVM-required packages → Self-hosted runners with KVM support
    • Large packages (Rust, PostGIS) → 32vcpu runners
    • Standard packages → 8vcpu runners
    • Darwin packages → Self-hosted macOS runners

Key Components

1. Dynamic Matrix Generation (github-matrix Package)

Location: nix/packages/github-matrix/

Core Responsibilities:

  • Evaluates Nix flake outputs using nix-eval-jobs (https://github.com/nix-community/nix-eval-jobs)
  • Determines package dependencies and build order using topological sorting
  • Identifies cached packages to skip redundant builds
  • Assigns appropriate runners based on package requirements
  • Generates GitHub Actions-compatible JSON matrices

Package Size Detection:

  • Uses requiredSystemFeatures = ["big-parallel"] in package definitions
  • Automatically allocates 32vcpu runners for:
    • Rust-based extensions (pg_graphql, pg_jsonschema, wrappers)
    • PostGIS (complex C++ builds)
    • pgvector with heavy dependencies

Output Format:

{
  "aarch64_linux": {
    "include": [
      {
        "attr": "checks.aarch64-linux.pg_graphql_15",
        "name": "pg_graphql-15.7",
        "system": "aarch64-linux",
        "runs_on": {"labels": ["blacksmith-32vcpu-ubuntu-2404-arm"]},
        "postgresql_version": "15"
      }
    ]
  },
  "x86_64_linux": {...},
  "aarch64_darwin": {...}
}

2. Custom Nix Installation Actions

Unify Nix installation across different runner types with two reusable GitHub Actions.

Ephemeral Runners (nix-install-ephemeral)

Location: .github/actions/nix-install-ephemeral/

Purpose: Set up Nix on fresh Blacksmith runners where Nix is not pre-installed

Features:

  • Installs Nix 2.31.2 using cachix/install-nix-action
  • Configures binary cache substituters
  • Optionally sets up AWS credentials for cache pushing
  • Creates post-build hook for automatic cache uploads

Configuration:

- uses: ./.github/actions/nix-install-ephemeral
  with:
    push-to-cache: 'true'  # Enable for build jobs
  env:
    DEV_AWS_ROLE: ${{ secrets.DEV_AWS_ROLE }}
    NIX_SIGN_SECRET_KEY: ${{ secrets.NIX_SIGN_SECRET_KEY }}

Cache Upload Mechanism:

  • Post-build hook automatically uploads successful builds to S3
  • Uses Nix signing keys for trusted binary cache
  • Hook script: /etc/nix/upload-to-cache.sh

Self-Hosted Runners (nix-install-self-hosted)

Location: .github/actions/nix-install-self-hosted/

Purpose: Configure AWS credentials on persistent self-hosted runners where Nix is pre-installed

Features:

  • Assumes AWS IAM role via OIDC
  • Writes credentials to /etc/nix/aws/nix-aws-credentials
  • Supports custom role duration (default 5 hours)

3. Reusable Nix Eval Workflow

Location: .github/workflows/nix-eval.yml

Purpose: Shared workflow for matrix generation

Features:

  • Callable from other workflows via workflow_call
  • Outputs structured JSON matrix
  • Runs on high-performance ephemeral runner
  • Handles optional AWS credentials for cache access

4. Restructured Build Workflow

Location: .github/workflows/nix-build.yml

New Structure:

jobs:
  nix-eval:
    # Generate build matrices
    uses: ./.github/workflows/nix-eval.yml

  nix-build-aarch64-linux:
    needs: nix-eval
    strategy:
      matrix: ${{ fromJSON(needs.nix-eval.outputs.matrix).aarch64_linux }}
    # Build ARM Linux packages

  nix-build-aarch64-darwin:
    needs: nix-eval
    strategy:
      matrix: ${{ fromJSON(needs.nix-eval.outputs.matrix).aarch64_darwin }}
    # Build macOS ARM packages

  nix-build-x86_64-linux:
    needs: nix-eval
    strategy:
      matrix: ${{ fromJSON(needs.nix-eval.outputs.matrix).x86_64_linux }}
    # Build x86_64 Linux packages

  run-testinfra:
    needs: [nix-build-aarch64-linux, ...]
    # Only run if all builds succeed or skip

  run-tests:
    needs: [nix-build-aarch64-linux, ...]
    # Run test suite

Key Improvements:

  1. Parallel Architecture Builds: Each architecture builds independently
  2. Smart Job Skipping: Uses !cancelled() with success/skip conditions
  3. Dynamic Job Names: Include PostgreSQL version for clarity

Related PRs

@yvan-sraka yvan-sraka requested review from jfroche and samrose August 11, 2025 10:11
@yvan-sraka yvan-sraka self-assigned this Aug 11, 2025
@yvan-sraka yvan-sraka requested review from a team as code owners August 11, 2025 10:11
@yvan-sraka yvan-sraka force-pushed the custom-github-runners branch from 8b61ad4 to 76aa79b Compare August 11, 2025 15:36
@yvan-sraka yvan-sraka force-pushed the custom-github-runners branch from 76aa79b to c75bf58 Compare September 12, 2025 13:46
@yvan-sraka yvan-sraka force-pushed the custom-github-runners branch 16 times, most recently from 1eb74b8 to db1e5e4 Compare September 29, 2025 14:29
@jfroche jfroche force-pushed the custom-github-runners branch 5 times, most recently from 003d671 to 840005b Compare September 29, 2025 21:14
jfroche and others added 28 commits November 25, 2025 23:55
When building a postgres extension, the build matrix may include
multiple time the same extension for different PostgreSQL versions.
This change makes it easier to identify which job corresponds to which PostgreSQL
version in the workflow runs.
treefmt is already included in the pre-commit hooks check.
Dynamically assign larger runners (32vcpu) for Rust and PostGIS extensions
while using smaller runners (8vcpu) for standard packages.
Add pytest tests for the package
Add nix-eval-jobs in path for the package
The matrix job returns the type of runner, so we can configure the nix
installation step accordingly.
Our changes were merged upstream, so we can now track the original
repository again.
…default

- Replace DeterminateSystems/nix-installer-action with custom nix-install-ephemeral action across all workflows
- Change default push-to-cache from 'true' to 'false' to prevent unnecessary nix/aws configurations
- Explicitly enable push-to-cache only for nix-build and nix-eval workflows where caching is beneficial
We might not need the full 8vcpu for aarch64-linux builds, so this
change reduces the runner size to 4vcpu to wait less for available
blacksmith runners.
Fix github-matrix that would hang when nix-eval-jobs encountered errors due to subprocess pipe deadlock - stderr buffer would fill while reading stdout.

This change ensure that evaluation errors are visible and the workflow fails properly while still showing which packages succeeded.
…isibility

Integrates github-action-utils library to improve error and warning
visibility in GitHub Actions UI through workflow command annotations.
Refactor error handling to collect and group evaluation errors similar to warnings. Errors with the same message are now displayed together with a list of affected attributes.
Extract core error messages and format them better for GitHub Actions
annotations.
Add nix-eval to needs dependencies and check its result in conditional expressions to prevent downstream test jobs from running when evaluation fails.
@jfroche jfroche force-pushed the custom-github-runners branch from c1c4fd7 to 9b659e4 Compare November 25, 2025 22:56
We are running an older version of the 'result' library that uses
'_value' instead of 'ok_value' to access the successful result of a
computation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants