Skip to content

Conversation

@yvan-sraka
Copy link
Collaborator

@yvan-sraka yvan-sraka commented Aug 11, 2025

This PR implements a migration from GitHub's standard runners to a hybrid infrastructure combining self-hosted and ephemeral Blacksmith runners for building Nix packages.
The implementation includes runner selection, dynamic build matrix generation, and optimized caching strategies to improve build performance and cost efficiency.

Problem Statement

The previous CI implementation had several limitations:

  1. Monolithic build process: A single job attempted to build all packages across all architectures
  2. Inefficient resource allocation: All packages used the same runner type regardless of build complexity
  3. Limited parallelization: Builds couldn't be efficiently distributed across different runner types
  4. Redundant builds: No mechanism to skip packages already available in the binary cache
  5. Poor cost optimization: Large, expensive builds ran on the same infrastructure as small, quick builds
  6. Poor job output clarity: No separation of build results made it hard to identify issues

Solution Architecture

High-Level Design

┌─────────────────┐
│   nix-eval      │  Evaluates flake, generates build matrix
│   (Blacksmith)  │  Identifies cached vs. uncached packages
└────────┬────────┘  Identifies large packages
         │
         ├──────────────┬──────────────┬
         │              │              │              
         v              v              v              
┌────────────────┐ ┌────────────────┐ ┌────────────────┐
│ aarch64-linux  │ │ aarch64-darwin │ │ x86_64-linux   │
│ Self-hosted/   │ │ Self-hosted    │ │ Blacksmith     │
│ Blacksmith     │ │ (macOS)        │ │ Ephemeral      │
└────────────────┘ └────────────────┘ └────────────────┘

Architecture Components

  1. Nix Evaluation Phase (nix-eval.yml):

    • Runs on powerful ephemeral runner (32vcpu)
    • Evaluates all flake outputs using nix-eval-jobs
    • Checks cache status for each package
    • Generates optimized build matrices per architecture
  2. Build Phases (separate jobs per architecture):

    • aarch64-linux: Self-hosted or Blacksmith ARM runners
    • aarch64-darwin: Self-hosted macOS runners
    • x86_64-linux: Blacksmith ephemeral runners
  3. Runner Selection Logic:

    • KVM-required packages → Self-hosted runners with KVM support
    • Large packages (Rust, PostGIS) → 32vcpu runners
    • Standard packages → 8vcpu runners
    • Darwin packages → Self-hosted macOS runners

Key Components

1. Dynamic Matrix Generation (github-matrix Package)

Location: nix/packages/github-matrix/

Core Responsibilities:

  • Evaluates Nix flake outputs using nix-eval-jobs (https://github.com/nix-community/nix-eval-jobs)
  • Determines package dependencies and build order using topological sorting
  • Identifies cached packages to skip redundant builds
  • Assigns appropriate runners based on package requirements
  • Generates GitHub Actions-compatible JSON matrices

Package Size Detection:

  • Uses requiredSystemFeatures = ["big-parallel"] in package definitions
  • Automatically allocates 32vcpu runners for:
    • Rust-based extensions (pg_graphql, pg_jsonschema, wrappers)
    • PostGIS (complex C++ builds)
    • pgvector with heavy dependencies

Output Format:

{
  "aarch64_linux": {
    "include": [
      {
        "attr": "checks.aarch64-linux.pg_graphql_15",
        "name": "pg_graphql-15.7",
        "system": "aarch64-linux",
        "runs_on": {"labels": ["blacksmith-32vcpu-ubuntu-2404-arm"]},
        "postgresql_version": "15"
      }
    ]
  },
  "x86_64_linux": {...},
  "aarch64_darwin": {...}
}

2. Custom Nix Installation Actions

Unify Nix installation across different runner types with two reusable GitHub Actions.

Ephemeral Runners (nix-install-ephemeral)

Location: .github/actions/nix-install-ephemeral/

Purpose: Set up Nix on fresh Blacksmith runners where Nix is not pre-installed

Features:

  • Installs Nix 2.31.2 using cachix/install-nix-action
  • Configures binary cache substituters
  • Optionally sets up AWS credentials for cache pushing
  • Creates post-build hook for automatic cache uploads

Configuration:

- uses: ./.github/actions/nix-install-ephemeral
  with:
    push-to-cache: 'true'  # Enable for build jobs
  env:
    DEV_AWS_ROLE: ${{ secrets.DEV_AWS_ROLE }}
    NIX_SIGN_SECRET_KEY: ${{ secrets.NIX_SIGN_SECRET_KEY }}

Cache Upload Mechanism:

  • Post-build hook automatically uploads successful builds to S3
  • Uses Nix signing keys for trusted binary cache
  • Hook script: /etc/nix/upload-to-cache.sh

Self-Hosted Runners (nix-install-self-hosted)

Location: .github/actions/nix-install-self-hosted/

Purpose: Configure AWS credentials on persistent self-hosted runners where Nix is pre-installed

Features:

  • Assumes AWS IAM role via OIDC
  • Writes credentials to /etc/nix/aws/nix-aws-credentials
  • Supports custom role duration (default 5 hours)

3. Reusable Nix Eval Workflow

Location: .github/workflows/nix-eval.yml

Purpose: Shared workflow for matrix generation

Features:

  • Callable from other workflows via workflow_call
  • Outputs structured JSON matrix
  • Runs on high-performance ephemeral runner
  • Handles optional AWS credentials for cache access

4. Restructured Build Workflow

Location: .github/workflows/nix-build.yml

New Structure:

jobs:
  nix-eval:
    # Generate build matrices
    uses: ./.github/workflows/nix-eval.yml

  nix-build-aarch64-linux:
    needs: nix-eval
    strategy:
      matrix: ${{ fromJSON(needs.nix-eval.outputs.matrix).aarch64_linux }}
    # Build ARM Linux packages

  nix-build-aarch64-darwin:
    needs: nix-eval
    strategy:
      matrix: ${{ fromJSON(needs.nix-eval.outputs.matrix).aarch64_darwin }}
    # Build macOS ARM packages

  nix-build-x86_64-linux:
    needs: nix-eval
    strategy:
      matrix: ${{ fromJSON(needs.nix-eval.outputs.matrix).x86_64_linux }}
    # Build x86_64 Linux packages

  run-testinfra:
    needs: [nix-build-aarch64-linux, ...]
    # Only run if all builds succeed or skip

  run-tests:
    needs: [nix-build-aarch64-linux, ...]
    # Run test suite

Key Improvements:

  1. Parallel Architecture Builds: Each architecture builds independently
  2. Smart Job Skipping: Uses !cancelled() with success/skip conditions
  3. Dynamic Job Names: Include PostgreSQL version for clarity

Related PRs

@yvan-sraka yvan-sraka requested review from jfroche and samrose August 11, 2025 10:11
@yvan-sraka yvan-sraka self-assigned this Aug 11, 2025
@yvan-sraka yvan-sraka requested review from a team as code owners August 11, 2025 10:11
@yvan-sraka yvan-sraka force-pushed the custom-github-runners branch from 8b61ad4 to 76aa79b Compare August 11, 2025 15:36
@yvan-sraka yvan-sraka force-pushed the custom-github-runners branch from 76aa79b to c75bf58 Compare September 12, 2025 13:46
@yvan-sraka yvan-sraka force-pushed the custom-github-runners branch 16 times, most recently from 1eb74b8 to db1e5e4 Compare September 29, 2025 14:29
@jfroche jfroche force-pushed the custom-github-runners branch 5 times, most recently from 003d671 to 840005b Compare September 29, 2025 21:14
jfroche and others added 27 commits November 19, 2025 20:32
Refactor GitHub Actions workflow to run build checks in parallel across different
architectures (aarch64-linux, aarch64-darwin) with separate job matrices.
Create a single nix-eval job to determine packages to build, removing
redundant extension and check matrices.
When building a postgres extension, the build matrix may include
multiple time the same extension for different PostgreSQL versions.
This change makes it easier to identify which job corresponds to which PostgreSQL
version in the workflow runs.
treefmt is already included in the pre-commit hooks check.
Dynamically assign larger runners (32vcpu) for Rust and PostGIS extensions
while using smaller runners (8vcpu) for standard packages.
Add pytest tests for the package
Add nix-eval-jobs in path for the package
The matrix job returns the type of runner, so we can configure the nix
installation step accordingly.
Our changes were merged upstream, so we can now track the original
repository again.
…default

- Replace DeterminateSystems/nix-installer-action with custom nix-install-ephemeral action across all workflows
- Change default push-to-cache from 'true' to 'false' to prevent unnecessary nix/aws configurations
- Explicitly enable push-to-cache only for nix-build and nix-eval workflows where caching is beneficial
We might not need the full 8vcpu for aarch64-linux builds, so this
change reduces the runner size to 4vcpu to wait less for available
blacksmith runners.
@yvan-sraka yvan-sraka force-pushed the custom-github-runners branch from 21a9736 to aa4b344 Compare November 19, 2025 19:32
Copy link
Collaborator

@samrose samrose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t see any reason we can’t merge reviewing code. Just need to generate images and test

@samrose
Copy link
Collaborator

samrose commented Nov 19, 2025

Added request change just to block merge until we test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants