Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
schema: spec-driven
created: 2026-03-15
109 changes: 109 additions & 0 deletions openspec/changes/2026-03-15-packer-s3-build-cache/design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
## Context

The packer build pipeline produces two image layers: `base` and `cassandra` (db). The bottlenecks are:

- `packer/base/install/install_bcc.sh`: Downloads BCC v0.35.0 source, runs `cmake` + `make -j$(nproc)` (C++ compilation). Installs to `/usr/lib/libbcc*`, `/usr/share/bcc/`, and Python bindings. Takes ~15–20 minutes. Version is pinned — the compiled output never changes for a given version + arch combination.

- `packer/cassandra/install/install_cassandra.sh`: Has three install paths controlled by `cassandra_versions.yaml`:
1. Download from Apache CDN by version prefix (no URL, no branch) — already fast
2. Download tarball from explicit URL — already fast
3. Clone git repo + `ant realclean && ant` build (URL ending in `.git` + `branch` field) — slow; pays full build cost every time

The YAML currently uses paths 1 and 2 for all versions (including `trunk` and `5.0-HEAD` which point to nightly pre-built tarballs on GitHub Releases). Path 3 is available for custom branch builds and will be used as the project evolves.

The `aws` CLI is already installed on packer instances (`install_awscli.sh` runs before these scripts).

## Goals / Non-Goals

**Goals:**
- Cache the BCC compiled artifact in S3, keyed by version + architecture. Skip compilation on cache hit.
- Cache Cassandra git-branch build artifacts in S3, keyed by version name + git SHA of the branch tip. Skip ant build on cache hit.
- Make cache operations best-effort: any S3 failure falls through to a normal build.
- Allow cache bypass via `PACKER_CACHE_SKIP=1` for debugging or forced rebuilds.
- Keep the S3 bucket name configurable via `PACKER_CACHE_BUCKET`.

**Non-Goals:**
- Caching Cassandra binary downloads (Apache CDN / GitHub Releases) — these are already pre-built and fast to download.
- Caching sidecar builds (addressed in a separate PR).
- Creating or managing the S3 bucket from Kotlin — the bucket is provisioned separately (same bucket used for cluster data, or a dedicated build cache bucket).
- Invalidating the cache automatically when build scripts change — the cache key is content-based (version + arch/SHA), so script changes only matter for BCC (which is version-pinned).

## Decisions

### 1. Cache key design

**BCC**: `bcc/bcc-v{VERSION}-{ARCH}.tar.gz`
- `VERSION` = `BCC_VERSION` variable (e.g., `0.35.0`)
- `ARCH` = output of `uname -m` (e.g., `x86_64`, `aarch64`)
- Rationale: BCC is version-pinned. The output is fully determined by version + architecture. No content hash needed.

**Cassandra source builds**: `cassandra/{VERSION}-{GIT_SHA}.tar.gz`
- `VERSION` = the version field from `cassandra_versions.yaml` (e.g., `trunk`, `5.1-dev`)
- `GIT_SHA` = `git rev-parse HEAD` after cloning (first 12 chars)
- Rationale: Branch HEAD moves. The git SHA uniquely identifies the exact source code that was compiled. Old entries accumulate in S3 but storage cost is negligible vs build time.

Both keys are prefixed with `packer-build-cache/` within the bucket to keep them organized.

### 2. What gets cached for BCC

The cache tarball captures the installed output:
- `/usr/lib/libbcc*`
- `/usr/lib/libbpf*` (installed by BCC build)
- `/usr/share/bcc/`
- `/usr/lib/python3/dist-packages/bcc/` (Python bindings)

The tarball is created with `tar -czf - /usr/lib/libbcc* /usr/share/bcc /usr/lib/python3/dist-packages/bcc` and extracted with `tar -xzf - -C /`.

**Alternative considered:** Cache the entire build directory and re-run `make install`. Rejected — more complex and slower than caching the install artifacts.

### 3. Cache write timing

For BCC: upload happens immediately after `sudo make install` succeeds and the Python bindings are verified.

For Cassandra source builds: upload happens after the built directory is moved to `/usr/local/cassandra/$version` and configured.

Cache writes are best-effort — a failure to upload does not fail the build (`aws s3 cp ... || true`).

### 4. Cache reads are also best-effort

If the `aws s3 cp` download fails for any reason (bucket doesn't exist, no permissions, network error), the script proceeds with the normal build path. This ensures the cache never blocks a build.

**Implementation**: Use `aws s3 cp --no-progress` wrapped in an `if` statement. A non-zero exit falls through to the build.

### 5. Shared helper function

A shared `s3_cache_get` / `s3_cache_put` helper will be extracted to `packer/lib/s3_cache.sh` and sourced by both `install_bcc.sh` and `install_cassandra.sh`. This avoids duplicating the bucket name resolution and error handling logic.

**Alternative considered:** Inline the logic in each script. Rejected — duplication makes it harder to change the bucket naming convention later.

**Deployment**: Packer's `script` provisioner uploads only a single file. `s3_cache.sh` will not exist at the relative path once the script runs on the remote build machine. Both `base.pkr.hcl` and `cassandra.pkr.hcl` must include a `provisioner "file"` block that uploads `../lib/s3_cache.sh` to `/tmp/s3_cache.sh` before any script that sources it. The scripts source it from this fixed path.

**Provisioner ordering for BCC in `base.pkr.hcl`**: `install_awscli.sh` must run before `install_bcc.sh` because the cache helpers call `aws s3 cp`.

### 6. `PACKER_CACHE_SKIP` bypass

Setting `PACKER_CACHE_SKIP=1` skips both cache reads and writes. Useful when:
- Debugging a build issue and wanting a clean compile
- The cache entry is suspected to be corrupted
- Testing that the build still works from scratch

### 7. IAM requirements

The packer instance profile needs:
```json
{
"Effect": "Allow",
"Action": ["s3:GetObject", "s3:PutObject", "s3:HeadObject"],
"Resource": "arn:aws:s3:::{PACKER_CACHE_BUCKET}/packer-build-cache/*"
}
```

This will be documented but not automated — the IAM policy for packer instances is managed outside this codebase.

## Risks / Trade-offs

- **Stale BCC cache**: If the BCC version bumps and the variable is updated but the cache key doesn't change for some reason — impossible by design since `BCC_VERSION` is in the key.
- **Corrupted cache entry**: If a cache upload was interrupted, the S3 object may be partial. The `tar` extraction will fail, causing the build to fall back to compiling from scratch (best-effort read handles this).
- **S3 bucket not configured**: If `PACKER_CACHE_BUCKET` is unset, cache operations are skipped entirely and the build proceeds normally. This is the safe default for users who haven't set up caching.
- **Architecture mismatch**: `uname -m` in the key prevents cross-architecture cache hits.
- **Base OS change**: The BCC cache key (`bcc-v{VERSION}-{ARCH}`) does not include the OS release. If the base AMI moves to a new Ubuntu LTS (e.g., 24.04 → 26.04), the existing cache entry would match but the binary may be incompatible. Mitigation: set `PACKER_CACHE_SKIP=1` whenever the base OS is upgraded, or manually delete the affected S3 objects.
37 changes: 37 additions & 0 deletions openspec/changes/2026-03-15-packer-s3-build-cache/proposal.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
## Why

Packer AMI builds are slow due to repeated compilation of large artifacts whose content never changes between builds. Two key bottlenecks have been identified:

1. **BCC compilation**: `install_bcc.sh` compiles BCC v0.35.0 from source on every build using cmake + make. This C++ compilation takes 15–20 minutes. The version is pinned, so the compiled output is always identical for a given architecture. There is no reason to recompile it.

2. **Cassandra source builds**: `install_cassandra.sh` supports building Cassandra from a git branch (clone + `ant realclean && ant`). When branch-based versions are configured in `cassandra_versions.yaml`, the full build runs from scratch every time, including Maven dependency resolution (which is then explicitly cleared between versions).

AMIs are built infrequently, but the operator provisions clusters from the same AMI repeatedly. Every AMI build currently pays these costs in full even when none of the inputs have changed.

## What Changes

- Add an S3-based binary cache to `install_bcc.sh`. Before compiling, check S3 for a cached tarball keyed by BCC version and architecture. On a cache hit, download and extract instead of compiling. On a miss, compile as normal then upload the result to S3.

- Add an S3-based binary cache to `install_cassandra.sh` for the git-branch build path only. Cache key is derived from the version name and git SHA of the branch tip. Downloaded binary releases (Apache CDN, GitHub releases) do not need caching — they are already pre-built tarballs.

- The S3 bucket name is configurable via the `PACKER_CACHE_BUCKET` environment variable. Cache operations are best-effort: S3 unavailability or a missing object falls back to a normal build.

- Both scripts expose a `PACKER_CACHE_SKIP` environment variable to bypass the cache (useful for forcing a clean rebuild).

## Capabilities

### New Capabilities

- `packer-s3-build-cache`: S3-backed binary artifact cache for compiled packer dependencies. Reduces image build time by 15–40 minutes on cache hits.

### Modified Capabilities

- `packer-base-image`: `install_bcc.sh` now checks/populates S3 cache before compiling BCC.
- `packer-db-image`: `install_cassandra.sh` now checks/populates S3 cache for git-branch Cassandra builds.

## Impact

- **AMI build time**: Reduces to ~5 minutes for BCC (download vs compile) and ~2–5 minutes per Cassandra source version (download vs ant build) on cache hits.
- **First build / cache miss**: No change — builds proceed exactly as today and populate the cache.
- **IAM permissions**: The packer instance role needs `s3:GetObject`, `s3:PutObject`, and `s3:HeadObject` on the cache bucket.
- **No Kotlin changes**: This is entirely packer/bash scope.
30 changes: 30 additions & 0 deletions openspec/changes/2026-03-15-packer-s3-build-cache/tasks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
## 1. Shared S3 cache helper

- [x] 1.1 Create `packer/lib/s3_cache.sh` with `s3_cache_get` and `s3_cache_put` functions
- `s3_cache_get <bucket> <key> <dest>`: downloads `s3://<bucket>/<key>` to `<dest>`, returns 0 on success, 1 on any failure
- `s3_cache_put <bucket> <key> <src>`: uploads `<src>` to `s3://<bucket>/<key>`, best-effort (failure is non-fatal)
- Both functions no-op and return 0 when `PACKER_CACHE_BUCKET` is unset or empty
- Both functions no-op and return 0 when `PACKER_CACHE_SKIP=1`

## 2. BCC cache in install_bcc.sh

- [x] 2.1 Source `packer/lib/s3_cache.sh` at the top of `install_bcc.sh`
- [x] 2.2 Compute `CACHE_KEY="packer-build-cache/bcc/bcc-v${BCC_VERSION}-$(uname -m).tar.gz"` before the build
- [x] 2.3 Add cache-read block before the cmake/make section: call `s3_cache_get`, and if successful, extract the tarball to `/` and skip compilation
- [x] 2.4 Add post-install cache-write block: after verifying the Python import, create a tarball of the installed BCC files and call `s3_cache_put`
- [x] 2.5 Verify the existing Python import check (`python3 -c "import bcc"`) still runs on both cache-hit and cache-miss paths

## 3. Cassandra source build cache in install_cassandra.sh

- [x] 3.1 Source `packer/lib/s3_cache.sh` at the top of `install_cassandra.sh`
- [x] 3.2 In the git-branch build path (the `else` branch): after `git clone`, capture `GIT_SHA=$(git -C "$version" rev-parse --short=12 HEAD)`
- [x] 3.3 Compute `CACHE_KEY="packer-build-cache/cassandra/${version}-${GIT_SHA}.tar.gz"`
- [x] 3.4 Add cache-read block: call `s3_cache_get`; if successful, extract the tarball to `/usr/local/cassandra/` and skip the ant build
- [x] 3.5 Add post-build cache-write block: after `sudo mv "$version" "/usr/local/cassandra/$version"`, create a tarball of the installed version directory and call `s3_cache_put`
- [x] 3.6 Ensure the configuration block (conf backup, `cassandra.in.sh` append) still runs on both cache-hit and cache-miss paths

## 4. Documentation

- [x] 4.1 Document `PACKER_CACHE_BUCKET` and `PACKER_CACHE_SKIP` in `packer/README.md`
- [x] 4.2 Document required IAM permissions (`s3:GetObject`, `s3:PutObject`, `s3:HeadObject`) for the packer instance role
- [x] 4.3 Document the S3 key structure and how to manually invalidate a cache entry (delete the S3 object)
57 changes: 57 additions & 0 deletions packer/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,63 @@ See [TESTING.md](TESTING.md) for comprehensive testing documentation including:
- CI integration
- Best practices

## S3 Build Cache

Packer scripts support an optional S3-backed binary cache to skip expensive compilation steps on repeat builds.

Two scripts benefit from caching:
- `base/install/install_bcc.sh` — caches compiled BCC artifacts (~15–20 min saved per cache hit)
- `cassandra/install/install_cassandra.sh` — caches git-branch Cassandra builds (~20 min saved per cache hit)

### Environment Variables

| Variable | Description |
|---|---|
| `PACKER_CACHE_BUCKET` | S3 bucket name for the build cache. When unset or empty, cache operations are skipped and builds proceed normally. |
| `PACKER_CACHE_SKIP` | Set to `1` to bypass both cache reads and writes. Useful for forced rebuilds or debugging. |

### S3 Key Structure

All cache entries are stored under the `packer-build-cache/` prefix within the bucket:

| Artifact | Key Pattern | Example |
|---|---|---|
| BCC | `packer-build-cache/bcc/bcc-v{VERSION}-{ARCH}.tar.gz` | `packer-build-cache/bcc/bcc-v0.35.0-x86_64.tar.gz` |
| Cassandra source build | `packer-build-cache/cassandra/{VERSION}-{GIT_SHA}.tar.gz` | `packer-build-cache/cassandra/trunk-a1b2c3d4e5f6.tar.gz` |

### Required IAM Permissions

The packer instance profile needs the following permissions on the cache bucket:

```json
{
"Effect": "Allow",
"Action": ["s3:GetObject", "s3:PutObject", "s3:HeadObject"],
"Resource": "arn:aws:s3:::{PACKER_CACHE_BUCKET}/packer-build-cache/*"
}
```

### Cache Invalidation

Cache entries are content-addressed (version + architecture for BCC; version + git SHA for Cassandra source builds), so they are automatically invalidated when the underlying inputs change.

To manually invalidate a cache entry, delete the corresponding S3 object:

```shell
# Invalidate BCC cache for a specific version and architecture
aws s3 rm "s3://${PACKER_CACHE_BUCKET}/packer-build-cache/bcc/bcc-v0.35.0-x86_64.tar.gz"

# Invalidate a Cassandra source build cache entry
aws s3 rm "s3://${PACKER_CACHE_BUCKET}/packer-build-cache/cassandra/trunk-a1b2c3d4e5f6.tar.gz"

# List all cache entries
aws s3 ls "s3://${PACKER_CACHE_BUCKET}/packer-build-cache/" --recursive
```

### Cache Behavior

Cache operations are best-effort. If the bucket is unavailable, permissions are missing, or a download fails, the build falls back to compiling from scratch. A failed cache write never fails the build.

## Directory Structure

```
Expand Down
12 changes: 9 additions & 3 deletions packer/base/base.pkr.hcl
Original file line number Diff line number Diff line change
Expand Up @@ -115,13 +115,19 @@ build {
}


# install AWS CLI v2 (must run before install_bcc.sh which uses aws s3 cp for caching)
provisioner "shell" {
script = "install/install_bcc.sh"
script = "install/install_awscli.sh"
}

# upload S3 cache helpers before any script that sources them
provisioner "file" {
source = "../lib/s3_cache.sh"
destination = "/tmp/s3_cache.sh"
}

# install AWS CLI v2
provisioner "shell" {
script = "install/install_awscli.sh"
script = "install/install_bcc.sh"
}

# install k3s (disabled, not auto-started)
Expand Down
44 changes: 44 additions & 0 deletions packer/base/install/install_bcc.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,10 @@ set -euo pipefail

echo "=== Running: install_bcc.sh ==="

# Uploaded to /tmp/s3_cache.sh by the packer file provisioner in base.pkr.hcl
# shellcheck source=../../lib/s3_cache.sh
source "/tmp/s3_cache.sh"

BCC_VERSION=0.35.0
WORK_DIR=""

Expand All @@ -25,6 +29,32 @@ fi

echo "Installing BCC version ${BCC_VERSION}..."

CACHE_KEY="packer-build-cache/bcc/bcc-v${BCC_VERSION}-$(uname -m).tar.gz"
CACHE_ARCHIVE=$(mktemp --suffix=".tar.gz")

# Try to restore from S3 cache before compiling.
# Use nested if for tar extraction so a corrupted archive falls back to compilation
# rather than aborting the build under set -euo pipefail.
if s3_cache_get "${PACKER_CACHE_BUCKET:-}" "${CACHE_KEY}" "${CACHE_ARCHIVE}"; then
echo "Extracting BCC from cache..."
if sudo tar -xzf "${CACHE_ARCHIVE}" -C /; then
rm -f "${CACHE_ARCHIVE}"
echo "Verifying BCC installation from cache..."
if ! python3 -c "import bcc" 2>/dev/null; then
echo "ERROR: BCC Python module not found after cache restore"
exit 1
fi
echo "BCC ${BCC_VERSION} restored from cache successfully"
echo "✓ install_bcc.sh completed successfully"
exit 0
else
echo "WARNING: Cache archive extraction failed, falling back to compilation"
rm -f "${CACHE_ARCHIVE}"
fi
else
rm -f "${CACHE_ARCHIVE}"
fi

# Remove any existing BCC installations
echo "Removing existing BCC packages..."
sudo apt update
Expand Down Expand Up @@ -89,5 +119,19 @@ if ! python3 -c "import bcc" 2>/dev/null; then
exit 1
fi

# Upload compiled artifacts to S3 cache (best-effort)
echo "Creating cache archive of installed BCC artifacts..."
CACHE_ARCHIVE=$(mktemp --suffix=".tar.gz")
# Collect installed paths; libbpf may not always be present
BCC_CACHE_PATHS=(/usr/lib/libbcc* /usr/share/bcc /usr/lib/python3/dist-packages/bcc)
if compgen -G "/usr/lib/libbpf*" > /dev/null 2>&1; then
BCC_CACHE_PATHS+=(/usr/lib/libbpf*)
fi
sudo tar -czf "${CACHE_ARCHIVE}" "${BCC_CACHE_PATHS[@]}" 2>/dev/null || true
# Make archive readable by the non-root user running aws s3 cp
sudo chmod a+r "${CACHE_ARCHIVE}"
s3_cache_put "${PACKER_CACHE_BUCKET:-}" "${CACHE_KEY}" "${CACHE_ARCHIVE}"
rm -f "${CACHE_ARCHIVE}"

echo "BCC ${BCC_VERSION} installed successfully"
echo "✓ install_bcc.sh completed successfully"
6 changes: 6 additions & 0 deletions packer/cassandra/cassandra.pkr.hcl
Original file line number Diff line number Diff line change
Expand Up @@ -159,6 +159,12 @@ build {
destination = "/tmp/cassandra.in.sh"
}

# upload S3 cache helpers before install_cassandra.sh which sources them
provisioner "file" {
source = "../lib/s3_cache.sh"
destination = "/tmp/s3_cache.sh"
}

provisioner "shell" {
environment_vars = [
# we need this to be set because install_cassandra checks for it and exits if it's not there
Expand Down
Loading
Loading