openspec: packer S3 build cache proposal#571
Conversation
…builds Add OpenSpec change proposal for caching compiled packer artifacts in S3 to reduce AMI build time by 15-40 minutes on cache hits. Co-authored-by: Jon Haddad <rustyrazorblade@users.noreply.github.com>
|
@claude openspec apply |
|
Claude finished @rustyrazorblade's task in 3m 2s —— View job Applying OpenSpec Change: Packer S3 Build Cache
SummaryImplemented all 4 tasks from the OpenSpec change:
|
Code Coverage
|
- Add packer/lib/s3_cache.sh with s3_cache_get/s3_cache_put helpers - Integrate S3 cache into install_bcc.sh (keyed by version + arch) - Integrate S3 cache into install_cassandra.sh git-branch build path (keyed by version + git SHA of branch tip) - Document PACKER_CACHE_BUCKET, PACKER_CACHE_SKIP, IAM requirements, and cache key structure in packer/README.md Cache operations are best-effort: S3 unavailability falls back to a normal build. PACKER_CACHE_BUCKET unset = cache disabled. Closes #569 Co-authored-by: Jon Haddad <rustyrazorblade@users.noreply.github.com>
PR Review: openspec packer S3 build cache proposalGood proposal overall — the problem is real and the approach is sound. A few issues to address before implementation: Critical: Script ordering issue in
|
|
test comment |
PR Review: Packer S3 Build Cache - Full ReviewGood overall design. The best-effort fallback philosophy is sound and the shared helper pattern in packer/lib/s3_cache.sh is clean. A few issues worth addressing before merging. Bug 1: Corrupted cache archives will fail the build, not fall back In install_bcc.sh, the cache-restore block runs under set -euo pipefail. If a partially-uploaded archive passes the aws s3 cp download but fails tar extraction, set -euo pipefail causes the script to exit immediately instead of falling back to a normal build. The design doc says the tar extraction will fail and the build will fall back to compiling from scratch - but that is not what the code does. The same issue exists in install_cassandra.sh on the cache-restore path. Fix: wrap the tar extraction in its own error handler and fall through to the build on failure, or temporarily disable errexit for the restore block. Bug 2: Cache archive owned by root may not be readable during upload In install_bcc.sh, sudo tar -czf creates the CACHE_ARCHIVE file owned by root with restrictive permissions. The aws s3 cp in s3_cache_put runs as the normal user and will likely get a permission-denied error - which then gets swallowed by the OR-true guard. The cache write silently fails every time. Fix: either sudo chmod a+r the archive before calling s3_cache_put, or run the entire aws s3 cp under sudo. The same pattern in install_cassandra.sh has the same problem. Minor: Cloned repo not cleaned up on cache hit (install_cassandra.sh) On a cache hit, the version clone directory is left behind in the working directory. For Cassandra repos this can be several hundred MB. Should add rm -rf of the clone dir after a successful cache restore. Minor: Misleading log message in s3_cache_put The message Cache write complete (errors are non-fatal) is printed even when the upload failed. Consider using a conditional to log the actual outcome. Observation: tasks.md shows unchecked tasks but implementation is already in the PR The tasks.md file has all items unchecked but the implementation scripts are already included in this PR. If this is intentional (proposal + implementation together), the tasks should be marked complete. If the tasks are meant to track future work, the implementation files should not be here yet. Minor: CACHE_ARCHIVE temp file not covered by cleanup trap (install_bcc.sh) If the script exits during cache restoration, the CACHE_ARCHIVE temp file is not deleted - the existing cleanup trap only handles WORK_DIR. Low impact since packer instances are ephemeral. Design looks solid otherwise: S3 key design (version + arch for BCC, version + git SHA for Cassandra) is content-addressed correctly. PACKER_CACHE_SKIP bypass is a good escape hatch. Documentation in packer/README.md is thorough and clear. IAM policy in both the docs and the design doc is correct. The shared helper avoids duplication correctly. The two bugs (corrupted-cache fallback under errexit, and sudo-owned archive not readable by aws-cli user) are the most impactful and should be addressed before merging. |
|
@claude address the feedback |
|
Claude finished @rustyrazorblade's task in 4m 30s —— View job Addressed all review feedback. Here's what was changed: Tasks
|
- Move install_awscli.sh before install_bcc.sh in base.pkr.hcl so that aws s3 cp is available when the cache helpers run - Add provisioner "file" to upload s3_cache.sh to /tmp/s3_cache.sh in both base.pkr.hcl and cassandra.pkr.hcl; update scripts to source from this fixed path (relative SCRIPT_DIR path breaks in packer runs since the script is uploaded to a random /tmp location) - Wrap tar extraction in nested if so a corrupted cache archive falls back to a normal build under set -euo pipefail instead of aborting - Add sudo chmod a+r on cache archives before passing to s3_cache_put so the non-root aws CLI user can read root-owned tar files - Clean up cloned repo on cache hit in install_cassandra.sh (saves several hundred MB on the build instance) - Fix misleading "Cache write complete" log message in s3_cache_put to report actual success or failure - Mark all tasks.md items as complete - Update design.md to document file provisioner deployment, provisioner ordering requirement, and OS-change cache invalidation risk Co-authored-by: Jon Haddad <rustyrazorblade@users.noreply.github.com>
|
test |
|
PR Review: Packer S3 Build Cache Overall this is a solid, well-structured proposal. The design doc is thorough, the shared helper pattern is the right call, and the best-effort semantics are implemented correctly throughout. A few issues to address before merging. Bug: Cache env vars not forwarded to remote packer shell provisioners Packer shell provisioners do not inherit the caller's environment variables automatically -- only variables in 'environment_vars' reach the remote script. Neither 'PACKER_CACHE_BUCKET' nor 'PACKER_CACHE_SKIP' appear in 'environment_vars' in either HCL file. The 'install_bcc.sh' provisioner has no 'environment_vars' at all. The 'install_cassandra.sh' provisioner only forwards 'INSTALL_CASSANDRA=1'. The effect: both scripts will always see 'PACKER_CACHE_BUCKET' as unset, the helper functions will no-op, and the cache will silently never work even when the user sets the env var locally. Fix for 'base.pkr.hcl' (same pattern needed in 'cassandra.pkr.hcl'): add both vars to environment_vars using packer's env() interpolation. env() returns empty string when the var is unset, which is the correct safe default (cache skipped). Issue: Hard failure sourcing /tmp/s3_cache.sh outside packer Both scripts unconditionally source '/tmp/s3_cache.sh'. If someone runs the script directly -- local Docker-based packer test, the testPackerScript Gradle task, manual debugging -- the file won't exist and the script aborts immediately before doing any real work. TESTING.md and the testPackerScript Gradle task specifically support running individual provisioner scripts in isolation. This change silently breaks that workflow. A runtime guard handles both cases: check if the file exists first with [[ -f /tmp/s3_cache.sh ]], and if not, define no-op stubs for s3_cache_get (return 1) and s3_cache_put (return 0). Minor: BCC cache key does not include OS version The design doc acknowledges this risk: if the base AMI moves to a new Ubuntu LTS, the existing cache entry will match by key but the binary may be incompatible. The documented mitigation is manual (PACKER_CACHE_SKIP=1). Adding lsb_release -rs to the key (e.g. bcc-v0.35.0-x86_64-24.04.tar.gz) would make invalidation automatic on OS upgrades and prevent a class of hard-to-debug runtime failures. Not a blocker, but worth considering. Nit: Integer flag compared as string In install_cassandra.sh, CACHE_HIT is compared as a string with == 0. Since it is an integer flag, the arithmetic form (( ! CACHE_HIT )) is more idiomatic. Things done well
The env var forwarding issue would prevent the cache from ever working. The sourcing issue would break local testing. Both should be fixed before merging. |
Add OpenSpec change proposal for caching compiled packer artifacts (BCC, Cassandra source builds) in S3 to reduce AMI build time by 15–40 minutes on cache hits.
Closes #569