Fix amd64 AMI Builds#2229
Open
mmlb wants to merge 15 commits into
Open
Conversation
PostgreSQL Extension Dependency Analysis: PR #2229
SummaryNo extensions had dependencies with MAJOR version updates. Full Analysis ResultsPostgreSQL 15 Extension DependenciesPostgreSQL 17 Extension DependenciesOrioleDB 17 Extension Dependencies |
PostgreSQL Package Dependency Analysis: PR #2229
SummaryNo packages had MAJOR version updates. Full Analysis ResultsPostgreSQL 15 Dependency ChangesExtracting PostgreSQL 15 dependencies...
Runtime Closure Size
Raw Dependency ClosurePostgreSQL 17 Dependency ChangesExtracting PostgreSQL 17 dependencies...
Runtime Closure Size
Raw Dependency Closure |
945edec to
7d7bc35
Compare
Doesn't make sense to have non-breaking space so looks fish to some tools, rightfully so.
No need to set EXECUTION_ID var and then use it for a differently named var later, just export the expected name from the beginning. I also sorted the nix/build-ami args for an easier time comparing to Release AMI Nix workflow.
Release AMI Nix ends up always building stage1 AMI so lets take in a force input that will pass it along as BUILD_AMI_NIX_FORCE_BUILD. Setting the env is behind an `if` since nix/build-ami will actually treat it as a tri-state. Passing in GIT_SHA too so that nix/build-ami can use it for finding the AMI just like Release AMI Nix does.
These are especially helpful for local actions because GHA does not show the jobs/steps within, it all looks like one "Build AMI" step.
This way when we get a hit we know its for exact same inputs. Changes to
build-ami script can cause a different AMI to have been built so should
be part of the INPUT_HASH calculation. Using $0 takes into account any
${packerSources} changes due to normal nix behavior.
Release AMI Nix *always* builds stage1 so lets make sure we have a way to behave similarly. Also need to make sure we find the correct AMI via architecture and sourceSha. Also sync'ing up the filters used to find stage-1, added architecture and sourceSha. The architecture tag filter should be obviously ok/no-op and sourceSha was already always being passed via build-ami action so all the AMIs should have it as long as it was built in CI. It seems redundant to have both sourceSha and inputHash filters since inputHash should always be better than sourceSha, when content changes inputHash changes and sourceSha is obviously different but a different sourceSha could be from changes to unrelated files. I'll take extra builds for now though just to get some clarity here, will make it better in the near future.
In CI it will abort so that the workflow can grab ec2 console output, hopefully catching any kernel messages there. Otherwise set to ask if stdout is a tty so that the caller can poke around if they want to debug. If stdout is not a tty then we will go with the default.
This is mostly just clean ups, it seems better to add things to the env pre/post build instead of having to use the annoyingly long output syntax or re-derive/parse from matrix values or packer files. I renamed the arch specific matrix values to something that reads better, arch makes more sense as a single value representing the architecture and target is a common name for gha matrix grouping. I changed most uses of the various postgres versions from using matrix value or getting from packer file to env vars. This way we use the same name consistently throughout the file and the names are more explicit. Its easy to guess the difference between POSTGRES_MAJOR_VERSION and POSTGRES_SUPABASE_VERSION compared to postgres_version and PG_VERSION.
No need to repeat ourselves when we have a perfectly good abstraction already.
This way testinfra and release-ami workflow/logs are more comparable.
Order things the same, add Debug AWS role secret step, use AWS_REGION env var instead of hard coding the region everywhere... etc just so we can get a better looking diff between them.
This wasn't really necessary so lets revert.
This is causing the plugin's cpu usage to go from ~.6% -> ~600%! See hashicorp/packer-plugin-amazon#676
mmlb
commented
Jun 20, 2026
| if ''${CI:-false}; then | ||
| echo "::notice::Setting packer build -on-error=abort since this is CI, this is different than non-CI runs!" | ||
| on_error=abort | ||
| elif ! [[ -t 1 ]]; then |
Collaborator
Author
There was a problem hiding this comment.
This should be -t 0 actually
samrose
approved these changes
Jun 20, 2026
samrose
left a comment
Collaborator
There was a problem hiding this comment.
Don't know for sure but Potential issue: the release workflow’s path filter no longer covers all files used by the workflow, so some future AMI-build changes may not trigger this release workflow automatically.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What kind of change does this PR introduce?
Bug fix and refactors.
What is the current behavior?
Flaky AMI builds.
What is the new behavior?
No more flaky AMI builds.
Additional context
We've been seeing issues with building the AMI, specifically the
Release AMI Nixworkflow. Took quite some debugging to figure out that we were hit with hashicorp/packer-plugin-amazon#676 ultimately due to packer not supporting lock files and always chasing the newest release possible.The issue boiled down to packer_plugin_amazon causing ~600% cpu load (6 cpus @ 100%) on the blacksmith machines which ended up causing the network to backup until the connection is forced closed by a middleware/hv and the build fail. Looks like arm machines were managing the load well enough to continue passing but we were also affected by the bug there too.
I'm still not sure why
Release AMI Nixseemed to trigger this bug more reliably even after re-working it to use the samebuid-amiaction and thusbuild-aminix package but in the meantime I was able to get a reproducer and narrow down on the issue. I blamed the blacksmith networking (because if its not DNS, its the network) but that was only another symptom (would be nice to have a better indicator/behavior though). Luckily support pointed out that the cpus were at 100% and that caused the network to drop. I don't know how I didn't notice that myself on the runner's metrics page. I recall focusing on network io instead. Once I saw the cpu usage I tried to replicate on my own machine to factor out blacksmith possibly throttling/slow cpu but saw the same behavior which finally led me the bug in question.Fix is to pin to 1.8.0 for now. I'll be working on the CI in general next and will enable renovate once I get it so we aren't running long builds for unrelated changes and are sure that testinfra and co are required to pass for PRs to be mergeable. I am keeping all the "unrelated" cleanups though because they are good and kaizen.