fix: pre-set ECR Lambda pull policy to prevent concurrent SetRepositoryPolicy race condition (#8190)#8945
Conversation
Implements issue aws#3888 to auto-create ECR repositories during packaging, matching sam deploy behavior. Enables package-once, deploy-many CI/CD workflows with managed ECR repos. - Add --resolve-image-repos CLI option to sam package - Call sync_ecr_stack() to auto-create managed ECR repositories - Add validation requiring --s3-bucket when flag is used - Add conflict detection with --image-repositories - Add unit tests for validation logic Closes aws#3888
…add integration test
…ryPolicy race condition (aws#8190) - Add _ensure_ecr_lambda_pull_policy() called before changeset creation to pre-set a stable SAMCliLambdaECRAccess SID on all referenced ECR repos. - Add _upsert_ecr_lambda_policy() to idempotently set/merge the policy, handling AccessDeniedException gracefully and retrying on concurrent SetRepositoryPolicy conflicts (ResourceInUseException). - Add ECRPolicySetError exception for unrecoverable policy failures. - Add 21 unit tests in test_ecr_policy_helpers.py covering all branches. - Patch _ensure_ecr_lambda_pull_policy in TestSamDeployCommand to isolate deploy-flow tests from ECR side-effects. - Fix test_updates_imageuri_when_pointing_to_local_archive: replace fragile CWD-relative file creation with pathlib.Path.is_file mock.
dcbcc82 to
3ae260f
Compare
roger-zhangg
left a comment
There was a problem hiding this comment.
Reviewed the full diff. This is a well-designed fix for the concurrent SetRepositoryPolicy race condition (issue #8190).
What works well:
-
Race condition prevention — Pre-setting the ECR policy before CloudFormation creates the changeset ensures Lambda always has pull access regardless of concurrent writes from CF's per-function
SetRepositoryPolicycalls. -
SID-based idempotency — Using a deterministic
SAMCliLambdaECRAccessSID means repeated deploys safely replace the existing SAM-managed statement without accumulating duplicates. -
Preserves customer policies — The code filters out only the SAM-owned SID and preserves all other existing statements in the policy document.
-
Graceful degradation —
AccessDeniedExceptionis caught and logged as a warning rather than blocking the deployment, which is the right tradeoff: users who lack ECR policy permissions might have manually configured policies already. -
Shared-repo deduplication — The set-based dedup on repo names correctly handles multiple functions sharing a single ECR repo (the exact scenario from issue #8190).
-
Comprehensive tests — 21 new unit tests cover: URI parsing, deduplication logic, no-existing-policy, merge-with-existing, idempotent replace, access denied soft-fail, unexpected error propagation, the 7-function shared-repo scenario, and mixed repos.
Minor notes (non-blocking):
force=Falseonset_repository_policyis correct — it prevents overwriting policies that includeaws:PrincipalOrgIDconditions, which is the safe default.- There is a theoretical TOCTOU window between
get_repository_policyandset_repository_policy, but since this runs once in the CLI before changeset creation (not concurrently), it is not a practical concern. - The unrelated test fix in
test_template.py(replacing CWD-relative file creation with a mock) is a clean improvement.
LGTM.
Problem
Fixes #8190.
When deploying a SAM application with multiple Lambda functions referencing the same or different ECR repositories, CloudFormation calls
ecr:SetRepositoryPolicyconcurrently — once per Lambda. Each call overwrites the existing policy rather than merging it, so whichever write lands last wins and earlier Lambdas lose access, resulting in a 403 on image pull.Solution
Before creating the changeset, SAM CLI now pre-sets a stable
SAMCliLambdaECRAccesspolicy SID on every ECR repository referenced by the deployment (viaImageRepository/ImageRepositories). Because the SID is deterministic, repeated calls are idempotent and safe.Three private helpers are added to
samcli/commands/deploy/deploy_context.py:_extract_ecr_repo_name_ensure_ecr_lambda_pull_policy_upsert_ecr_lambda_policyfor each_upsert_ecr_lambda_policySAMCliLambdaECRAccessstatement; retries on concurrentSetRepositoryPolicyconflicts; skips gracefully onAccessDeniedException_ensure_ecr_lambda_pull_policyis called fromDeployContext.run()immediately beforecreate_and_wait_for_changeset.Changes
samcli/commands/deploy/deploy_context.py— ECR policy helpers + call sitesamcli/commands/deploy/exceptions.py— newECRPolicySetError(UserException)tests/unit/commands/deploy/test_ecr_policy_helpers.py— 21 new unit tests covering all branches of the three helperstests/unit/commands/deploy/test_deploy_context.py— patch_ensure_ecr_lambda_pull_policyat class level to isolate existing deploy tests from ECR side-effectstests/unit/commands/_utils/test_template.py— fixtest_updates_imageuri_when_pointing_to_local_archive: replace fragile CWD-relative file creation (which caused aPermissionErroron macOS) with apathlib.Path.is_filemockTesting
Ruff and mypy also pass (mypy pre-existing errors are unrelated to this change).