[DPC-5227] Add sweeper lambda for cleaning up ECR images by lukey-luke · Pull Request #403 · CMSgov/cdap

lukey-luke · 2026-03-06T23:46:50Z

🎫 Ticket

DPC-5227

🛠 Changes

This PR aims to add a new ECR image cleanup lambda function to replace previous lifecycle policy approach
Opt-in deletion:
- all repos are scanned, but no images are removed, without being explicitly defined
- Repositories for lambda-managed cleanup are defined based on app and environment in the main.tf HERE
- Terraform will create a corresponding SSM entry to share this information with the python lambda (/<app>/<env>/ecr-cleanup/repos)
- Deletion criteria: An image must be ...
  - Not running in a live ecs container
  - outside the 5 most recent for its repo
  - older than 14 days

ℹ️ Context

The existing ECR lifecycle policy for dpc is currently located HERE.
There were two incidents related to lifecycle policy. This technical spike outlines options explored for image management in order to avoid same class of issues going forward. Confluence link HERE
Additional context on slack thread HERE
Additional work to review logs from dry run and increase repositories opted in is captured in a follow-up ticket: DPC-5259
Additional work to improve code and make things more extensible: DPC-5263

🧪 Validation

Manually verified functionality in test env

cd terraform/services/ecr-cleanup
tofu init -backend-config=../../backends/dpc-test.s3.tfbackend -reconfigure
tofu plan -var env=test -var app=dpc
tofu apply -var app=dpc -var env=test
aws lambda invoke --function-name dpc-test-ecr-cleanup --region us-east-1 response.json

See additional logs at /aws/lambda/dpc-test-ecr-cleanup on cloudwatch HERE

Added tests to cover python code:

cd terraform/services/ecr-cleanup/lambda_src
pytest .

lukey-luke · 2026-03-07T00:06:22Z

Note: CI failure for scripts/tofu-plan HERE

Looks like we're getting rate limited.

  │ Error: Failed to install provider
  │ 
  │ Error while installing hashicorp/aws v5.100.0: github.com: GET
  │ https://github.com/opentofu/terraform-provider-aws/releases/download/v5.100.0/terraform-provider-aws_5.100.0_linux_amd64.zip
  │ giving up after 3 attempt(s)
  ╵

Tofu plan successful for all dpc environments

Jose-verdance

Hey Luke, this looks really good, just some minor suggestion!

Jose-verdance · 2026-03-09T14:57:13Z

terraform/services/ecr-cleanup/lambda_src/lambda_function.py

+    for page in paginator.paginate():
+        for repo in page['repositories']:
+            name = repo['repositoryName']
+            if name.startswith(f'{app}-'):


Hey @lukey-luke, since we have the set of opted in repos, why not use that to match the repos?

One of the requests from @mianava was to log repostories that would be subject to removal as a "dry run" before opting into a list of repo's to delete with the new lambda.

Because DPC shares an account with BCDA, I thought it would be most straightforward to use account + prefix to look at "all images for DPC that we might want to opt in" and go from there. I'm open to other suggestions if you think there's a better way to capture repos before deleting them @Jose-verdance

terraform/services/ecr-cleanup/lambda_src/lambda_function.py

jdettmannnava · 2026-03-09T14:26:47Z

terraform/services/ecr-cleanup/conf.sh

@@ -0,0 +1 @@
+TARGET_ENVS="dpc-dev dpc-test dpc-sandbox dpc-prod"


Shouldn't this only be test and prod?

I copied this from the api-waf conf.sh HERE.

I believe this is correct if we're rolling out for just DPC first. However, it looks like the bcda- prefix works for an account-wide tofu plan apparently:

cdap/scripts/tofu-plan

Line 21 in 8874310

if [[ "$APP" == "bcda" && ("$ENV" == "test" || "$ENV" == "prod") ]]; then

api-waf runs a separate instance for each environment. There is no difference between 'dev' and 'test' (or 'sanbox' and 'prod') in terms of the ECR repositories. Unlike api-waf, it would be redundant to run this lambda for 'dev' and 'test'.

jdettmannnava · 2026-03-09T14:30:21Z

terraform/services/ecr-cleanup/variables.tf

@@ -0,0 +1,23 @@
+variable "app" {


I thought we were just having one lambda for all the teams?

I think the goal is to have a working example for DPC and then this can be extended to other teams.

Note: first pass is to only include a dry run with repositories logged. We probably want to have this deployed in production as a working example and then carry that over.

I drafted a follow-up item to review logs for ecr-cleanup lambda here: DPC-5259

jdettmannnava · 2026-03-09T14:30:53Z

terraform/services/ecr-cleanup/variables.tf

+  description = "The application environment"
+  type        = string
+  validation {
+    condition     = contains(["dev", "test", "sandbox", "prod"], var.env)


Shouldn't this be test or prod only? We don't have distinctions between dev/test and sandbox/prod repositories.

Originally discussed using existing prefixes (e.g. "rls-r" format for production images), but after starting this work, it seems that image cleanup can be handled entirely by lambda and avoid the need for lifecycle policy which has been subject to 2 recent incidents - 1 of which was a dev image being deployed to prod.

Sorry, I don't follow the connection between my statement and your response. Please see my clarification in conf.sh

terraform/services/ecr-cleanup/lambda_src/test_lambda_function.py

jdettmannnava · 2026-03-09T15:25:27Z

terraform/services/ecr-cleanup/lambda_src/lambda_function.py

+import boto3
+from botocore.exceptions import ClientError
+
+KEEP_COUNT = 5


Both BCDA and DPC use different strategies based on tag prefix, so this universal rule set seems a bit constraining.

This is a default value and can be extended to be configurable by other teams if needed, but this doesn't have to change to make lambda work for DPC

The current lifecycle policy protects 5 release images and deletes all other images after 14 days. The global policy implemented here does not match that. I think it would be better to have a way to implement these sorts of differentials (as I know that BCDA will require it) here. However, as a first pass, we can leave it as is and resolve it in another ticket if you can't complete such changes this sprint.

Yes! Retrieving these from parameters is preferred.

Making these configurable makes sense but I am also ok with updating this as part of a separate ticket to include any other changes related to making this lambda adaptable for other teams.

terraform/services/ecr-cleanup/lambda_src/test_lambda_function.py

jdettmannnava · 2026-03-09T15:53:43Z

Fails pylint on my machine

Co-authored-by: jdettmannnava <145699825+jdettmannnava@users.noreply.github.com>

… into ls/task-dpc-5227-sweeper-lambda

terraform/services/ecr-cleanup/lambda_src/test_lambda_function.py

jdettmannnava · 2026-03-10T13:26:28Z

terraform/services/ecr-cleanup/lambda_src/lambda_function.py

+import boto3
+from botocore.exceptions import ClientError
+
+KEEP_COUNT = 5


The current lifecycle policy protects 5 release images and deletes all other images after 14 days. The global policy implemented here does not match that. I think it would be better to have a way to implement these sorts of differentials (as I know that BCDA will require it) here. However, as a first pass, we can leave it as is and resolve it in another ticket if you can't complete such changes this sprint.

jdettmannnava · 2026-03-10T13:33:39Z

terraform/services/ecr-cleanup/variables.tf

+  description = "The application environment"
+  type        = string
+  validation {
+    condition     = contains(["dev", "test", "sandbox", "prod"], var.env)


Sorry, I don't follow the connection between my statement and your response. Please see my clarification in conf.sh

jdettmannnava · 2026-03-10T13:35:04Z

terraform/services/ecr-cleanup/conf.sh

@@ -0,0 +1 @@
+TARGET_ENVS="dpc-dev dpc-test dpc-sandbox dpc-prod"


api-waf runs a separate instance for each environment. There is no difference between 'dev' and 'test' (or 'sanbox' and 'prod') in terms of the ECR repositories. Unlike api-waf, it would be redundant to run this lambda for 'dev' and 'test'.

terraform/services/ecr-cleanup/README.md

terraform/services/ecr-cleanup/main.tf

…ng images

… into ls/task-dpc-5227-sweeper-lambda

lukey-luke · 2026-03-10T19:46:50Z

After some additional discussion, it seems defining a list of repositories based on env + app should suffice to enable teams to opt-in to image cleanup. I've updated PR description with updated example featuring logs from the dpc nonprod account.

I've drafted another follow-up item: DPC-5263 - Review configuration options for ECR cleanup

terraform/services/ecr-cleanup/lambda_src/lambda_function.py

mianava

This looks great! Switching this to cdap-test for evaluation will enable us to extend the use of one function used across repos and apps instead of many functions.

mianava · 2026-03-11T15:55:44Z

terraform/services/ecr-cleanup/lambda_src/lambda_function.py

+import boto3
+from botocore.exceptions import ClientError
+
+KEEP_COUNT = 5


Yes! Retrieving these from parameters is preferred.

mianava · 2026-03-11T15:58:27Z

terraform/services/ecr-cleanup/conf.sh

@@ -0,0 +1 @@
+TARGET_ENVS="dpc-test dpc-prod"


I believe we'll want this issued in cdap-test and cdap-prod so we just have 1 lambda (per account).

terraform/services/ecr-cleanup/main.tf

mianava · 2026-03-11T16:01:23Z

terraform/services/ecr-cleanup/main.tf

+      "ecr:BatchDeleteImage",
+    ]
+    resources = [
+      "arn:aws:ecr:${data.aws_region.current.name}:${data.aws_caller_identity.current.account_id}:repository/*"


Let's flip this so it comes from a data block.

@mianava can you clarify the ask here?

Are you saying to move statements for both SSMAccess and ECRAccess to separate blocks similar to the way iam_policy_document is managed for KMS keys HERE

@mianava Did you mean to say a local block?

No, I mean from a data block. You can look up the arn for the ECR repositories using
data "aws_ecr_repository" "this" {
for_each = toset(var.ecr_repo_names)
name = each.key
}

this ensures the repositories exist and the permission is accurate

terraform/services/ecr-cleanup/main.tf

mianava · 2026-03-11T16:03:56Z

terraform/services/ecr-cleanup/main.tf

+data "aws_caller_identity" "current" {}
+data "aws_region" "current" {}


We can favor the 'standards' module here in place of these.

This may not be needed as we can set the function up under "cdap" and will not require these values in ecr_repository lookups.

No longer needed I've specified arn for SSM directly and used for_each in a separate data block for repositories HERE

terraform/services/ecr-cleanup/main.tf

mianava · 2026-03-11T16:09:26Z

terraform/services/ecr-cleanup/main.tf

+  name        = "/${var.app}/${var.env}/ecr-cleanup/repos"
+  type        = "SecureString"
+  description = "Comma-separated list of ECR repository names to clean up"
+  value       = jsonencode(local.repo_list_by_app_env[var.app][var.env])


In the status quo, we might get overwrites from one or the other. I would favor using Terraform for this for version control - removing the SSM parameter all together.

Covered in follow-up item: DPC-5263

Co-authored-by: mianava <miaferguson@navapbc.com>

…ing resources

Jose-verdance · 2026-03-11T22:13:26Z

terraform/services/ecr-cleanup/lambda_src/lambda_function.py

+    """
+    response = client.get_parameter(Name=ssm_param_name, WithDecryption=True)
+    value = response['Parameter']['Value']
+    return json.loads(value)


After looking at this further, I think we should check if client.get_parameter encounters an error due to not finding the parameter or if it returns an empty value. We should fail gracefully and print to the console and potentially even alert us if this happens. If your creating a follow up ticket for expanding the lambda further we can move alerting off this tickets scope, but we should still handle those scenarios and include unit test to cover that as well.

Jose-verdance

Hey Luke, this is looking great, I just left some comments with some additional suggestion and stuff that can be done it a follow up to avoid scope creep!

Jose-verdance · 2026-03-11T22:16:31Z

terraform/services/ecr-cleanup/lambda_src/lambda_function.py

+import boto3
+from botocore.exceptions import ClientError
+
+KEEP_COUNT = 5


Making these configurable makes sense but I am also ok with updating this as part of a separate ticket to include any other changes related to making this lambda adaptable for other teams.

Jose-verdance · 2026-03-11T22:17:26Z

terraform/services/ecr-cleanup/lambda_src/lambda_function.py

+                    log_images_for_deletion(repo, to_delete)
+            log({'msg': f'Cleanup complete for repo: {repo}'})
+        except ClientError as e:
+            log({'msg': f'Error processing repo {repo}: {e}', 'repo': repo})


Thinking about this further, this lambda is going to be very important moving forward and should probably be doing more than logging that it failed.

lukey-luke added 9 commits March 2, 2026 15:45

add sweeper lambda to replace ECR policy

31327c1

simplify logic to delete images

631cb70

update get_parameter() to work w/ sops params

b4e9474

pass in repo_list as a variable

6fa7acd

use archive provider for deploying lambda

96590a0

fix way repo_list is passed in to create param

0943dc3

add a README with instruction to test and deploy

c946d4b

update to log all repos except delete opt-ins

5c9b179

fix log message to reflect protected images

9bb60b2

lukey-luke requested a review from a team as a code owner March 6, 2026 23:46

add default value for repo_list

917e769

lukey-luke requested a review from a team March 7, 2026 00:06

Jose-verdance reviewed Mar 9, 2026

View reviewed changes

terraform/services/ecr-cleanup/lambda_src/lambda_function.py Outdated Show resolved Hide resolved

jdettmannnava reviewed Mar 9, 2026

View reviewed changes

lukey-luke and others added 4 commits March 9, 2026 15:43

update python code based on PR feedback

54a1cc2

Apply suggestion from @jdettmannnava

3ef89b3

Co-authored-by: jdettmannnava <145699825+jdettmannnava@users.noreply.github.com>

include repository list in locals for main.tf

3d89f19

Merge remote-tracking branch 'origin/ls/task-dpc-5227-sweeper-lambda'…

6ec405f

… into ls/task-dpc-5227-sweeper-lambda

lukey-luke requested review from Jose-verdance and jdettmannnava March 9, 2026 23:46

jdettmannnava reviewed Mar 10, 2026

View reviewed changes

lukey-luke added 5 commits March 10, 2026 08:05

only use test and prod envs for managing ECR images

44dbff5

improve test coverage - add mock for ECR batch delete calls

15313b1

remove keys for dev and sandbox - not used anymore

59eeb24

update lambda to reduce time window to max 14 days

1799267

update ECS policy to enable detecting which images are actively runni…

3192133

…ng images

lukey-luke added 4 commits March 10, 2026 12:27

Merge branch 'main' into ls/task-dpc-5227-sweeper-lambda

fbaf578

formatting

8c043cc

Merge remote-tracking branch 'origin/ls/task-dpc-5227-sweeper-lambda'…

9803b96

… into ls/task-dpc-5227-sweeper-lambda

add default value for repos array in function body

98a1174

lukey-luke requested review from a team and jdettmannnava March 10, 2026 19:46

jdettmannnava reviewed Mar 11, 2026

View reviewed changes

terraform/services/ecr-cleanup/lambda_src/lambda_function.py Outdated Show resolved Hide resolved

use 30 days for threshold for when images are deleted

150835e

lukey-luke requested a review from jdettmannnava March 11, 2026 14:57

mianava requested changes Mar 11, 2026

View reviewed changes

lukey-luke and others added 8 commits March 11, 2026 10:16

Apply suggestions from code review

d422c78

Co-authored-by: mianava <miaferguson@navapbc.com>

make more general for repo list to apply to entire account

5a97873

break out policy documents and avoid string interpolation for specify…

ceb9a61

…ing resources

update README with instructions for manually invoking lambda

4c357e9

fix permission issue for get_all_repos()

2ab74f6

allow python lambda to read repos that would be deleted after opting in

f391f41

tofu formatting

6718684

use cdap app for tofu apply

16ff0f0

lukey-luke requested a review from mianava March 11, 2026 20:38

add ListTagsForResource to gha policy

b2f8ac1

mianava previously approved these changes Mar 11, 2026

View reviewed changes

Jose-verdance reviewed Mar 11, 2026

View reviewed changes

add exception handling for fails SSM param access

9e8cab2

lukey-luke dismissed mianava’s stale review via 9e8cab2 March 11, 2026 23:16

lukey-luke requested review from Jose-verdance and mianava March 11, 2026 23:20

		@@ -0,0 +1 @@
		TARGET_ENVS="dpc-dev dpc-test dpc-sandbox dpc-prod"

		data "aws_caller_identity" "current" {}
		data "aws_region" "current" {}

Conversation

lukey-luke commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🎫 Ticket

🛠 Changes

ℹ️ Context

🧪 Validation

Uh oh!

lukey-luke commented Mar 7, 2026

Uh oh!

Jose-verdance left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jdettmannnava commented Mar 9, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lukey-luke commented Mar 10, 2026

Uh oh!

Uh oh!

mianava left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lukey-luke commented Mar 6, 2026 •

edited

Loading