Skip to content

[DPC-5227] Add sweeper lambda for cleaning up ECR images#403

Open
lukey-luke wants to merge 34 commits intomainfrom
ls/task-dpc-5227-sweeper-lambda
Open

[DPC-5227] Add sweeper lambda for cleaning up ECR images#403
lukey-luke wants to merge 34 commits intomainfrom
ls/task-dpc-5227-sweeper-lambda

Conversation

@lukey-luke
Copy link

@lukey-luke lukey-luke commented Mar 6, 2026

🎫 Ticket

DPC-5227

🛠 Changes

  • This PR aims to add a new ECR image cleanup lambda function to replace previous lifecycle policy approach
  • Opt-in deletion:
    • all repos are scanned, but no images are removed, without being explicitly defined
    • Repositories for lambda-managed cleanup are defined based on app and environment in the main.tf HERE
    • Terraform will create a corresponding SSM entry to share this information with the python lambda (/<app>/<env>/ecr-cleanup/repos)
    • Deletion criteria: An image must be ...
      • Not running in a live ecs container
      • outside the 5 most recent for its repo
      • older than 14 days

ℹ️ Context

  • The existing ECR lifecycle policy for dpc is currently located HERE.
  • There were two incidents related to lifecycle policy. This technical spike outlines options explored for image management in order to avoid same class of issues going forward. Confluence link HERE
  • Additional context on slack thread HERE
  • Additional work to review logs from dry run and increase repositories opted in is captured in a follow-up ticket: DPC-5259
  • Additional work to improve code and make things more extensible: DPC-5263

🧪 Validation

Manually verified functionality in test env

cd terraform/services/ecr-cleanup
tofu init -backend-config=../../backends/dpc-test.s3.tfbackend -reconfigure
tofu plan -var env=test -var app=dpc
tofu apply -var app=dpc -var env=test
aws lambda invoke --function-name dpc-test-ecr-cleanup --region us-east-1 response.json   

Screenshot 2026-03-10 at 12 23 42 PM

See additional logs at /aws/lambda/dpc-test-ecr-cleanup on cloudwatch HERE

Added tests to cover python code:

cd terraform/services/ecr-cleanup/lambda_src
pytest .

@lukey-luke lukey-luke requested a review from a team as a code owner March 6, 2026 23:46
@lukey-luke
Copy link
Author

Note: CI failure for scripts/tofu-plan HERE

Looks like we're getting rate limited.

  │ Error: Failed to install provider
  │ 
  │ Error while installing hashicorp/aws v5.100.0: github.com: GET
  │ https://github.com/opentofu/terraform-provider-aws/releases/download/v5.100.0/terraform-provider-aws_5.100.0_linux_amd64.zip
  │ giving up after 3 attempt(s)
  ╵

Tofu plan successful for all dpc environments

@lukey-luke lukey-luke requested a review from a team March 7, 2026 00:06
Copy link

@Jose-verdance Jose-verdance left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Luke, this looks really good, just some minor suggestion!

for page in paginator.paginate():
for repo in page['repositories']:
name = repo['repositoryName']
if name.startswith(f'{app}-'):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @lukey-luke, since we have the set of opted in repos, why not use that to match the repos?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the requests from @mianava was to log repostories that would be subject to removal as a "dry run" before opting into a list of repo's to delete with the new lambda.

Because DPC shares an account with BCDA, I thought it would be most straightforward to use account + prefix to look at "all images for DPC that we might want to opt in" and go from there. I'm open to other suggestions if you think there's a better way to capture repos before deleting them @Jose-verdance

@@ -0,0 +1 @@
TARGET_ENVS="dpc-dev dpc-test dpc-sandbox dpc-prod"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this only be test and prod?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied this from the api-waf conf.sh HERE.

I believe this is correct if we're rolling out for just DPC first. However, it looks like the bcda- prefix works for an account-wide tofu plan apparently:

if [[ "$APP" == "bcda" && ("$ENV" == "test" || "$ENV" == "prod") ]]; then

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

api-waf runs a separate instance for each environment. There is no difference between 'dev' and 'test' (or 'sanbox' and 'prod') in terms of the ECR repositories. Unlike api-waf, it would be redundant to run this lambda for 'dev' and 'test'.

@@ -0,0 +1,23 @@
variable "app" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we were just having one lambda for all the teams?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the goal is to have a working example for DPC and then this can be extended to other teams.

Note: first pass is to only include a dry run with repositories logged. We probably want to have this deployed in production as a working example and then carry that over.

I drafted a follow-up item to review logs for ecr-cleanup lambda here: DPC-5259

description = "The application environment"
type = string
validation {
condition = contains(["dev", "test", "sandbox", "prod"], var.env)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be test or prod only? We don't have distinctions between dev/test and sandbox/prod repositories.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally discussed using existing prefixes (e.g. "rls-r" format for production images), but after starting this work, it seems that image cleanup can be handled entirely by lambda and avoid the need for lifecycle policy which has been subject to 2 recent incidents - 1 of which was a dev image being deployed to prod.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't follow the connection between my statement and your response. Please see my clarification in conf.sh

import boto3
from botocore.exceptions import ClientError

KEEP_COUNT = 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both BCDA and DPC use different strategies based on tag prefix, so this universal rule set seems a bit constraining.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a default value and can be extended to be configurable by other teams if needed, but this doesn't have to change to make lambda work for DPC

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current lifecycle policy protects 5 release images and deletes all other images after 14 days. The global policy implemented here does not match that. I think it would be better to have a way to implement these sorts of differentials (as I know that BCDA will require it) here. However, as a first pass, we can leave it as is and resolve it in another ticket if you can't complete such changes this sprint.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! Retrieving these from parameters is preferred.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making these configurable makes sense but I am also ok with updating this as part of a separate ticket to include any other changes related to making this lambda adaptable for other teams.

@jdettmannnava
Copy link
Contributor

Fails pylint on my machine

import boto3
from botocore.exceptions import ClientError

KEEP_COUNT = 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current lifecycle policy protects 5 release images and deletes all other images after 14 days. The global policy implemented here does not match that. I think it would be better to have a way to implement these sorts of differentials (as I know that BCDA will require it) here. However, as a first pass, we can leave it as is and resolve it in another ticket if you can't complete such changes this sprint.

description = "The application environment"
type = string
validation {
condition = contains(["dev", "test", "sandbox", "prod"], var.env)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't follow the connection between my statement and your response. Please see my clarification in conf.sh

@@ -0,0 +1 @@
TARGET_ENVS="dpc-dev dpc-test dpc-sandbox dpc-prod"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

api-waf runs a separate instance for each environment. There is no difference between 'dev' and 'test' (or 'sanbox' and 'prod') in terms of the ECR repositories. Unlike api-waf, it would be redundant to run this lambda for 'dev' and 'test'.

@lukey-luke
Copy link
Author

After some additional discussion, it seems defining a list of repositories based on env + app should suffice to enable teams to opt-in to image cleanup. I've updated PR description with updated example featuring logs from the dpc nonprod account.

I've drafted another follow-up item: DPC-5263 - Review configuration options for ECR cleanup

@lukey-luke lukey-luke requested review from a team and jdettmannnava March 10, 2026 19:46
Copy link
Contributor

@mianava mianava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Switching this to cdap-test for evaluation will enable us to extend the use of one function used across repos and apps instead of many functions.

import boto3
from botocore.exceptions import ClientError

KEEP_COUNT = 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! Retrieving these from parameters is preferred.

@@ -0,0 +1 @@
TARGET_ENVS="dpc-test dpc-prod"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we'll want this issued in cdap-test and cdap-prod so we just have 1 lambda (per account).

"ecr:BatchDeleteImage",
]
resources = [
"arn:aws:ecr:${data.aws_region.current.name}:${data.aws_caller_identity.current.account_id}:repository/*"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's flip this so it comes from a data block.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mianava can you clarify the ask here?

Are you saying to move statements for both SSMAccess and ECRAccess to separate blocks similar to the way iam_policy_document is managed for KMS keys HERE

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mianava Did you mean to say a local block?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I mean from a data block. You can look up the arn for the ECR repositories using
data "aws_ecr_repository" "this" {
for_each = toset(var.ecr_repo_names)
name = each.key
}

this ensures the repositories exist and the permission is accurate

Comment on lines +12 to +13
data "aws_caller_identity" "current" {}
data "aws_region" "current" {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can favor the 'standards' module here in place of these.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may not be needed as we can set the function up under "cdap" and will not require these values in ecr_repository lookups.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer needed I've specified arn for SSM directly and used for_each in a separate data block for repositories HERE

name = "/${var.app}/${var.env}/ecr-cleanup/repos"
type = "SecureString"
description = "Comma-separated list of ECR repository names to clean up"
value = jsonencode(local.repo_list_by_app_env[var.app][var.env])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the status quo, we might get overwrites from one or the other. I would favor using Terraform for this for version control - removing the SSM parameter all together.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Covered in follow-up item: DPC-5263

@lukey-luke lukey-luke requested a review from mianava March 11, 2026 20:38
mianava
mianava previously approved these changes Mar 11, 2026
"""
response = client.get_parameter(Name=ssm_param_name, WithDecryption=True)
value = response['Parameter']['Value']
return json.loads(value)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking at this further, I think we should check if client.get_parameter encounters an error due to not finding the parameter or if it returns an empty value. We should fail gracefully and print to the console and potentially even alert us if this happens. If your creating a follow up ticket for expanding the lambda further we can move alerting off this tickets scope, but we should still handle those scenarios and include unit test to cover that as well.

Copy link

@Jose-verdance Jose-verdance left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Luke, this is looking great, I just left some comments with some additional suggestion and stuff that can be done it a follow up to avoid scope creep!

import boto3
from botocore.exceptions import ClientError

KEEP_COUNT = 5

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making these configurable makes sense but I am also ok with updating this as part of a separate ticket to include any other changes related to making this lambda adaptable for other teams.

log_images_for_deletion(repo, to_delete)
log({'msg': f'Cleanup complete for repo: {repo}'})
except ClientError as e:
log({'msg': f'Error processing repo {repo}: {e}', 'repo': repo})

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this further, this lambda is going to be very important moving forward and should probably be doing more than logging that it failed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants