Skip to content
Open
Show file tree
Hide file tree
Changes from 24 commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
31327c1
add sweeper lambda to replace ECR policy
lukey-luke Mar 2, 2026
631cb70
simplify logic to delete images
lukey-luke Mar 5, 2026
b4e9474
update get_parameter() to work w/ sops params
lukey-luke Mar 6, 2026
6fa7acd
pass in repo_list as a variable
lukey-luke Mar 6, 2026
96590a0
use archive provider for deploying lambda
lukey-luke Mar 6, 2026
0943dc3
fix way repo_list is passed in to create param
lukey-luke Mar 6, 2026
c946d4b
add a README with instruction to test and deploy
lukey-luke Mar 6, 2026
5c9b179
update to log all repos except delete opt-ins
lukey-luke Mar 6, 2026
9bb60b2
fix log message to reflect protected images
lukey-luke Mar 6, 2026
917e769
add default value for repo_list
lukey-luke Mar 6, 2026
54a1cc2
update python code based on PR feedback
lukey-luke Mar 9, 2026
3ef89b3
Apply suggestion from @jdettmannnava
lukey-luke Mar 9, 2026
3d89f19
include repository list in locals for main.tf
lukey-luke Mar 9, 2026
6ec405f
Merge remote-tracking branch 'origin/ls/task-dpc-5227-sweeper-lambda'…
lukey-luke Mar 9, 2026
44dbff5
only use test and prod envs for managing ECR images
lukey-luke Mar 10, 2026
15313b1
improve test coverage - add mock for ECR batch delete calls
lukey-luke Mar 10, 2026
59eeb24
remove keys for dev and sandbox - not used anymore
lukey-luke Mar 10, 2026
1799267
update lambda to reduce time window to max 14 days
lukey-luke Mar 10, 2026
3192133
update ECS policy to enable detecting which images are actively runni…
lukey-luke Mar 10, 2026
fbaf578
Merge branch 'main' into ls/task-dpc-5227-sweeper-lambda
lukey-luke Mar 10, 2026
8c043cc
formatting
lukey-luke Mar 10, 2026
9803b96
Merge remote-tracking branch 'origin/ls/task-dpc-5227-sweeper-lambda'…
lukey-luke Mar 10, 2026
98a1174
add default value for repos array in function body
lukey-luke Mar 10, 2026
150835e
use 30 days for threshold for when images are deleted
lukey-luke Mar 11, 2026
d422c78
Apply suggestions from code review
lukey-luke Mar 11, 2026
5a97873
make more general for repo list to apply to entire account
lukey-luke Mar 11, 2026
ceb9a61
break out policy documents and avoid string interpolation for specify…
lukey-luke Mar 11, 2026
4c357e9
update README with instructions for manually invoking lambda
lukey-luke Mar 11, 2026
2ab74f6
fix permission issue for get_all_repos()
lukey-luke Mar 11, 2026
f391f41
allow python lambda to read repos that would be deleted after opting in
lukey-luke Mar 11, 2026
6718684
tofu formatting
lukey-luke Mar 11, 2026
16ff0f0
use cdap app for tofu apply
lukey-luke Mar 11, 2026
b2f8ac1
add ListTagsForResource to gha policy
lukey-luke Mar 11, 2026
9e8cab2
add exception handling for fails SSM param access
lukey-luke Mar 11, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,4 @@ terraform.tfvars
!.terraform-docs.yml

__pycache__
venv/
31 changes: 31 additions & 0 deletions terraform/services/ecr-cleanup/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# OpenTofu for ecr-cleanup function and associated infra

This service sets up the infrastructure for the ecr-cleanup lambda function, which runs nightly to delete old ECR images while protecting any image referenced by an active ECS task definition.

## Updating the lambda code

The executable for this lambda is in lambda_src. It must pass both pylint and pytest checks.

### Run the tests
```bash
cd lambda_src
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
pip install pylint pytest
pylint lambda_function.py
pytest .
```

## Configuring repositories

Pass the list of ECR repositories to manage via the `repo_list` variable. Terraform will create and manage the SSM parameter.

## Manual deploy

Pass in a backend file when running terraform init. Example:

```bash
export AWS_REGION=us-east-1
tofu init -backend-config=../../backends/dpc-test.s3.tfbackend
tofu apply -var app=dpc -var env=test
```
1 change: 1 addition & 0 deletions terraform/services/ecr-cleanup/conf.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
TARGET_ENVS="dpc-test dpc-prod"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we'll want this issued in cdap-test and cdap-prod so we just have 1 lambda (per account).

169 changes: 169 additions & 0 deletions terraform/services/ecr-cleanup/lambda_src/lambda_function.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,169 @@
"""
Cleans up old ECR images while protecting images referenced by currently running ECS tasks.
"""

from datetime import datetime, timedelta, timezone
import json
import os
import boto3
from botocore.exceptions import ClientError

KEEP_COUNT = 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both BCDA and DPC use different strategies based on tag prefix, so this universal rule set seems a bit constraining.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a default value and can be extended to be configurable by other teams if needed, but this doesn't have to change to make lambda work for DPC

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current lifecycle policy protects 5 release images and deletes all other images after 14 days. The global policy implemented here does not match that. I think it would be better to have a way to implement these sorts of differentials (as I know that BCDA will require it) here. However, as a first pass, we can leave it as is and resolve it in another ticket if you can't complete such changes this sprint.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! Retrieving these from parameters is preferred.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making these configurable makes sense but I am also ok with updating this as part of a separate ticket to include any other changes related to making this lambda adaptable for other teams.

MAX_AGE_DAYS = 30
AWS_BATCH_SIZE = 100

ssm_client = boto3.client('ssm')
ecs_client = boto3.client('ecs')
ecr_client = boto3.client('ecr')


def log(data):
"""Adds a UTC timestamp to data and prints it as a JSON log line."""
data['time'] = datetime.now(timezone.utc).isoformat()
print(json.dumps(data, default=str))


def parse_image_ref(image_uri):
"""Parses an ECR image URI and returns a (repo_name, ref) tuple
where ref is a digest, tag, or 'latest'."""
if '@' in image_uri:
repo_part, digest = image_uri.split('@', 1)
repo_name = repo_part.split('/')[-1]
return repo_name, digest

last_component = image_uri.split('/')[-1]

if ':' in last_component:
repo_name, tag = last_component.rsplit(':', 1)
return repo_name, tag

return last_component, 'latest'


def get_images_to_delete(client, repo_name, protected_refs):
"""
Fetches all images for repo_name and returns those eligible for deletion:
1. Not referenced in protected_refs,
2. Outside the KEEP_COUNT newest
3. Older than MAX_AGE_DAYS.
"""
cutoff = datetime.now(timezone.utc) - timedelta(days=MAX_AGE_DAYS)

images = []
paginator = client.get_paginator('describe_images')
for page in paginator.paginate(repositoryName=repo_name):
images.extend(page['imageDetails'])

candidates = []
for img in images:
digest = img['imageDigest']
tags = img.get('imageTags', [])
if digest not in protected_refs and not any(tag in protected_refs for tag in tags):
candidates.append(img)

candidates.sort(key=lambda x: x['imagePushedAt'], reverse=True)

remainder = candidates[KEEP_COUNT:]

return [img for img in remainder if img['imagePushedAt'] < cutoff]


def get_repo_list(client, ssm_param_name):
"""
Read an SSM SecureString parameter and return a list of ECR repository names.
Note: this uses SecureString to maintain compatibility with the existing SOPS mechanism.
"""
response = client.get_parameter(Name=ssm_param_name, WithDecryption=True)
value = response['Parameter']['Value']
return json.loads(value)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking at this further, I think we should check if client.get_parameter encounters an error due to not finding the parameter or if it returns an empty value. We should fail gracefully and print to the console and potentially even alert us if this happens. If your creating a follow up ticket for expanding the lambda further we can move alerting off this tickets scope, but we should still handle those scenarios and include unit test to cover that as well.



def get_all_repos(client, app):
"""Returns all ECR repository names that belong to the given app prefix."""
repos = set()
paginator = client.get_paginator('describe_repositories')
for page in paginator.paginate():
for repo in page['repositories']:
name = repo['repositoryName']
if name.startswith(f'{app}-'):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @lukey-luke, since we have the set of opted in repos, why not use that to match the repos?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the requests from @mianava was to log repostories that would be subject to removal as a "dry run" before opting into a list of repo's to delete with the new lambda.

Because DPC shares an account with BCDA, I thought it would be most straightforward to use account + prefix to look at "all images for DPC that we might want to opt in" and go from there. I'm open to other suggestions if you think there's a better way to capture repos before deleting them @Jose-verdance

repos.add(name)
return repos


def get_protected_image_refs(client):
"""
Return a set of all image tags and digests referenced by currently RUNNING ECS tasks.
"""
refs = set()

cluster_arns = []
for page in client.get_paginator('list_clusters').paginate():
cluster_arns.extend(page['clusterArns'])

for cluster_arn in cluster_arns:
task_arns = []
for page in client.get_paginator('list_tasks').paginate(
cluster=cluster_arn, desiredStatus='RUNNING'
):
task_arns.extend(page['taskArns'])

for i in range(0, len(task_arns), AWS_BATCH_SIZE):
batch = task_arns[i:i + AWS_BATCH_SIZE]
try:
resp = client.describe_tasks(cluster=cluster_arn, tasks=batch)
for task in resp['tasks']:
for container in task.get('containers', []):
_, ref = parse_image_ref(container.get('image', ''))
refs.add(ref)
except ClientError as e:
log({'msg': f'Error describing tasks in cluster {cluster_arn}: {e}'})

return refs


def delete_images(client, repo_name, images):
"""Deletes the given images from the ECR repository in batches."""
image_ids = [{'imageDigest': img['imageDigest']} for img in images]
for i in range(0, len(image_ids), AWS_BATCH_SIZE):
batch = image_ids[i:i + AWS_BATCH_SIZE]
client.batch_delete_image(repositoryName=repo_name, imageIds=batch)

def log_images_for_deletion(repo, images):
"""Logs images that would be deleted if the repo were opted in."""
for img in images:
log({'msg': 'Would delete image (not opted in)', 'repo': repo,
'digest': img['imageDigest']})

def lambda_handler(event, context): # pylint: disable=unused-argument
"""
Main entry point for lambda function.
Reads configured repos from SSM, which are opted in for lambda to clean up.
Reviews active ECS task definitions, then deletes eligible images that
are old enough and no longer running.
For repos associated with the app but not opted in, logs images that would
be deleted without taking action.
"""
ssm_param = f"/{os.environ['APP']}/{os.environ['ENV']}/ecr-cleanup/repos"

opted_in = set(get_repo_list(ssm_client, ssm_param))
all_repos = get_all_repos(ecr_client, os.environ['APP'])
log({'msg': 'Collected repository list for app',
'app': os.environ['APP'],
'repos': list(all_repos)})

protected_refs = get_protected_image_refs(ecs_client)
log({'msg': 'Built protected image set from running ECS tasks',
'repos': list(protected_refs)})

for repo in all_repos:
try:
to_delete = get_images_to_delete(ecr_client, repo, protected_refs)

if len(to_delete):
if repo in opted_in:
delete_images(ecr_client, repo, to_delete)
else:
log_images_for_deletion(repo, to_delete)
log({'msg': f'Cleanup complete for repo: {repo}'})
except ClientError as e:
log({'msg': f'Error processing repo {repo}: {e}', 'repo': repo})

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this further, this lambda is going to be very important moving forward and should probably be doing more than logging that it failed.

Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
boto3==1.40.52
Loading
Loading