[DPC-5227] Add sweeper lambda for cleaning up ECR images#403
[DPC-5227] Add sweeper lambda for cleaning up ECR images#403lukey-luke wants to merge 34 commits intomainfrom
Conversation
|
Note: CI failure for scripts/tofu-plan HERE Looks like we're getting rate limited. Tofu plan successful for all dpc environments |
Jose-verdance
left a comment
There was a problem hiding this comment.
Hey Luke, this looks really good, just some minor suggestion!
| for page in paginator.paginate(): | ||
| for repo in page['repositories']: | ||
| name = repo['repositoryName'] | ||
| if name.startswith(f'{app}-'): |
There was a problem hiding this comment.
Hey @lukey-luke, since we have the set of opted in repos, why not use that to match the repos?
There was a problem hiding this comment.
One of the requests from @mianava was to log repostories that would be subject to removal as a "dry run" before opting into a list of repo's to delete with the new lambda.
Because DPC shares an account with BCDA, I thought it would be most straightforward to use account + prefix to look at "all images for DPC that we might want to opt in" and go from there. I'm open to other suggestions if you think there's a better way to capture repos before deleting them @Jose-verdance
| @@ -0,0 +1 @@ | |||
| TARGET_ENVS="dpc-dev dpc-test dpc-sandbox dpc-prod" | |||
There was a problem hiding this comment.
Shouldn't this only be test and prod?
There was a problem hiding this comment.
api-waf runs a separate instance for each environment. There is no difference between 'dev' and 'test' (or 'sanbox' and 'prod') in terms of the ECR repositories. Unlike api-waf, it would be redundant to run this lambda for 'dev' and 'test'.
| @@ -0,0 +1,23 @@ | |||
| variable "app" { | |||
There was a problem hiding this comment.
I thought we were just having one lambda for all the teams?
There was a problem hiding this comment.
I think the goal is to have a working example for DPC and then this can be extended to other teams.
Note: first pass is to only include a dry run with repositories logged. We probably want to have this deployed in production as a working example and then carry that over.
I drafted a follow-up item to review logs for ecr-cleanup lambda here: DPC-5259
| description = "The application environment" | ||
| type = string | ||
| validation { | ||
| condition = contains(["dev", "test", "sandbox", "prod"], var.env) |
There was a problem hiding this comment.
Shouldn't this be test or prod only? We don't have distinctions between dev/test and sandbox/prod repositories.
There was a problem hiding this comment.
Originally discussed using existing prefixes (e.g. "rls-r" format for production images), but after starting this work, it seems that image cleanup can be handled entirely by lambda and avoid the need for lifecycle policy which has been subject to 2 recent incidents - 1 of which was a dev image being deployed to prod.
There was a problem hiding this comment.
Sorry, I don't follow the connection between my statement and your response. Please see my clarification in conf.sh
terraform/services/ecr-cleanup/lambda_src/test_lambda_function.py
Outdated
Show resolved
Hide resolved
| import boto3 | ||
| from botocore.exceptions import ClientError | ||
|
|
||
| KEEP_COUNT = 5 |
There was a problem hiding this comment.
Both BCDA and DPC use different strategies based on tag prefix, so this universal rule set seems a bit constraining.
There was a problem hiding this comment.
This is a default value and can be extended to be configurable by other teams if needed, but this doesn't have to change to make lambda work for DPC
There was a problem hiding this comment.
The current lifecycle policy protects 5 release images and deletes all other images after 14 days. The global policy implemented here does not match that. I think it would be better to have a way to implement these sorts of differentials (as I know that BCDA will require it) here. However, as a first pass, we can leave it as is and resolve it in another ticket if you can't complete such changes this sprint.
There was a problem hiding this comment.
Yes! Retrieving these from parameters is preferred.
There was a problem hiding this comment.
Making these configurable makes sense but I am also ok with updating this as part of a separate ticket to include any other changes related to making this lambda adaptable for other teams.
terraform/services/ecr-cleanup/lambda_src/test_lambda_function.py
Outdated
Show resolved
Hide resolved
terraform/services/ecr-cleanup/lambda_src/test_lambda_function.py
Outdated
Show resolved
Hide resolved
terraform/services/ecr-cleanup/lambda_src/test_lambda_function.py
Outdated
Show resolved
Hide resolved
|
Fails pylint on my machine |
Co-authored-by: jdettmannnava <145699825+jdettmannnava@users.noreply.github.com>
| import boto3 | ||
| from botocore.exceptions import ClientError | ||
|
|
||
| KEEP_COUNT = 5 |
There was a problem hiding this comment.
The current lifecycle policy protects 5 release images and deletes all other images after 14 days. The global policy implemented here does not match that. I think it would be better to have a way to implement these sorts of differentials (as I know that BCDA will require it) here. However, as a first pass, we can leave it as is and resolve it in another ticket if you can't complete such changes this sprint.
| description = "The application environment" | ||
| type = string | ||
| validation { | ||
| condition = contains(["dev", "test", "sandbox", "prod"], var.env) |
There was a problem hiding this comment.
Sorry, I don't follow the connection between my statement and your response. Please see my clarification in conf.sh
| @@ -0,0 +1 @@ | |||
| TARGET_ENVS="dpc-dev dpc-test dpc-sandbox dpc-prod" | |||
There was a problem hiding this comment.
api-waf runs a separate instance for each environment. There is no difference between 'dev' and 'test' (or 'sanbox' and 'prod') in terms of the ECR repositories. Unlike api-waf, it would be redundant to run this lambda for 'dev' and 'test'.
|
After some additional discussion, it seems defining a list of repositories based on env + app should suffice to enable teams to opt-in to image cleanup. I've updated PR description with updated example featuring logs from the dpc nonprod account. I've drafted another follow-up item: DPC-5263 - Review configuration options for ECR cleanup |
mianava
left a comment
There was a problem hiding this comment.
This looks great! Switching this to cdap-test for evaluation will enable us to extend the use of one function used across repos and apps instead of many functions.
| import boto3 | ||
| from botocore.exceptions import ClientError | ||
|
|
||
| KEEP_COUNT = 5 |
There was a problem hiding this comment.
Yes! Retrieving these from parameters is preferred.
| @@ -0,0 +1 @@ | |||
| TARGET_ENVS="dpc-test dpc-prod" | |||
There was a problem hiding this comment.
I believe we'll want this issued in cdap-test and cdap-prod so we just have 1 lambda (per account).
| "ecr:BatchDeleteImage", | ||
| ] | ||
| resources = [ | ||
| "arn:aws:ecr:${data.aws_region.current.name}:${data.aws_caller_identity.current.account_id}:repository/*" |
There was a problem hiding this comment.
Let's flip this so it comes from a data block.
There was a problem hiding this comment.
@mianava Did you mean to say a local block?
There was a problem hiding this comment.
No, I mean from a data block. You can look up the arn for the ECR repositories using
data "aws_ecr_repository" "this" {
for_each = toset(var.ecr_repo_names)
name = each.key
}
this ensures the repositories exist and the permission is accurate
| data "aws_caller_identity" "current" {} | ||
| data "aws_region" "current" {} |
There was a problem hiding this comment.
We can favor the 'standards' module here in place of these.
There was a problem hiding this comment.
This may not be needed as we can set the function up under "cdap" and will not require these values in ecr_repository lookups.
There was a problem hiding this comment.
No longer needed I've specified arn for SSM directly and used for_each in a separate data block for repositories HERE
| name = "/${var.app}/${var.env}/ecr-cleanup/repos" | ||
| type = "SecureString" | ||
| description = "Comma-separated list of ECR repository names to clean up" | ||
| value = jsonencode(local.repo_list_by_app_env[var.app][var.env]) |
There was a problem hiding this comment.
In the status quo, we might get overwrites from one or the other. I would favor using Terraform for this for version control - removing the SSM parameter all together.
Co-authored-by: mianava <miaferguson@navapbc.com>
| """ | ||
| response = client.get_parameter(Name=ssm_param_name, WithDecryption=True) | ||
| value = response['Parameter']['Value'] | ||
| return json.loads(value) |
There was a problem hiding this comment.
After looking at this further, I think we should check if client.get_parameter encounters an error due to not finding the parameter or if it returns an empty value. We should fail gracefully and print to the console and potentially even alert us if this happens. If your creating a follow up ticket for expanding the lambda further we can move alerting off this tickets scope, but we should still handle those scenarios and include unit test to cover that as well.
Jose-verdance
left a comment
There was a problem hiding this comment.
Hey Luke, this is looking great, I just left some comments with some additional suggestion and stuff that can be done it a follow up to avoid scope creep!
| import boto3 | ||
| from botocore.exceptions import ClientError | ||
|
|
||
| KEEP_COUNT = 5 |
There was a problem hiding this comment.
Making these configurable makes sense but I am also ok with updating this as part of a separate ticket to include any other changes related to making this lambda adaptable for other teams.
| log_images_for_deletion(repo, to_delete) | ||
| log({'msg': f'Cleanup complete for repo: {repo}'}) | ||
| except ClientError as e: | ||
| log({'msg': f'Error processing repo {repo}: {e}', 'repo': repo}) |
There was a problem hiding this comment.
Thinking about this further, this lambda is going to be very important moving forward and should probably be doing more than logging that it failed.
🎫 Ticket
DPC-5227
🛠 Changes
main.tfHERE/<app>/<env>/ecr-cleanup/repos)ℹ️ Context
🧪 Validation
Manually verified functionality in test env
See additional logs at
/aws/lambda/dpc-test-ecr-cleanupon cloudwatch HEREAdded tests to cover python code: