-
Notifications
You must be signed in to change notification settings - Fork 40
fix(cli-integ): add retry for iam eventual consistency issue and migration tests for java #788
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cdk migrate java
& fetchDockerLoginCredential
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #788 +/- ##
==========================================
- Coverage 81.52% 80.83% -0.69%
==========================================
Files 63 63
Lines 8611 8611
Branches 1038 1032 -6
==========================================
- Hits 7020 6961 -59
- Misses 1561 1619 +58
- Partials 30 31 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
// Write the credentials back to stdout | ||
fs.writeFileSync(1, JSON.stringify(credentials)); | ||
|
||
const deadline = Date.now() + 60_000; // 60 seconds timeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't know you could have _ in numbers. nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall could we have a wrapper method : withBackoffRetry or something like that with a max retry /timeout parameter ?
This would allow us to wrap any such tests in that method and have a centralized place for all retry logic than per test
Could be in future that other tests might come up which need these retries
|
||
// Retry on AccessDenied errors due to eventual consistency in AWS IAM. | ||
// Newly created roles and policies may take time to propagate across regions. | ||
if (e.name === 'AccessDenied') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to match with error type : Like class/interface type rather than the name ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a very heave-handed approach to what appears to be mainly an issue with our internal integration tests (since I am not aware of customer reports).
In any case, this change needs to be a separate PR as it is Cx facing.
cc @rix0rrr
const deadline = Date.now() + 60_000; // 60 seconds timeout | ||
let lastError: Error | undefined; | ||
|
||
while (Date.now() < deadline) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably can be implemented at the SDK level using retries:
https://docs.aws.amazon.com/AWSJavaScriptSDK/v3/latest/Package/-smithy-util-retry/
https://github.com/smithy-lang/smithy-typescript/blob/main/packages/types/src/retry.ts
7b6b015
to
2b6c048
Compare
Signed-off-by: github-actions <[email protected]>
captureStderr: false, | ||
let output: string = ''; | ||
|
||
await retry(process.stdout, 'Getting docker credentials', retry.forSeconds(60), async () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be retry
or waitForAssumeRole
or even retryOnMatchingErrors
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should not be waitForAssumeRole
.
We are retrying this because in this call, there is a credential fetching that needs a wait period for IAM eventual consistency. We could use retryOnMatchingErrors
, but there is no clean way to catch error and retry based on that. The package just returns exited with code 1
in case of errors.
So we are retrying anyway and not based on errors, thus the generic retry
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Final thought should we add a sleep time to retry function ? Is that something that can be used like the wait and retry we do in retryOnMatchingErrors
? Would that add any value ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It already has a sleep time of 5 seconds.
…ation tests for java (#788) ### Background The CDK pipelines have been experiencing intermittent failures due to flaky tests that typically pass on retry. This pull request addresses the investigation of the two most frequent failing tests. <img width="1307" height="530" alt="image (1)" src="https://github.com/user-attachments/assets/c03da25a-6921-4358-8a12-81db8722d437" /> ### AWS IAM Eventual Consistency Issue Test: `docker-credential-cdk-assets can assume role and fetch ECR credentials` Issue: Docker credential fetching fails with AccessDenied errors because newly created IAM roles and policies require time to propagate across AWS regions. Fix: Implemented a 60-second retry mechanism for `fetchDockerLoginCredentials()` when encountering AccessDenied errors. ### CDK Migration Test Instability Test: `cdk migrate java deploys successfully` Issue: Java CDK migration tests fail sporadically due to Maven Central repository rate limiting errors & dependency resolution failure Fix: Implemented full test retry logic as these transient network-related issues could not be reproduced in local environments. ### Impact These changes should improve pipeline stability and reduce the need for manual intervention. --- By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license --------- Signed-off-by: github-actions <[email protected]> Co-authored-by: github-actions <[email protected]>
…ation tests for java (aws#788) ### Background The CDK pipelines have been experiencing intermittent failures due to flaky tests that typically pass on retry. This pull request addresses the investigation of the two most frequent failing tests. <img width="1307" height="530" alt="image (1)" src="https://github.com/user-attachments/assets/c03da25a-6921-4358-8a12-81db8722d437" /> ### AWS IAM Eventual Consistency Issue Test: `docker-credential-cdk-assets can assume role and fetch ECR credentials` Issue: Docker credential fetching fails with AccessDenied errors because newly created IAM roles and policies require time to propagate across AWS regions. Fix: Implemented a 60-second retry mechanism for `fetchDockerLoginCredentials()` when encountering AccessDenied errors. ### CDK Migration Test Instability Test: `cdk migrate java deploys successfully` Issue: Java CDK migration tests fail sporadically due to Maven Central repository rate limiting errors & dependency resolution failure Fix: Implemented full test retry logic as these transient network-related issues could not be reproduced in local environments. ### Impact These changes should improve pipeline stability and reduce the need for manual intervention. --- By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license --------- Signed-off-by: github-actions <[email protected]> Co-authored-by: github-actions <[email protected]>
Background
The CDK pipelines have been experiencing intermittent failures due to flaky tests that typically pass on retry. This pull request addresses the investigation of the two most frequent failing tests.
AWS IAM Eventual Consistency Issue
Test:
docker-credential-cdk-assets can assume role and fetch ECR credentials
Issue: Docker credential fetching fails with AccessDenied errors because newly created IAM roles and policies require time to propagate across AWS regions.
Fix: Implemented a 60-second retry mechanism for
fetchDockerLoginCredentials()
when encountering AccessDenied errors.CDK Migration Test Instability
Test:
cdk migrate java deploys successfully
Issue: Java CDK migration tests fail sporadically due to Maven Central repository rate limiting errors & dependency resolution failure
Fix: Implemented full test retry logic as these transient network-related issues could not be reproduced in local environments.
Impact
These changes should improve pipeline stability and reduce the need for manual intervention.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license