Skip to content

Conversation

abidhasan-aws
Copy link
Contributor

@abidhasan-aws abidhasan-aws commented Aug 12, 2025

Background

The CDK pipelines have been experiencing intermittent failures due to flaky tests that typically pass on retry. This pull request addresses the investigation of the two most frequent failing tests.

image (1)

AWS IAM Eventual Consistency Issue

Test: docker-credential-cdk-assets can assume role and fetch ECR credentials

Issue: Docker credential fetching fails with AccessDenied errors because newly created IAM roles and policies require time to propagate across AWS regions.

Fix: Implemented a 60-second retry mechanism for fetchDockerLoginCredentials() when encountering AccessDenied errors.

CDK Migration Test Instability

Test: cdk migrate java deploys successfully

Issue: Java CDK migration tests fail sporadically due to Maven Central repository rate limiting errors & dependency resolution failure

Fix: Implemented full test retry logic as these transient network-related issues could not be reproduced in local environments.

Impact

These changes should improve pipeline stability and reduce the need for manual intervention.


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license

@aws-cdk-automation aws-cdk-automation requested a review from a team August 12, 2025 17:32
@github-actions github-actions bot added the p2 label Aug 12, 2025
@abidhasan-aws abidhasan-aws changed the title add retry for cdk migrate java & fetchDockerLoginCredential fix: mitigate and add retry for flaky tests Aug 12, 2025
@codecov-commenter
Copy link

codecov-commenter commented Aug 12, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 80.83%. Comparing base (c6585ad) to head (653d9a2).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #788      +/-   ##
==========================================
- Coverage   81.52%   80.83%   -0.69%     
==========================================
  Files          63       63              
  Lines        8611     8611              
  Branches     1038     1032       -6     
==========================================
- Hits         7020     6961      -59     
- Misses       1561     1619      +58     
- Partials       30       31       +1     
Flag Coverage Δ
suite.unit 80.83% <ø> (-0.69%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@abidhasan-aws abidhasan-aws changed the title fix: mitigate and add retry for flaky tests test: add retry mechanisms for IAM eventual consistency and CDK migration tests for java Aug 13, 2025
@abidhasan-aws abidhasan-aws changed the title test: add retry mechanisms for IAM eventual consistency and CDK migration tests for java test: add retry mechanisms for iam eventual consistency and migration tests for java Aug 13, 2025
@abidhasan-aws abidhasan-aws changed the title test: add retry mechanisms for iam eventual consistency and migration tests for java fix: add retry mechanisms for iam eventual consistency and migration tests for java Aug 13, 2025
@abidhasan-aws abidhasan-aws changed the title fix: add retry mechanisms for iam eventual consistency and migration tests for java fix: add retry for iam eventual consistency issue and migration tests for java Aug 13, 2025
@abidhasan-aws abidhasan-aws marked this pull request as ready for review August 13, 2025 09:52
// Write the credentials back to stdout
fs.writeFileSync(1, JSON.stringify(credentials));

const deadline = Date.now() + 60_000; // 60 seconds timeout
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't know you could have _ in numbers. nice!

Copy link

@kumvprat kumvprat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall could we have a wrapper method : withBackoffRetry or something like that with a max retry /timeout parameter ?

This would allow us to wrap any such tests in that method and have a centralized place for all retry logic than per test

Could be in future that other tests might come up which need these retries


// Retry on AccessDenied errors due to eventual consistency in AWS IAM.
// Newly created roles and policies may take time to propagate across regions.
if (e.name === 'AccessDenied') {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to match with error type : Like class/interface type rather than the name ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a very heave-handed approach to what appears to be mainly an issue with our internal integration tests (since I am not aware of customer reports).

In any case, this change needs to be a separate PR as it is Cx facing.

cc @rix0rrr

const deadline = Date.now() + 60_000; // 60 seconds timeout
let lastError: Error | undefined;

while (Date.now() < deadline) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signed-off-by: github-actions <[email protected]>
captureStderr: false,
let output: string = '';

await retry(process.stdout, 'Getting docker credentials', retry.forSeconds(60), async () => {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be retry or waitForAssumeRole or even retryOnMatchingErrors ?

Copy link
Contributor Author

@abidhasan-aws abidhasan-aws Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should not be waitForAssumeRole.

We are retrying this because in this call, there is a credential fetching that needs a wait period for IAM eventual consistency. We could use retryOnMatchingErrors, but there is no clean way to catch error and retry based on that. The package just returns exited with code 1 in case of errors.

So we are retrying anyway and not based on errors, thus the generic retry.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final thought should we add a sleep time to retry function ? Is that something that can be used like the wait and retry we do in retryOnMatchingErrors ? Would that add any value ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It already has a sleep time of 5 seconds.

@abidhasan-aws abidhasan-aws disabled auto-merge August 15, 2025 11:19
@mrgrain mrgrain enabled auto-merge August 15, 2025 11:41
@mrgrain mrgrain changed the title fix: add retry for iam eventual consistency issue and migration tests for java fix(cli-tests): add retry for iam eventual consistency issue and migration tests for java Aug 15, 2025
@mrgrain mrgrain changed the title fix(cli-tests): add retry for iam eventual consistency issue and migration tests for java fix(cli-integ): add retry for iam eventual consistency issue and migration tests for java Aug 15, 2025
@mrgrain mrgrain added this pull request to the merge queue Aug 15, 2025
Merged via the queue into main with commit 093e5a6 Aug 15, 2025
36 of 37 checks passed
@mrgrain mrgrain deleted the fix-flaky-cli-integ-tests branch August 15, 2025 12:28
iankhou pushed a commit that referenced this pull request Aug 21, 2025
…ation tests for java (#788)

### Background
The CDK pipelines have been experiencing intermittent failures due to
flaky tests that typically pass on retry. This pull request addresses
the investigation of the two most frequent failing tests.

<img width="1307" height="530" alt="image (1)"
src="https://github.com/user-attachments/assets/c03da25a-6921-4358-8a12-81db8722d437"
/>


### AWS IAM Eventual Consistency Issue
Test: `docker-credential-cdk-assets can assume role and fetch ECR
credentials`

Issue: Docker credential fetching fails with AccessDenied errors because
newly created IAM roles and policies require time to propagate across
AWS regions.

Fix: Implemented a 60-second retry mechanism for
`fetchDockerLoginCredentials()` when encountering AccessDenied errors.

### CDK Migration Test Instability
Test: `cdk migrate java deploys successfully`

Issue: Java CDK migration tests fail sporadically due to Maven Central
repository rate limiting errors & dependency resolution failure

Fix: Implemented full test retry logic as these transient
network-related issues could not be reproduced in local environments.

### Impact
These changes should improve pipeline stability and reduce the need for
manual intervention.


---
By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache-2.0 license

---------

Signed-off-by: github-actions <[email protected]>
Co-authored-by: github-actions <[email protected]>
vivian12300 pushed a commit to vivian12300/aws-cdk-cli that referenced this pull request Aug 26, 2025
…ation tests for java (aws#788)

### Background
The CDK pipelines have been experiencing intermittent failures due to
flaky tests that typically pass on retry. This pull request addresses
the investigation of the two most frequent failing tests.

<img width="1307" height="530" alt="image (1)"
src="https://github.com/user-attachments/assets/c03da25a-6921-4358-8a12-81db8722d437"
/>


### AWS IAM Eventual Consistency Issue
Test: `docker-credential-cdk-assets can assume role and fetch ECR
credentials`

Issue: Docker credential fetching fails with AccessDenied errors because
newly created IAM roles and policies require time to propagate across
AWS regions.

Fix: Implemented a 60-second retry mechanism for
`fetchDockerLoginCredentials()` when encountering AccessDenied errors.

### CDK Migration Test Instability
Test: `cdk migrate java deploys successfully`

Issue: Java CDK migration tests fail sporadically due to Maven Central
repository rate limiting errors & dependency resolution failure

Fix: Implemented full test retry logic as these transient
network-related issues could not be reproduced in local environments.

### Impact
These changes should improve pipeline stability and reduce the need for
manual intervention.


---
By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache-2.0 license

---------

Signed-off-by: github-actions <[email protected]>
Co-authored-by: github-actions <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants