Skip to content

Fix CloudControl API throttling errors causing immediate failures#2940

Open
psantus wants to merge 5 commits intohashicorp:mainfrom
psantus:b/retry-on-throttling
Open

Fix CloudControl API throttling errors causing immediate failures#2940
psantus wants to merge 5 commits intohashicorp:mainfrom
psantus:b/retry-on-throttling

Conversation

@psantus
Copy link

@psantus psantus commented Dec 11, 2025

Community Note

  • Please vote on this pull request by adding a 👍 reaction to the original pull request comment to help the community and maintainers prioritize this request
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for pull request followers and do not help prioritize the request
  • The resources and data sources in this provider are generated from the CloudFormation schema, so they can only support the actions that the underlying schema supports. For this reason submitted bugs should be limited to defects in the generation and runtime code of the provider. Customizing behavior of the resource, or noting a gap in behavior are not valid bugs and should be submitted as enhancements to AWS via the CloudFormation Open Coverage Roadmap.

Closes: #2939

Rollback Plan

If a change needs to be reverted, we will publish an updated version of the library.

Changes to Security Controls

Are there any changes to security controls (access controls, encryption, logging) in this pull request? If so, explain.

Description

Implement provider-level retry to mitigate absence of SDK-level retry.

Problem

CloudControl API operations fail immediately when encountering throttling errors instead of retrying automatically, causing unnecessary deployment
failures during high-concurrency scenarios.

Error observed:
waiter state transitioned to FAILED. StatusMessage: Rate exceeded for operation 'CREATE'. ErrorCode: Throttling

Root Cause

The AWS SDK Go v2's default CloudControl waiter treats all OperationStatus: "FAILED" responses as terminal failures, regardless of the ErrorCode.
This means throttling errors (ErrorCode: "Throttling") are not retried, even though they should be.

SDK Issue: Filed upstream at aws-sdk-go-v2#3248

Solution

Override the default waiter retry logic to properly handle throttling errors:

  1. Waiter-level retry: Continue polling when ErrorCode: "Throttling" instead of failing immediately
  2. Client-level retry: Add explicit throttling error classification for initial API calls

Changes

  • internal/service/cloudcontrol/waiter.go: Add throttling retry logic to RetryGetResourceRequestStatus
  • internal/provider/provider.go: Configure CloudControl client with throttling-aware retry behavior

Testing

  • ✅ Throttling errors now retry instead of failing
  • ✅ Non-throttling errors still fail immediately as expected
  • ✅ All CloudControl operations (create/update/delete) benefit from the fix

Rollback Plan

@psantus psantus requested a review from a team as a code owner December 11, 2025 17:49
Copy link
Member

@YakDriver YakDriver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this contribution @psantus! You've identified a real issue with CloudControl API throttling handling.

Key Concerns

  1. Error matching approach: The string-based error matching in provider.go is fragile. AWS SDK v2 provides typed errors that should be used instead.

  2. Missing tests: This PR lacks unit tests to verify the retry behavior works correctly for throttling errors while still failing fast for other errors.

  3. Upstream SDK issue: You've filed aws/aws-sdk-go-v2#3248, which is the right approach. Have you received any response from the AWS SDK team? If they're planning to fix this in the SDK, we should wait for that rather than implementing a workaround.

Recommendations

  • Use typed error checking (errors.As) instead of string matching
  • Add unit tests for both retry paths (waiter and client-level)
  • Consider if this should wait for the upstream SDK fix
  • Verify the actual error codes returned by CloudControl API match "Throttling"

The waiter-level fix in waiter.go looks reasonable, but the client-level retry in provider.go needs improvement for robustness.

What's the status of the upstream SDK issue?

@psantus
Copy link
Author

psantus commented Feb 4, 2026

@YakDriver thanks for your review!

What's the status of the upstream SDK issue?

No clue, it's being investigated by service team (they posted an internal issue number on the GitHub ticket.)
However, independently of the SDK layer, in a separate thread (via AWS Support), the Service Team has determined that the original throttling did not occur in the Cloud Control API itself, but in the underlying service (I was creating 3 SageMakerImageVersions at the same time, and their API throttle level was that low). The SageMaker team raised their default throttle quota, so the under-under-underlying issue got solved.

If you'd be willing to take this fix, despite the lack of status/ETA on other layers, then I'd be happy to improve my contribution to meet your recommendations.

@psantus psantus requested a review from YakDriver February 15, 2026 12:58
@psantus
Copy link
Author

psantus commented Feb 15, 2026

@YakDriver I address your comments above. Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix CloudControl API throttling errors causing immediate failures

2 participants