Fix CloudControl API throttling errors causing immediate failures#2940
Fix CloudControl API throttling errors causing immediate failures#2940psantus wants to merge 5 commits intohashicorp:mainfrom
Conversation
There was a problem hiding this comment.
Thank you for this contribution @psantus! You've identified a real issue with CloudControl API throttling handling.
Key Concerns
-
Error matching approach: The string-based error matching in
provider.gois fragile. AWS SDK v2 provides typed errors that should be used instead. -
Missing tests: This PR lacks unit tests to verify the retry behavior works correctly for throttling errors while still failing fast for other errors.
-
Upstream SDK issue: You've filed aws/aws-sdk-go-v2#3248, which is the right approach. Have you received any response from the AWS SDK team? If they're planning to fix this in the SDK, we should wait for that rather than implementing a workaround.
Recommendations
- Use typed error checking (
errors.As) instead of string matching - Add unit tests for both retry paths (waiter and client-level)
- Consider if this should wait for the upstream SDK fix
- Verify the actual error codes returned by CloudControl API match "Throttling"
The waiter-level fix in waiter.go looks reasonable, but the client-level retry in provider.go needs improvement for robustness.
What's the status of the upstream SDK issue?
|
@YakDriver thanks for your review!
No clue, it's being investigated by service team (they posted an internal issue number on the GitHub ticket.) If you'd be willing to take this fix, despite the lack of status/ETA on other layers, then I'd be happy to improve my contribution to meet your recommendations. |
|
@YakDriver I address your comments above. Thanks a lot! |
Community Note
Closes: #2939
Rollback Plan
If a change needs to be reverted, we will publish an updated version of the library.
Changes to Security Controls
Are there any changes to security controls (access controls, encryption, logging) in this pull request? If so, explain.
Description
Implement provider-level retry to mitigate absence of SDK-level retry.
Problem
CloudControl API operations fail immediately when encountering throttling errors instead of retrying automatically, causing unnecessary deployment
failures during high-concurrency scenarios.
Error observed:
waiter state transitioned to FAILED. StatusMessage: Rate exceeded for operation 'CREATE'. ErrorCode: Throttling
Root Cause
The AWS SDK Go v2's default CloudControl waiter treats all OperationStatus: "FAILED" responses as terminal failures, regardless of the ErrorCode.
This means throttling errors (ErrorCode: "Throttling") are not retried, even though they should be.
SDK Issue: Filed upstream at aws-sdk-go-v2#3248
Solution
Override the default waiter retry logic to properly handle throttling errors:
Changes
Testing
Rollback Plan