feat(aws): add retry logic with exponential backoff#616
feat(aws): add retry logic with exponential backoff#616ArangoGutierrez merged 1 commit intoNVIDIA:mainfrom
Conversation
Pull Request Test Coverage Report for Build 21755174663Details
💛 - Coveralls |
There was a problem hiding this comment.
Pull request overview
This PR introduces a generic exponential-backoff retry helper for AWS operations and extends the CLI with new validate and provision commands for managing environments and reprovisioning instances.
Changes:
- Added
WithRetryandRetryConfigin the AWS provider with exponential backoff and jitter, plus comprehensive unit tests for timing, error classification, and generic type support. - Added a
validateCLI command to pre-validate environment YAML (YAML structure, required fields, SSH keys, AWS credentials, and component dependencies). - Added a
provisionCLI command to (re)run provisioning on existing instances or arbitrary SSH hosts, including optional kubeconfig retrieval, and wired new commands into the main CLI entrypoint.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
pkg/provider/aws/retry.go |
Implements generic retry helper with configurable max retries, backoff, and AWS-style retryable error detection. |
pkg/provider/aws/retry_test.go |
Adds unit tests covering default config, retry behavior, context cancellation, backoff timing, capping, and generics. |
cmd/cli/validate/validate.go |
Introduces holodeck validate command with multiple validation stages (YAML, required fields, SSH keys, AWS creds, component config). |
cmd/cli/provision/provision.go |
Introduces holodeck provision command supporting instance-ID mode and SSH mode, cluster vs single-node provisioning, and kubeconfig download. |
cmd/cli/main.go |
Registers new CLI subcommands (describe, get, provision, scp, ssh, update, validate) and updates the top-level help/usage examples accordingly. |
| if attempt < cfg.MaxRetries { | ||
| // Add jitter | ||
| jitter := time.Duration(rand.Int63n(int64(backoff / 2))) | ||
| sleepDuration := backoff + jitter |
There was a problem hiding this comment.
WithRetry computes jitter using rand.Int63n(int64(backoff / 2)) without guarding against a zero backoff, so a configuration with InitialBackoff == 0 will cause a runtime panic (Int63n with argument 0). Consider validating the retry config (e.g., defaulting non-positive backoff values) or skipping jitter when backoff <= 0 to avoid panics from misconfiguration.
cmd/cli/validate/validate.go
Outdated
| if env.Spec.Auth.KeyName == "" { | ||
| results = append(results, ValidationResult{ | ||
| Check: "Auth.KeyName", | ||
| Passed: false, | ||
| Message: "KeyName is required", | ||
| }) | ||
| } else { | ||
| results = append(results, ValidationResult{ | ||
| Check: "Auth.KeyName", | ||
| Passed: true, | ||
| Message: fmt.Sprintf("KeyName: %s", env.Spec.Auth.KeyName), | ||
| }) |
There was a problem hiding this comment.
validateRequiredFields unconditionally requires Auth.KeyName, but SSH environments are valid without a key name (see pkg/testutil/fixtures.go:49-64 where ValidSSHEnvironment omits KeyName). This will cause holodeck validate to fail on otherwise valid SSH configurations; consider making KeyName required only for the AWS provider or relaxing this check for SSH.
| if env.Spec.Auth.KeyName == "" { | |
| results = append(results, ValidationResult{ | |
| Check: "Auth.KeyName", | |
| Passed: false, | |
| Message: "KeyName is required", | |
| }) | |
| } else { | |
| results = append(results, ValidationResult{ | |
| Check: "Auth.KeyName", | |
| Passed: true, | |
| Message: fmt.Sprintf("KeyName: %s", env.Spec.Auth.KeyName), | |
| }) | |
| if env.Spec.Provider == v1alpha1.ProviderAWS { | |
| if env.Spec.Auth.KeyName == "" { | |
| results = append(results, ValidationResult{ | |
| Check: "Auth.KeyName", | |
| Passed: false, | |
| Message: "KeyName is required for AWS provider", | |
| }) | |
| } else { | |
| results = append(results, ValidationResult{ | |
| Check: "Auth.KeyName", | |
| Passed: true, | |
| Message: fmt.Sprintf("KeyName: %s", env.Spec.Auth.KeyName), | |
| }) | |
| } |
cmd/cli/validate/validate.go
Outdated
| if strings.Contains(awsResult.Message, "warning") { | ||
| hasWarnings = true | ||
| } else { | ||
| hasErrors = true | ||
| } |
There was a problem hiding this comment.
In run, AWS credential failures are treated as warnings only if validateAWSCredentials returns a message containing the word "warning", but validateAWSCredentials never returns such messages. This makes the warning path effectively dead code and can confuse future maintainers; either remove the string check and consistently treat AWS credential issues as errors, or introduce explicit warning cases in validateAWSCredentials.
| if strings.Contains(awsResult.Message, "warning") { | |
| hasWarnings = true | |
| } else { | |
| hasErrors = true | |
| } | |
| hasErrors = true |
cmd/cli/validate/validate.go
Outdated
| func (m *command) run() error { | ||
| results := make([]ValidationResult, 0) | ||
| hasErrors := false | ||
| hasWarnings := false | ||
|
|
||
| // 1. Validate environment file exists and is valid YAML | ||
| env, err := m.validateEnvFile() | ||
| if err != nil { | ||
| results = append(results, ValidationResult{ | ||
| Check: "Environment file", | ||
| Passed: false, | ||
| Message: err.Error(), | ||
| }) | ||
| hasErrors = true | ||
| m.printResults(results) | ||
| return fmt.Errorf("validation failed") | ||
| } | ||
| results = append(results, ValidationResult{ | ||
| Check: "Environment file", | ||
| Passed: true, | ||
| Message: "Valid YAML structure", | ||
| }) | ||
|
|
||
| // 2. Validate required fields | ||
| fieldResults := m.validateRequiredFields(env) | ||
| for _, r := range fieldResults { | ||
| results = append(results, r) | ||
| if !r.Passed { | ||
| hasErrors = true | ||
| } | ||
| } | ||
|
|
||
| // 3. Validate SSH keys | ||
| keyResults := m.validateSSHKeys(env) | ||
| for _, r := range keyResults { | ||
| results = append(results, r) | ||
| if !r.Passed { | ||
| hasErrors = true | ||
| } | ||
| } | ||
|
|
||
| // 4. Validate AWS credentials (if AWS provider) | ||
| if env.Spec.Provider == v1alpha1.ProviderAWS { | ||
| awsResult := m.validateAWSCredentials() | ||
| results = append(results, awsResult) | ||
| if !awsResult.Passed { | ||
| if strings.Contains(awsResult.Message, "warning") { | ||
| hasWarnings = true | ||
| } else { | ||
| hasErrors = true | ||
| } | ||
| } | ||
| } | ||
|
|
||
| // 5. Validate component configuration | ||
| compResults := m.validateComponents(env) | ||
| for _, r := range compResults { | ||
| results = append(results, r) | ||
| if !r.Passed { | ||
| hasWarnings = true | ||
| } | ||
| } | ||
|
|
||
| // Print results | ||
| m.printResults(results) | ||
|
|
||
| // Determine exit status | ||
| if hasErrors { | ||
| return fmt.Errorf("validation failed with errors") | ||
| } | ||
| if hasWarnings && m.strict { | ||
| return fmt.Errorf("validation failed with warnings (strict mode)") | ||
| } | ||
|
|
||
| m.log.Info("\n✅ Validation passed") | ||
| return nil | ||
| } |
There was a problem hiding this comment.
The new validate command implements non-trivial behavior (YAML parsing, required-field checks, SSH key checks, AWS credentials, and component dependency validation) but has no dedicated tests, while other CLI commands in this repo (e.g., cmd/cli/create and cmd/cli/dryrun) are covered by tests. Adding focused tests around run, validateRequiredFields, validateSSHKeys, and validateComponents would help prevent regressions in this validation logic.
cmd/cli/provision/provision.go
Outdated
| func (m *command) runInstanceMode(instanceID string) error { | ||
| // Get instance details | ||
| manager := instances.NewManager(m.log, m.cachePath) | ||
| instance, err := manager.GetInstance(instanceID) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to get instance: %v", err) | ||
| } | ||
|
|
||
| // Load environment | ||
| env, err := jyaml.UnmarshalFromFile[v1alpha1.Environment](instance.CacheFile) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to read environment: %v", err) | ||
| } | ||
|
|
||
| m.log.Info("Provisioning instance %s...", instanceID) | ||
|
|
||
| // Run provisioning based on instance type | ||
| if env.Spec.Cluster != nil && env.Status.Cluster != nil && len(env.Status.Cluster.Nodes) > 0 { | ||
| if err := m.runClusterProvision(&env); err != nil { | ||
| return err | ||
| } | ||
| } else { | ||
| if err := m.runSingleNodeProvision(&env); err != nil { | ||
| return err | ||
| } | ||
| } | ||
|
|
||
| // Update provisioned status | ||
| env.Labels[instances.InstanceProvisionedLabelKey] = "true" | ||
| data, err := jyaml.MarshalYAML(env) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to marshal environment: %v", err) | ||
| } | ||
| if err := os.WriteFile(instance.CacheFile, data, 0600); err != nil { | ||
| return fmt.Errorf("failed to update cache file: %v", err) | ||
| } | ||
|
|
||
| // Download kubeconfig if requested and Kubernetes is installed | ||
| if m.kubeconfig != "" && env.Spec.Kubernetes.Install { | ||
| hostUrl, err := m.getHostURL(&env) | ||
| if err != nil { | ||
| m.log.Warning("Failed to get host URL for kubeconfig: %v", err) | ||
| } else { | ||
| if err := utils.GetKubeConfig(m.log, &env, hostUrl, m.kubeconfig); err != nil { | ||
| m.log.Warning("Failed to download kubeconfig: %v", err) | ||
| } | ||
| } | ||
| } | ||
|
|
||
| m.log.Info("✅ Provisioning completed successfully") | ||
| return nil | ||
| } |
There was a problem hiding this comment.
The new provision command introduces multiple code paths (instance mode vs. SSH mode, kubeconfig download, and host URL resolution) but currently lacks automated tests, whereas other CLI commands (e.g., create, dryrun) have coverage. To keep the CLI behavior stable, consider adding tests for runInstanceMode, runSSHMode, and getHostURL that exercise both AWS and SSH scenarios, as well as error paths (missing flags, missing nodes, unknown host URL).
5c7a226 to
216f8c3
Compare
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
216f8c3 to
bea0877
Compare
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Summary
Add a generic retry helper with exponential backoff for AWS API calls.
Changes
New Files
pkg/provider/aws/retry.gopkg/provider/aws/retry_test.goAPI
Configuration
Retryable Errors
ThrottlingRequestLimitExceededServiceUnavailableInternalErrorconnection resettimeoutTest Results
All 10 tests pass:
Test plan
go build ./pkg/provider/aws/...- compilesgo test ./pkg/provider/aws/... -run TestRetry- passes