Skip to content

feat(aws): add retry logic with exponential backoff#616

Merged
ArangoGutierrez merged 1 commit intoNVIDIA:mainfrom
ArangoGutierrez:feat/aws-retry-logic
Feb 6, 2026
Merged

feat(aws): add retry logic with exponential backoff#616
ArangoGutierrez merged 1 commit intoNVIDIA:mainfrom
ArangoGutierrez:feat/aws-retry-logic

Conversation

@ArangoGutierrez
Copy link
Collaborator

Summary

Add a generic retry helper with exponential backoff for AWS API calls.

Changes

New Files

File Description
pkg/provider/aws/retry.go Retry helper implementation
pkg/provider/aws/retry_test.go 10 comprehensive tests

API

// Usage
result, err := WithRetry(ctx, DefaultRetryConfig(), func() (*ec2.Output, error) {
    return client.CreateVpc(ctx, input)
})

Configuration

Parameter Default Description
MaxRetries 3 Maximum retry attempts
InitialBackoff 1s Initial backoff duration
MaxBackoff 30s Maximum backoff cap

Retryable Errors

  • Throttling
  • RequestLimitExceeded
  • ServiceUnavailable
  • InternalError
  • connection reset
  • timeout

Test Results

All 10 tests pass:

  • Default config validation
  • Success on first attempt
  • Success after retries
  • Max retries exceeded
  • Non-retryable errors
  • Context cancellation
  • Error classification
  • Exponential backoff timing
  • Backoff capping
  • Generic type support

Test plan

  • go build ./pkg/provider/aws/... - compiles
  • go test ./pkg/provider/aws/... -run TestRetry - passes

Copilot AI review requested due to automatic review settings February 4, 2026 20:30
@coveralls
Copy link

coveralls commented Feb 4, 2026

Pull Request Test Coverage Report for Build 21755174663

Details

  • 52 of 52 (100.0%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.6%) to 46.557%

Totals Coverage Status
Change from base Build 21753110877: 0.6%
Covered Lines: 2143
Relevant Lines: 4603

💛 - Coveralls

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a generic exponential-backoff retry helper for AWS operations and extends the CLI with new validate and provision commands for managing environments and reprovisioning instances.

Changes:

  • Added WithRetry and RetryConfig in the AWS provider with exponential backoff and jitter, plus comprehensive unit tests for timing, error classification, and generic type support.
  • Added a validate CLI command to pre-validate environment YAML (YAML structure, required fields, SSH keys, AWS credentials, and component dependencies).
  • Added a provision CLI command to (re)run provisioning on existing instances or arbitrary SSH hosts, including optional kubeconfig retrieval, and wired new commands into the main CLI entrypoint.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
pkg/provider/aws/retry.go Implements generic retry helper with configurable max retries, backoff, and AWS-style retryable error detection.
pkg/provider/aws/retry_test.go Adds unit tests covering default config, retry behavior, context cancellation, backoff timing, capping, and generics.
cmd/cli/validate/validate.go Introduces holodeck validate command with multiple validation stages (YAML, required fields, SSH keys, AWS creds, component config).
cmd/cli/provision/provision.go Introduces holodeck provision command supporting instance-ID mode and SSH mode, cluster vs single-node provisioning, and kubeconfig download.
cmd/cli/main.go Registers new CLI subcommands (describe, get, provision, scp, ssh, update, validate) and updates the top-level help/usage examples accordingly.

Comment on lines 49 to 52
if attempt < cfg.MaxRetries {
// Add jitter
jitter := time.Duration(rand.Int63n(int64(backoff / 2)))
sleepDuration := backoff + jitter
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WithRetry computes jitter using rand.Int63n(int64(backoff / 2)) without guarding against a zero backoff, so a configuration with InitialBackoff == 0 will cause a runtime panic (Int63n with argument 0). Consider validating the retry config (e.g., defaulting non-positive backoff values) or skipping jitter when backoff <= 0 to avoid panics from misconfiguration.

Copilot uses AI. Check for mistakes.
Comment on lines 211 to 222
if env.Spec.Auth.KeyName == "" {
results = append(results, ValidationResult{
Check: "Auth.KeyName",
Passed: false,
Message: "KeyName is required",
})
} else {
results = append(results, ValidationResult{
Check: "Auth.KeyName",
Passed: true,
Message: fmt.Sprintf("KeyName: %s", env.Spec.Auth.KeyName),
})
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

validateRequiredFields unconditionally requires Auth.KeyName, but SSH environments are valid without a key name (see pkg/testutil/fixtures.go:49-64 where ValidSSHEnvironment omits KeyName). This will cause holodeck validate to fail on otherwise valid SSH configurations; consider making KeyName required only for the AWS provider or relaxing this check for SSH.

Suggested change
if env.Spec.Auth.KeyName == "" {
results = append(results, ValidationResult{
Check: "Auth.KeyName",
Passed: false,
Message: "KeyName is required",
})
} else {
results = append(results, ValidationResult{
Check: "Auth.KeyName",
Passed: true,
Message: fmt.Sprintf("KeyName: %s", env.Spec.Auth.KeyName),
})
if env.Spec.Provider == v1alpha1.ProviderAWS {
if env.Spec.Auth.KeyName == "" {
results = append(results, ValidationResult{
Check: "Auth.KeyName",
Passed: false,
Message: "KeyName is required for AWS provider",
})
} else {
results = append(results, ValidationResult{
Check: "Auth.KeyName",
Passed: true,
Message: fmt.Sprintf("KeyName: %s", env.Spec.Auth.KeyName),
})
}

Copilot uses AI. Check for mistakes.
Comment on lines 143 to 147
if strings.Contains(awsResult.Message, "warning") {
hasWarnings = true
} else {
hasErrors = true
}
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In run, AWS credential failures are treated as warnings only if validateAWSCredentials returns a message containing the word "warning", but validateAWSCredentials never returns such messages. This makes the warning path effectively dead code and can confuse future maintainers; either remove the string check and consistently treat AWS credential issues as errors, or introduce explicit warning cases in validateAWSCredentials.

Suggested change
if strings.Contains(awsResult.Message, "warning") {
hasWarnings = true
} else {
hasErrors = true
}
hasErrors = true

Copilot uses AI. Check for mistakes.
Comment on lines 97 to 173
func (m *command) run() error {
results := make([]ValidationResult, 0)
hasErrors := false
hasWarnings := false

// 1. Validate environment file exists and is valid YAML
env, err := m.validateEnvFile()
if err != nil {
results = append(results, ValidationResult{
Check: "Environment file",
Passed: false,
Message: err.Error(),
})
hasErrors = true
m.printResults(results)
return fmt.Errorf("validation failed")
}
results = append(results, ValidationResult{
Check: "Environment file",
Passed: true,
Message: "Valid YAML structure",
})

// 2. Validate required fields
fieldResults := m.validateRequiredFields(env)
for _, r := range fieldResults {
results = append(results, r)
if !r.Passed {
hasErrors = true
}
}

// 3. Validate SSH keys
keyResults := m.validateSSHKeys(env)
for _, r := range keyResults {
results = append(results, r)
if !r.Passed {
hasErrors = true
}
}

// 4. Validate AWS credentials (if AWS provider)
if env.Spec.Provider == v1alpha1.ProviderAWS {
awsResult := m.validateAWSCredentials()
results = append(results, awsResult)
if !awsResult.Passed {
if strings.Contains(awsResult.Message, "warning") {
hasWarnings = true
} else {
hasErrors = true
}
}
}

// 5. Validate component configuration
compResults := m.validateComponents(env)
for _, r := range compResults {
results = append(results, r)
if !r.Passed {
hasWarnings = true
}
}

// Print results
m.printResults(results)

// Determine exit status
if hasErrors {
return fmt.Errorf("validation failed with errors")
}
if hasWarnings && m.strict {
return fmt.Errorf("validation failed with warnings (strict mode)")
}

m.log.Info("\n✅ Validation passed")
return nil
}
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new validate command implements non-trivial behavior (YAML parsing, required-field checks, SSH key checks, AWS credentials, and component dependency validation) but has no dedicated tests, while other CLI commands in this repo (e.g., cmd/cli/create and cmd/cli/dryrun) are covered by tests. Adding focused tests around run, validateRequiredFields, validateSSHKeys, and validateComponents would help prevent regressions in this validation logic.

Copilot generated this review using guidance from repository custom instructions.
Comment on lines 141 to 192
func (m *command) runInstanceMode(instanceID string) error {
// Get instance details
manager := instances.NewManager(m.log, m.cachePath)
instance, err := manager.GetInstance(instanceID)
if err != nil {
return fmt.Errorf("failed to get instance: %v", err)
}

// Load environment
env, err := jyaml.UnmarshalFromFile[v1alpha1.Environment](instance.CacheFile)
if err != nil {
return fmt.Errorf("failed to read environment: %v", err)
}

m.log.Info("Provisioning instance %s...", instanceID)

// Run provisioning based on instance type
if env.Spec.Cluster != nil && env.Status.Cluster != nil && len(env.Status.Cluster.Nodes) > 0 {
if err := m.runClusterProvision(&env); err != nil {
return err
}
} else {
if err := m.runSingleNodeProvision(&env); err != nil {
return err
}
}

// Update provisioned status
env.Labels[instances.InstanceProvisionedLabelKey] = "true"
data, err := jyaml.MarshalYAML(env)
if err != nil {
return fmt.Errorf("failed to marshal environment: %v", err)
}
if err := os.WriteFile(instance.CacheFile, data, 0600); err != nil {
return fmt.Errorf("failed to update cache file: %v", err)
}

// Download kubeconfig if requested and Kubernetes is installed
if m.kubeconfig != "" && env.Spec.Kubernetes.Install {
hostUrl, err := m.getHostURL(&env)
if err != nil {
m.log.Warning("Failed to get host URL for kubeconfig: %v", err)
} else {
if err := utils.GetKubeConfig(m.log, &env, hostUrl, m.kubeconfig); err != nil {
m.log.Warning("Failed to download kubeconfig: %v", err)
}
}
}

m.log.Info("✅ Provisioning completed successfully")
return nil
}
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new provision command introduces multiple code paths (instance mode vs. SSH mode, kubeconfig download, and host URL resolution) but currently lacks automated tests, whereas other CLI commands (e.g., create, dryrun) have coverage. To keep the CLI behavior stable, consider adding tests for runInstanceMode, runSSHMode, and getHostURL that exercise both AWS and SSH scenarios, as well as error paths (missing flags, missing nodes, unknown host URL).

Copilot generated this review using guidance from repository custom instructions.
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
@ArangoGutierrez ArangoGutierrez enabled auto-merge (squash) February 6, 2026 15:03
@ArangoGutierrez ArangoGutierrez merged commit a2e0de5 into NVIDIA:main Feb 6, 2026
19 checks passed
ArangoGutierrez added a commit to ArangoGutierrez/holodeck that referenced this pull request Feb 10, 2026
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
ArangoGutierrez added a commit to ArangoGutierrez/holodeck that referenced this pull request Feb 10, 2026
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
ArangoGutierrez added a commit to ArangoGutierrez/holodeck that referenced this pull request Feb 10, 2026
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
ArangoGutierrez added a commit to ArangoGutierrez/holodeck that referenced this pull request Feb 12, 2026
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
ArangoGutierrez added a commit to ArangoGutierrez/holodeck that referenced this pull request Feb 13, 2026
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
ArangoGutierrez added a commit to ArangoGutierrez/holodeck that referenced this pull request Feb 13, 2026
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants