feat(aws): add retry logic with exponential backoff by ArangoGutierrez · Pull Request #616 · NVIDIA/holodeck

ArangoGutierrez · 2026-02-04T20:30:31Z

Summary

Add a generic retry helper with exponential backoff for AWS API calls.

Changes

New Files

File	Description
`pkg/provider/aws/retry.go`	Retry helper implementation
`pkg/provider/aws/retry_test.go`	10 comprehensive tests

API

// Usage
result, err := WithRetry(ctx, DefaultRetryConfig(), func() (*ec2.Output, error) {
    return client.CreateVpc(ctx, input)
})

Configuration

Parameter	Default	Description
MaxRetries	3	Maximum retry attempts
InitialBackoff	1s	Initial backoff duration
MaxBackoff	30s	Maximum backoff cap

Retryable Errors

Throttling
RequestLimitExceeded
ServiceUnavailable
InternalError
connection reset
timeout

Test Results

All 10 tests pass:

Default config validation
Success on first attempt
Success after retries
Max retries exceeded
Non-retryable errors
Context cancellation
Error classification
Exponential backoff timing
Backoff capping
Generic type support

Test plan

go build ./pkg/provider/aws/... - compiles
go test ./pkg/provider/aws/... -run TestRetry - passes

coveralls · 2026-02-04T20:33:23Z

Pull Request Test Coverage Report for Build 21755174663

Details

52 of 52 (100.0%) changed or added relevant lines in 1 file are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.6%) to 46.557%

Totals
Change from base Build 21753110877:	0.6%
Covered Lines:	2143
Relevant Lines:	4603

💛 - Coveralls

Copilot

Pull request overview

This PR introduces a generic exponential-backoff retry helper for AWS operations and extends the CLI with new validate and provision commands for managing environments and reprovisioning instances.

Changes:

Added WithRetry and RetryConfig in the AWS provider with exponential backoff and jitter, plus comprehensive unit tests for timing, error classification, and generic type support.
Added a validate CLI command to pre-validate environment YAML (YAML structure, required fields, SSH keys, AWS credentials, and component dependencies).
Added a provision CLI command to (re)run provisioning on existing instances or arbitrary SSH hosts, including optional kubeconfig retrieval, and wired new commands into the main CLI entrypoint.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`pkg/provider/aws/retry.go`	Implements generic retry helper with configurable max retries, backoff, and AWS-style retryable error detection.
`pkg/provider/aws/retry_test.go`	Adds unit tests covering default config, retry behavior, context cancellation, backoff timing, capping, and generics.
`cmd/cli/validate/validate.go`	Introduces `holodeck validate` command with multiple validation stages (YAML, required fields, SSH keys, AWS creds, component config).
`cmd/cli/provision/provision.go`	Introduces `holodeck provision` command supporting instance-ID mode and SSH mode, cluster vs single-node provisioning, and kubeconfig download.
`cmd/cli/main.go`	Registers new CLI subcommands (`describe`, `get`, `provision`, `scp`, `ssh`, `update`, `validate`) and updates the top-level help/usage examples accordingly.

Copilot · 2026-02-04T20:37:22Z

pkg/provider/aws/retry.go

+		if attempt < cfg.MaxRetries {
+			// Add jitter
+			jitter := time.Duration(rand.Int63n(int64(backoff / 2)))
+			sleepDuration := backoff + jitter


WithRetry computes jitter using rand.Int63n(int64(backoff / 2)) without guarding against a zero backoff, so a configuration with InitialBackoff == 0 will cause a runtime panic (Int63n with argument 0). Consider validating the retry config (e.g., defaulting non-positive backoff values) or skipping jitter when backoff <= 0 to avoid panics from misconfiguration.

Copilot · 2026-02-04T20:37:22Z

cmd/cli/validate/validate.go

+	if env.Spec.Auth.KeyName == "" {
+		results = append(results, ValidationResult{
+			Check:   "Auth.KeyName",
+			Passed:  false,
+			Message: "KeyName is required",
+		})
+	} else {
+		results = append(results, ValidationResult{
+			Check:   "Auth.KeyName",
+			Passed:  true,
+			Message: fmt.Sprintf("KeyName: %s", env.Spec.Auth.KeyName),
+		})


validateRequiredFields unconditionally requires Auth.KeyName, but SSH environments are valid without a key name (see pkg/testutil/fixtures.go:49-64 where ValidSSHEnvironment omits KeyName). This will cause holodeck validate to fail on otherwise valid SSH configurations; consider making KeyName required only for the AWS provider or relaxing this check for SSH.

Suggested change

if env.Spec.Auth.KeyName == "" {

results = append(results, ValidationResult{

Check: "Auth.KeyName",

Passed: false,

Message: "KeyName is required",

})

} else {

results = append(results, ValidationResult{

Check: "Auth.KeyName",

Passed: true,

Message: fmt.Sprintf("KeyName: %s", env.Spec.Auth.KeyName),

})

if env.Spec.Provider == v1alpha1.ProviderAWS {

if env.Spec.Auth.KeyName == "" {

results = append(results, ValidationResult{

Check: "Auth.KeyName",

Passed: false,

Message: "KeyName is required for AWS provider",

})

} else {

results = append(results, ValidationResult{

Check: "Auth.KeyName",

Passed: true,

Message: fmt.Sprintf("KeyName: %s", env.Spec.Auth.KeyName),

})

}

Copilot · 2026-02-04T20:37:23Z

cmd/cli/validate/validate.go

+			if strings.Contains(awsResult.Message, "warning") {
+				hasWarnings = true
+			} else {
+				hasErrors = true
+			}


In run, AWS credential failures are treated as warnings only if validateAWSCredentials returns a message containing the word "warning", but validateAWSCredentials never returns such messages. This makes the warning path effectively dead code and can confuse future maintainers; either remove the string check and consistently treat AWS credential issues as errors, or introduce explicit warning cases in validateAWSCredentials.

Suggested change

if strings.Contains(awsResult.Message, "warning") {

hasWarnings = true

} else {

hasErrors = true

}

hasErrors = true

Copilot · 2026-02-04T20:37:23Z

cmd/cli/validate/validate.go

+func (m *command) run() error {
+	results := make([]ValidationResult, 0)
+	hasErrors := false
+	hasWarnings := false
+
+	// 1. Validate environment file exists and is valid YAML
+	env, err := m.validateEnvFile()
+	if err != nil {
+		results = append(results, ValidationResult{
+			Check:   "Environment file",
+			Passed:  false,
+			Message: err.Error(),
+		})
+		hasErrors = true
+		m.printResults(results)
+		return fmt.Errorf("validation failed")
+	}
+	results = append(results, ValidationResult{
+		Check:   "Environment file",
+		Passed:  true,
+		Message: "Valid YAML structure",
+	})
+
+	// 2. Validate required fields
+	fieldResults := m.validateRequiredFields(env)
+	for _, r := range fieldResults {
+		results = append(results, r)
+		if !r.Passed {
+			hasErrors = true
+		}
+	}
+
+	// 3. Validate SSH keys
+	keyResults := m.validateSSHKeys(env)
+	for _, r := range keyResults {
+		results = append(results, r)
+		if !r.Passed {
+			hasErrors = true
+		}
+	}
+
+	// 4. Validate AWS credentials (if AWS provider)
+	if env.Spec.Provider == v1alpha1.ProviderAWS {
+		awsResult := m.validateAWSCredentials()
+		results = append(results, awsResult)
+		if !awsResult.Passed {
+			if strings.Contains(awsResult.Message, "warning") {
+				hasWarnings = true
+			} else {
+				hasErrors = true
+			}
+		}
+	}
+
+	// 5. Validate component configuration
+	compResults := m.validateComponents(env)
+	for _, r := range compResults {
+		results = append(results, r)
+		if !r.Passed {
+			hasWarnings = true
+		}
+	}
+
+	// Print results
+	m.printResults(results)
+
+	// Determine exit status
+	if hasErrors {
+		return fmt.Errorf("validation failed with errors")
+	}
+	if hasWarnings && m.strict {
+		return fmt.Errorf("validation failed with warnings (strict mode)")
+	}
+
+	m.log.Info("\n✅ Validation passed")
+	return nil
+}


The new validate command implements non-trivial behavior (YAML parsing, required-field checks, SSH key checks, AWS credentials, and component dependency validation) but has no dedicated tests, while other CLI commands in this repo (e.g., cmd/cli/create and cmd/cli/dryrun) are covered by tests. Adding focused tests around run, validateRequiredFields, validateSSHKeys, and validateComponents would help prevent regressions in this validation logic.

Copilot · 2026-02-04T20:37:23Z

cmd/cli/provision/provision.go

+func (m *command) runInstanceMode(instanceID string) error {
+	// Get instance details
+	manager := instances.NewManager(m.log, m.cachePath)
+	instance, err := manager.GetInstance(instanceID)
+	if err != nil {
+		return fmt.Errorf("failed to get instance: %v", err)
+	}
+
+	// Load environment
+	env, err := jyaml.UnmarshalFromFile[v1alpha1.Environment](instance.CacheFile)
+	if err != nil {
+		return fmt.Errorf("failed to read environment: %v", err)
+	}
+
+	m.log.Info("Provisioning instance %s...", instanceID)
+
+	// Run provisioning based on instance type
+	if env.Spec.Cluster != nil && env.Status.Cluster != nil && len(env.Status.Cluster.Nodes) > 0 {
+		if err := m.runClusterProvision(&env); err != nil {
+			return err
+		}
+	} else {
+		if err := m.runSingleNodeProvision(&env); err != nil {
+			return err
+		}
+	}
+
+	// Update provisioned status
+	env.Labels[instances.InstanceProvisionedLabelKey] = "true"
+	data, err := jyaml.MarshalYAML(env)
+	if err != nil {
+		return fmt.Errorf("failed to marshal environment: %v", err)
+	}
+	if err := os.WriteFile(instance.CacheFile, data, 0600); err != nil {
+		return fmt.Errorf("failed to update cache file: %v", err)
+	}
+
+	// Download kubeconfig if requested and Kubernetes is installed
+	if m.kubeconfig != "" && env.Spec.Kubernetes.Install {
+		hostUrl, err := m.getHostURL(&env)
+		if err != nil {
+			m.log.Warning("Failed to get host URL for kubeconfig: %v", err)
+		} else {
+			if err := utils.GetKubeConfig(m.log, &env, hostUrl, m.kubeconfig); err != nil {
+				m.log.Warning("Failed to download kubeconfig: %v", err)
+			}
+		}
+	}
+
+	m.log.Info("✅ Provisioning completed successfully")
+	return nil
+}


The new provision command introduces multiple code paths (instance mode vs. SSH mode, kubeconfig download, and host URL resolution) but currently lacks automated tests, whereas other CLI commands (e.g., create, dryrun) have coverage. To keep the CLI behavior stable, consider adding tests for runInstanceMode, runSSHMode, and getHostURL that exercise both AWS and SSH scenarios, as well as error paths (missing flags, missing nodes, unknown host URL).

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Copilot AI review requested due to automatic review settings February 4, 2026 20:30

Copilot started reviewing on behalf of ArangoGutierrez February 4, 2026 20:30 View session

Copilot AI reviewed Feb 4, 2026

View reviewed changes

ArangoGutierrez force-pushed the feat/aws-retry-logic branch from 5c7a226 to 216f8c3 Compare February 6, 2026 14:54

feat(aws): add retry logic with exponential backoff

bea0877

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

ArangoGutierrez force-pushed the feat/aws-retry-logic branch from 216f8c3 to bea0877 Compare February 6, 2026 15:03

ArangoGutierrez enabled auto-merge (squash) February 6, 2026 15:03

ArangoGutierrez merged commit a2e0de5 into NVIDIA:main Feb 6, 2026
19 checks passed

ArangoGutierrez added a commit to ArangoGutierrez/holodeck that referenced this pull request Feb 10, 2026

feat(aws): add retry logic with exponential backoff (NVIDIA#616)

0a30f0c

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

ArangoGutierrez added a commit to ArangoGutierrez/holodeck that referenced this pull request Feb 10, 2026

feat(aws): add retry logic with exponential backoff (NVIDIA#616)

4f871e6

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

ArangoGutierrez added a commit to ArangoGutierrez/holodeck that referenced this pull request Feb 10, 2026

feat(aws): add retry logic with exponential backoff (NVIDIA#616)

325a50e

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

ArangoGutierrez added a commit to ArangoGutierrez/holodeck that referenced this pull request Feb 12, 2026

feat(aws): add retry logic with exponential backoff (NVIDIA#616)

1d65a8d

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

ArangoGutierrez added a commit to ArangoGutierrez/holodeck that referenced this pull request Feb 13, 2026

feat(aws): add retry logic with exponential backoff (NVIDIA#616)

31b611e

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

ArangoGutierrez added a commit to ArangoGutierrez/holodeck that referenced this pull request Feb 13, 2026

feat(aws): add retry logic with exponential backoff (NVIDIA#616)

43a2e73

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(aws): add retry logic with exponential backoff#616

feat(aws): add retry logic with exponential backoff#616
ArangoGutierrez merged 1 commit intoNVIDIA:mainfrom
ArangoGutierrez:feat/aws-retry-logic

ArangoGutierrez commented Feb 4, 2026

Uh oh!

coveralls commented Feb 4, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 4, 2026

Uh oh!

Copilot AI Feb 4, 2026

Uh oh!

Copilot AI Feb 4, 2026

Uh oh!

Copilot AI Feb 4, 2026

Uh oh!

Copilot AI Feb 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ArangoGutierrez commented Feb 4, 2026

Summary

Changes

New Files

API

Configuration

Retryable Errors

Test Results

Test plan

Uh oh!

coveralls commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 21755174663

Details

💛 - Coveralls

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coveralls commented Feb 4, 2026 •

edited

Loading