Skip to content

fix(aws): implement cleanup on partial creation failures#612

Merged
ArangoGutierrez merged 1 commit intoNVIDIA:mainfrom
ArangoGutierrez:fix/cleanup-partial-failures
Feb 6, 2026
Merged

fix(aws): implement cleanup on partial creation failures#612
ArangoGutierrez merged 1 commit intoNVIDIA:mainfrom
ArangoGutierrez:fix/cleanup-partial-failures

Conversation

@ArangoGutierrez
Copy link
Collaborator

Summary

Implement automatic rollback when Create() fails partway through, preventing orphaned AWS resources.

Changes

Cleanup Stack Pattern

type cleanupFunc func() error
var cleanupStack []cleanupFunc

defer func() {
    if err != nil {
        // Rollback on failure - LIFO order
        for i := len(cleanupStack) - 1; i >= 0; i-- {
            if cleanupErr := cleanupStack[i](); cleanupErr != nil {
                p.log.Warning("Cleanup failed: %v", cleanupErr)
            }
        }
    }
}()

Resources Covered

Resource Cleanup Method
VPC deleteVPC()
Subnet deleteSubnet()
Internet Gateway deleteInternetGateway()
Route Table deleteRouteTable()
Security Group deleteSecurityGroups()
EC2 Instance deleteEC2Instances()

Impact

  • Before: Partial failures left orphaned resources requiring manual cleanup
  • After: All resources are automatically rolled back on failure

Test plan

  • go build ./pkg/provider/aws/... - compiles
  • Integration test with intentional failure

Copilot AI review requested due to automatic review settings February 4, 2026 16:49
@coveralls
Copy link

coveralls commented Feb 4, 2026

Pull Request Test Coverage Report for Build 21756076629

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 49 of 50 (98.0%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+2.8%) to 49.356%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/provider/aws/create.go 49 50 98.0%
Totals Coverage Status
Change from base Build 21755955021: 2.8%
Covered Lines: 2300
Relevant Lines: 4660

💛 - Coveralls

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request aims to implement automatic rollback for partial AWS resource creation failures by introducing a cleanup stack pattern. However, the PR also includes substantial unrelated changes adding new CLI commands (validate, provision, and registrations for describe, get, scp, ssh, update).

Changes:

  • Implements LIFO cleanup stack in AWS provider's Create() method to rollback resources on failure
  • Adds new validate command to pre-validate environment files before creation
  • Adds new provision command to provision or re-provision existing instances

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 8 comments.

File Description
pkg/provider/aws/create.go Adds cleanup stack pattern with defer to rollback created AWS resources on failure
cmd/cli/validate/validate.go New command to validate environment files, SSH keys, AWS credentials, and component dependencies
cmd/cli/provision/provision.go New command supporting instance-mode and SSH-mode provisioning with kubeconfig download
cmd/cli/main.go Registers 7 new CLI commands (validate, provision, describe, get, scp, ssh, update)

Comment on lines 214 to 216
env.Spec.HostUrl = m.host
env.Spec.PrivateKey = m.keyPath
env.Spec.Username = m.username
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The field paths are incorrect. Based on the Environment type definition, these should be env.Spec.Instance.HostUrl (not env.Spec.HostUrl), env.Spec.Auth.PrivateKey (not env.Spec.PrivateKey), and env.Spec.Auth.Username (not env.Spec.Username). This code will not compile or will set wrong fields.

Suggested change
env.Spec.HostUrl = m.host
env.Spec.PrivateKey = m.keyPath
env.Spec.Username = m.username
env.Spec.Instance.HostUrl = m.host
env.Spec.Auth.PrivateKey = m.keyPath
env.Spec.Auth.Username = m.username

Copilot uses AI. Check for mistakes.
}

// Create provisioner and run
p, err := provisioner.New(m.log, env.Spec.PrivateKey, env.Spec.Username, hostUrl)
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The field paths are incorrect. Based on the Environment type definition, these should be env.Spec.Auth.PrivateKey and env.Spec.Auth.Username (not env.Spec.PrivateKey and env.Spec.Username). This code will not compile or will access wrong fields.

Copilot uses AI. Check for mistakes.
Comment on lines 278 to 279
env.Spec.PrivateKey,
env.Spec.Username,
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The field paths are incorrect. Based on the Environment type definition, these should be env.Spec.Auth.PrivateKey and env.Spec.Auth.Username (not env.Spec.PrivateKey and env.Spec.Username). This code will not compile or will access wrong fields.

Copilot uses AI. Check for mistakes.
}
}
} else if env.Spec.Provider == v1alpha1.ProviderSSH {
return env.Spec.HostUrl, nil
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The field path is incorrect. Based on the Environment type definition, this should be env.Spec.Instance.HostUrl (not env.Spec.HostUrl). This code will not compile or will access the wrong field.

Suggested change
return env.Spec.HostUrl, nil
return env.Spec.Instance.HostUrl, nil

Copilot uses AI. Check for mistakes.
cmd/cli/main.go Outdated
Comment on lines 116 to 126
describe.NewCommand(log),
dryrun.NewCommand(log),
get.NewCommand(log),
list.NewCommand(log),
oscmd.NewCommand(log),
provision.NewCommand(log),
scp.NewCommand(log),
ssh.NewCommand(log),
status.NewCommand(log),
update.NewCommand(log),
validate.NewCommand(log),
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR includes changes to register multiple new CLI commands (describe, get, provision, scp, ssh, update, validate) that appear unrelated to the stated purpose of "implement cleanup on partial creation failures". Only the validate and provision commands have implementation files in this PR, but the others are registered without implementations. This makes the PR scope unclear and harder to review. Consider splitting these changes into separate PRs focused on specific features.

Copilot uses AI. Check for mistakes.
return p.deleteEC2Instances(cleanupCache)
})

// Save objects ID's into a cache file
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: If updateAvailableCondition fails after all resources are successfully created, the defer cleanup will be triggered and roll back all successfully created resources. This would leave the AWS resources in limbo - created but not tracked. Consider setting a flag to disable cleanup after all resources are successfully created, or clearing the cleanup stack before this call.

Suggested change
// Save objects ID's into a cache file
// Save objects ID's into a cache file. At this point all AWS resources
// have been successfully created, so we must not run the deferred
// cleanup logic on failure, otherwise we would delete live resources
// that are no longer tracked. Clear the cleanup stack before updating
// the available condition.
cleanupStack = nil

Copilot uses AI. Check for mistakes.
Comment on lines 310 to 320

// getKubeconfigPath returns the path to save kubeconfig
func getKubeconfigPath(instanceID string) string {
homeDir, err := os.UserHomeDir()
if err != nil {
return fmt.Sprintf("kubeconfig-%s", instanceID)
}
kubeDir := filepath.Join(homeDir, ".kube")
_ = os.MkdirAll(kubeDir, 0755)
return filepath.Join(kubeDir, fmt.Sprintf("config-%s", instanceID))
}
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead code: The function getKubeconfigPath is defined but never called. Consider removing it if it's not needed, or use it if it was intended to be used somewhere.

Suggested change
// getKubeconfigPath returns the path to save kubeconfig
func getKubeconfigPath(instanceID string) string {
homeDir, err := os.UserHomeDir()
if err != nil {
return fmt.Sprintf("kubeconfig-%s", instanceID)
}
kubeDir := filepath.Join(homeDir, ".kube")
_ = os.MkdirAll(kubeDir, 0755)
return filepath.Join(kubeDir, fmt.Sprintf("config-%s", instanceID))
}

Copilot uses AI. Check for mistakes.
}
}

// Update provisioned status
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential nil pointer: If env.Labels is nil, this will panic. Consider initializing the map if it's nil before setting the key, e.g., if env.Labels == nil { env.Labels = make(map[string]string) }.

Suggested change
// Update provisioned status
// Update provisioned status
if env.Labels == nil {
env.Labels = make(map[string]string)
}

Copilot uses AI. Check for mistakes.
@ArangoGutierrez ArangoGutierrez force-pushed the fix/cleanup-partial-failures branch from 386ff08 to e9424ea Compare February 6, 2026 15:06
@ArangoGutierrez ArangoGutierrez enabled auto-merge (squash) February 6, 2026 15:09
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
@ArangoGutierrez ArangoGutierrez force-pushed the fix/cleanup-partial-failures branch from e9424ea to 1851e25 Compare February 6, 2026 15:31
@ArangoGutierrez ArangoGutierrez merged commit 87e5ca0 into NVIDIA:main Feb 6, 2026
3 checks passed
ArangoGutierrez added a commit to ArangoGutierrez/holodeck that referenced this pull request Feb 10, 2026
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
ArangoGutierrez added a commit to ArangoGutierrez/holodeck that referenced this pull request Feb 10, 2026
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
ArangoGutierrez added a commit to ArangoGutierrez/holodeck that referenced this pull request Feb 10, 2026
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
ArangoGutierrez added a commit to ArangoGutierrez/holodeck that referenced this pull request Feb 12, 2026
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
ArangoGutierrez added a commit to ArangoGutierrez/holodeck that referenced this pull request Feb 13, 2026
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
ArangoGutierrez added a commit to ArangoGutierrez/holodeck that referenced this pull request Feb 13, 2026
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants