Skip to content

Improve error handling on upload failures #216

@stmcginnis

Description

@stmcginnis

Coldsnap has built in retries with increasing backoff delays when uploading blocks. It can be hard to tell what is happening during this time since there is no output while the retries are happening.

https://github.com/awslabs/coldsnap/blob/develop/src/upload.rs#L171-L188

It might be useful to add a --verbose flag to the command to be able to get a little more insight into what is going on. Or just default to emit some sort of warning message that a retry is happening.

The number of times retries happen also seems to be a little too high.SNAPSHOT_BLOCK_ATTEMPTS is current set to 12. It seems likely that if the upload does not succeed after 3-5 attempts, it's not going to.

It would also be good if coldsnap recognized some failures that are not worth retrying as they are not transient failures. Things like AccessDeniedException as @grosser encountered in bottlerocket-os/bottlerocket#2667 should just immediately fail:

Failed to put block 1551 for snapshot 'snap-0f48e9c316f6fa504': TransientError: connection closed before message completed
Failed to put block 1552 for snapshot 'snap-0f48e9c316f6fa504': AccessDeniedException: User: arn:aws:sts::589470546123:assumed-role/compute-arf/foo@bar.com is not authorized to perform: ebs:PutSnapshotBlock on resource: arn:aws:ec2:us-west-2::snapshot/snap-0f48e9c316f6fa504 because no identity-based policy allows the ebs:PutSnapshotBlock action

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions