Skip to content

Backup corruption on transient google storage errors #1291

@devoxel

Description

@devoxel

We have some large clickhouse clusters that take a long time to backup (~17 hours). We use a Google Storage backend.

We encountered failures to restore, that occur when we see transient errors during the long backup time.

For example, we saw a log like this

warn gcs.PutFile: can't copy buffer: googleapi: got HTTP response code 503 with body: Service Unavailable during backup time
# on restore
clickhouse-backup returned status "error": one of restoreDataRegular go-routine return error: can't attach data parts for table 'XXX': code: 233, message: Detached part "YYY" not found

Cause

The bug appears to be caused by the Retrier simply reattempting uploads by calling PutFile again with the same file descriptor. This will not work when the file has been consumed by the destination implementation - the next f.Read() will not return bytes that were already read this way.

Either you need full buffering, or the storage libraries need to fix this.

The GCS library itself can also take care of retries on transient errors, but the default behaviour is to only retry idempotent operations. So for gcs specifically, a change like 2d62623 (from a fork I just created) can cause the library itself to retry transient errors on all operations.

I am testing the latter approach today will report back if issues continue to occur regardless of Unavailable errors.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions