Backup corruption on transient google storage errors

We have some large clickhouse clusters that take a long time to backup (~17 hours). We use a Google Storage backend.

We encountered failures to restore, that occur when we see transient errors during the long backup time.

For example, we saw a log like this

```
warn gcs.PutFile: can't copy buffer: googleapi: got HTTP response code 503 with body: Service Unavailable during backup time
```

```
# on restore
clickhouse-backup returned status "error": one of restoreDataRegular go-routine return error: can't attach data parts for table 'XXX': code: 233, message: Detached part "YYY" not found
```

### Cause

The bug appears to be caused by the Retrier simply reattempting uploads by calling PutFile again with the same file descriptor. This will not work when the file has been consumed by the destination implementation - the next f.Read() will not return bytes that were already read this way. 

Either you need full buffering, or the storage libraries need to fix this.

The GCS library itself can also take care of retries on transient errors, but the default behaviour is to only retry idempotent operations. So for gcs specifically, a change like [2d62623](https://github.com/devoxel/clickhouse-backup/commit/2d62623) (from a fork I just created) can cause the library itself to retry transient errors on all operations.

I am testing the latter approach today will report back if issues continue to occur regardless of Unavailable errors.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backup corruption on transient google storage errors #1291

Cause

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Backup corruption on transient google storage errors #1291

Description

Cause

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions