-
Notifications
You must be signed in to change notification settings - Fork 259
Description
We have some large clickhouse clusters that take a long time to backup (~17 hours). We use a Google Storage backend.
We encountered failures to restore, that occur when we see transient errors during the long backup time.
For example, we saw a log like this
warn gcs.PutFile: can't copy buffer: googleapi: got HTTP response code 503 with body: Service Unavailable during backup time
# on restore
clickhouse-backup returned status "error": one of restoreDataRegular go-routine return error: can't attach data parts for table 'XXX': code: 233, message: Detached part "YYY" not found
Cause
The bug appears to be caused by the Retrier simply reattempting uploads by calling PutFile again with the same file descriptor. This will not work when the file has been consumed by the destination implementation - the next f.Read() will not return bytes that were already read this way.
Either you need full buffering, or the storage libraries need to fix this.
The GCS library itself can also take care of retries on transient errors, but the default behaviour is to only retry idempotent operations. So for gcs specifically, a change like 2d62623 (from a fork I just created) can cause the library itself to retry transient errors on all operations.
I am testing the latter approach today will report back if issues continue to occur regardless of Unavailable errors.