-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Description
Current behavior
When a RUN command in a Dockerfile fails due to transient issues (e.g., network flakes during downloads or API calls), the build halts immediately. To recover, users must manually restart the entire build, relying on layer caching to skip prior successful steps. However, if users implement retries within the RUN command, the retry executes on a potentially "dirty" filesystem state if the command partially mutated the layer before failing.
There are a few scenarios where a dirty filesystem from the previous command execution attempt would cause failure. Some examples:
- The installation script wanted to make a copy of something, and the previous execution attempt already made the copy, so that the next installation attempt would either hang awaiting user confirmation to overwrite the file, or would fail immediately, saying the file already exists.
- The installation scripts will see that a certain artifact from the previous attempt is already present, without knowing that this existing artifact is half-written (relevant, for example, for some package managers with caches), and will try to use this malformed artifact.
Buildkit does not provide built-in retry mechanisms for individual commands, forcing users to handle this externally or with hacky in-command logic.
Desired behavior
Introduce configurable retry options for RUN commands:
- Add a Dockerfile directive or build flag, e.g.,
RUN --retry=3 --backoff=5s commandor a global--retry-policyfor the build, or maybe something to put intobuildkitd.toml - On failure, automatically retry the command up to the specified count, with optional exponential backoff.
- Crucially, each retry should reset the layer to its clean pre-
RUNstate (from the cached previous layer) to ensure idempotency and avoid running on a polluted filesystem. - Optionally customize retries based on exit codes (e.g., only retry on network-related errors like curl's 6 or 7).
- If all retries fail, the build fails as usual.
This would make builds more resilient to flaky externalities without requiring custom scripts that don't handle layer resets properly.
Use case example
New RUN flags are specifically needed in cases when these 2 conditions are met:
- The underlying command/script doesn't have built-in retry flags, in contrast to, for example, curl.
- Patching the underlying binary/script that can fail is not suitable.
In a Dockerfile:
RUN --retry=3 curl https://getmic.ro | bash
If internal commands for fetching data inside the installation script fail due to a temporary network issue, retry up to 3 times, resetting the layer each time, before failing the build. Keep in mind that setting curl's retry parameters only affects the way the script will be fetched, not the execution of the underlying commands that would fetch the micro binary itself; that's what makes RUN's flag necessary.
Alternatives considered
- Manual build restarts, which lead to throwing out into the trash half-baked results of other pending operations, hence wasting the compute
- Unreliable, inconsistent ad-hoc in-
RUNretry loops (e.g., bash for-loop with sleep), which don't reset layer state, leading to potential inconsistencies, and additional mental overhead. In-RUNretry loop example:moby/buildkit /Dockerfile - Helpers like kadwanev/retry, which you have to find a way to put into the image first, and then ensure they don't end up in your production runtime image, because these
retryhelpers are only needed at build-time - External CI tools (e.g., Buildkite job retries), which operate at a higher level, not granular to specific commands, and basically have the same consequences as manual build restarts, but with a bit better DevX