Feature Request: Automatic retries for Dockerfile `RUN` commands with layer resetting

## Current behavior

When a `RUN` command in a Dockerfile fails due to transient issues (e.g., network flakes during downloads or API calls), the build halts immediately. To recover, users must manually restart the entire build, relying on layer caching to skip prior successful steps. However, if users implement retries within the `RUN` command, the retry executes on a potentially "dirty" filesystem state if the command partially mutated the layer before failing.

There are a few scenarios where a dirty filesystem from the previous command execution attempt would cause failure. Some examples:

- The installation script wanted to make a copy of something, and the previous execution attempt already made the copy, so that the next installation attempt would either hang awaiting user confirmation to overwrite the file, or would fail immediately, saying the file already exists.
- The installation scripts will see that a certain artifact from the previous attempt is already present, without knowing that this existing artifact is half-written (relevant, for example, for some package managers with caches), and will try to use this malformed artifact.

Buildkit does not provide built-in retry mechanisms for individual commands, forcing users to handle this externally or with hacky in-command logic.

## Desired behavior

Introduce configurable retry options for `RUN` commands:
- Add a Dockerfile directive or build flag, e.g., `RUN --retry=3 --backoff=5s command` or a global `--retry-policy` for the build, or maybe something to put into `buildkitd.toml`
- On failure, automatically retry the command up to the specified count, with optional exponential backoff.
- Crucially, each retry should reset the layer to its clean pre-`RUN` state (from the cached previous layer) to ensure idempotency and avoid running on a polluted filesystem.
- Optionally customize retries based on exit codes (e.g., only retry on network-related errors like curl's 6 or 7).
- If all retries fail, the build fails as usual.

This would make builds more resilient to flaky externalities without requiring custom scripts that don't handle layer resets properly.

## Use case example

New `RUN` flags are specifically needed in cases when these 2 conditions are met:
1. The underlying command/script doesn't have built-in retry flags, in contrast to, for example, curl.
2. Patching the underlying binary/script that can fail is not suitable.

In a Dockerfile:
```
RUN --retry=3 curl https://getmic.ro | bash
```
If internal commands for fetching data inside the installation script fail due to a temporary network issue, retry up to 3 times, resetting the layer each time, before failing the build. Keep in mind that setting curl's retry parameters only affects the way the script will be fetched, not the execution of the underlying commands that would fetch the `micro` binary itself; that's what makes `RUN`'s flag necessary.

## Alternatives considered

- Manual build restarts, which lead to throwing out into the trash half-baked results of other pending operations, hence wasting the compute
- Unreliable, inconsistent ad-hoc in-`RUN` retry loops (e.g., bash for-loop with sleep), which don't reset layer state, leading to potential inconsistencies, and additional mental overhead. In-`RUN` retry loop example: [`moby/buildkit /Dockerfile`](https://github.com/moby/buildkit/blob/73d52dd274651723b1c156f0947805077942f4d4/Dockerfile#L114-L133)
- Helpers like [kadwanev/retry](https://github.com/kadwanev/retry), which you have to find a way to put into the image first, and then ensure they don't end up in your production runtime image, because these `retry` helpers are only needed at build-time
- External CI tools (e.g., Buildkite job retries), which operate at a higher level, not granular to specific commands, and basically have the same consequences as manual build restarts, but with a bit better DevX


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Request: Automatic retries for Dockerfile `RUN` commands with layer resetting #6429

Current behavior

Desired behavior

Use case example

Alternatives considered

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: Automatic retries for Dockerfile RUN commands with layer resetting #6429

Description

Current behavior

Desired behavior

Use case example

Alternatives considered

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Feature Request: Automatic retries for Dockerfile `RUN` commands with layer resetting #6429