multistage cache lookahead

For single stage builds we perform a cache lookahead. This means if we have 100% cache hitrate we don't even need to download and unpack the files in order to create the image. The idea here is to take this lookahead functionality further to work across stages too.

Currently in multistage builds every stage is built (from cache), even if the dependent stages would have resolved to a 100% cache hitrate themselves. This means that even if we have 100% cache hitrate, we at least need to download and unpack all the files for all multistage ancestors. In case of long FROM chains this is partially mitigated by squashing the stages together, meaning that we in effect build a single stage. In case we have COPY --from though this is not possible, as we can't squash across forks & merges. If we opt to --cache-copy-layers we don't even use the files from the hereby built image at all and instead load them cache directly. This means that in that case downloading and unpacking was completely in vain and we can skip it completely as an optimization.

The difficulty here is that currently our cache-key depends on the file contents. This is the safe option. Even if an upstream image changes, or a multistage ancestor changes, as long as the file contents stay the same, we have a cache hit. The reverse is true too, we don't need to detect upstream changes, as we will notice them by the changed files. However, our lookahead needs to know this a priori, hence it only works if the files are guaranteed to be the same. This is for example the case if you reference images by their shasum. In that case it is guaranteed that the files will be the same after download. The same logic also applies if you provide a checksum to COPY/ADD, which is not yet implemented in kaniko, but would be a nice incentive to do so.

We can know a-priori whether a lookahead key will be stable and we can opt to do file hashing in case it is not. The only downside is that we would lose a cache hit if the a-priori key changed but the file-contents did not. Not yet sure whether we can remedy this case, as it basically would need to have two references to the same cache layer.

Implementation is not straightforward as before we can decide whether we have a cache hit or not, a lot of transformations have to happen on the stages. Transformations are small things like replacing named stage references with the stage index, but also more meaningful things like unrolling `ONBUILD` instructions. Currently the control-flow is `skip&squash -> transformations -> build`, to implement cache-lookahead we have to flip that around `transformations -> skip&squash -> build`. To not overload the reviewers I will split those changes up into smaller more digestible refactorings with that goal in mind.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

multistage cache lookahead #334

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

multistage cache lookahead #334

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions