Skip to content

Conversation

afeldman-nm
Copy link
Collaborator

@afeldman-nm afeldman-nm commented Sep 26, 2025

This PR attempts to speed up docker build operations during PR regression test runs. The speedup comes by exploiting layer caching in a remote registry cache (there is already caching at image granularity.)

This PR should address vllm-project/vllm#25004 and contribute to addressing vllm-project/vllm#23588

In the process, I am hopeful that by increasing layer reuse between builds, this PR will indirectly increase the utilization of worker-local layer caches during unit test docker pull operations (since a given worker should see a lot of repeated layer pulls across consecutive unit tests), thereby lowering individual unit test startup times as described in vllm-project/vllm#24779

Key changes:

  • Use docker buildx build instead of docker build (to enable registry caching)
  • Create docker buildx builder instance which uses docker-container backend (to enable registry caching)
  • Utilize --cache-from and --cache-to for remote registry caching as shown below, with mode=max to ensure that layers from intermediate build stages are cached:
        --cache-from=type=registry,ref={{ docker_image_cache }}
        --cache-to=type=registry,ref={{ docker_image_cache }},mode=max

Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
@afeldman-nm
Copy link
Collaborator Author

afeldman-nm commented Sep 26, 2025

I have succeeded in getting docker buildx build to use the remote registry layer cache.

However I have not yet demonstrated build-time savings.

With a completely empty remote cache, the build time was 46min - about twice as long as the current build time, due to the time needed to push all of the layers to the remote cache. This is not necessarily bad or unexpected:

https://buildkite.com/vllm/ci/builds/32642#019986ef-3aa4-4865-aa56-5844310e119c

However, the subsequent build - in which a one-command modification to the vLLM source had been made - still took about 33min, so on the order of the typical build time.

https://buildkite.com/vllm/ci/builds/32648#01998725-e681-4b77-961b-563059c0413a

We would hope that since only the vLLM source was modified, we would get cache hits for all image layers not associated with the vLLM source. Instead, here is a breakdown of which layers were and were not able to exploit cache in the "subsequent" image build linked above, broken down by each build stage (along with the time needed to build layers which had cache misses):

[base 1/11] - pull
[base 2/11] - CACHED
[base 3/11] - CACHED
[base 4/11] - CACHED
[base 5/11] - CACHED
[base 6/11] - CACHED
[base 7/11] - CACHED
[base 8/11] - CACHED
[base 9/11] - CACHED
[base 10/11] - CACHED
[base 11/11] - CACHED

[build 1/8] - CACHED
[build 2/8] - CACHED
[build 3/8] - performed COPY 48.2s
[build 4/8] - performed RUN 0.2s
[build 5/8] - performed RUN (compile) 78.3s
[build 6/8] - performed RUN 0.2s
[build 7/8] - performed COPY 0.0s
[build 8/8] - performed RUN 0.3s

[vllm-base 1/21] - performed FROM
[vllm-base 2/21] - CACHED
[vllm-base 3/21] - CACHED
[vllm-base 4/21] - CACHED
[vllm-base 5/21] - CACHED
[vllm-base 6/21] - CACHED
[vllm-base 7/21] - CACHED
[vllm-base 8/21] - performed RUN (dist/*.whl) 193.3s
[vllm-base 9/21] - performed RUN (/vllm-workspace?) 7.8s
[vllm-base 10/21] - performed COPY 0.0s 0.1s
[vllm-base 11/21] - performed COPY 0.0s
[vllm-base 12/21] - performed COPY 0.0s
[vllm-base 13/21] - 0.1s
[vllm-base 14/21] - 0.0s
[vllm-base 15/21] - 2.4s
[vllm-base 16/21] - 0.0s
[vllm-base 17/21] - 43.7s
[vllm-base 18/21] - 0.0s
[vllm-base 19/21] - 3.0s
[vllm-base 20/21] - 0.0s
[vllm-base 21/21] - 334s

[test 1/7] - performed ADD 0.7s
[test 2/7] - performed RUN 64.2s
[test 3/7] - performed RUN 1.1s
[test 4/7] - performed RUN 1.5s
[test 5/7] - performed COPY 0.0s
[test 6/7] - performed RUN 0.1s
[test 7/7] - performed RUN 0.1s

Note that test 5/7 - test 7/7 move the precompiled vLLM package into the image's python install and then copy in the source. In principle these are the only layers which should have had cache misses. It is TODO to figure out why that is not the case.

Together the steps above take about 12min.

Additionally, the following docker image build steps took a significant amount of time:

  • Exporting layers: 273.2s
  • Pushing layers: 313.3s
  • Writing cache: 206.8s
  • Total: 13.22min

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants