Skip to content

[Feature][P2]: Investigate Lazy Image Pull with Stargz/Nydus Snapshotters #28655

@rzabarazesh

Description

@rzabarazesh

🚀 The feature, motivation and pitch

Description

Docker pulls entire images before starting containers, which can take several minutes for large images (30GB). Lazy pulling technologies like stargz-snapshotter and nydus-snapshotter allow containers to start immediately by pulling only the layers needed for startup, fetching remaining layers on-demand in the background. This could significantly reduce the time between "docker run" and container actually starting, especially beneficial for test containers.

What You'll Do

  1. Research Phase:

    • Deep dive into stargz-snapshotter and nydus-snapshotter architectures
    • Compare performance characteristics, maturity, and maintenance status
    • Analyze compatibility with current setup (Docker/containerd, BuildKit)
    • Document security implications and production readiness
    • Check if vLLM's workload patterns benefit from lazy pull
  2. POC Phase :

    • Set up test environment with containerd + snapshotter
    • Convert vLLM test image to stargz/nydus format
    • Benchmark startup times: traditional pull vs lazy pull
    • Measure actual layer access patterns during test runs
    • Evaluate complexity vs benefit tradeoff
  3. Decision & Documentation:

    • Create recommendation: implement, defer, or skip
    • Document findings with benchmarks
    • If implementing: create detailed implementation plan
    • If skipping: document why and conditions for revisiting

Deliverables

  • Research document comparing stargz vs nydus
  • Architecture diagram: how it integrates with current CI
  • POC setup on test instances
  • Performance benchmarks (startup time, total pull time)
  • Layer access analysis (which layers accessed when)
  • Security and reliability assessment
  • Cost-benefit analysis
  • Go/No-Go recommendation with justification
  • Implementation plan (if Go) or deferral conditions (if No-Go)

Research Questions to Answer

Technical Feasibility:

  • Does BuildKit support building stargz/nydus images?
  • Can ECR store these formats natively?
  • How does it work with Docker-in-Docker (used in CI)?
  • What's the containerd version requirement?

Performance:

  • How much faster is container startup? (target: 50%+ improvement)
  • Does on-demand pulling slow down test execution?
  • Network bandwidth impact (pulling while testing)
  • Does FSx cache help lazy pull or make it redundant?

Reliability:

  • What happens if network fails mid-pull?
  • How mature are these projects? (production-ready?)
  • Who maintains them? (Google, Alibaba Cloud, community)
  • Any known issues or limitations?

Complexity:

  • How much effort to implement and maintain?
  • Does it require changes to all Dockerfiles?
  • Impact on existing cache strategies (Tasks 2, 11, 12)?
  • How to roll back if issues arise?

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions