-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
Open
Labels
Description
🚀 The feature, motivation and pitch
Description
Docker pulls entire images before starting containers, which can take several minutes for large images (30GB). Lazy pulling technologies like stargz-snapshotter and nydus-snapshotter allow containers to start immediately by pulling only the layers needed for startup, fetching remaining layers on-demand in the background. This could significantly reduce the time between "docker run" and container actually starting, especially beneficial for test containers.
What You'll Do
-
Research Phase:
- Deep dive into stargz-snapshotter and nydus-snapshotter architectures
- Compare performance characteristics, maturity, and maintenance status
- Analyze compatibility with current setup (Docker/containerd, BuildKit)
- Document security implications and production readiness
- Check if vLLM's workload patterns benefit from lazy pull
-
POC Phase :
- Set up test environment with containerd + snapshotter
- Convert vLLM test image to stargz/nydus format
- Benchmark startup times: traditional pull vs lazy pull
- Measure actual layer access patterns during test runs
- Evaluate complexity vs benefit tradeoff
-
Decision & Documentation:
- Create recommendation: implement, defer, or skip
- Document findings with benchmarks
- If implementing: create detailed implementation plan
- If skipping: document why and conditions for revisiting
Deliverables
- Research document comparing stargz vs nydus
- Architecture diagram: how it integrates with current CI
- POC setup on test instances
- Performance benchmarks (startup time, total pull time)
- Layer access analysis (which layers accessed when)
- Security and reliability assessment
- Cost-benefit analysis
- Go/No-Go recommendation with justification
- Implementation plan (if Go) or deferral conditions (if No-Go)
Research Questions to Answer
Technical Feasibility:
- Does BuildKit support building stargz/nydus images?
- Can ECR store these formats natively?
- How does it work with Docker-in-Docker (used in CI)?
- What's the containerd version requirement?
Performance:
- How much faster is container startup? (target: 50%+ improvement)
- Does on-demand pulling slow down test execution?
- Network bandwidth impact (pulling while testing)
- Does FSx cache help lazy pull or make it redundant?
Reliability:
- What happens if network fails mid-pull?
- How mature are these projects? (production-ready?)
- Who maintains them? (Google, Alibaba Cloud, community)
- Any known issues or limitations?
Complexity:
- How much effort to implement and maintain?
- Does it require changes to all Dockerfiles?
- Impact on existing cache strategies (Tasks 2, 11, 12)?
- How to roll back if issues arise?
Alternatives
No response
Additional context
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Todo