Skip to content

Docker CUDA Image Bloat Checker #95

@mitulgarg

Description

@mitulgarg

Summary

Add image size and bloat detection to the existing env-doctor dockerfile command. Flag oversized CUDA base images, suggest slimmer alternatives, and recommend multi-stage builds when compilation layers aren't needed
at runtime.

Motivation

CUDA Docker images are notoriously bloated — nvidia/cuda:12.4.0-devel-ubuntu22.04 is ~4.5GB while nvidia/cuda:12.4.0-runtime-ubuntu22.04 is ~1.2GB. Many users unknowingly ship devel images to production or use full
CUDA images when they only need the runtime. The current env-doctor dockerfile command validates correctness but doesn't flag optimization opportunities.

Current State

The existing Dockerfile validator (src/env_doctor/validators/dockerfile_validator.py) checks:

  • Base image CUDA version compatibility
  • Library version mismatches
  • Runtime vs devel mismatch (only when compilation packages need devel)
  • Driver installation errors
  • Deprecated packages

It does NOT check:

  • Base image size or bloat
  • Whether devel is used unnecessarily (without compilation)
  • Multi-stage build opportunities
  • Redundant CUDA component installation on top of CUDA base images
  • Slimmer alternative base images

Proposed Implementation

New Data: cuda_image_sizes.json

  • Estimated sizes for common base image variants:
    {
      "nvidia/cuda": {
        "12.4.0-devel-ubuntu22.04": { "size_mb": 4500, "variant": "devel" },
        "12.4.0-runtime-ubuntu22.04": { "size_mb": 1200, "variant": "runtime" },
        "12.4.0-base-ubuntu22.04": { "size_mb": 250, "variant": "base" }
      }
    }
    

New Validation Methods in DockerfileValidator

  1. _validate_image_bloat()
  • If using devel image without any compilation commands (nvcc, gcc, make, pip install from source), warn and suggest runtime or base variant
  • Estimate base image size from lookup table, flag if >2GB without justification
  1. _validate_redundant_cuda_installs()
  • Detect apt-get install cuda-* or conda install cudatoolkit on top of nvidia/cuda base images
  • These are redundant and add gigabytes unnecessarily
  1. _suggest_multi_stage_build()
  • If devel image IS needed for compilation, suggest a multi-stage build pattern:

Build stage: compile with devel

FROM nvidia/cuda:12.4.0-devel-ubuntu22.04 AS builder
RUN pip install flash-attn --no-build-isolation

Runtime stage: slim image

FROM nvidia/cuda:12.4.0-runtime-ubuntu22.04
COPY --from=builder /usr/local/lib/python3.11/...

  • Only trigger when both compilation packages AND devel image are detected
  1. _validate_unnecessary_layers()
  • Flag apt-get install without --no-install-recommends (common bloat source)
  • Flag missing rm -rf /var/lib/apt/lists/* after apt operations
  • Flag pip install without --no-cache-dir

Integration

  • Add new checks to existing DockerfileValidator.validate() pipeline
  • Results appear as WARNING or INFO level alongside existing checks
  • Include estimated size savings in recommendations

Acceptance Criteria

  • Detects unnecessary devel image usage and suggests runtime/base alternatives
  • Flags redundant CUDA package installations on CUDA base images
  • Suggests multi-stage builds when compilation is needed but runtime doesn't need devel
  • Flags common Dockerfile bloat patterns (missing --no-cache-dir, --no-install-recommends)
  • Estimated size savings shown in recommendations
  • Integrates cleanly into existing env-doctor dockerfile output
  • No false positives: devel image with compilation commands is NOT flagged as bloated

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions