Using GPU-optimized NGC images as base for ML (Pytorch/Tensorflow) docker images

Consolidating some of the discussion @ngam had around using [NVIDIA GPU Cloud (NGC)](https://www.nvidia.com/en-us/gpu-cloud) [containers](https://developer.nvidia.com/ai-hpc-containers) as the base image for `pytorch-notebook` and `ml-notebook`, and potentially `cupy` (#322)

- https://github.com/pangeo-data/pangeo-docker-images/issues/322#issuecomment-1114362274
- https://github.com/pangeo-data/pangeo-docker-images/issues/320#issuecomment-1114359084
- Benchmarks: https://github.com/ngam/ngc-ext-pangeo/issues/24#issuecomment-1128125241

**Is your feature request related to a problem? Please describe.**

For machine learning and data analytics work that rely on NVIDIA Graphical Processing Units (GPUs), there are several optimizations related to drivers/hardware that can help to speed up processing workflows. Currently, the `pytorch-notebook` and `ml-notebook` docker images rely on CUDA libraries from conda-forge which are less optimized than what exists on NGC.

**Describe the solution you'd like**

Refactor the `pytorch-notebook` and `ml-notebook` to be based on NGC containers instead of the current [`base-image`](https://github.com/pangeo-data/pangeo-docker-images/blob/master/base-image/Dockerfile#L2). This might involve flipping the current installation pipeline from Pangeo-first/ML-second (`base-notebook` -> `pangeo-notebook` -> `ml-notebook`) to ML-first/Pangeo-second (`ngc` -> `ml-notebook` -> `pangeo-notebook`). Something that can help with this is a `pangeo-notebook` metapackage https://github.com/pangeo-data/pangeo-docker-images/issues/359

**Describe alternatives you've considered**

Spin things off into a different repository (`pangeo-gpu-docker-images`?), or have a separate build chain (`ngc-pytorch-notebook`, `ngc-ml-notebook`) from the current CI/CD infrastructure.

**Additional context**
Add any other context or screenshots about the feature request here.

One benefit of chaging the build order to ML-first/Pangeo-second is that ML folks who don't need all of the heavy Climate/Ocean packages `pangeo-notebook` can get a slimmer `ml-notebook`. For example, if they're deploying a model to some server API, they can base their docker image on `ngc-ml-notebook`, instead of the current heavy `ml-notebook`.

Disadvantage is that the refactoring will require some effort, and we need to be careful to ensure this doesn't affect existing JupyterHub deployments.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using GPU-optimized NGC images as base for ML (Pytorch/Tensorflow) docker images #457

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Using GPU-optimized NGC images as base for ML (Pytorch/Tensorflow) docker images #457

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions