-
Notifications
You must be signed in to change notification settings - Fork 108
Description
Consolidating some of the discussion @ngam had around using NVIDIA GPU Cloud (NGC) containers as the base image for pytorch-notebook and ml-notebook, and potentially cupy (#322)
- Add cupy to ml notebooks #322 (comment)
- Use cudatoolkit=11 in both tensorflow and pytorch images #320 (comment)
- Benchmarks: uploading images ngam/ngc-ext-pangeo#24 (comment)
Is your feature request related to a problem? Please describe.
For machine learning and data analytics work that rely on NVIDIA Graphical Processing Units (GPUs), there are several optimizations related to drivers/hardware that can help to speed up processing workflows. Currently, the pytorch-notebook and ml-notebook docker images rely on CUDA libraries from conda-forge which are less optimized than what exists on NGC.
Describe the solution you'd like
Refactor the pytorch-notebook and ml-notebook to be based on NGC containers instead of the current base-image. This might involve flipping the current installation pipeline from Pangeo-first/ML-second (base-notebook -> pangeo-notebook -> ml-notebook) to ML-first/Pangeo-second (ngc -> ml-notebook -> pangeo-notebook). Something that can help with this is a pangeo-notebook metapackage #359
Describe alternatives you've considered
Spin things off into a different repository (pangeo-gpu-docker-images?), or have a separate build chain (ngc-pytorch-notebook, ngc-ml-notebook) from the current CI/CD infrastructure.
Additional context
Add any other context or screenshots about the feature request here.
One benefit of chaging the build order to ML-first/Pangeo-second is that ML folks who don't need all of the heavy Climate/Ocean packages pangeo-notebook can get a slimmer ml-notebook. For example, if they're deploying a model to some server API, they can base their docker image on ngc-ml-notebook, instead of the current heavy ml-notebook.
Disadvantage is that the refactoring will require some effort, and we need to be careful to ensure this doesn't affect existing JupyterHub deployments.