SC18

Super Computer Conference 18 - Dallas

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Deep Learning at Scale - Monday November 12 2018

GitHub Repo

What was covered:

Overview of machine learning / deep learning
Success stories of deep learning for scientific applications
How to train deep learning models with Keras
How to scale training to multiple nodes on HPCs
How to utilize HPC for hyper-parameter optimization
Data management for deep learning at scale
- Data management and I/O at scale is still an open issue
- Extreme-scale applications regularly have I/O issues on HPC
  - See Gordon Bell finalist on Deep Learning for Climate
  - See CosmoFlow
- Some things which have worked
  - Burst-buffer filesystems
  - Pre-staging input files to RAM
  - Per-node input file chunking (no global shuffling)
- General solutions don’t yet exist

Introduction to Machine Learning, Deep Learning - Prabhat (from NERSC)

Presented classical machine learning algorithms
We shouldn't be afraid of non-convex functions if we have large enough data
Machine Learning (ML) workflow:
- What is the problem
- Data exploration
- Evaluation metrics
- Deploy
- Large scale
- Monitoring
Department Of Energy HPC exascale project to try out
History of Deep Learning and why it got so much attention in recent years? Computing power & big data
Discussed overfitting and how to optimize a Neural Network (NN)
Bias - Variance
NERSC Big data stack
NERSC ML/DL webpage
Deep Learning for Science blog
Summary:
- ML has had tremendous impact in commercial world
- Analogs in scientific domains
- Flavors of ML:
- Supervised Tasks: Regression, Classification
- Unsupervised Task: Clustering
- ML Algorithms:
- k-nn, k-means, Linear Regression, SVM, kernel-SVM, Deep Learning,…
- ML workflow
- ML software is available at DOE HPC facilities, please try it out!

Deep Learning for science - Debbie Bard (from NERSC)

Real example of where they use Deep Learning models and very large data.
Building ML models and scaling are 2 different problems
Classification and Regression are most popular ML problems
Why DL as sclae for science?
- Scientific data is typically large and complex
- Harder to find optimal hyperparameters
- Need lots of prototyping and model evaluation
- Key metric: time to scientific insight
- Don’t want to wait for days to train a single model
- Fast turnaround of ideas and exploration
ML/DL for climate science used as an example of an aplication of DL to science
Application of cutting-edge methods from machine learning, statistics, topology and applied math to:
- Predict extreme weather change in the future
- Improve mechanistic understanding of extreme events
- Classification and object detection of various types of weather patterns
- Segmentation (pixel-wise) of extreme weather patterns
- Tracking of spatio-temporal extreme events, such as hurricanes
- Generative Learning of complex systems using physics-based constraints
Most bottlenecks occur in I/O whe scale
For large scale we need HPC
KNL nodes on NERC Cori

Introduction to NERSC - Steve Farrell (from NERSC)

NERSC overview:
- National Energy Research Scientific Computing Center, established in 1974 as the first unclassified supercomputer center
- Now the Mission HPC facility for the DOE Office of Science
  - 7000 users, 800 projects, 700 codes, 48 states, 40 countries, universities, and national labs
The Cori supercomputer:
- 2,388 Intel Xeon ("Haswell") nodes
- 9,688 Intel Xeon Phi ("KNL") nodes
- 1.8 PB "burst buffer" (1,700 GB/s I/O bandwidth)
- 30 PB Lustre (700 GB/s I/O bandwidth)
- Entered Top500 list at #5 in Nov 2016, now #10
Perlmutter: next-gen system optimized for science
- Cray Shasta system with 3-4x capability of Cori
- GPU-accelerated (4x NVIDIA) nodes and CPU-only (AMD) nodes
- Cray Slingshot high performance network
- Single-tier All-Flash Lustre based file system
Deep Learning at NERSC:
- Optimized DL software stack
  - Frameworks: TensorFlow, Keras, PyTorch, Caffe, …Backed by Intel MKL-DNN
  - Multi-node libraries: Cray PE DL plugin, Horovod, PyTorch distributed, …
- Big Data Center collaborations
  - Intel optimizing TensorFlow and PyTorch for CPU with MKL-DNN
  - Cray optimizing DL workflows, distributed training with Distributed Training Framework, data management issues.
Jupyter at NERSC:
- JupyterHub deployed at NERSC to provide rich, interactive, notebook ecosystem on Cori
- Jupyter servers run on a dedicated login node and can access the system to offload work.
- Expansion of the service in progress.
- Today you can optionally utilize Jupyter as an in-browser GUI for browsing files on Cori and for submitting jobs (via terminal)

alt text

Hands-on code walkthrough - Steve Farrell (from NERSC)

TensorFlow - Best for production
- Most popular framework by several metrics
- Backed by Google and a strong community
- Moving towards more dynamic (“eager”) workflows
Keras - Best for prototyping
- Usability first; “an API designed for human beings, not machines.”
- Now included in TF; recommended high-level API
PyTorch - Best for prototyping | Best for production
- Rapidly growing in popularity, particularly in academic circles
- Pythonic experience; easy to use and debug
- Merging with Caffe2 for easier production deployment
Tutorial

Hyper-parameter optimization - Mike Ringenburg (from Cray)

Goals of this Session
- Introduce the concept of Hyperparameter Optimization (HPO)
  - What are hyperparameters?
  - Why are they important?
- Introduce automatic distributed hyperparameter optimization
- Using the Cray HPO library as an example
What are Hyperparameters?
- Neural network model parameters – weights (connection, bias)
- Neural network hyperparameters – any other settings, e.g.:
- Topology:
  - Number of neurons in fully connected layers
  - Filters, kernel sizes, strides of convolutional / pooling layers
  - Activation function: logistic, ReLU, tanh
- Training:
  - Learning rate, batch size, momentum
  - Dropout probability, batch normalization
  - Optimizers (SGD, Adam, RMSProp, AdaGrad, etc)
- Significant impact on convergence
  - Accuracy and time-to-accuracy
  - Good choices especially important at scale
Search the space of possible hyperparameter sets:
- Brute-force of entire search space intractable
- More sets than atoms in the observable universe
- Only evaluate a subspace (prune):
- Grid or Random Search
- Bayesian Optimization based
- Genetic / Evolutionary Algorithms (“population-based training”)
Population Based Training
- Optimize hyperparameters and parameters
- Hyperparameters optimized as usual with GA / EA
- Parameters optimized with checkpoint / restore:
- At the end of each epoch, population copies best parameters
- Creates a “training schedule” with customized epochs
Cray Hyperparameter Optimization
- Cray integrated Hyperparameter Optimization (HPO) support with Python interface and Chapel backend
- Supported distributed optimization as well as distributed training
  - E.g., 20 nodes, 5 HPO instances each training on 4 nodes
- Simple steps to use:
  - Create a python wrapper script
  - Define optimizer and configuration
  - Provide parameters, search range and executable command
  - Run wrapper script to optimize

from crayai import hpo

eval = hpo.evaluator(‘python …’)
params = ([[“--learning_rate”, 0.01, (1e-6, 0.1)],
           [“--dropout_rate”,  0.5,  (0.3, 0.7)],
           ...]

Optimizer = hpo.{genetic,grid,random}.Optimizer(
               params, eval, launcher=“urika”,
               generations=5)

The Cray HPO package comes with Urika integration
- Leverages deep learning and analytics resources of the Urika image
- Allows seamless distribution with workload managers
Supports three techniques for HPO
- Genetic optimization
- Grid sweep
- Random search
Supports Population Based Training
- Learn a custom training parameter schedule while training
- Trains a better model in less time
Examples: Grid

# Grid search example
from crayai import hpo

# HPs with bounds discretized at linear intervals
params = hpo.params([["-a",  1.0, (-1.0e3, 1.0e3)],
                     ["-b", -1.0, (-1.0e3, 1.0e3)]])

# Evaluator specifies how to evaluate a set of HPs
evaluator = hpo.evaluator('python ./model.py')

# Create a random search optimizer
optimizer = hpo.grid_search.optimizer(evaluator,num_iters = 1000)
optimizer.optimize(params)

Examples: Random

# Random search example
from crayai import hpo

# HPs with bounds discretized at linear intervals
params = hpo.params([["-a",  1.0, (-1.0e3, 1.0e3)],
                     ["-b", -1.0, (-1.0e3, 1.0e3)]])

# Evaluator specifies how to evaluate a set of HPs
evaluator = hpo.evaluator('python ./model.py')

# Create a random search optimizer
optimizer = hpo.random_search.optimizer(evaluator,num_iters = 1000)
optimizer.optimize(params)

Examples: Genetic

from crayai import hpo

# HPs with bounds discretized at linear intervals
params = hpo.params([["-a",  1.0, (-1.0e3, 1.0e3)],
                     ["-b", -1.0, (-1.0e3, 1.0e3)]])

# Evaluator specifies how to evaluate a set of HPs
evaluator = hpo.evaluator('python ./model.py')

# Create a random search optimizer
optimizer = hpo.genetic.optimizer(evaluator,
                  generations = 10,
                  num_demes = 4,
                  pop_size = 64,
                  mutation_rate = 0.005,
                  crossover_rate = 0.33)
optimizer.optimize(params)

Each epoch will select the best

Scaling Deep Learning to multiple nodes: strategies for success - Thorsten Kurth (from NERSC)

Motivation:
Data-parallel Training
- applies to Stochastic Gradient Descent-type algorithms
  - each node takes a data batch and computes model updates independently
  - these updates are then collectively summed and applied to the local model
Synchronous Update
- all nodes compute gradients locally
- gradients are summed across nodes
- updates propagated to all nodes
- pros:
  - stable convergence
- cons:
  - scaling is not optimal because all nodes have to wait for reduction to complete (stragglers slow everyone down)
  - global (effective) batch size grows with number of nodes
Asynchronous Update
- all nodes compute gradient update locally
- gradient is sent to parameters server
- parameters servers incorporates gradients into model as they arrive and sends back the updated model to the corresponding node
- pros
  - no node waits for anybody (perfect scaling)
  - resilient
- cons
  - use of stale gradients can have impact on convergence rate (depending on #workers)
  - parameter server can be bottleneck
Hybrid update
- synchronous groups exchange information through parameters server asynchronously
- gradient is sent to parameters server
- parameters servers incorporates gradients into model as they arrive and sends back the updated model to the corresponding node
- pros
  - impact of stragglers reduced compared to fully synchronous mode
  - negative impact on stochastic convergence controlled
  - finer control on total batch size
- cons
  - group size needs to be tuned
- other methods: stale-synchronous/pipelining
Large batch training and scaling learning rate
- In data parallelism with fully synchronous SGD one uses N workers to processes N batches of data
- Need to scale the learning rate accordingly to accelerate the convergence
Challenges with Large Batch Training: not stable and generalization gap
Avoiding initial training instabilities using larning rate warm-up (variable learning rate)
Large batch training takeaway
- Training with large batch-size requires learning rate scaling. Anywhere between sub-sqrt to linear scaling have been used in practice.
- Training with large learning rates introduces instabilities in the initial stages of training. Gradual warm-up to target learning rate works well.
- Training with a large batch-size has a generalization gap w.r.t small batch-size
- Linear warm-up + scaling LR works well for ~10x scaling of batch-size
- Increasing the batch-size is an alternative approach to decaying the learning rate
- Adaptive learning rate scaling methods like LARC is another alternative
- These methods are constantly pushing the limits of the largest batch-size we can use for training but they still don’t eliminate the upper bound
- There are works that try to understand the reasons behind the generalization gap which we didn’t review here

Distributed training hands-on session - Thorsten Kurth (from NERSC)

Horovod for distributed training
- good to consider Cray PE ML Plugin for scalable training on HPC systems
Horovod link
- Enables distributed synchronous data-parallel training with minimal changes to user code
- Uses ring all-reduce and MPI to collectively combine gradients across workers
- Such approaches shown to scale better than parameter-server approaches (e.g. distributed TensorFlow with gRPC)
Initialize Horovod and MPI:

hvd.init()

Wrap your optimizer in the Horovod distributed optimizer:

opt = keras.optimizers.SGD(lr=lr*hvd.size(), ...)
opt = hvd.DistributedOptimizer(opt)

Construct the variables broadcast callback:

callbacks = [hvd.callbacks.BroadcastGlobalVariablesCallback(0)]

Train model as usual; it should now synchronize at every mini-batch step:

model.fit(..., callbacks=callbacks)

Launch your script with MPI

mpirun -n NUM_RANKS … python train.py …

Notes:
- batch size increase, learning rate should decrease and opposite
- start small and scale up
- end of epoch we will always see small jumps
- SGD + momentum is most used
- LARC Optimizer link
- Standardize data, batch normalization
- GPUs are more efficient than multiple nodes model? (not sure)
- It might take longer to train because validation data is not distributed in chunks like training data

Some additional resources

General ML/DL training
- Andrew Ng’s famous ML course on Coursera: https://www.coursera.org/learn/machine-learning
- Deep learning specialization on Coursera: https://www.coursera.org/specializations/deep-learning
Distributed training
- Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis: https://arxiv.org/abs/1802.09941
- Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates: https://arxiv.org/abs/1708.07120
- Training Imagenet in 3 hours for $25; and CIFAR10 for $0.26: https://www.fast.ai/2018/04/30/dawnbench-fastai/
MLPerf Deep Learning Benchmarking (https://mlperf.org)
Cray Urika-XC and Urika-CS AI and Analytics software suites
- DL Frameworks: TensorFlow, Keras, PyTorch with Cray ML Plugin and Horovod for distributed training
- Hyperparameter optimization library
- Also Spark, Dask Distributed, Anaconda Python, Jupyeter and pbdR
Population Based Training of Neural Networks (DeepMind):

Big Data Challenge - How to Engage with Large Scale Facilities? - Wednesday November 14 2018

Telescope data challenges
- where they were getting 15 TB data / night
- transfer to all 3 centers each night needed
- data available to anyone
- 60 Pb volumes
Texas Advanced Computer Center (TAC)
- They use Singularity and Docker - Less optimized but easy to use
- High resolution 3D images
- Interest in more container based technology - life-cycle issues
- GPU interest in speed up processing

Applications of Deep Learning in Industry and Research - Wednesday November 14 2018

Matthew Grover - Walmart Inc

Alex Sutton - Microsoft Corporation

Great challenges in creating Data Pipelines
Continue model maintenance after model production to test model drift and re-training
Look for Benchmarking - How fast the model runs on different platforms and resources
Some key tools they mentioned:
- Elastic File - check it out
- Horovod from Uber link
- Deep learning frameworks: Pytorch + Tensorflow
- Nvidia RAPIDS link
- ONNX from Facebook link
- per-configured VM
Recommended paper: TFX: A TensorFlow-Based Production-Scale Machine Learning Platform link

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SC18

Super Computer Conference 18 - Dallas

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Deep Learning at Scale - Monday November 12 2018

What was covered:

Introduction to Machine Learning, Deep Learning - Prabhat (from NERSC)

Deep Learning for science - Debbie Bard (from NERSC)

Introduction to NERSC - Steve Farrell (from NERSC)

Hands-on code walkthrough - Steve Farrell (from NERSC)

Hyper-parameter optimization - Mike Ringenburg (from Cray)

Scaling Deep Learning to multiple nodes: strategies for success - Thorsten Kurth (from NERSC)

Distributed training hands-on session - Thorsten Kurth (from NERSC)

Some additional resources

Big Data Challenge - How to Engage with Large Scale Facilities? - Wednesday November 14 2018

Applications of Deep Learning in Industry and Research - Wednesday November 14 2018

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally