Skip to content
George Mihaila edited this page Nov 19, 2018 · 4 revisions

Super Computer Conference 18 - Dallas

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Deep Learning at Scale - Monday November 12 2018

GitHub Repo

What was covered:

  • Overview of machine learning / deep learning
  • Success stories of deep learning for scientific applications
  • How to train deep learning models with Keras
  • How to scale training to multiple nodes on HPCs
  • How to utilize HPC for hyper-parameter optimization
  • Data management for deep learning at scale
    • Data management and I/O at scale is still an open issue
    • Extreme-scale applications regularly have I/O issues on HPC
      • See Gordon Bell finalist on Deep Learning for Climate
      • See CosmoFlow
    • Some things which have worked
      • Burst-buffer filesystems
      • Pre-staging input files to RAM
      • Per-node input file chunking (no global shuffling)
    • General solutions don’t yet exist

Introduction to Machine Learning, Deep Learning - Prabhat (from NERSC)

  • Presented classical machine learning algorithms
  • We shouldn't be afraid of non-convex functions if we have large enough data
  • Machine Learning (ML) workflow:
    • What is the problem
    • Data exploration
    • Evaluation metrics
    • Deploy
    • Large scale
    • Monitoring
  • Department Of Energy HPC exascale project to try out
  • History of Deep Learning and why it got so much attention in recent years? Computing power & big data
  • Discussed overfitting and how to optimize a Neural Network (NN)
  • Bias - Variance
  • NERSC Big data stack
  • NERSC ML/DL webpage
  • Deep Learning for Science blog
  • Summary:
    • ML has had tremendous impact in commercial world
    • Analogs in scientific domains
    • Flavors of ML:
    • Supervised Tasks: Regression, Classification
    • Unsupervised Task: Clustering
    • ML Algorithms:
    • k-nn, k-means, Linear Regression, SVM, kernel-SVM, Deep Learning,…
    • ML workflow
    • ML software is available at DOE HPC facilities, please try it out!

Deep Learning for science - Debbie Bard (from NERSC)

  • Real example of where they use Deep Learning models and very large data.
  • Building ML models and scaling are 2 different problems
  • Classification and Regression are most popular ML problems
  • Why DL as sclae for science?
    • Scientific data is typically large and complex
    • Harder to find optimal hyperparameters
    • Need lots of prototyping and model evaluation
    • Key metric: time to scientific insight
    • Don’t want to wait for days to train a single model
    • Fast turnaround of ideas and exploration
  • ML/DL for climate science used as an example of an aplication of DL to science
  • Application of cutting-edge methods from machine learning, statistics, topology and applied math to:
    • Predict extreme weather change in the future
    • Improve mechanistic understanding of extreme events
    • Classification and object detection of various types of weather patterns
    • Segmentation (pixel-wise) of extreme weather patterns
    • Tracking of spatio-temporal extreme events, such as hurricanes
    • Generative Learning of complex systems using physics-based constraints
  • Most bottlenecks occur in I/O whe scale
  • For large scale we need HPC
  • KNL nodes on NERC Cori

Introduction to NERSC - Steve Farrell (from NERSC)

  • NERSC overview:

    • National Energy Research Scientific Computing Center, established in 1974 as the first unclassified supercomputer center
    • Now the Mission HPC facility for the DOE Office of Science
      • 7000 users, 800 projects, 700 codes, 48 states, 40 countries, universities, and national labs alt text
  • The Cori supercomputer:

    • 2,388 Intel Xeon ("Haswell") nodes
    • 9,688 Intel Xeon Phi ("KNL") nodes
    • 1.8 PB "burst buffer" (1,700 GB/s I/O bandwidth)
    • 30 PB Lustre (700 GB/s I/O bandwidth)
    • Entered Top500 list at #5 in Nov 2016, now #10 alt text
  • Perlmutter: next-gen system optimized for science

    • Cray Shasta system with 3-4x capability of Cori
    • GPU-accelerated (4x NVIDIA) nodes and CPU-only (AMD) nodes
    • Cray Slingshot high performance network
    • Single-tier All-Flash Lustre based file system
  • Deep Learning at NERSC:

    • Optimized DL software stack
      • Frameworks: TensorFlow, Keras, PyTorch, Caffe, …Backed by Intel MKL-DNN
      • Multi-node libraries: Cray PE DL plugin, Horovod, PyTorch distributed, …
    • Big Data Center collaborations
      • Intel optimizing TensorFlow and PyTorch for CPU with MKL-DNN
      • Cray optimizing DL workflows, distributed training with Distributed Training Framework, data management issues.
  • Jupyter at NERSC:

    • JupyterHub deployed at NERSC to provide rich, interactive, notebook ecosystem on Cori
    • Jupyter servers run on a dedicated login node and can access the system to offload work.
    • Expansion of the service in progress.
    • Today you can optionally utilize Jupyter as an in-browser GUI for browsing files on Cori and for submitting jobs (via terminal)

alt text

Hands-on code walkthrough - Steve Farrell (from NERSC)

  • TensorFlow - Best for production
    • Most popular framework by several metrics
    • Backed by Google and a strong community
    • Moving towards more dynamic (“eager”) workflows
  • Keras - Best for prototyping
    • Usability first; “an API designed for human beings, not machines.”
    • Now included in TF; recommended high-level API
  • PyTorch - Best for prototyping | Best for production
    • Rapidly growing in popularity, particularly in academic circles
    • Pythonic experience; easy to use and debug
    • Merging with Caffe2 for easier production deployment
  • Tutorial

Hyper-parameter optimization - Mike Ringenburg (from Cray)

  • Goals of this Session

    • Introduce the concept of Hyperparameter Optimization (HPO)
      • What are hyperparameters?
      • Why are they important?
    • Introduce automatic distributed hyperparameter optimization
    • Using the Cray HPO library as an example
  • What are Hyperparameters?

    • Neural network model parameters – weights (connection, bias)
    • Neural network hyperparameters – any other settings, e.g.:
    • Topology:
      • Number of neurons in fully connected layers
      • Filters, kernel sizes, strides of convolutional / pooling layers
      • Activation function: logistic, ReLU, tanh
    • Training:
      • Learning rate, batch size, momentum
      • Dropout probability, batch normalization
      • Optimizers (SGD, Adam, RMSProp, AdaGrad, etc)
    • Significant impact on convergence
      • Accuracy and time-to-accuracy
      • Good choices especially important at scale
  • Search the space of possible hyperparameter sets:

    • Brute-force of entire search space intractable
    • More sets than atoms in the observable universe
    • Only evaluate a subspace (prune):
    • Grid or Random Search
    • Bayesian Optimization based
    • Genetic / Evolutionary Algorithms (“population-based training”) alt text
  • Population Based Training

    • Optimize hyperparameters and parameters
    • Hyperparameters optimized as usual with GA / EA
    • Parameters optimized with checkpoint / restore:
    • At the end of each epoch, population copies best parameters
    • Creates a “training schedule” with customized epochs
  • Cray Hyperparameter Optimization

    • Cray integrated Hyperparameter Optimization (HPO) support with Python interface and Chapel backend
    • Supported distributed optimization as well as distributed training
      • E.g., 20 nodes, 5 HPO instances each training on 4 nodes
    • Simple steps to use:
      • Create a python wrapper script
      • Define optimizer and configuration
      • Provide parameters, search range and executable command
      • Run wrapper script to optimize
from crayai import hpo

eval = hpo.evaluator(‘python …’)
params = ([[“--learning_rate”, 0.01, (1e-6, 0.1)],
           [“--dropout_rate”,  0.5,  (0.3, 0.7)],
           ...]

Optimizer = hpo.{genetic,grid,random}.Optimizer(
               params, eval, launcher=“urika”,
               generations=5)
  • The Cray HPO package comes with Urika integration
    • Leverages deep learning and analytics resources of the Urika image
    • Allows seamless distribution with workload managers
  • Supports three techniques for HPO
    • Genetic optimization
    • Grid sweep
    • Random search
  • Supports Population Based Training
    • Learn a custom training parameter schedule while training
    • Trains a better model in less time
  • Examples: Grid
# Grid search example
from crayai import hpo

# HPs with bounds discretized at linear intervals
params = hpo.params([["-a",  1.0, (-1.0e3, 1.0e3)],
                     ["-b", -1.0, (-1.0e3, 1.0e3)]])

# Evaluator specifies how to evaluate a set of HPs
evaluator = hpo.evaluator('python ./model.py')

# Create a random search optimizer
optimizer = hpo.grid_search.optimizer(evaluator,num_iters = 1000)
optimizer.optimize(params)

  • Examples: Random
# Random search example
from crayai import hpo

# HPs with bounds discretized at linear intervals
params = hpo.params([["-a",  1.0, (-1.0e3, 1.0e3)],
                     ["-b", -1.0, (-1.0e3, 1.0e3)]])

# Evaluator specifies how to evaluate a set of HPs
evaluator = hpo.evaluator('python ./model.py')

# Create a random search optimizer
optimizer = hpo.random_search.optimizer(evaluator,num_iters = 1000)
optimizer.optimize(params)

  • Examples: Genetic
from crayai import hpo

# HPs with bounds discretized at linear intervals
params = hpo.params([["-a",  1.0, (-1.0e3, 1.0e3)],
                     ["-b", -1.0, (-1.0e3, 1.0e3)]])

# Evaluator specifies how to evaluate a set of HPs
evaluator = hpo.evaluator('python ./model.py')

# Create a random search optimizer
optimizer = hpo.genetic.optimizer(evaluator,
                  generations = 10,
                  num_demes = 4,
                  pop_size = 64,
                  mutation_rate = 0.005,
                  crossover_rate = 0.33)
optimizer.optimize(params)

  • Each epoch will select the best

Scaling Deep Learning to multiple nodes: strategies for success - Thorsten Kurth (from NERSC)

  • Motivation: alt text

  • Data-parallel Training

    • applies to Stochastic Gradient Descent-type algorithms
      • each node takes a data batch and computes model updates independently
      • these updates are then collectively summed and applied to the local model
  • Synchronous Update

    • all nodes compute gradients locally
    • gradients are summed across nodes
    • updates propagated to all nodes
    • pros:
      • stable convergence
    • cons:
      • scaling is not optimal because all nodes have to wait for reduction to complete (stragglers slow everyone down)
      • global (effective) batch size grows with number of nodes
  • Asynchronous Update

    • all nodes compute gradient update locally
    • gradient is sent to parameters server
    • parameters servers incorporates gradients into model as they arrive and sends back the updated model to the corresponding node
    • pros
      • no node waits for anybody (perfect scaling)
      • resilient
    • cons
      • use of stale gradients can have impact on convergence rate (depending on #workers)
      • parameter server can be bottleneck
  • Hybrid update

    • synchronous groups exchange information through parameters server asynchronously

    • gradient is sent to parameters server

    • parameters servers incorporates gradients into model as they arrive and sends back the updated model to the corresponding node

    • pros

      • impact of stragglers reduced compared to fully synchronous mode
      • negative impact on stochastic convergence controlled
      • finer control on total batch size
    • cons

      • group size needs to be tuned
    • other methods: stale-synchronous/pipelining

  • Large batch training and scaling learning rate

    • In data parallelism with fully synchronous SGD one uses N workers to processes N batches of data
    • Need to scale the learning rate accordingly to accelerate the convergence
  • Challenges with Large Batch Training: not stable and generalization gap

  • Avoiding initial training instabilities using larning rate warm-up (variable learning rate)

  • Large batch training takeaway

    • Training with large batch-size requires learning rate scaling. Anywhere between sub-sqrt to linear scaling have been used in practice.

    • Training with large learning rates introduces instabilities in the initial stages of training. Gradual warm-up to target learning rate works well.

    • Training with a large batch-size has a generalization gap w.r.t small batch-size

    • Linear warm-up + scaling LR works well for ~10x scaling of batch-size

    • Increasing the batch-size is an alternative approach to decaying the learning rate

    • Adaptive learning rate scaling methods like LARC is another alternative

    • These methods are constantly pushing the limits of the largest batch-size we can use for training but they still don’t eliminate the upper bound

    • There are works that try to understand the reasons behind the generalization gap which we didn’t review here

Distributed training hands-on session - Thorsten Kurth (from NERSC)

  • Horovod for distributed training
    • good to consider Cray PE ML Plugin for scalable training on HPC systems
  • Horovod link
    • Enables distributed synchronous data-parallel training with minimal changes to user code
    • Uses ring all-reduce and MPI to collectively combine gradients across workers
    • Such approaches shown to scale better than parameter-server approaches (e.g. distributed TensorFlow with gRPC) alt text
  • Initialize Horovod and MPI:
hvd.init()

  • Wrap your optimizer in the Horovod distributed optimizer:
opt = keras.optimizers.SGD(lr=lr*hvd.size(), ...)
opt = hvd.DistributedOptimizer(opt)

  • Construct the variables broadcast callback:
callbacks = [hvd.callbacks.BroadcastGlobalVariablesCallback(0)]
   
  • Train model as usual; it should now synchronize at every mini-batch step:
model.fit(..., callbacks=callbacks)

  • Launch your script with MPI
mpirun -n NUM_RANKS … python train.py …

  • Notes:
    • batch size increase, learning rate should decrease and opposite
    • start small and scale up
    • end of epoch we will always see small jumps
    • SGD + momentum is most used
    • LARC Optimizer link
    • Standardize data, batch normalization
    • GPUs are more efficient than multiple nodes model? (not sure)
    • It might take longer to train because validation data is not distributed in chunks like training data

Some additional resources

Big Data Challenge - How to Engage with Large Scale Facilities? - Wednesday November 14 2018

  • Telescope data challenges
    • where they were getting 15 TB data / night
    • transfer to all 3 centers each night needed
    • data available to anyone
    • 60 Pb volumes
  • Texas Advanced Computer Center (TAC)
    • They use Singularity and Docker - Less optimized but easy to use
    • High resolution 3D images
    • Interest in more container based technology - life-cycle issues
    • GPU interest in speed up processing

Applications of Deep Learning in Industry and Research - Wednesday November 14 2018

Matthew Grover - Walmart Inc

Alex Sutton - Microsoft Corporation

  • Great challenges in creating Data Pipelines
  • Continue model maintenance after model production to test model drift and re-training
  • Look for Benchmarking - How fast the model runs on different platforms and resources
  • Some key tools they mentioned:
    • Elastic File - check it out
    • Horovod from Uber link
    • Deep learning frameworks: Pytorch + Tensorflow
    • Nvidia RAPIDS link
    • ONNX from Facebook link
    • per-configured VM
  • Recommended paper: TFX: A TensorFlow-Based Production-Scale Machine Learning Platform link