[doc] Updated Autoscaling section

mauri-melato · sean-smith · commit 59422fe50bba · 2018-11-12T17:00:10.000-06:00
Signed-off-by: Maurizio Melato &lt;mmelato@amazon.com&gt;
diff --git a/docs/source/autoscaling.rst b/docs/source/autoscaling.rst
@@ -4,28 +4,54 @@
 AWS ParallelCluster Auto Scaling
 ================================
 
-.. image:: images/as-basic-diagram.png
+The auto scaling strategy described here applies to HPC clusters deployed with one of the
+supported traditional job scheduler, either SGE, Slurm or Torque.
+In these cases AWS ParallelCluster directly implements the scaling capabilities by managing
+the `Auto Scaling Group`_ (ASG) of the compute nodes and changing the scheduler configuration
+accordingly.  
+For HPC clusters based on AWS Batch, ParallelCluster relies on the elastic scaling capabilities
+provided by the AWS-managed job scheduler.
 
 Clusters deployed with AWS ParallelCluster are elastic in several ways. The first is by
 simply setting the ``initial_queue_size`` and ``max_queue_size`` parameters of a cluster
-settings. The ``initial_queue_size`` sets minimum size value of the ComputeFleet
-`Auto Scaling Group`_ (ASG) and also the desired capacity value . The ``max_queue_size``
-sets maximum size value of the ComputeFleet ASG.
+settings. The ``initial_queue_size`` sets the minimum size value of the ComputeFleet ASG and also
+the desired capacity value.
+The ``max_queue_size`` sets the maximum size value of the ComputeFleet ASG.
+
+.. image:: images/as-basic-diagram.png
 
 Scaling Up
 ==========
 
 Every minute, a process called jobwatcher_ runs on the master instance and evaluates
-the current number of instances requested in the queue. If this number is greater than the
-current autoscaling desired, it adds more instances. If you submit more jobs,
-the queue will get re-evaluated and the ASG updated up to the ``max_queue_size``.
+the current number of instances required by the pending jobs in the queue.
+If the total number of busy nodes and requested nodes is greater than the current desired value in the ASG,
+it adds more instances.
+If you submit more jobs, the queue will get re-evaluated and the ASG updated up to the ``max_queue_size``.
+
+With SGE each job requires a number of slots to run (one slot corresponds to one processing unit, e.g. a vCPU).
+When evaluating the number of instances required to serve the currently pending jobs, the jobwatcher
+divides the total number of requested slots by the capacity of a single compute node.
+The capacity of a compute node that is the number of available vcpu depends on the EC2 instance type selected
+in the cluster configuration.
+
+With Slurm and Torque schedulers each job can require both a number of nodes and a number of slots per node.
+The jobwatcher takes into account the request of each job and determines the number of compute nodes to fulfill
+the new computational requirements.  
+For example, assuming a cluster with c5.2xlarge (8 vCPU) as compute instance type, and three queued pending jobs 
+with the following requirements: job1 2 nodes / 4 slots each, job2 3 nodes / 2 slots and job3 1 node / 4 slots,
+the jobwatcher will require three new compute instances to the ASG to serve the three jobs.
+
+*Current limitation*: the auto scale up logic does not consider partially loaded busy nodes, i.e. each node running
+a job is considered busy even if there are empty slots.
 
 Scaling Down
 ============
 
-On each compute node, a process called nodewatcher_ runs and evaluates the
-work left in the queue. If an instance has had no jobs for longer than ``scaledown_idletime``
-(which defaults to 10 minutes), the instance is terminated.
+On each compute node, a process called nodewatcher_ runs and evaluates the idle time of
+the node. If an instance had no jobs for longer than ``scaledown_idletime``
+(which defaults to 10 minutes) and currently there are no pending jobs in the cluster,
+the instance is terminated.
 
 It specifically calls the TerminateInstanceInAutoScalingGroup_ API call,
 which will remove an instance as long as the size of the ASG is at least the