Skip to content

Commit 2487742

Browse files
committed
First cut of SGE docs
1 parent fa0e42a commit 2487742

File tree

2 files changed

+76
-21
lines changed

2 files changed

+76
-21
lines changed

docs/user/_sources/cloud.txt

Lines changed: 9 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -6,29 +6,23 @@ Introduction
66

77
The C-PAC team has released an Amazon Marketplace AMI, making it easier for researchers to use C-PAC in the cloud. You can use the AMI to either launch a single machine for basic runs or create a high performance computing (HPC) cluster using Starcluster. Clusters can be dynamically scaled up as your computational needs increase. Detailed explanations of cloud computing and HPC are beyond the scope of this documentation, but we will define a few key terms before we start. If these terms are familiar, you may skip them and proceed to later sections.
88

9-
* Amazon Machine Instance (AMI) - A disk image of an operating system and any additional installed software that can be used to create a virtual machine.
9+
* **Amazon Machine Instance (AMI)** - A disk image of an operating system and any additional installed software that can be used to create a virtual machine.
1010

11-
* Instance - A single running virtual machine whose initial state is based on the AMI that it is launched from. Instances can be classified as spot instances or on-demand instances. On-demand instances are reliably created the moment they are requested for a fixed rate. Spot instances are created based on whether or not a bid that you set is accepted by Amazon. They can be significantly cheaper than on-demand instances, but are only created when Amazon accepts your bid.
11+
* **Instance** - A single running virtual machine whose initial state is based on the AMI that it is launched from. Instances can be classified as spot instances or on-demand instances. On-demand instances are reliably created the moment they are requested for a fixed rate. Spot instances are created based on whether or not a bid that you set is accepted by Amazon. They can be significantly cheaper than on-demand instances, but are only created when Amazon accepts your bid.
1212

13-
* Instance Type - The hardware specification for a given instance. A list of the instance types made available by Amazon may be found `here <http://aws.amazon.com/ec2/instance-types>`_.
13+
* **Instance Type** - The hardware specification for a given instance. A list of the instance types made available by Amazon may be found `here <http://aws.amazon.com/ec2/instance-types>`_.
1414

15-
* Terminated Instance - An instance is considered terminated when its resources have been completely freed up for use by others in the Amazon cloud. Any data on a terminated instance that is not relocated to persistent storage such as EBS (see below) will be completely discarded. Instance termination is the virtual equivalent of throwing out a physical server. When you have terminated an instance, you are no longer paying for it. Note that by default, instances do not have persistent storage attached to them- you will need to configure persistent storage when you set up the instance.
15+
* **Terminated Instance** - An instance is considered terminated when its resources have been completely freed up for use by others in the Amazon cloud. Any data on a terminated instance that is not relocated to persistent storage such as EBS (see below) will be completely discarded. Instance termination is the virtual equivalent of throwing out a physical server. When you have terminated an instance, you are no longer paying for it. Note that by default, instances do not have persistent storage attached to them- you will need to configure persistent storage when you set up the instance.
1616

17-
* Stopped Instance - An instance is considered stopped when it is not active, but its resources are still available for future use whenever you choose to reactivate it. Stopping an instance is the virtual equivalent of turning a computer off or putting it in hibernate mode. When you stop an instance, you continue to pay for the storage associated with it (i.e., the main and other volumes attached to it), but not for the instance itself. You should stop an instance when the analyses you are working on are not fully done and you would like to preserve the current state of a running instance.
17+
* **Stopped Instance** - An instance is considered stopped when it is not active, but its resources are still available for future use whenever you choose to reactivate it. Stopping an instance is the virtual equivalent of turning a computer off or putting it in hibernate mode. When you stop an instance, you continue to pay for the storage associated with it (i.e., the main and other volumes attached to it), but not for the instance itself. You should stop an instance when the analyses you are working on are not fully done and you would like to preserve the current state of a running instance.
1818

19-
* Simple Storage Service (S3) - A form of storage offered by Amazon. S3 is not intended to be directly attached to instances since it lacks a filesystem, but it can be used to archive large datasets. Amazon provides tools for uploading data to S3 'buckets' where it can be stored. It is less costly than EBS.
19+
* **Simple Storage Service (S3)** - A form of storage offered by Amazon. S3 is not intended to be directly attached to instances since it lacks a filesystem, but it can be used to archive large datasets. Amazon provides tools for uploading data to S3 'buckets' where it can be stored. It is less costly than EBS.
2020

21-
* Elastic Block Storage (EBS) - A form of persistent storage offered by Amazon for use with instances. When you have terminated an instance, items stored in an EBS volume can be accessed by any future instances that you start up.
21+
* **Elastic Block Storage (EBS)** - A form of persistent storage offered by Amazon for use with instances. When you have terminated an instance, items stored in an EBS volume can be accessed by any future instances that you start up.
2222

23-
* EC2 Instance Store - A form of temporary storage that comes included with some instance types. Instance store volumes must be added manually before launching an instance, and all files stored on them will be lost when the instance is terminated. The instance store is typically mounted at ``/mnt``.
23+
* **EC2 Instance Store** - A form of temporary storage that comes included with some instance types. Instance store volumes must be added manually before launching an instance, and all files stored on them will be lost when the instance is terminated. The instance store is typically mounted at ``/mnt``.
2424

25-
* Head Node - The primary node of an HPC cluster, to which all other nodes are connected. The head node will run a job scheduler (such as Sun Grid Engine) to allocate jobs to the other nodes.
26-
27-
* Worker Node - A node in an HPC cluster to which tasks are delegated by the head node via a job scheduler.
28-
29-
* Job Scheduler - A program that can allocate computational resources in an HPC cluster to jobs based on availability and distribute jobs across nodes. The C-PAC AMI uses Sun Grid Engine (SGE) as its job scheduler.
30-
31-
* Job Submission Script - A shell script with a series of commands to be executed as part of the job. Submission scripts may also include flags that activate functionality specific to the scheduler.
25+
Lastly, it would be import to review any terms related to :doc:`the Sun Grid Engine job scheduler <compute_config>`.
3226

3327
Creating AWS Access and Network Keys
3428
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -345,5 +339,4 @@ Additional Links
345339
^^^^^^^^^^^^^^^^
346340

347341
* `The StarCluster User Manual <http://star.mit.edu/cluster/docs/latest/manual/index.html>`_
348-
* `The Sun Grid Engine User Guide <http://www.csb.yale.edu/userguides/sysresource/batch/doc/UserGuide_6.1.pdf>`_
349342
* `Getting Started with AWS <http://docs.aws.amazon.com/gettingstarted/latest/awsgsg-intro/gsg-aws-intro.html>`_

docs/user/_sources/compute_config.txt

Lines changed: 67 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -6,11 +6,11 @@ Computer Settings
66

77
#. **FSL Path - [path]:** Full path to the FSL version to be used by CPAC. If you have specified an FSL path in your .bashrc file, this path will be set automatically.
88

9-
#. **Job Scheduler / Resource Manager - [SGE, PBS]:** Sun Grid Engine (SGE) or Portable Batch System (PBS). Only applies if you are running on a grid or compute cluster.
9+
#. **Job Scheduler / Resource Manager - [SGE, PBS]:** Sun Grid Engine (SGE) or Portable Batch System (PBS). Only applies if you are running on a grid or compute cluster. See the section below entitled `SGE Configuration` for more information on how to set up a SGE.
1010

11-
#. **SGE Parallel Environment - [text]:** SGE Parallel Environment to use when running CPAC. Only applies when you are running on a grid or compute cluster using SGE.
11+
#. **SGE Parallel Environment - [text]:** SGE Parallel Environment to use when running CPAC. Only applies when you are running on a grid or compute cluster using SGE. See the section below entitled `SGE Configuration` for more information on how to set up a SGE.
1212

13-
#. **SGE Queue - [text]:** SGE Queue to use when running CPAC. Only applies when you are running on a grid or compute cluster using SGE.
13+
#. **SGE Queue - [text]:** SGE Queue to use when running CPAC. Only applies when you are running on a grid or compute cluster using SGE. See the section below entitled `SGE Configuration` for more information on how to set up a SGE.
1414

1515
#. **Number of Cores Per Subject - [integer]:** Number of cores (on a single machine) or slots on a node (cluster/grid) per subject. Slots are cores on a cluster/grid node. 'Number of Cores Per Subject' multiplied by 'Number of Subjects to Run Simultaneously' multiplied by 'Number of Cores for Anatomical Registration (ANTS)' must not be greater than the total number of cores.
1616

@@ -25,8 +25,7 @@ The following key/value pairs must be defined in your :doc:`pipeline configurati
2525

2626
+-------------------------------+------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+
2727
| Key | Description | Potential Values |
28-
+===============================+==============================================================================================================================+====================================================================================+
29-
| runOnGrid | Run using a Grid Resource Manager (such as Sun Grid Engine or HTCondor)? | True,False |
28+
+===============================+==============================================================================================================================+====================================================================================+ | runOnGrid | Run using a Grid Resource Manager (such as Sun Grid Engine or HTCondor)? | True,False |
3029
+-------------------------------+------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+
3130
| FSLDIR | The path to the FSL directory on your server/workstation. | A string that is a path (e.g., /usr/share/fsl/5.0) |
3231
+-------------------------------+------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+
@@ -53,3 +52,66 @@ The box below contains an example of what these parameters might look like when
5352
numCoresPerSubject : 1
5453
numSubjectsAtOnce : 1
5554
num_ants_threads : 1
55+
56+
Setting up SGE
57+
"""""""""""""""
58+
59+
Preliminaries
60+
^^^^^^^^^^^^^
61+
62+
Before you configure Sun Grid Engine so that it works with C-PAC, you should understand the following concepts:
63+
64+
* **Job Scheduler** - A program that can allocate computational resources in an HPC cluster to jobs based on availability and distribute jobs across nodes. C-PAC can use Sun Grid Engine (SGE) as its job scheduler (and SGE is comes pre-configured with C-PAC's :doc:`cloud image <cloud>`).
65+
66+
* **Parallel Environment** - A specification for how SGE parallelizes work. Parallel environments can have limits on the number of CPUs used, whitelists and blacklists that dictate who can use resources, and specific methods for balancing server load during distributed tasks.
67+
68+
* **The Job Queue** - A grouping of jobs that run at the same time. The queue can be frozen, in which case all jobs that it contains will cease.
69+
70+
* **Head Node** - The primary node of an HPC cluster, to which all other nodes are connected. The head node will run a job scheduler (such as Sun Grid Engine) to allocate jobs to the other nodes.
71+
72+
* **Worker Node** - A node in an HPC cluster to which tasks are delegated by the head node via a job scheduler.
73+
74+
* **Job Submission Script** - A shell script with a series of commands to be executed as part of the job. Submission scripts may also include flags that activate functionality specific to the scheduler.
75+
76+
Configuring A Parallel Environment
77+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
78+
79+
The specifics of configuring a parallel environment in SGE are beyond the scope of this guide (see `Oracle's blog <https://blogs.oracle.com/templedf/entry/configuring_a_new_parallel_environment>`_ for a good primer on how to prepare a parallel environment). Nevertheless, we will discuss how to configure a simple example (the `mpi_conf` environment used by the C-PAC cloud image). To do this, we will first create a file named mpi_smp.conf that will appear as follows:
80+
81+
.. code-block:: bash
82+
pe_name mpi_smp
83+
slots 999
84+
user_lists NONE
85+
xuser_lists NONE
86+
start_proc_args NONE
87+
stop_proc_args NONE
88+
allocation_rule $pe_slots
89+
control_slaves TRUE
90+
job_is_first_task FALSE
91+
urgency_slots min
92+
accounting_summary TRUE
93+
94+
This configuration ensures that:
95+
* Up to 999 slots will be used.
96+
* No users are whitelisted or blacklisted and no special hooks or cleanup tasks occur before or after a job.
97+
* All job slots that a C-PAC job submission requests are on the same machine.
98+
* SGE has full control over the jobs submitted (in terms of resource scheduling).
99+
* The C-PAC run is not part of a parallel job that would require an awareness of which task was performed first.
100+
* An accounting record is written concerning how the job used resources.
101+
102+
To activate this parallel environment and tie it to a job queue named 'all.q', use the following commands:
103+
104+
.. code-block:: bash
105+
qconf -Ap /path/to/mpi_smp.conf
106+
qconf -mattr queue pe_list "mpi_smp" all.q
107+
108+
You would then set the SGE Parallel Environment to "mpi_smp" and the SGE queue to "all.q" in your pipeline configuration file before starting your C-PAC run.
109+
110+
Additional Links
111+
""""""""""""""""
112+
113+
* `The Sun Grid Engine User Guide <http://www.csb.yale.edu/userguides/sysresource/batch/doc/UserGuide_6.1.pdf>`_
114+
* `Starcluster's Sun Grid Engine Tutorial <http://star.mit.edu/cluster/docs/0.93.3/guides/sge.html>`_
115+
* `Oracle's Parallel Environment Tutorial <https://blogs.oracle.com/templedf/entry/configuring_a_new_parallel_environment>`_
116+
* `University of Tennessee Knoxville's Guide to Using SGE <https://newton.utk.edu/doc/Documentation/UsingTheGridEngine/>`_
117+

0 commit comments

Comments
 (0)