Skip to content

lesson 2: submitting jobs

Unknown edited this page Aug 23, 2017 · 2 revisions

Submitting Jobs

Introduction

In lesson one we submitted some simple jobs using the srun command. This is a handy way to run a one-off task; "one-off" meaning that you do not have a large batch of similar jobs you want to run simultaneously. You may also want to run a batch job comprised of set of jobs that represent a group task. In this case, you use the sbatch command and a configuration file that tells sbatch about the jobs you want to run. In either case you may want to include "options" that can help both you the user and the job scheduler. This in turn helps other users as well.

For the following exercises you will use the files in the slurm_tutorial directory.

  • hello.sh
  • hello.sub
  • hello_to.sh
  • hello_to.sub
  • sleepytime.sh
  • sleepytime.sub

These programs are "shell scripts" written in the a basic utility language on Linux computers known as the shell. In particular, we are using the bash flavor of shell. In order to use the cluster effectively, you will will have to learn to use the linux command line and bash or another shell.

One-Off Jobs

First we will run the script hello.sh on the head node by launching directly at the command line. Then we will actually submit it to the cluster.

hello.sh

balter@clusthead1:~/slurm_tutorial$ cat hello.sh
#!/bin/bash

echo "Hello from $(hostname)"
echo "It is currently $(date)"
echo ""
echo "SLURM_JOB_NAME: $SLURM_JOB_NAME"
echo "SLURM_JOBID: " $SLURM_JOBID
echo "SLURM_ARRAY_TASK_ID: " $SLURM_ARRAY_TASK_ID
echo "SLURM_ARRAY_JOB_ID: " $SLURM_ARRAY_JOB_ID

Now run the script on the head node

balter@clusthead1:~/slurm_tutorial$ ./hello.sh
Hello from clusthead1
It is currently Thu Jun  1 21:37:54 PDT 2017

SLURM_JOB_NAME:
SLURM_JOBID:
SLURM_ARRAY_TASK_ID:
SLURM_ARRAY_JOB_ID:

Because we did not run this script through the scheduler, the SLURM* variables are absent.

Now run it using srun.

balter@clusthead1:~/slurm_tutorial$ srun hello.sh
Hello from clustnode-3-56.local
It is currently Thu Jun  1 22:15:22 PDT 2017

SLURM_JOB_NAME: hello.sh
SLURM_JOBID:  2170
SLURM_ARRAY_TASK_ID:
SLURM_ARRAY_JOB_ID:

In this case, the SLURM_JOB_NAME variable was automatically set to the name of the executable (hello.sh). You can also customize this as well will see later. SLURM also gave the jobs a SLURM_JOBID. For now, don't worry about the other SLURM variables.

Command line arguments

Often we call programs with additional arguments or options. In this case, you can simply follow the program name with these options after srun. The script hello_to.sh takes two arguments: a first name and a last name. Then the script greets that person cheerfully.

hello_to.sh

balter@clusthead1:~/slurm_tutorial$ cat hello_to.sh
#!/bin/bash

# Things you type after the name of the program are arguments
# In this case two arguments will be captures, and these will
# be used for the first name and last name. You run this script
# like this:

# ./hello_to.sh Michael Jackson

firstname=$1
lastname=$2

echo "Hello to $firstname $lastname from $(hostname)"
echo "It is currently $(date)"
echo ""
echo "SLURM_JOB_NAME: $SLURM_JOB_NAME"
echo "SLURM_JOBID: " $SLURM_JOBID
echo "SLURM_ARRAY_TASK_ID: " $SLURM_ARRAY_TASK_ID
echo "SLURM_ARRAY_JOB_ID: " $SLURM_ARRAY_JOB_ID

This time, submit the job to SLURM using the srun command. You can supply the command line arguments after the script name just as you did before.

balter@clusthead1:~/slurm_tutorial$ srun hello_to.sh tracy jordan
Hello to tracy jordan from clustnode-3-56.local
It is currently Thu Jun  1 22:21:21 PDT 2017

SLURM_JOB_NAME: hello_to.sh
SLURM_JOBID:  2171
SLURM_ARRAY_TASK_ID:
SLURM_ARRAY_JOB_ID:

Batch jobs

A long running job

So far we have run jobs that completed almost instantaneously. Some jobs may take a long time to run. Using srun is not very useful for these jobs. To demonstrate, here is a simple script that uses the bash sleep function that halts execution for a specified amount of time in seconds. The script takes a single argument which specifies how many seconds to sleep.

sleepytime.sh

balter@clusthead1:~/slurm_tutorial$ cat sleepytime.sh
#!/bin/bash
# file name: sleepytime.sh

# When you run this program as:
#       ./sleep.sh 10
# "$1" holds the value 10.
# In other words, $1 is the "argument passed to the command"

TIMETOWAIT=$1
echo "sleeping for $TIMETOWAIT seconds"
/bin/sleep $TIMETOWAIT
echo "I'm awake now!"

# remove the "#" on the next line to generate an error
#this_is_not_a_command

First, run the script on the head node to sleep for 5 seconds.

balter@clusthead1:~/slurm_tutorial$ ./sleepytime.sh 5
sleeping for 5 seconds
I'm awake now!

Now let's start it with srun, but for 30 seconds.

balter@clusthead1:~/slurm_tutorial$ srun sleepytime.sh 30
sleeping for 30 seconds

Well...ok...waiting...tic...toc...tic...I'm not getting any work done...

This is where batch jobs become useful. When you submit a batch job, it goes off to the cluster and runs, but you get your command line right back. You can use commands to monitor and control your jobs as they run (see Lesson 4). To start a batch job, you need to use a "submission script."

SLURM Submission script

This is a sample submission script that is contained in the tutorial directory.

sleepytime.sub

balter@clusthead1:~/slurm_tutorial$ cat sleepytime.sub
#!/bin/sh

### --------  SLURM  ----------- ###
#SBATCH --job-name=sleepytime
##SBATCH --array=1-3
##SBATCH --output="sleep_%A_%a_%j.out"
##SBATCH --error="sleep_%A_%a_%j.err"
#SBATCH --output="sleep_%j.out"
#SBATCH --error="sleep_%j.err"
### -------------------------- ###

### Display all variables set by slurm
#env | grep "^SLURM" | sort

### All my commands for job will go here

echo "job name: $SLURM_JOB_NAME"
echo "SLURM_JOBID: $SLURM_JOBID"
echo "SLURM_ARRAY_TASK_ID: $SLURM_ARRAY_TASK_ID"
echo "SLURM_ARRAY_JOB_ID: $SLURM_ARRAY_JOB_ID"

./sleepytime.sh 10
#./sleepytime.sh $SLURM_ARRAY_TASK_ID

Things to notice:

  1. There are commented lines with the word SBATCH immediately after the hash symbol. These lines are interpreted by the SLURM scheduler.
  2. There are lines where SBATCH is double commented with ##. These lines are ignored by the SLURM scheduler.
  3. There are lines with a single # that are commented and ignored both by SLURM and bash.
  4. We have specified names for log files to capture output (stdout) and errors (stderr).
  5. These filenames have funny symbols in them. SLURM allows you to include filename patterns that can put useful information (such as job id) into the name of the output file. For instance, %j will be replaced by the SLURM_JOBID.
  6. We explicitly specified the job name "sleepytime" instead of the default "sleepytime.sh."

You can launch this batch job on the cluster using SLURM's sbatch command.

balter@clusthead1:~/slurm_tutorial$ sbatch sleepytime.sub
Submitted batch job 2179
balter@clusthead1:~/slurm_tutorial$

SLURM submitted my job and told me that the job id is 2179. However, it returned me back to my command line immediately. After 10 seconds or so, we can look in our directory (sorting by time in reverse order) to see the log files.

balter@clusthead1:~/slurm_tutorial$ ls -ltr
...
...
-rw-rw-r--. 1 balter HPCUsers     0 Jun  2 00:46 sleep_2179.err
-rw-rw-r--. 1 balter HPCUsers   122 Jun  2 00:46 sleep_2179.out

Two output files were created with the format we specified. Here is what they look like:

balter@clusthead1:~/slurm_tutorial$ cat sleep_2179.err
balter@clusthead1:~/slurm_tutorial$ cat sleep_2179.out
job name: sleepytime
SLURM_JOBID:  2179
SLURM_ARRAY_TASK_ID:
SLURM_ARRAY_JOB_ID:
sleeping for 10 seconds
I'm awake now!

Since everything went smoothly, the error file sleep_2179.err is empty.

Capturing errors

When we run a job and it fails, we need to know what happened. There could be an error in our program, or even an error in the system like a disk down or connection broken. By specifying an error log file we will save all of this information because anything that would go to stderr is written to that file.

To test this, we need to create a situation that will create an error. Edit the sleepytime.sh file and uncomment the last line that reads #this_is_not_a_command. Then sbatch it again again.

Now our output files look like:

listing

balter@clusthead1:~/slurm_tutorial$ ls -ltr
...
...
-rwxr-xr-x. 1 balter HPCUsers   362 Jun  2 00:49 sleepytime.sh
-rw-rw-r--. 1 balter HPCUsers   122 Jun  2 00:50 sleep_2180.out
-rw-rw-r--. 1 balter HPCUsers    67 Jun  2 00:50 sleep_2180.err

output log file

balter@clusthead1:~/slurm_tutorial$ cat sleep_2180.out
job name: sleepytime
SLURM_JOBID:  2180
SLURM_ARRAY_TASK_ID:
SLURM_ARRAY_JOB_ID:
sleeping for 10 seconds
I'm awake now!

error log file

balter@clusthead1:~/slurm_tutorial$ cat sleep_2180.err
./sleepytime.sh: line 15: this_is_not_a_command: command not found

Stderr was written to the .err file.

Interactive Jobs

Submitting a job to run while you have a cup of coffee or go home and sleep is one of the great things about SLURM. But before you can launch you job you need to make sure it works. Because we want to limit our use of the head node as much as possible, it is best to develop and debug your code in an interactive session. This is also a good idea for simple but i/o intensive operations you might do at the command line, like an awk or sed command to work on a text file.

The simplest way to start an interactive job is to use srun. In this case, the jobs you run with srun is actually just your choice of interactive shell. For example

balter@clusthead1:~/slurm_tutorial$ srun --pty bash
balter@clustnode-3-56:~/slurm_tutorial$

In this case, SLURM dropped me into an interactive shell on node 3-56. The --pty option is necessary so that SLURM actually gives you a terminal, as opposed to just running bash. If you leave off --pty, the command bash will be submitted to the cluster and run there, but you will have no way to access it :-/

To leave the interactive session and drop back to the shell where you launched it, just type exit.

You don't have to launch yourself in to a bash shell. You can just as easily launch right into Python, R, Ruby or any other shell you want. Just remember to use --pty.

balter@clusthead1:~/slurm_tutorial$ srun --pty python
Python 3.6.1 |Continuum Analytics, Inc.| (default, May 11 2017, 13:09:58)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>

Job Options

As you saw in the submission file, you can specify options to provide additional information that tells SLURM the minimum resources you need to have your job run properly (number of CPUs, memory requirements, time needed to run, etc.). This also helps SLURM manage jobs on the cluster in the most efficient and equitable way possible by making sure that users have just the resources they need, and save the rest for other jobs.

To learn about job submission options, proceed to Lesson 3: SRLUM Options.

Clone this wiki locally