Next 200 lines

s-sajid-ali · s-sajid-ali · commit 94313764d467 · 2025-01-29T08:23:28.000-05:00
diff --git a/docs/navigating_the_cluster/hpc_foundations.md b/docs/navigating_the_cluster/hpc_foundations.md
@@ -759,3 +759,214 @@ srun: job 56142474 has been allocated resources
 > - `--label` labels standard output of tasks based on task ID from 0 to N.
 
 
+So far we understood that slurm chooses '_on which nodes our programs should run on_' based on it's scheduling decisions, however it also provides more control like specifying explicitly on which `partition` we can run our programs on.
+
+Here partitions are similar nodes grouped together as a list. For example H100 nodes are grouped together as a partition named `H100_Partition`. Whenever we submit a job request for H100s then nodes sequentially along this partition are reserved and our job is scheduled on them.
+
+You can check the list of **_all partitions_** and their **_compute node list_** with the `sinfo` command. This will provide you with more information about the partitions, and their statuses:
+
+```sh
+sinfo
+```
+
+To specify a particular partition, you can use the `--partition` option as shown below:
+
+```sh
+srun --partition=cs --nodes=2 --tasks-per-node=1 --cpus-per-task=4 --mem=4GB --time=05:00 lua hello.lua
+```
+
+
+
+> **_(A) SLURM OVERVIEW_**
+> 
+> - Users submit jobs on the cluster. 
+> 
+> - Slurm ( or also called slurm controller ) that runs exclusively on it's own node, `queues` up these submitted jobs based on `priority` and schedules them across compute nodes based on the jobs' `compute requirements` and expected `execution time` (Priority, and Backfill scheduling).
+> 
+> - Once a job has been `scheduled` on a compute node(s) it runs without interruption. The slurm controller continously monitors the job's status throughout it's life cycle and manages a `database` ( i.e. MySQL ) where it temporarily maintains the `status` of all running jobs across the cluster.   
+> 
+> - Whenever users make a slurm query such as the `squeue` command, to check on the status of their jobs ( or anyother slurm commands ). Such commands invoke a `Remote Procedure Call` (RPC for short) to the slurm controller, that fetches the job's status from it's database for the user. 
+>
+> - **Too many RPCs** to the slurm controller in a short span of time may result in overloading of operations on the slurm's database. **_Resulting in slurm's poor performance_** ( RPCs are usually not rate limited for various reasons )
+>
+> - Hence it is **recommended** to _takecare_ Or **limit invoking slurm commands very frequently** in case of invoking them within a bash script or a python script
+>
+> - **_Failing_** to follow may result in the user account being **_suspended_**
+
+
+> **_(B) IMPORTANT NOTE !_**
+>
+> It is **_crucial_** to understand everything until this point, this builds your foundations in understanding further topics covered from this point onwards.
+> Please make sure to cover all the topics until this point in case you may have missed anything. It gets easier from here.
+
+
+> **_RECAP:_**
+> 
+> - So far we have learn what compute clusters are ...
+> - How srun works ...
+> - squeue ...
+> - scancel ... 
+> - And more ...
+
+
+## Submitting `Batch jobs`
+
+Previously we have seen how we could submit individual `interactive jobs` mostly to run individual programs, however there is an issue with this method :
+
+> **_What happens if we get disconnected from our ssh session while running our jobs ?_**
+> 
+> - To understand this we need to understand how `ssh sessions` and `bash shells` are setup in our case
+>
+> - First, when we ssh to `greene.hpc.nyu.edu`, we land on a `login node` running a `bash shell`, the console is our shell where we execute the linux commands
+> 
+> - Then when we submit a job with `srun`, our program runs **_within_** a `new bash sub-shell` belonging to this particular **_srun_** within which slurm sets the necessary environment variables accordingly, like the `SLURM_PROCID` environment variable as we have seen before 
+> 
+> - Therefore, "hello, world" `output(s)` printed by this program executing on `compute node(s)` are `buffered` all the way from their `sub-shell(s)` to our `bash shell` running on the login node, and are displayed line after line on console
+> 
+> - Hence, if our `ssh gets disconnected` for any reason, the `current bash shell is destroyed`, and the job currently being executed within this `sub-shell` is `cancelled`
+
+Therefore, we make use of `slurm batch` scripts also called `sbatch` instead of interactive jobs. Basically they are simple bash scripts `with special directives` that we **_submit to slurm_** instead of running them interactively.
+
+Within a `sbatch` script we either specify a single job by invoking `srun` or **_batch multiple jobs_** by invoking multiple `srun` and **_submit it to slurm_** under a single job id hence called a `batch job`. 
+
+Once we submit a `batch job`, they are independently scheduled regardless of what happens to our shell. We can safely disconnect from our ssh session, and return later on to check on the status of our submitted batch job.
+
+> **_NOTE_**: 
+>
+> **_Submitting Batch jobs is the preferred way of submitting jobs to slurm_**
+
+A simple batch job can be written as :
+
+```sh
+#!/bin/bash
+
+#SBATCH --nodes=1
+#SBATCH --tasks-per-node=1
+#SBATCH --cpus-per-task=1
+#SBATCH --mem=1GB
+#SBATCH --time=00:05:00
+
+srun /bin/bash -c "sleep 60; echo 'hello, world' "
+```
+
+As you can see, we provide the familiar slurm options in a format that is `#SBATCH`, these are called slurm `directives` in our bash script.  
+
+Create a batch script like the above named `hello.sbatch` and submit it using the `sbatch` command:
+
+```sh
+sbatch hello.sbatch
+```
+
+Check the status of this job with:
+
+```sh
+squeue --me
+```
+
+Once done, notice that in the same directory from where you submitted this job, there is a new file created `slurm-55815161.out`, where the number `55815161` is the **_job id_** in this example.
+
+Check the contents of this file:
+
+```sh
+cat slurm-<Job_ID>.out
+```
+```sh
+[pp2959@log-1 slurm_hello_world]$ cat slurm-55815161.out
+hello, world
+[pp2959@log-1 slurm_hello_world]$
+```
+
+This is the output of your job. A new file is created by default named `slurm-<Job_ID>.out` and the outputs are written to it.
+
+You can write the outputs to a custom file name for example `hello.out` using the directive `#SBATCH --output=hello.out`. Add this directive to your `hello.sbatch` file as shown below:
+
+```sh
+#!/bin/bash
+
+#SBATCH --nodes=1
+#SBATCH --tasks-per-node=1
+#SBATCH --cpus-per-task=1
+#SBATCH --mem=1GB
+#SBATCH --time=00:05:00
+
+#SBATCH --output=hello.out
+
+srun /bin/bash -c "echo 'hello, world' "
+```
+
+And re-submit your batch job:
+
+```sh
+sbatch hello.sbatch
+```
+
+You should notice a new file `hello.out` gets created, and your `hello, world` message output is redirected to this file.
+
+```sh
+[pp2959@log-1 slurm_hello_world]$ cat hello.out
+hello, world
+[pp2959@log-1 slurm_hello_world]$
+```
+
+By default **_error messages_** that gets generated by your programs are redirected to the same output file, but you can also specify an exclusive file just for writing error messages at, using the directive `#SBATCH --error=hello.err` in this example.
+
+Modify `hello.sbatch` to include this directive and also a modified program that prints `hello, world` then throws an **_error_** with exit code `1`:
+
+```sh
+#!/bin/bash
+
+#SBATCH --nodes=1
+#SBATCH --tasks-per-node=1
+#SBATCH --cpus-per-task=1
+#SBATCH --mem=1GB
+
+#SBATCH --output=hello.out
+#SBATCH --error=hello.err
+
+srun /bin/bash -c "echo 'hello, world'; exit 1"
+```
+
+Submit this job:
+
+```sh
+sbatch hello.sbatch
+```
+
+Once done check both, output and error outputs of your program:
+
+```sh
+[pp2959@log-3 slurm_hello_world]$ cat hello.err
+srun: error: cm013: task 0: Exited with exit code 1
+srun: Terminating StepId=55815589.0
+[pp2959@log-3 slurm_hello_world]$ cat hello.out
+hello, world
+[pp2959@log-3 slurm_hello_world]$
+```
+
+> The error messages are redirected to a seperate file `hello.err`
+
+In this example the error message states as follows, 
+
+- In the first line, slurm tells us that the program running on host `cm013`, which is a compute node, with task 0, for this particular `srun` exited with an error message of exit code 1, since we used `exit 1` in our bash script. You may use any error codes from 1 to 255 for debugging purposes, where code 0 is to exit with no errors.
+- Also we have a `StepId` in this error message as `StepId=55815589.0`. Here this particular `srun` is assigned a step id of 0. 
+- Invoking a `srun` is also called a `job step` in a `batch job`.
+
+We can invoke multiple `job steps` within our `batch job` as shown below:
+
+```sh
+#!/bin/bash
+
+#SBATCH --nodes=1
+#SBATCH --tasks-per-node=1
+#SBATCH --cpus-per-task=1
+#SBATCH --mem=1GB
+
+#SBATCH --output=hello.out
+#SBATCH --error=hello.err
+
+srun --time=02:00 /bin/bash -c "echo '(step 0): hello, world'; "
+srun --time=02:00 /bin/bash -c "echo '(step 1): hello, world'; "
+```
+
+Every `srun` declared in the `batch script` is called a `job step` that will get it's own `step id` from 0 to N.
+