diff --git a/mkdocs/docs/HPC/getting_started.md b/mkdocs/docs/HPC/getting_started.md index 91e15ceedd9..85e4c405ea7 100644 --- a/mkdocs/docs/HPC/getting_started.md +++ b/mkdocs/docs/HPC/getting_started.md @@ -1,111 +1,153 @@ +# Title + {% set exampleloc="mkdocs/docs/HPC/examples/Getting_Started/tensorflow_mnist" %} + # Getting Started -Welcome to the "Getting Started" guide. This chapter will lead you through the initial steps of logging into the {{hpcinfra}} and submitting your very first job. We'll also walk you through the process step by step using a practical example. +Welcome to the "Getting Started" guide. This chapter will lead you through the +initial steps of logging into the {{hpcinfra}} and submitting your very first +job. We'll also walk you through the process step by step using a practical +example. -In addition to this chapter, you might find the [recording of the *Introduction to HPC-UGent* training session](https://www.ugent.be/hpc/en/training/introhpcugent-recording) to be a useful resource. +In addition to this chapter, you might find the [recording of the *Introduction +to HPC-UGent* training +session](https://www.ugent.be/hpc/en/training/introhpcugent-recording) to be a +useful resource. -Before proceeding, read [the introduction to HPC](introduction.md) to gain an understanding of the {{ hpcinfra }} and related terminology. +Before proceeding, read [the introduction to HPC](introduction.md) to gain an +understanding of the {{ hpcinfra }} and related terminology. -### Getting Access +## Getting Access -To get access to the {{hpcinfra}}, visit [Getting an HPC Account](account.md). +To get access to the {{hpcinfra}}, you need [an HPC account](account.md). -If you have not used Linux before, +If you have not used Linux before, {%- if site == 'Gent' %} now would be a good time to follow our [Linux Tutorial](linux-tutorial/index.md). {%- else %} -please learn some basics first before continuing. (see [Appendix C - Useful Linux Commands](useful_linux_commands.md)) +please learn some basics first before continuing +(see [Appendix C - Useful Linux Commands](useful_linux_commands.md)). {%- endif %} -#### A typical workflow looks like this: +### A typical workflow looks like this -1. Connect to the login nodes -2. Transfer your files to the {{hpcinfra}} -3. Optional: compile your code and test it -4. Create a job script and submit your job -5. Wait for job to be executed -6. Study the results generated by your jobs, either on the cluster or +1. Connect to the login nodes +2. Transfer your files to the {{hpcinfra}} +3. Optional: compile your code and test it +4. Create a job script and submit your job +5. Wait for job to be executed +6. Study the results generated by your jobs, either on the cluster or after downloading them locally. -We will walk through an illustrative workload to get you started. In this example, our objective is to train a deep learning model for recognizing hand-written digits (MNIST dataset) using [TensorFlow](https://www.tensorflow.org/); +We will walk through an illustrative workload to get you started. In this +example, our objective is to train a deep learning model for recognizing +handwritten digits (MNIST dataset) using +[TensorFlow](https://www.tensorflow.org/); see the [example scripts](https://github.com/hpcugent/vsc_user_docs/tree/main/{{exampleloc}}). ### Getting Connected There are two options to connect -- Using a terminal to connect via SSH (for power users) (see [First Time connection to the {{ hpcinfra}}](connecting.md#first-time-connection-to-the-hpc-infrastructure)) +- Using a terminal to connect via SSH (for power users) + (see [First Time connection to the + {{hpcinfra}}](connecting.md#first-time-connection-to-the-hpc-infrastructure)) - [Using the web portal](web_portal.md) -Considering your operating system is **{{OS}}**, +Considering your operating system is **{{OS}}**, {%- if OS == linux %} -it is recommended to make use of the `ssh` command in a terminal to get the most flexibility. +it is recommended to make use of the `ssh` command in a terminal to get the +most flexibility. -Assuming you have already generated SSH keys in the previous step ([Getting Access](#getting-access)), and that they are in a default location, you should now be able to login by running the following command: +Assuming you have already generated SSH keys in the previous step ([Getting +Access](#getting-access)), and that they are in a default location, you should +now be able to login by running the following command: ```shell ssh {{userid}}@{{loginnode}} ``` -!!! Warning "User your own VSC account id" - - Replace **{{userid}}** with your VSC account id (see ) +!!! Warning "Use your own VSC account id" + +```text +Replace **{{userid}}** with your VSC account id (see +) +``` !!! Tip - You can also still use the web portal (see [shell access on web portal](web_portal.md#shell-access)) +```text +You can also still use the web portal (see [shell access on web +portal](web_portal.md#shell-access)) +``` {%- else %} {%- if OS == windows %} it is recommended to use the web portal. -{%- else %} it should be easy to make use of the `ssh` command in a terminal, but the web portal will work too. {%- endif %} +{%- else %} it should be easy to make use of the `ssh` command in a terminal, +but the web portal will work too. {%- endif %} -The [web portal](web_portal.md) offers a convenient way to upload files and gain shell access to the {{hpcinfra}} from a standard web browser (no software installation or configuration required). +The [web portal](web_portal.md) offers a convenient way to upload files and +gain shell access to the {{hpcinfra}} from a standard web browser (no software +installation or configuration required). See [shell access](web_portal.md#shell-access) when using the web portal, or -[connection to the {{hpcinfra}}](connecting.md#first-time-connection-to-the-hpc-infrastructure) when using a terminal. +[connection to the +{{hpcinfra}}](connecting.md#first-time-connection-to-the-hpc-infrastructure) +when using a terminal. -Make sure you can get to a shell access to the {{hpcinfra}} before proceeding with the next steps. +Make sure you can get to a shell access to the {{hpcinfra}} before proceeding +with the next steps. {%- endif %} !!! Info - When having problems see the [connection issues section on the troubleshooting page](troubleshooting.md#sec:connecting-issues). - +```text +When having problems see the [connection issues section on the troubleshooting +page](troubleshooting.md#sec:connecting-issues). +``` ### Transfer your files -Now that you can login, it is time to transfer files from your local computer to your **home directory** on the {{hpcinfra}}. +Now that you can login, it is time to transfer files from your local computer +to your **home directory** on the {{hpcinfra}}. Download following the example scripts to your computer: -- [tensorflow_mnist.py](https://raw.githubusercontent.com/hpcugent/vsc_user_docs/main/{{exampleloc}}/tensorflow_mnist.py) +- [tensorflow_mnist.py](https://raw.githubusercontent.com/hpcugent/vsc_user_docs/main/{{exampleloc}}/tensorflow_mnist.py) - [run.sh](https://raw.githubusercontent.com/hpcugent/vsc_user_docs/main/{{exampleloc}}/run.sh) -You can also find the example scripts in our git repo: [https://github.com/hpcugent/vsc_user_docs/](https://github.com/hpcugent/vsc_user_docs/tree/main/mkdocs/docs/HPC/examples/Getting_Started/tensorflow_mnist). +You can also find the example scripts in our git repository: +[https://github.com/hpcugent/vsc_user_docs/](https://github.com/hpcugent/vsc_user_docs/tree/main/mkdocs/docs/HPC/examples/Getting_Started/tensorflow_mnist). {%- if OS == windows %} -The [HPC-UGent web portal](https://login.hpc.ugent.be) provides a file browser that allows uploading files. +The [HPC-UGent web portal](https://login.hpc.ugent.be) provides a file browser +that allows uploading files. For more information see the [file browser section](web_portal.md#file-browser). -Upload both files (`run.sh` and `tensorflow-mnist.py`) to your **home directory** and go back to your shell. +Upload both files (`run.sh` and `tensorflow-mnist.py`) to your **home +directory** and go back to your shell. !!! Info - As an alternative, you can use WinSCP (see [our section](connecting.md#winscp)) +```text +As an alternative, you can use WinSCP (see [our section](connecting.md#winscp)) +``` {%- else %} On your local machine you can run: + ```shell curl -OL https://raw.githubusercontent.com/hpcugent/vsc_user_docs/main/{{exampleloc}}/tensorflow_mnist.py curl -OL https://raw.githubusercontent.com/hpcugent/vsc_user_docs/main/{{exampleloc}}/run.sh ``` -Using the `scp` command, the files can be copied from your local host to your *home directory* (`~`) on the remote host (HPC). +Using the `scp` command, the files can be copied from your local host to your +*home directory* (`~`) on the remote host (HPC). + ```shell scp tensorflow_mnist.py run.sh {{userid}}{{ loginnode }}:~ ``` @@ -115,30 +157,38 @@ ssh {{userid}}@{{ loginnode }} ``` !!! Warning "User your own VSC account id" - - Replace **{{userid}}** with your VSC account id (see ) + +```text +Replace **{{userid}}** with your VSC account id (see ) +``` !!! Info - For more information about transfering files or `scp`, see [tranfer files from/to hpc](connecting.md#transfer-files-tofrom-the-hpc). +```text +For more information about transferring files or `scp`, see [transfer files +from/to hpc](connecting.md#transfer-files-tofrom-the-hpc). +``` {%- endif %} -When running `ls` in your session on the {{hpcinfra}}, you should see the two files listed in your home directory (`~`): +When running `ls` in your session on the {{hpcinfra}}, you should see the two +files listed in your home directory (`~`): -``` +```text $ ls ~ run.sh tensorflow_mnist.py ``` -When you do not see these files, make sure you uploaded the files to your **home directory**. +When you do not see these files, make sure you uploaded the files to your +**home directory**. ### Submitting a job -Jobs are submitted and executed using job scripts. In our case **run.sh** can be used as a (very minimal) job script. +Jobs are submitted and executed using job scripts. In our case **run.sh** can +be used as a (very minimal) job script. -A job script is a shell script, a text file that specifies the resources, -the software that is used (via `module load` statements), +A job script is a shell script, a text file that specifies the resources, +the software that is used (via `module load` statements), and the steps that should be executed to run the calculation. Our job script looks like this: @@ -150,69 +200,101 @@ module load TensorFlow/2.15.1-foss-2023a python tensorflow_mnist.py ``` -As you can see this job script will run the Python script named **tensorflow_mnist.py**. +As you can see this job script will run the Python script named +**tensorflow_mnist.py**. -The jobs you submit are per default executed on **cluser/{{defaultcluster}}**, you can swap to another cluster by issuing the following command. +The jobs you submit are per default executed on **cluster/{{defaultcluster}}**, +you can swap to another cluster by issuing the following command. ```shell module swap cluster/{{othercluster}} ``` !!! Tip - - When submitting jobs with limited amount of resources, it is recommended to use the [debug/interactive cluster](interactive_debug.md#interactive-and-debug-cluster): `donphan`. + +```text +When submitting jobs with limited amount of resources, it is recommended to use +the [debug/interactive +cluster](interactive_debug.md#interactive-and-debug-cluster): `donphan`. +``` {%- if site == 'Gent' %} - To get a list of all clusters and their hardware, see . +```text +To get a list of all clusters and their hardware, see +. +``` {%- endif %} -This job script can now be submitted to the cluster's job system for execution, using the qsub (**q**ueue **sub**mit) command: +This job script can now be submitted to the cluster's job system for execution, +using the qsub (**q**ueue **sub**mit) command: -``` +```text $ qsub run.sh {{jobid}} ``` -This command returns a job identifier (*{{jobid}}*) on the HPC cluster. This is a unique identifier for the job which can be used to monitor and manage your job. +This command returns a job identifier (*{{jobid}}*) on the HPC cluster. This is +a unique identifier for the job which can be used to monitor and manage your +job. !!! Warning "Make sure you understand what the `module` command does" - - Note that the module commands only modify environment variables. For instance, running `module swap cluster/{{othercluster}}` will update your shell environment so that `qsub` submits a job to the `{{othercluster}}` cluster, - but our active shell session is still running on the login node. - - It is important to understand that while `module` commands affect your session environment, they do ***not*** change where the commands your are running are being executed: they will still be run on the login node you are on. - - When you submit a job script however, the commands ***in*** the job script will be run on a workernode of the cluster the job was submitted to (like `{{othercluster}}`). -For detailed information about `module` commands, read the [running batch jobs](running_batch_jobs.md) chapter. +```text +Note that the module commands only modify environment variables. For instance, +running `module swap cluster/{{othercluster}}` will update your shell +environment so that `qsub` submits a job to the `{{othercluster}}` cluster, +but our active shell session is still running on the login node. +``` + +```text +It is important to understand that while `module` commands affect your session +environment, they do ***not*** change where the commands your are running are +being executed: they will still be run on the login node you are on. +``` + +```text +When you submit a job script however, the commands ***in*** the job script will +be run on a workernode of the cluster the job was submitted to (like +`{{othercluster}}`). +``` + +For detailed information about `module` commands, read the [running batch +jobs](running_batch_jobs.md) chapter. ### Wait for job to be executed -Your job is put into a queue before being executed, so it may take a while before it actually starts. -(see [when will my job start?](running_batch_jobs.md#when-will-my-job-start) for scheduling policy). +Your job is put into a queue before being executed, so it may take a while +before it actually starts. +(see [when will my job start?](running_batch_jobs.md#when-will-my-job-start) +for scheduling policy). You can get an overview of the active jobs using the `qstat` command: -``` + +```text $ qstat Job ID Name User Time Use S Queue ---------- ---------------- --------------- -------- - ------- {{jobid}} run.sh {{userid}} 0:00:00 Q {{othercluster}} ``` -Eventually, after entering `qstat` again you should see that your job has started running: -``` +Eventually, after entering `qstat` again you should see that your job has +started running: + +```text $ qstat Job ID Name User Time Use S Queue ---------- ---------------- --------------- -------- - ------- {{jobid}} run.sh {{userid}} 0:00:01 R {{othercluster}} ``` -If you don't see your job in the output of the `qstat` command anymore, your job has likely completed. +If you don't see your job in the output of the `qstat` command anymore, your +job has likely completed. -Read [this section](running_batch_jobs.md#monitoring-and-managing-your-jobs) on how to interpret the output. +Read [this section](running_batch_jobs.md#monitoring-and-managing-your-jobs) on +how to interpret the output. ### Inspect your results @@ -227,25 +309,34 @@ By default located in the directory where you issued `qsub`. !!! Info - For more information about the stdout and stderr output channels, see this [section](linux-tutorial/beyond_the_basics.md#inputoutput). +```text +For more information about the stdout and stderr output channels, see this +[section](linux-tutorial/beyond_the_basics.md#inputoutput). +``` {%- endif %} In our example when running `ls` in the current directory you should see 2 new files: - + - **run.sh.o{{jobid}}**, containing *normal output messages* produced by job {{jobid}}; - **run.sh.e{{jobid}}**, containing *errors and warnings* produced by job {{jobid}}. !!! Info - + run.sh.e{{jobid}} should be empty (no errors or warnings). !!! Warning "Use your own job ID" - Replace **{{jobid}}** with the jobid you got from the `qstat` command (see above) or simply look for added files in your current directory by running `ls`. - -When examining the contents of ``run.sh.o{{jobid}}`` you will see something like this: +```text +Replace **{{jobid}}** with the jobid you got from the `qstat` command (see +above) or simply look for added files in your current directory by running +`ls`. ``` + +When examining the contents of ``run.sh.o{{jobid}}`` you will see something +like this: + +```text Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz 11493376/11490434 [==============================] - 1s 0us/step Epoch 1/5 @@ -261,13 +352,19 @@ Epoch 5/5 313/313 - 0s - loss: 0.0782 - accuracy: 0.9764 ``` -Hurray 🎉, we trained a deep learning model and achieved 97,64 percent accuracy. +Hurrah 🎉, we trained a deep learning model and achieved 97,64 percent accuracy. !!! Warning - When using TensorFlow specifically, you should actually submit jobs to a GPU cluster for better performance, see [GPU clusters](gpu.md). +```text +When using TensorFlow specifically, you should actually submit jobs to a GPU +cluster for better performance, see [GPU clusters](gpu.md). +``` - For the purpose of this example, we are running a very small TensorFlow workload on a CPU-only cluster. +```text +For the purpose of this example, we are running a very small TensorFlow +workload on a CPU-only cluster. +``` ### Next steps @@ -276,4 +373,5 @@ Hurray 🎉, we trained a deep learning model and achieved 97,64 percent accurac - [Multi core jobs/Parallel Computing](multi_core_jobs.md) - [Interactive and debug cluster](interactive_debug.md#interactive-and-debug-cluster) -For more examples see [Program examples](program_examples.md) and [Job script examples](jobscript_examples.md) +For more examples see [Program examples](program_examples.md) and [Job script +examples](jobscript_examples.md)