You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: mkdocs/docs/HPC/getting_started.md
+82-45Lines changed: 82 additions & 45 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,7 @@
1
+
# Title
2
+
1
3
{% set exampleloc="mkdocs/docs/HPC/examples/Getting_Started/tensorflow_mnist" %}
4
+
2
5
# Getting Started
3
6
4
7
Welcome to the "Getting Started" guide. This chapter will lead you through the initial steps of logging into the {{hpcinfra}} and submitting your very first job. We'll also walk you through the process step by step using a practical example.
@@ -7,25 +10,25 @@ In addition to this chapter, you might find the [recording of the *Introduction
7
10
8
11
Before proceeding, read [the introduction to HPC](introduction.md) to gain an understanding of the {{ hpcinfra }} and related terminology.
9
12
10
-
###Getting Access
13
+
## Getting Access
11
14
12
15
To get access to the {{hpcinfra}}, visit [Getting an HPC Account](account.md).
13
16
14
-
If you have not used Linux before,
17
+
If you have not used Linux before,
15
18
{%- if site == 'Gent' %}
16
19
now would be a good time to follow our [Linux Tutorial](linux-tutorial/index.md).
17
20
{%- else %}
18
21
please learn some basics first before continuing. (see [Appendix C - Useful Linux Commands](useful_linux_commands.md))
19
22
{%- endif %}
20
23
21
-
####A typical workflow looks like this:
24
+
### A typical workflow looks like this
22
25
23
-
1.Connect to the login nodes
24
-
2.Transfer your files to the {{hpcinfra}}
25
-
3.Optional: compile your code and test it
26
-
4.Create a job script and submit your job
27
-
5.Wait for job to be executed
28
-
6.Study the results generated by your jobs, either on the cluster or
26
+
1. Connect to the login nodes
27
+
2. Transfer your files to the {{hpcinfra}}
28
+
3. Optional: compile your code and test it
29
+
4. Create a job script and submit your job
30
+
5. Wait for job to be executed
31
+
6. Study the results generated by your jobs, either on the cluster or
29
32
after downloading them locally.
30
33
31
34
We will walk through an illustrative workload to get you started. In this example, our objective is to train a deep learning model for recognizing hand-written digits (MNIST dataset) using [TensorFlow](https://www.tensorflow.org/);
@@ -38,10 +41,10 @@ There are two options to connect
38
41
- Using a terminal to connect via SSH (for power users) (see [First Time connection to the {{ hpcinfra}}](connecting.md#first-time-connection-to-the-hpc-infrastructure))
39
42
-[Using the web portal](web_portal.md)
40
43
41
-
Considering your operating system is **{{OS}}**,
44
+
Considering your operating system is **{{OS}}**,
42
45
43
46
{%- if OS == linux %}
44
-
it is recommended to make use of the `ssh` command in a terminal to get the most flexibility.
47
+
it is recommended to make use of the `ssh` command in a terminal to get the most flexibility.
45
48
46
49
Assuming you have already generated SSH keys in the previous step ([Getting Access](#getting-access)), and that they are in a default location, you should now be able to login by running the following command:
47
50
@@ -50,12 +53,16 @@ ssh {{userid}}@{{loginnode}}
50
53
```
51
54
52
55
!!! Warning "User your own VSC account id"
53
-
54
-
Replace **{{userid}}** with your VSC account id (see <https://account.vscentrum.be>)
56
+
57
+
```text
58
+
Replace **{{userid}}** with your VSC account id (see <https://account.vscentrum.be>)
59
+
```
55
60
56
61
!!! Tip
57
62
58
-
You can also still use the web portal (see [shell access on web portal](web_portal.md#shell-access))
63
+
```text
64
+
You can also still use the web portal (see [shell access on web portal](web_portal.md#shell-access))
65
+
```
59
66
60
67
{%- else %}
61
68
{%- if OS == windows %} it is recommended to use the web portal.
@@ -72,16 +79,17 @@ Make sure you can get to a shell access to the {{hpcinfra}} before proceeding wi
72
79
73
80
!!! Info
74
81
75
-
When having problems see the [connection issues section on the troubleshooting page](troubleshooting.md#sec:connecting-issues).
76
-
82
+
```text
83
+
When having problems see the [connection issues section on the troubleshooting page](troubleshooting.md#sec:connecting-issues).
84
+
```
77
85
78
86
### Transfer your files
79
87
80
88
Now that you can login, it is time to transfer files from your local computer to your **home directory** on the {{hpcinfra}}.
81
89
82
90
Download following the example scripts to your computer:
You can also find the example scripts in our git repo: [https://github.com/hpcugent/vsc_user_docs/](https://github.com/hpcugent/vsc_user_docs/tree/main/mkdocs/docs/HPC/examples/Getting_Started/tensorflow_mnist).
@@ -95,17 +103,21 @@ Upload both files (`run.sh` and `tensorflow-mnist.py`) to your **home directory*
95
103
96
104
!!! Info
97
105
98
-
As an alternative, you can use WinSCP (see [our section](connecting.md#winscp))
106
+
```text
107
+
As an alternative, you can use WinSCP (see [our section](connecting.md#winscp))
When submitting jobs with limited amount of resources, it is recommended to use the [debug/interactive cluster](interactive_debug.md#interactive-and-debug-cluster): `donphan`.
179
+
180
+
```text
181
+
When submitting jobs with limited amount of resources, it is recommended to use the [debug/interactive cluster](interactive_debug.md#interactive-and-debug-cluster): `donphan`.
182
+
```
165
183
166
184
{%- if site == 'Gent' %}
167
185
168
-
To get a list of all clusters and their hardware, see <https://www.ugent.be/hpc/en/infrastructure>.
186
+
```text
187
+
To get a list of all clusters and their hardware, see <https://www.ugent.be/hpc/en/infrastructure>.
188
+
```
169
189
170
190
{%- endif %}
171
191
172
192
This job script can now be submitted to the cluster's job system for execution, using the qsub (**q**ueue **sub**mit) command:
173
193
174
-
```
194
+
```text
175
195
$ qsub run.sh
176
196
{{jobid}}
177
197
```
178
198
179
199
This command returns a job identifier (*{{jobid}}*) on the HPC cluster. This is a unique identifier for the job which can be used to monitor and manage your job.
180
200
181
201
!!! Warning "Make sure you understand what the `module` command does"
182
-
183
-
Note that the module commands only modify environment variables. For instance, running `module swap cluster/{{othercluster}}` will update your shell environment so that `qsub` submits a job to the `{{othercluster}}` cluster,
184
-
but our active shell session is still running on the login node.
185
-
186
-
It is important to understand that while `module` commands affect your session environment, they do ***not*** change where the commands your are running are being executed: they will still be run on the login node you are on.
187
-
188
-
When you submit a job script however, the commands ***in*** the job script will be run on a workernode of the cluster the job was submitted to (like `{{othercluster}}`).
202
+
203
+
```text
204
+
Note that the module commands only modify environment variables. For instance, running `module swap cluster/{{othercluster}}` will update your shell environment so that `qsub` submits a job to the `{{othercluster}}` cluster,
205
+
but our active shell session is still running on the login node.
206
+
```
207
+
208
+
```text
209
+
It is important to understand that while `module` commands affect your session environment, they do ***not*** change where the commands your are running are being executed: they will still be run on the login node you are on.
210
+
```
211
+
212
+
```text
213
+
When you submit a job script however, the commands ***in*** the job script will be run on a workernode of the cluster the job was submitted to (like `{{othercluster}}`).
214
+
```
189
215
190
216
For detailed information about `module` commands, read the [running batch jobs](running_batch_jobs.md) chapter.
191
217
@@ -195,15 +221,17 @@ Your job is put into a queue before being executed, so it may take a while befor
195
221
(see [when will my job start?](running_batch_jobs.md#when-will-my-job-start) for scheduling policy).
196
222
197
223
You can get an overview of the active jobs using the `qstat` command:
@@ -227,25 +255,30 @@ By default located in the directory where you issued `qsub`.
227
255
228
256
!!! Info
229
257
230
-
For more information about the stdout and stderr output channels, see this [section](linux-tutorial/beyond_the_basics.md#inputoutput).
258
+
```text
259
+
For more information about the stdout and stderr output channels, see this [section](linux-tutorial/beyond_the_basics.md#inputoutput).
260
+
```
231
261
232
262
{%- endif %}
233
263
234
264
In our example when running `ls` in the current directory you should see 2 new files:
235
-
265
+
236
266
-**run.sh.o{{jobid}}**, containing *normal output messages* produced by job {{jobid}};
237
267
-**run.sh.e{{jobid}}**, containing *errors and warnings* produced by job {{jobid}}.
238
268
239
269
!!! Info
240
-
270
+
241
271
run.sh.e{{jobid}} should be empty (no errors or warnings).
242
272
243
273
!!! Warning "Use your own job ID"
244
274
245
-
Replace **{{jobid}}** with the jobid you got from the `qstat` command (see above) or simply look for added files in your current directory by running `ls`.
275
+
```text
276
+
Replace **{{jobid}}** with the jobid you got from the `qstat` command (see above) or simply look for added files in your current directory by running `ls`.
277
+
```
246
278
247
279
When examining the contents of ``run.sh.o{{jobid}}`` you will see something like this:
248
-
```
280
+
281
+
```text
249
282
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
0 commit comments