You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Welcome to the "Getting Started" guide. This chapter will lead you through the initial steps of logging into the {{hpcinfra}} and submitting your very first job. We'll also walk you through the process step by step using a practical example.
7
+
Welcome to the "Getting Started" guide. This chapter will lead you through the
8
+
initial steps of logging into the {{hpcinfra}} and submitting your very first
9
+
job. We'll also walk you through the process step by step using a practical
10
+
example.
8
11
9
-
In addition to this chapter, you might find the [recording of the *Introduction to HPC-UGent* training session](https://www.ugent.be/hpc/en/training/introhpcugent-recording) to be a useful resource.
12
+
In addition to this chapter, you might find the [recording of the *Introduction
13
+
to HPC-UGent* training
14
+
session](https://www.ugent.be/hpc/en/training/introhpcugent-recording) to be a
15
+
useful resource.
10
16
11
-
Before proceeding, read [the introduction to HPC](introduction.md) to gain an understanding of the {{ hpcinfra }} and related terminology.
17
+
Before proceeding, read [the introduction to HPC](introduction.md) to gain an
18
+
understanding of the {{ hpcinfra }} and related terminology.
12
19
13
20
## Getting Access
14
21
@@ -18,7 +25,8 @@ If you have not used Linux before,
18
25
{%- if site == 'Gent' %}
19
26
now would be a good time to follow our [Linux Tutorial](linux-tutorial/index.md).
20
27
{%- else %}
21
-
please learn some basics first before continuing. (see [Appendix C - Useful Linux Commands](useful_linux_commands.md))
28
+
please learn some basics first before continuing. (see [Appendix C - Useful
29
+
Linux Commands](useful_linux_commands.md))
22
30
{%- endif %}
23
31
24
32
### A typical workflow looks like this
@@ -31,22 +39,30 @@ please learn some basics first before continuing. (see [Appendix C - Useful Linu
31
39
6. Study the results generated by your jobs, either on the cluster or
32
40
after downloading them locally.
33
41
34
-
We will walk through an illustrative workload to get you started. In this example, our objective is to train a deep learning model for recognizing hand-written digits (MNIST dataset) using [TensorFlow](https://www.tensorflow.org/);
42
+
We will walk through an illustrative workload to get you started. In this
43
+
example, our objective is to train a deep learning model for recognizing
44
+
hand-written digits (MNIST dataset) using
45
+
[TensorFlow](https://www.tensorflow.org/);
35
46
see the [example scripts](https://github.com/hpcugent/vsc_user_docs/tree/main/{{exampleloc}}).
36
47
37
48
### Getting Connected
38
49
39
50
There are two options to connect
40
51
41
-
- Using a terminal to connect via SSH (for power users) (see [First Time connection to the {{ hpcinfra}}](connecting.md#first-time-connection-to-the-hpc-infrastructure))
52
+
- Using a terminal to connect via SSH (for power users)
it is recommended to make use of the `ssh` command in a terminal to get the most flexibility.
60
+
it is recommended to make use of the `ssh` command in a terminal to get the
61
+
most flexibility.
48
62
49
-
Assuming you have already generated SSH keys in the previous step ([Getting Access](#getting-access)), and that they are in a default location, you should now be able to login by running the following command:
63
+
Assuming you have already generated SSH keys in the previous step ([Getting
64
+
Access](#getting-access)), and that they are in a default location, you should
65
+
now be able to login by running the following command:
50
66
51
67
```shell
52
68
ssh {{userid}}@{{loginnode}}
@@ -55,51 +71,64 @@ ssh {{userid}}@{{loginnode}}
55
71
!!! Warning "User your own VSC account id"
56
72
57
73
```text
58
-
Replace **{{userid}}** with your VSC account id (see <https://account.vscentrum.be>)
74
+
Replace **{{userid}}** with your VSC account id (see
75
+
<https://account.vscentrum.be>)
59
76
```
60
77
61
78
!!! Tip
62
79
63
80
```text
64
-
You can also still use the web portal (see [shell access on web portal](web_portal.md#shell-access))
81
+
You can also still use the web portal (see [shell access on web
82
+
portal](web_portal.md#shell-access))
65
83
```
66
84
67
85
{%- else %}
68
86
{%- if OS == windows %} it is recommended to use the web portal.
69
-
{%- else %} it should be easy to make use of the `ssh` command in a terminal, but the web portal will work too. {%- endif %}
87
+
{%- else %} it should be easy to make use of the `ssh` command in a terminal,
88
+
but the web portal will work too. {%- endif %}
70
89
71
-
The [web portal](web_portal.md) offers a convenient way to upload files and gain shell access to the {{hpcinfra}} from a standard web browser (no software installation or configuration required).
90
+
The [web portal](web_portal.md) offers a convenient way to upload files and
91
+
gain shell access to the {{hpcinfra}} from a standard web browser (no software
92
+
installation or configuration required).
72
93
73
94
See [shell access](web_portal.md#shell-access) when using the web portal, or
74
-
[connection to the {{hpcinfra}}](connecting.md#first-time-connection-to-the-hpc-infrastructure) when using a terminal.
You can also find the example scripts in our git repo: [https://github.com/hpcugent/vsc_user_docs/](https://github.com/hpcugent/vsc_user_docs/tree/main/mkdocs/docs/HPC/examples/Getting_Started/tensorflow_mnist).
121
+
You can also find the example scripts in our git repository:
When submitting jobs with limited amount of resources, it is recommended to use the [debug/interactive cluster](interactive_debug.md#interactive-and-debug-cluster): `donphan`.
217
+
When submitting jobs with limited amount of resources, it is recommended to use
To get a list of all clusters and their hardware, see <https://www.ugent.be/hpc/en/infrastructure>.
225
+
To get a list of all clusters and their hardware, see
226
+
<https://www.ugent.be/hpc/en/infrastructure>.
188
227
```
189
228
190
229
{%- endif %}
191
230
192
-
This job script can now be submitted to the cluster's job system for execution, using the qsub (**q**ueue **sub**mit) command:
231
+
This job script can now be submitted to the cluster's job system for execution,
232
+
using the qsub (**q**ueue **sub**mit) command:
193
233
194
234
```text
195
235
$ qsub run.sh
196
236
{{jobid}}
197
237
```
198
238
199
-
This command returns a job identifier (*{{jobid}}*) on the HPC cluster. This is a unique identifier for the job which can be used to monitor and manage your job.
239
+
This command returns a job identifier (*{{jobid}}*) on the HPC cluster. This is
240
+
a unique identifier for the job which can be used to monitor and manage your
241
+
job.
200
242
201
243
!!! Warning "Make sure you understand what the `module` command does"
202
244
203
245
```text
204
-
Note that the module commands only modify environment variables. For instance, running `module swap cluster/{{othercluster}}` will update your shell environment so that `qsub` submits a job to the `{{othercluster}}` cluster,
246
+
Note that the module commands only modify environment variables. For instance,
247
+
running `module swap cluster/{{othercluster}}` will update your shell
248
+
environment so that `qsub` submits a job to the `{{othercluster}}` cluster,
205
249
but our active shell session is still running on the login node.
206
250
```
207
251
208
252
```text
209
-
It is important to understand that while `module` commands affect your session environment, they do ***not*** change where the commands your are running are being executed: they will still be run on the login node you are on.
253
+
It is important to understand that while `module` commands affect your session
254
+
environment, they do ***not*** change where the commands your are running are
255
+
being executed: they will still be run on the login node you are on.
210
256
```
211
257
212
258
```text
213
-
When you submit a job script however, the commands ***in*** the job script will be run on a workernode of the cluster the job was submitted to (like `{{othercluster}}`).
259
+
When you submit a job script however, the commands ***in*** the job script will
260
+
be run on a workernode of the cluster the job was submitted to (like
261
+
`{{othercluster}}`).
214
262
```
215
263
216
-
For detailed information about `module` commands, read the [running batch jobs](running_batch_jobs.md) chapter.
264
+
For detailed information about `module` commands, read the [running batch
265
+
jobs](running_batch_jobs.md) chapter.
217
266
218
267
### Wait for job to be executed
219
268
220
-
Your job is put into a queue before being executed, so it may take a while before it actually starts.
221
-
(see [when will my job start?](running_batch_jobs.md#when-will-my-job-start) for scheduling policy).
269
+
Your job is put into a queue before being executed, so it may take a while
270
+
before it actually starts.
271
+
(see [when will my job start?](running_batch_jobs.md#when-will-my-job-start)
272
+
for scheduling policy).
222
273
223
274
You can get an overview of the active jobs using the `qstat` command:
224
275
@@ -229,7 +280,8 @@ Job ID Name User Time Use S Queue
@@ -273,10 +328,13 @@ In our example when running `ls` in the current directory you should see 2 new f
273
328
!!! Warning "Use your own job ID"
274
329
275
330
```text
276
-
Replace **{{jobid}}** with the jobid you got from the `qstat` command (see above) or simply look for added files in your current directory by running `ls`.
331
+
Replace **{{jobid}}** with the jobid you got from the `qstat` command (see
332
+
above) or simply look for added files in your current directory by running
333
+
`ls`.
277
334
```
278
335
279
-
When examining the contents of ``run.sh.o{{jobid}}`` you will see something like this:
336
+
When examining the contents of ``run.sh.o{{jobid}}`` you will see something
337
+
like this:
280
338
281
339
```text
282
340
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
@@ -294,16 +352,18 @@ Epoch 5/5
294
352
313/313 - 0s - loss: 0.0782 - accuracy: 0.9764
295
353
```
296
354
297
-
Hurray 🎉, we trained a deep learning model and achieved 97,64 percent accuracy.
355
+
Hurrah 🎉, we trained a deep learning model and achieved 97,64 percent accuracy.
298
356
299
357
!!! Warning
300
358
301
359
```text
302
-
When using TensorFlow specifically, you should actually submit jobs to a GPU cluster for better performance, see [GPU clusters](gpu.md).
360
+
When using TensorFlow specifically, you should actually submit jobs to a GPU
361
+
cluster for better performance, see [GPU clusters](gpu.md).
303
362
```
304
363
305
364
```text
306
-
For the purpose of this example, we are running a very small TensorFlow workload on a CPU-only cluster.
365
+
For the purpose of this example, we are running a very small TensorFlow
366
+
workload on a CPU-only cluster.
307
367
```
308
368
309
369
### Next steps
@@ -313,4 +373,5 @@ For the purpose of this example, we are running a very small TensorFlow workload
0 commit comments