You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: cluster.md
+35-6Lines changed: 35 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,7 @@ nav_order: 4
6
6
7
7
# Computing Cluster
8
8
9
-
(Last updated: Jul 23, 2024)
9
+
(Last updated: Mar 18, 2025)
10
10
11
11
This document is created by Yen-Chia Hsu when supervising students who need to have access to computing resources for their projects.
12
12
This document explains how to work with computer clusters in Dutch academia.
@@ -58,7 +58,10 @@ Below is an example of estimating SBUs. You can also find [another example on th
58
58
59
59
### <aname="snellius-run-script"></a>How to run scripts on the Snellius cluster?
60
60
61
-
The Snellius cluster uses the [Slurm manager](https://slurm.schedmd.com/overview.html). When you log into the cluster, you are typically on the head node. You will need to request a compute node to run scripts. DO NOT run scripts on the head node, as you will slow down the head node and cause trouble to other users when they log in.
61
+
The Snellius cluster uses the [Slurm manager](https://slurm.schedmd.com/overview.html). When you log into the cluster, you are typically on the head node. You will need to request a compute node to run scripts.
62
+
63
+
{: .important }
64
+
> DO NOT run scripts on the head node, as you will slow down the head node and cause trouble to other users when they log in.
62
65
63
66
The easiest way to run scripts on a cluster is to use interactive sessions, which is like a normal terminal where you can type commands. The following two links contain more information about running Slurm interactive sessions:
64
67
-[Interactive jobs (by SURF)](https://servicedesk.surf.nl/wiki/display/WIKI/Interactive+jobs)
@@ -104,7 +107,18 @@ Usually, a good practice is to use the `srun` command with interactive sessions
104
107
105
108
You can use the [Git](https://git-scm.com/) tool to manage code. There are many online Git services for hosting code, such as [GitHub](https://github.com/), [GitLab](https://about.gitlab.com/), [Bitbucket](https://bitbucket.org/), etc. In this way, you can sync the code on both your local machine (e.g., your personal MAC) and remote server (e.g., your Snellius space). If you are new to Git, check the [Git guide](https://github.com/git-guides). Also, here is a [15-minute Git tutorial video](https://www.youtube.com/watch?v=USjZcfj8yxE).
106
109
107
-
For data management, according to this [SURF documentation on Snellius file systems](https://servicedesk.surf.nl/wiki/display/WIKI/Snellius+filesystems), your home directory has default capacity quotas for storing data (check the mentioned SURF documentation for the size of the default storage). You can also request an extra project space quota (when applying for the Snellius account), which is separate from the default quota. You may need to move your data from your local machine to the remote server. In this case, you can use the [rsync tool](https://linux.die.net/man/1/rsync). Here is [a tutorial](https://www.digitalocean.com/community/tutorials/how-to-use-rsync-to-sync-local-and-remote-directories) about how to use it to sync the local and remote data. For example, if you are on the terminal of your local machine, you can use the following command to sync a file with path `/data/project-potions/healing_potion.py` to the `/var/www/project-potions/` folder on your remote server `snellius.surf.nl` with user name `hpotter`:
110
+
For data management, check the [SURF documentation on Snellius file systems](https://servicedesk.surf.nl/wiki/display/WIKI/Snellius+filesystems) for different types of storage. There are three types of storage: home, scratch, and project file systems. Your home directory has default capacity quotas for storing data (which is the home file storage). Besides that, the scratch file system is for you to store temporary files in a fast speed. You can also request an extra project space quota (when applying for the Snellius account), which is separate from the default quota and is typically larger (so that you can store large datasets). If you want to know the available quotas that you have, check [this documentation](https://servicedesk.surf.nl/wiki/spaces/WIKI/pages/37388489/Checking+disk+usage) to use the follwoing commands:
111
+
-`myquota` (quota for all file systems)
112
+
-`home-quota` (quota for your home directory, such as `/home/USER_NAME`)
113
+
-`scratch-quota` (quota for fast temporary storage, such as `/scratch-local/USER_NAME`)
114
+
-`prjspc-quota` (quota for your project space, such as `/projects/0/PROJECT_NAME`).
115
+
116
+
The project space directory should be in the email that SURF send to you after you apply for the Snellius account. You can figure out your home directory by typing the following command in your terminal:
117
+
```sh
118
+
pwd
119
+
```
120
+
121
+
You may need to move your data from your local machine to the remote server (e.g., your project space on Snellius). In this case, you can use the [rsync tool](https://linux.die.net/man/1/rsync). Here is [a tutorial](https://www.digitalocean.com/community/tutorials/how-to-use-rsync-to-sync-local-and-remote-directories) about how to use it to sync the local and remote data. For example, if you are on the terminal of your local machine, you can use the following command to sync a file with path `/data/project-potions/healing_potion.py` to the `/var/www/project-potions/` folder on your remote server `snellius.surf.nl` with user name `hpotter`:
In case you need help from someone else to move the data to your project space, the workflow is to put the other person's public SSH key in the `~/.ssh/authorized_keys` file. The `~` symbol means your home directory. If the `~/.ssh/` folder is not there, use the following commands to create the file:
142
+
```sh
143
+
cd~
144
+
mkdir .ssh
145
+
cd .ssh
146
+
touch authorized_keys
147
+
```
148
+
149
+
After that, copy the person's public SSH key and put it into the `authorized_keys` file. This person can also be yourself on another machine (e.g., your laptop). You can use text editors in the terminal, such as `vim` or `nano`, to do this. If you need to check if the `~/.ssh/` exists, use the follwing command to list all directories (including the hidden ones):
150
+
```sh
151
+
ls -lah
152
+
```
153
+
154
+
If you need to create a SSH private and public key pair, check [this SURF documentation](https://servicedesk.surf.nl/wiki/spaces/WIKI/pages/62226987/Creating+and+using+an+SSH+key+pair+on+the+Terminal).
155
+
127
156
### <aname="troubleshooting"></a>Troubleshooting
128
157
129
158
Solutions for some problems can be found on the [Snellius documentation website](https://servicedesk.surf.nl/wiki/display/WIKI/Snellius) or the [Slurm documentation website](https://slurm.schedmd.com/). Below are some common problems and their potential solutions.
130
159
131
-
I got errors when running interactive sessions and sbatch jobs. What to do?
160
+
#### I got errors when running interactive sessions and sbatch jobs. What to do?
132
161
- It is possible that your ran out of the computing budget. Use `accinfo --product gpu` to check if you still have available budget. If not, contact the Snellius help desk.
133
162
134
-
It takes a very long time to request a computing code. What to do?
163
+
#### It takes a very long time to request a computing code. What to do?
135
164
- You can use the `sinfo` command to check the state of the nodes. Here is the [documentation of the meaning of node states](https://slurm.schedmd.com/sinfo.html#SECTION_NODE-STATE-CODES) (e.g., drained, down). Then, you can use the `--exclude` option in the `srun` command (for example, `--exclude=gcn42,gcn56`) to exclude the nodes that are drained.
136
165
137
-
About half of my SBATCH jobs failed. It looks like the code just hangs there and does nothing after loading the modules. But if I request interactive sessions, the code always works. What should I do?
166
+
#### About half of my SBATCH jobs failed. It looks like the code just hangs there and does nothing after loading the modules. But if I request interactive sessions, the code always works. What should I do?
138
167
- Consider using `srun python -u` to check whether the job is indeed hanging. However, using `-u` is not good because it puts a lot of pressure on the network and filesystem.
139
168
- Consider also adding the line `module purge` before loading any other modules to ensure that no previous modules affect the current job.
140
169
- A recommended solution is to add a line `sys.stdout.flush()` after important print statements.
0 commit comments