A cluster is like a team of computers or laptops that work together to accomplish tasks more efficiently than they could individually. Imagine you have a big project that would take a long time if you did it alone. If you gather a group of friends and each person works on a part of the project, you can get it done much faster. This is similar to how a cluster works: each computer, or node, in the cluster takes on part of the workload, and together they complete the task more quickly.
In your case, you're building a virtual cluster with three laptops:
Login Node (Latop 1): This is like the team leader or the coordinator. It's where you log in to give commands and manage the cluster. It doesn't do much heavy lifting but is crucial for organizing the tasks.
Compute Nodes (Laptop 2 and Laptop 3): These are like the workers. They handle the actual processing and computation. When you give a command to the login node, it distributes the work to these compute nodes. These compute nodes, we will call them compute00 and compute01.
So, when all three laptops (one login node and two compute nodes) are connected and working together, they form a cluster that can handle larger and more complex tasks than any single laptop could on its own.
We will be using these notes to test out the virtual cluster:
https://carpentries-incubator.github.io/hpc-intro/10-hpc-intro/index.html
If you would like to have an in-depth hands-on experience of deploying an OpenHPC 2.x cluster for yourself, there is a hands-on guide for deploying the OpenHPC2.x virtual lab yourself from scratch:
https://hpc-ecosystems.gitlab.io/training/openhpc-2.x-guide/2_virtual_lab_setup/
- Install VirtualBox
- Install Vagrant
- Install Gitbash (if using windows)
Follow the guide here if you want more details on how to install VirtualBox and Vagrant:
https://hpc-ecosystems.gitlab.io/training/openhpc-2.x-guide/2_virtual_lab_setup/
What you will be building
-
Open gitbash and go to your desired projects/documents folder
-
Create a directory called
vcluster
to store your virtual cluster files. -
Navigate into the newly created directory
vcluster
. -
Clone the repository to the
vcluster
directory.
git clone https://gitlab.com/hpc-ecosystems/training/openhpc-2.x-virtual-lab.git
cd openhpc-2.x-virtual-lab
- If you do
ls
you should see the following files:
.gitignore
client00.box
compute-node.box
input.local.lab
openhpc-demo-client00.ova
openhpc-demo-client01.ova
README.md
slurm.conf.lab
Vagrantfile
- Delete the existing
Vagrantfile
file withrm Vagrantfile
, and download the newVagrantfile
and thepackage.box
from this repo and copy them to your local .../openhpc-2.x-virtual-lab/ folder - Now go to this link pre-packaged Vagrant box and go to the folder: "OpenHPC2-vCluster files" and download "openhpc2-smshost-20240724.box". and copy it to your local .../openhpc-2.x-virtual-lab/ folder
- NOTE: If a password is required, please use
ohpc2template
- HINT: You can download the pre-packaged
.box
file to another location if you intend to build multiple machines from the packaged box
You should then end up with:
.gitignore
client00.box
compute-node.box
input.local.lab
openhpc-demo-client00.ova
openhpc-demo-client01.ova
openhpc2-smshost-20240724.box
package.box
README.md
slurm.conf.lab
Vagrantfile
- Add the pre-built Vagrant box to the Vagrant environment using:
From ...openhpc-2.x-virtual-lab/
:
vagrant box add openhpc/ohpc2 file://openhpc2-smshost-20240724.box
- Once complete (
==> box: Successfully added box 'openhpc/ohpc2' (v0) for 'virtualbox'!
) start the login node (referred to as the smshost):
vagrant up smshost
You should see something like this:
Bringing machine 'smshost' up with 'virtualbox' provider...
==> smshost: Importing base box 'openhpc/ohpc2'...
==> smshost: Matching MAC address for NAT networking...
==> smshost: Clearing any previously set network interfaces...
==> smshost: Preparing network interfaces based on configuration...
smshost: Adapter 1: nat
smshost: Adapter 2: intnet
smshost: Adapter 3: hostonly
==> smshost: Forwarding ports...
smshost: 22 (guest) => 2299 (host) (adapter 1)
==> smshost: Running 'pre-boot' VM customizations...
==> smshost: Booting VM...
==> smshost: Waiting for machine to boot. This may take a few minutes...
smshost: SSH address: 127.0.0.1:2299
smshost: SSH username: vagrant
smshost: SSH auth method: private key
==> smshost: Machine booted and ready!
...
- HINT: This should show a virtual machine in the VirtualBox GUI with the name
smshost_vcluster
*
- Now start first compute node 0:
vagrant up compute00
You should see something like this:
Bringing machine 'compute00' up with 'virtualbox' provider...
==> compute00: Importing base box 'file://./package.box'...
==> compute00: Matching MAC address for NAT networking...
==> compute00: Setting the name of the VM: compute00_vcluster_20240724
==> compute00: Preparing network interfaces based on configuration...
compute00: Adapter 1: intnet
==> compute00: Forwarding ports...
compute00: 22 (guest) => 2222 (host) (adapter 1)
compute00: VirtualBox adapter #1 not configured as "NAT". Skipping port
compute00: forwards on this adapter.
==> compute00: Running 'pre-boot' VM customizations...
==> compute00: Booting VM...
...
...
...
If the box appears to be booting properly, you may want to increase
the timeout ("config.vm.boot_timeout") value.
- Now start second compute node 1:
vagrant up compute01
You should see something like this:
Bringing machine 'compute01' up with 'virtualbox' provider...
==> compute01: Importing base box 'file://./package.box'...
==> compute01: Matching MAC address for NAT networking...
==> compute01: Setting the name of the VM: compute01_vcluster_20240724
==> compute01: Preparing network interfaces based on configuration...
compute01: Adapter 1: intnet
==> compute01: Forwarding ports...
compute01: 22 (guest) => 2222 (host) (adapter 1)
compute01: VirtualBox adapter #1 not configured as "NAT". Skipping port
compute01: forwards on this adapter.
==> compute01: Running 'pre-boot' VM customizations...
==> compute01: Booting VM...
...
...
...
If the box appears to be booting properly, you may want to increase
the timeout ("config.vm.boot_timeout") value.
In VirualBox you should see now:
- Test nodes:
From [vagrant@smshost vagrant]#
:
vagrant ssh smshost
From [vagrant@smshost vagrant]#
:
sudo su
From [root@smshost vagrant]#
:
sinfo
Output:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
normal* up 1-00:00:00 2 down* compute[00-01]
It will be likely that at this stage on first boot both compute nodes will be marked as down. This is to be expected from the virtual environment since the VMs have in effect been in a dormant state since their first original deployment many days/weeks/months ago. We need to restart the HPC services to resume normal HPC service.
The following steps will remove an errant entry that reappears in the compute node DNS entries that will not be needed for this lab, and then restarts the Slurm workload manager.
HINT: There is a script cluster_up.sh supplied in this repo that can be used for this process or do the following:
From [root@smshost vagrant]#
:
ssh
to first node as root and do:
ssh compute00
From [root@compute00 ]#
:
sed -i 's/127.0.1.1 smshost smshost/#127.0.1.1 smshost smshost/' /etc/hosts
systemctl restart slurmd
exit
From [root@smshost vagrant]#
:
ssh
to second node as root and do:
ssh compute01
From [root@compute01 ]#
:
sed -i 's/127.0.1.1 smshost smshost/#127.0.1.1 smshost smshost/' /etc/hosts
systemctl restart slurmd
exit
- You should be back at login node
[root@smshost vagrant]#
and make sure you get the following:
sudo sinfo
And get the following:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
normal* up 1-00:00:00 2 idle compute[00-01]
The STATE
should show idle
.
If the
STATE
shows anything else (usually the alternative state isdown
), then the compute nodes must be brought back to service:
scontrol update nodename=compute0[0-1] state=resume
- Lastly fix a small bug in
/etc/hosts
on login node:
From [root@smshost vagrant]#
:
sudo sed -i '3d' /etc/hosts
Now we can start using our cluster and submit jobs!
- First let us create a user profile login called
test
and login to:
From [root@smshost vagrant]#
:
sudo su - test
You should see:
[test@smshost ]$
From here on we will start using HPC Software Carpentry notes: https://carpentries-incubator.github.io/hpc-intro/10-hpc-intro/index.html
Note on the section: https://carpentries-incubator.github.io/hpc-intro/17-parallel/index.html
To install Amdahl rather follow this process:
Exit to smshost root [root@smshost vagrant]#
and install the following:
yum install python3-devel
and:
pip3 install --upgrade pip
Go back to test
user: sudo su - test
and first create a virtual environment:
python3 -m venv amdahl-env
source amdahl-env/bin/activate
And do the following:
pip install amdahl
pip install numpy
Now you can start with the section on "Running the Job on a Compute Node".
Script updates:
- Replace
#SBATCH -p cpubase_bycore_b1
with#SBATCH -p normal
- Note, for the scripts you don't need to load any modules like:
module load Python
ormodule load SciPy-bundle
- The configurations that can Node (-N) and processors (-n) can only be the following:
Configuration | Description |
---|---|
-N 1 -n 1 |
1 Node, 1 Task |
-N 1 -n 2 |
1 Node, 2 Tasks |
-N 2 -n 2 |
2 Nodes, 2 Tasks |
-N 2 -n 4 |
2 Nodes, 4 Tasks |