You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/Documentation/Systems/Gila/index.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,7 @@
1
1
2
2
# About Gila
3
3
4
-
Gila is an OpenHPC-based cluster running on __Dual AMD EPYC 7532 Rome CPUs__ and __Intel Xeon Icelake CPUs with NVIDIA A100 GPUs__. The nodes run as virtual machines in a local virtual private cloud (OpenStack). Gila is allocated for NLR workloads and intended for LDRD, SPP or Office of Science workloads. Check back regularly as the configuration and capabilities for Gila are augmented over time.
4
+
Gila is an OpenHPC-based cluster. The [nodes](./running.md#gila-compute-nodes) run as virtual machines in a local virtual private cloud (OpenStack). Gila is allocated for NLR workloads and intended for LDRD, SPP or Office of Science workloads. Check back regularly as the configuration and capabilities for Gila are augmented over time.
Copy file name to clipboardExpand all lines: docs/Documentation/Systems/Gila/modules.md
+70-63Lines changed: 70 additions & 63 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,7 @@
1
1
# Modules on Gila
2
2
3
3
On Gila, modules are deployed and organized slightly differently than on other NLR HPC systems.
4
-
While the basic concepts of using modules remain the same, there are important differences in how modules are structured, discovered, and loaded. These differences are intentional and are designed to improve compatibility, reproducibility, and long-term maintainability. The upcoming sections of this document will walk through these differences step by step.
4
+
While the basic concepts of using modules remain the same, there are important differences in how modules are structured, discovered, and loaded. These differences are intentional and designed to improve compatibility, reproducibility, and long-term maintainability. The upcoming sections of this document will walk through these differences step by step.
5
5
6
6
The module system used on this cluster is [Lmod](../../Environment/lmod.md).
7
7
@@ -14,15 +14,19 @@ When you log in to Gila, three modules are loaded automatically by default:
14
14
!!! note
15
15
The `DefApps` module is a convenience module that ensures both `Core` and `GCC` are loaded upon login or when you use `module restore`. It does not load additional software itself but guarantees that the essential environment is active.
16
16
17
-
## X86 VS ARM
17
+
## x86 vs ARM
18
18
19
-
There are two module stacks on Gila, one for each hardware architecture and each stack is loaded depending on the hardware used.
20
-
The two hardware stacks are almost identical in terms of modules offered, however some modules might be missing and/or have different versions. Please email [HPC-Help](mailto:[email protected]) for any request regarding modules availability and/or versions change.
21
-
The recommended usage is to connect to the login node corresponding to the hardware intended to be used for the compute, e.g. `gila-login-1` for **x86** and `gila-hopper-login1` for **arm**.
19
+
Gila has two separate module stacks, one for each hardware architecture. The appropriate stack is automatically loaded based on which login node you use.
20
+
The two hardware stacks are almost identical in terms of available modules. However, some modules might be missing or have different versions depending on the architecture. For requests regarding module availability or version changes, please email [HPC-Help](mailto:[email protected]).
21
+
22
+
To ensure proper module compatibility, connect to the login node corresponding to your target compute architecture:
23
+
24
+
-**x86 architecture**: Use `gila-login-1`
25
+
-**ARM architecture**: Use `gila-hopper-login1` (Grace Hopper nodes)
22
26
23
-
!!! warning
24
-
Usage of the GraceHopper computes from the x86 login node, or the usage of x86 computes from the GraceHopper login is not allowed and will cause module problems.
25
27
28
+
!!! warning
29
+
Do not submit jobs to Grace Hopper (ARM) compute nodes from the x86 login node, or vice versa.
26
30
27
31
## Module Structure on Gila
28
32
@@ -69,18 +73,64 @@ This separation between Base and Core modules ensures:
69
73
* Reduced risk of mixing incompatible software
70
74
* A cleaner and more predictable module environment
71
75
76
+
## Module Commands: restore, avail, and spider
77
+
78
+
### module restore
79
+
80
+
The `module restore` command reloads the set of modules that were active at the start of your login session or at the last checkpoint. This is useful if you have unloaded or swapped modules and want to return to your original environment.
81
+
82
+
Example:
83
+
84
+
```bash
85
+
module restore
86
+
```
87
+
88
+
This will restore the default modules that were loaded at login, such as `Core/25.05`, `DefApps`, and `gcc/14.2.0`.
89
+
90
+
### module avail
91
+
92
+
The `module avail` command lists all modules that are **currently visible** in your environment. This includes modules that are compatible with the loaded compiler, MPI, or CUDA base modules.
93
+
94
+
Example:
95
+
96
+
```bash
97
+
module avail
98
+
```
99
+
100
+
You can also search for a specific software:
101
+
102
+
```bash
103
+
module avail python
104
+
```
105
+
106
+
### module spider
107
+
108
+
The `module spider` command provides a **complete listing of all versions and configurations** of a software package, including those that are **not currently visible** with `module avail`. It also shows **which modules need to be loaded** to make a specific software configuration available.
109
+
110
+
Example:
111
+
112
+
```bash
113
+
module spider python/3.10
114
+
```
115
+
116
+
This output will indicate any prerequisite modules you need to load before the software becomes available.
117
+
118
+
!!! tip
119
+
Use `module avail` for quick checks and `module spider` when you need full details or to resolve dependencies for specific versions.
72
120
73
121
## MPI-Enabled Software
74
122
75
123
MPI-enabled software modules are identified by a `-mpi` suffix at the end of the module name.
76
124
77
125
Similar to compiler modules, MPI-enabled software is **not visible by default**. These modules only appear after an MPI implementation is loaded. Supported MPI implementations include `openmpi`, `mpich`, and `intelmpi`.
78
126
79
-
Loading an MPI implementation makes MPI-enabled software that was installed with that specific MPI stack available when running `module avail`.
127
+
Loading an MPI implementation makes MPI-enabled software built with that specific MPI stack available when running `module avail`.
80
128
81
129
This behavior ensures that only software built against the selected MPI implementation is exposed, helping users avoid mixing incompatible MPI libraries.
82
130
83
-
For example, using **module spider** to find all available variances of **HDF5**.
131
+
### Example: Finding and Loading MPI-Enabled HDF5
132
+
133
+
Use `module spider` to find all available variants of **HDF5**.
84
134
85
135
```bash
86
136
[USER@gila-login-1 ~]$ ml spider hdf5
@@ -91,10 +141,10 @@ For example, using **module spider** to find all available variances of **HDF5**
91
141
hdf5/1.14.5-mpi
92
142
```
93
143
94
-
Each version of **HDF5** requires dependency modules to be loaded so that they can be available to be used.
95
-
Please refer to the **module spider** section for more details.
144
+
Each version of **HDF5** requires dependency modules to be loaded before it becomes available.
145
+
Please refer to the [module spider section](modules.md#module-spider) for more details.
96
146
97
-
To find the dependencies needed for **hdf5/1.14.5-mpi**
147
+
To find the dependencies needed for `hdf5/1.14.5-mpi`:
98
148
99
149
```bash
100
150
[USER@gila-login-1 ~]$ ml spider hdf5/1.14.5-mpi
@@ -107,19 +157,20 @@ To find the dependencies needed for **hdf5/1.14.5-mpi**
107
157
oneapi/2025.1.3 openmpi/5.0.5
108
158
```
109
159
110
-
Without the dependencies and using **ml avail**
160
+
Before loading the dependencies:
111
161
112
162
```bash
113
163
[USER@gila-login-1 ~]$ ml avail hdf5
114
164
--------------- [ gcc/14.2.0 ] -------------
115
165
hdf5/1.14.5
116
166
```
117
167
118
-
This version of **HDF5** is not *mpi*enabled.
168
+
This version of **HDF5** is not MPI-enabled.
119
169
120
-
Now with the dependencies loaded
170
+
After loading the dependencies, both versions are now visible:
121
171
122
172
```bash
173
+
[USER@gila-login-1 ~]$ ml gcc/14.2.0 openmpi/5.0.5
@@ -128,7 +179,7 @@ Now with the dependencies loaded
128
179
```
129
180
130
181
131
-
!!! note
182
+
!!! tip
132
183
To determine whether a software package is available on the cluster, use `module spider`. This command lists **all available versions and configurations** of a given software, including those that are not currently visible with `module avail`.
133
184
134
185
To find out which modules must be loaded in order to access a specific software configuration, run `module spider` using the **full module name**. This will show the required modules that need to be loaded to make that software available.
@@ -144,7 +195,7 @@ This means you can use Apptainer and Podman at any time without loading a specif
144
195
## Building on Gila
145
196
146
197
Building on Gila should be done on compute nodes and **NOT** login nodes.
147
-
Some important build tools are not available by default and requires loading them from the module stack.
198
+
Some important build tools are not available by default and require loading them from the module stack.
148
199
149
200
These build tools are:
150
201
@@ -154,51 +205,7 @@ These build tools are:
154
205
- automake
155
206
- m4
156
207
157
-
158
-
## Module Commands: restore, avail, and spider
159
-
160
-
### module restore
161
-
162
-
The `module restore` command reloads the set of modules that were active at the start of your login session or at the last checkpoint. This is useful if you have unloaded or swapped modules and want to return to your original environment.
163
-
164
-
Example:
165
-
166
-
```bash
167
-
module restore
168
-
```
169
-
170
-
This will restore the default modules that were loaded at login, such as `Core/25.05`, `DefApps`, and `gcc/14.2.0`.
171
-
172
-
### module avail
173
-
174
-
The `module avail` command lists all modules that are **currently visible** in your environment. This includes modules that are compatible with the loaded compiler, MPI, or CUDA base modules.
175
-
176
-
Example:
177
-
178
-
```bash
179
-
module avail
180
-
```
181
-
182
-
You can also search for a specific software:
183
-
184
-
```bash
185
-
module avail python
186
-
```
187
-
188
-
### module spider
189
-
190
-
The `module spider` command provides a **complete listing of all versions and configurations** of a software package, including those that are **not currently visible** with `module avail`. It also shows **which modules need to be loaded** to make a specific software configuration available.
191
-
192
-
Example:
193
-
194
-
```bash
195
-
module spider python/3.10
196
-
```
197
-
198
-
This output will indicate any prerequisite modules you need to load before the software becomes available.
199
-
200
-
!!! tip
201
-
Use `module avail` for quick checks and `module spider` when you need full details or to resolve dependencies for specific versions.
208
+
Please see [here](./running.md#example-compiling-a-program-on-gila) for a full example of compiling a program on Gila.
202
209
203
210
204
211
## Frequently Asked Questions
@@ -213,4 +220,4 @@ This output will indicate any prerequisite modules you need to load before the s
213
220
While it is technically possible, Miniforge is intended to provide an isolated environment separate from external modules. Be careful with the order in which modules are loaded, as this can impact your `PATH` and `LD_LIBRARY_PATH`.
214
221
215
222
??? note "What if I want a different CUDA version?"
216
-
Other CUDA versions are available under **CORE** modules. If you need additional versions, please reach out to [HPC-Help](mailto:[email protected]). Note that CUDA modules under CORE do **not** automatically make CUDA-enabled software available; only CUDA modules under **Base** modules will load CUDA-enabled packages.
223
+
Other CUDA versions are available under **Core** modules. If you need additional versions, please reach out to [HPC-Help](mailto:[email protected]). Note that CUDA modules under CORE do **not** automatically make CUDA-enabled software available; only CUDA modules under **Base** modules will load CUDA-enabled packages.
Copy file name to clipboardExpand all lines: docs/Documentation/Systems/Gila/running.md
+130-7Lines changed: 130 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,17 +3,25 @@
3
3
*Learn about compute nodes and job partitions on Gila.*
4
4
5
5
6
-
## Compute Nodes
6
+
## Gila Compute Nodes
7
7
8
-
Compute nodes in Gila are virtualized nodes. **These nodes are not configured as exclusive and can be shared by multiple users or jobs.** Be sure to request the resources that your job needs, including memory and cores.
8
+
Gila compute nodesare not configured as exclusive and can be shared by multiple users or jobs. Be sure to request the resources that your job needs, including memory and cores. If you need exclusive use of a node, add the `--exclusive` flag to your job submission.
9
9
10
+
### CPU Nodes
10
11
11
-
## GPU hosts
12
+
The CPU nodes in Gila are single-threaded virtualized nodes. There are two sockets and NUMA nodes per compute node, with each socket containing 30 __AMD EPYC Milan__ (x86-64) cores. Each node has 220GB of RAM that can be used.
12
13
13
-
GPU nodes in Gila have NVIDIA A100 GPUs running on __Intel Xeon Icelake CPUs__.
14
14
15
+
### GPU Nodes
16
+
17
+
GPU nodes in Gila have 8 NVIDIA A100 GPUs running on x86-64 __Intel Xeon Icelake CPUs__. There are 42 cores on a GPU node, with one socket and NUMA node. Each GPU node has 910GB of RAM, and each NVIDIA A100 GPU has 80GB of VRAM.
18
+
19
+
### Grace Hopper Nodes
20
+
21
+
Gila has 6 NVIDIA Grace Hopper nodes. To use the Grace Hopper nodes, submit your jobs to the `gh` partition from the `gila-hopper-login1.hpc.nrel.gov` login node. Each Grace Hopper node has a 72 core NVIDIA Grace CPU and an NVIDIA GH200 GPU, with 96GB of VRAM and 470GB of RAM. They have one socket and NUMA node.
22
+
23
+
Please note - the __NVIDIA Grace CPUs__ run on a different processing architecture (ARM64) than both the __Intel Xeon Icelake CPUs__ (x86-64) and the __AMD EPYC Milan__ (x86-64). Any application that is manually compiled by a user and intended to be used on the Grace Hopper nodes __MUST__ be compiled on the Grace Hopper nodes themselves.
15
24
16
-
There are also 5 NVIDIA Grace Hopper nodes. To use the Grace Hopper nodes, submit your jobs to the gh partition from the `gila-hopper-login1.hpc.nrel.gov` login node.
17
25
18
26
19
27
## Partitions
@@ -23,13 +31,128 @@ A list of partitions can be found by running the `sinfo` command. Here are the
23
31
| Partition Name | CPU | GPU | Qty | RAM | Cores/node |
Gila is optmized for single-node workloads. Multi-node jobs may experience degraded performance.
40
+
Gila is optimized for single-node workloads. Multi-node jobs may experience degraded performance. All MPI distribution flavors work on Gila, with noted performance from Intel-MPI. Gila is single-threaded, and applications that are compiled to make use of multiple threads will not be able to take advantage of this.
41
+
42
+
## Example: Compiling a Program on Gila
43
+
44
+
In this section we will describe how to compile an MPI based application using an Intel toolchain from the module system. Please see the [Modules page](./modules.md) for additional information on the Gila module system.
45
+
46
+
47
+
### Requesting an interactive session
48
+
First, we will begin by requesting an interactive session. This will give us a compute node from where we can carry out our work. An example command for requesting such a session is as follows:
This will request a single node from the AMD partition with 60 cores and 60 GB of memory for one hour. We request this node using the ```aurorahpc``` account that is open to all NLR staff, but if you have an HPC allocation, please replace ```aurorahpc``` with the project handle.
53
+
54
+
### Loading necessary modules
55
+
56
+
Once we have an allocated node, we will need to load the initial Intel module for the toolchain `oneapi`. This will give us access to the Intel toolchain, and we will we now load the module ```intel-oneapi-mpi``` to give us access to Intel MPI. Please note, you can always check what modules are available to you by using the command ```module avail``` and you can also check what modules you have loaded by using the command ```module list```. The commands for loading the modules that we need are as follows:
57
+
58
+
```bash
59
+
module load oneapi
60
+
module load intel-oneapi-mpi
61
+
```
62
+
63
+
### Copying program files
64
+
65
+
We now have access to the tools we need from the Intel toolchain in order to be able to compile a program! First, create a directory called `program-compilation` under `/projects` or `/scratch`.
66
+
67
+
```bash
68
+
mkdir program-compilation
69
+
cd program-compilation
70
+
```
71
+
Now we are going to copy the `phostone.c` file from `/nopt/nrel/apps/210929a` to our `program-compilation` directory.
`rsync` is a copy command that is commonly used for transferring files, and the parameters that we put into the command allow for us to see the progress of the file transfer and preserve important file characteristics.
76
+
77
+
### Program compilation
78
+
79
+
Once the file is copied, we can compile the program. The command we need to use in order to compile the program is as follows:
80
+
81
+
```bash
82
+
mpiicx -qopenmp phostone.c -o phost.intelmpi
83
+
```
84
+
85
+
The command ```mpiicx``` is the Intel MPI compiler that was loaded from the module ```intel-oneapi-mpi```, and we added the flag of ```-qopenmp``` to make sure that the OpenMP compiled portions of the program are able to be loaded. We then specified the file name as `phost.intelmpi` using the ```-o``` flag.
86
+
87
+
### Submitting a job
88
+
89
+
The following batch script requests two cores to use two MPI ranks on a single node, with a run time of up to an hour. Save this script to a file such as `submit_intel.sh`, and submit using `sbatch submit_intel.sh`. Again, if you have an HPC allocation, we request that you replace ```aurorahpc``` with the project handle.
90
+
91
+
??? example "Batch Submission Script - Intel MPI"
92
+
93
+
```bash
94
+
#!/bin/bash
95
+
#SBATCH --nodes=1
96
+
#SBATCH --ntasks=2
97
+
#SBATCH --cpus-per-task=2
98
+
#SBATCH --time=00:01:00
99
+
#SBATCH --mem=20GB
100
+
#SBATCH --account=aurorahpc
101
+
102
+
module load oneapi
103
+
module load intel-oneapi-mpi
104
+
105
+
srun --cpus-per-task 2 -n 2 ./phost.intelmpi -F
106
+
```
107
+
108
+
Your output should look similar to the following
109
+
110
+
```
111
+
MPI VERSION Intel(R) MPI Library 2021.14 for Linux* OS
We can now follow these steps using OpenMPI as well! First, we will unload the Intel modules from the Intel toolchain. We will then load GNU modules and OpenMPI using the `module load` command from earlier. The commands are as follows:
123
+
124
+
```bash
125
+
module unload intel-oneapi-mpi
126
+
module unload oneapi
127
+
module load gcc
128
+
module load openmpi
129
+
```
130
+
131
+
We can then compile the phost program again by using the following commands:
132
+
133
+
```bash
134
+
mpicc -fopenmp phostone.c -o phost.openmpi
135
+
```
136
+
137
+
Once the program has been compiled against OpenMPI, we can go ahead and submit another batch script to test the program:
0 commit comments