From 4f185bbe4e94c21bd3eacf06a910a19a324f4228 Mon Sep 17 00:00:00 2001 From: hyandt Date: Fri, 13 Jun 2025 13:51:11 -0600 Subject: [PATCH 01/23] Start Gila section --- docs/Documentation/Systems/Gila/filesystem.md | 38 ++++ docs/Documentation/Systems/Gila/index.md | 41 +++++ docs/Documentation/Systems/Gila/modules.md | 1 + docs/Documentation/Systems/Gila/running.md | 169 ++++++++++++++++++ mkdocs.yml | 5 + 5 files changed, 254 insertions(+) create mode 100644 docs/Documentation/Systems/Gila/filesystem.md create mode 100644 docs/Documentation/Systems/Gila/index.md create mode 100644 docs/Documentation/Systems/Gila/modules.md create mode 100644 docs/Documentation/Systems/Gila/running.md diff --git a/docs/Documentation/Systems/Gila/filesystem.md b/docs/Documentation/Systems/Gila/filesystem.md new file mode 100644 index 000000000..0e6cecd58 --- /dev/null +++ b/docs/Documentation/Systems/Gila/filesystem.md @@ -0,0 +1,38 @@ + + +# Gila Filesystem Architecture Overview + + + +## Project Storage: /projects + + + +## Home Directories: /home + + + +## Scratch Space: /scratch/username and /scratch/username/jobid + + + + +## Temporary space: $TMPDIR + + + +There is no expectation of data longevity in scratch space, and it is subject to purging once the node is idle. If desired data is stored here during the job, please be sure to copy it to a /projects directory as part of the job script before the job finishes. + +## Mass Storage System + +There is no Mass Storage System for deep archive storage on Gila. + +## Backups and Snapshots + +There are no backups or snapshots of data on Gila. Though the system is protected from hardware failure by multiple layers of redundancy, please keep regular backups of important data on Gila, and consider using a Version Control System (such as Git) for important code. + + diff --git a/docs/Documentation/Systems/Gila/index.md b/docs/Documentation/Systems/Gila/index.md new file mode 100644 index 000000000..bf19a9f77 --- /dev/null +++ b/docs/Documentation/Systems/Gila/index.md @@ -0,0 +1,41 @@ + +# About Gila + +Gila is an OpenHPC-based cluster running on . The nodes run as virtual machines in a local virtual private cloud (OpenStack). Gila is allocated for NREL workloads and intended for LDRD, SPP or Office of Science workloads. Allocation decisions are made by the IACAC through the annual allocation request process. Check back regularly as the configuration and capabilities for Gila are augmented over time. + +## Accessing Gila +Access to Gila requires an NREL HPC account and permission to join an existing allocation. Please see the [System Access](https://www.nrel.gov/hpc/system-access.html) page for more information on accounts and allocations. + +#### For NREL Employees: +To access Gila, log into the NREL network and connect via ssh: + + ssh + ssh + +#### For External Collaborators: + +There are currently no external-facing login nodes for Gila. There are two options to connect: + +1. Connect to the [SSH gateway host](https://www.nrel.gov/hpc/ssh-gateway-connection.html) and log in with your username, password, and OTP code. Once connected, ssh to the login nodes as above. +1. Connect to the [HPC VPN](https://www.nrel.gov/hpc/vpn-connection.html) and ssh to the login nodes as above. + + +There are currently two login nodes. They share the same home directory so work done on one will appear on the other. They are: + + vs-login-1 + vs-login-2 + +You may connect directly to a login node, but they may be cycled in and out of the pool. If a node is unavailable, try connecting to another login node or the `vs.hpc.nrel.gov` round-robin option. + +## Get Help with Gila + +Please see the [Help and Support Page](../../help.md) for further information on how to seek assistance with Gila or your NREL HPC account. + +## Building code + + +Don't build or run code on a login node. Login nodes have limited CPU and memory available. Use a compute or GPU node instead. Simply start an interactive job on an appropriately provisioned node and partition for your work and do your builds there. Similarly, build your projects under `/projects/your_project_name/` as home directories are **limited to 5GB** per user. + + +--- + diff --git a/docs/Documentation/Systems/Gila/modules.md b/docs/Documentation/Systems/Gila/modules.md new file mode 100644 index 000000000..9fca2ba13 --- /dev/null +++ b/docs/Documentation/Systems/Gila/modules.md @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/docs/Documentation/Systems/Gila/running.md b/docs/Documentation/Systems/Gila/running.md new file mode 100644 index 000000000..1f4448f75 --- /dev/null +++ b/docs/Documentation/Systems/Gila/running.md @@ -0,0 +1,169 @@ +# Running on Gila + +*This page discusses the compute nodes, partitions, and gives some examples of building and running applications.* + + +## About Gila + +### Compute hosts + +Gila is a collection of physical nodes with each regular node containing Dual AMD EPYC 7532 Rome CPUs. However, each node is virtualized. That is it is split up into virtual nodes with each virtual node having a portion of the cores and memory of the physical node. Similar virtual nodes are then assigned slurm partitions as shown below. + + + +### Shared file systems + +Gila's home directories are shared across all nodes. Each user has a quota of 5 GB. There is also /scratch/$USER and /projects spaces seen across all nodes. + +### Partitions + +Partitions are flexible and fluid on Gila. A list of partitions can be found by running the `sinfo` command. Here are the partitions as of 3/27/2025. + +| Partition Name | Qty | RAM | Cores/node | /var/scratch
1K-blocks | AU Charge Factor | +| :--: | :--: | :--: | :--: | :--: | :--: | +| gpu
*1 x NVIDIA Tesla A100* | 16 | 114 GB | 30 | 6,240,805,336| 12 | +| lg | 39 | 229 GB | 60 | 1,031,070,000| 7 | +| std | 60 | 114 GB | 30 | 515,010,816| 3.5 | +| sm | 28 | 61 GB | 16 | 256,981,000| 0.875 | +| t | 15 | 16 GB | 4 | 61,665,000| 0.4375 | + +### Allocation Unit (AU) Charges + +The equation for calculating the AU cost of a job on Gila is: + +```AU cost = (Walltime in hours * Number of Nodes * Charge Factor)``` + +The Walltime is the actual length of time that the job runs, in hours or fractions thereof. + +The **Charge Factor** for each partition is listed in the table above. + + + +### Operating Software + +The Gila HPC cluster runs fairly current versions of OpenHPC and SLURM on top of OpenStack. + + + diff --git a/mkdocs.yml b/mkdocs.yml index 840ef7403..65d1c426e 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -42,6 +42,11 @@ nav: - Documentation/Systems/Swift/running.md - Documentation/Systems/Swift/modules.md - Documentation/Systems/Swift/filesystems.md + - Gila: + - Documentation/Systems/Gila/index.md + - Documentation/Systems/Gila/running.md + - Documentation/Systems/Gila/modules.md + - Documentation/Systems/Gila/filesystem.md - Vermilion: - Documentation/Systems/Vermilion/index.md - Documentation/Systems/Vermilion/running.md From dc5fd20629b0a209161534685e961b729a0944d6 Mon Sep 17 00:00:00 2001 From: hyandt Date: Fri, 13 Jun 2025 14:01:39 -0600 Subject: [PATCH 02/23] Add comments --- docs/Documentation/Systems/Gila/filesystem.md | 14 +++++++------- docs/Documentation/Systems/Gila/index.md | 9 +++++---- docs/Documentation/Systems/Gila/modules.md | 2 +- docs/Documentation/Systems/Gila/running.md | 7 ++++--- 4 files changed, 17 insertions(+), 15 deletions(-) diff --git a/docs/Documentation/Systems/Gila/filesystem.md b/docs/Documentation/Systems/Gila/filesystem.md index 0e6cecd58..ff3bf8df0 100644 --- a/docs/Documentation/Systems/Gila/filesystem.md +++ b/docs/Documentation/Systems/Gila/filesystem.md @@ -1,4 +1,4 @@ - +*Swift layout as an example* # Gila Filesystem Architecture Overview @@ -6,24 +6,24 @@ ## Project Storage: /projects - +*Quota usage can be viewed at any time by issuing a `cd` command into the project directory, and using the `df -h` command to view total, used, and remaining available space for the mounted project directory* ## Home Directories: /home - +*/home directories are mounted as `/home/`. Home directories are hosted under the user's initial /project directory. Quotas in /home are included as a part of the quota of that project's storage allocation* ## Scratch Space: /scratch/username and /scratch/username/jobid - +*The scratch directory on each Swift compute node is a 1.8TB spinning disk, and is accessible only on that node. The default writable path for scratch use is `/scratch/`. There is no global, network-accessible `/scratch` space. `/projects` and `/home` are both network-accessible, and may be used as /scratch-style working space instead.* ## Temporary space: $TMPDIR - +*When a job starts, the environment variable `$TMPDIR` is set to `/scratch//` for the duration of the job. This is temporary space only, and should be purged when your job is complete. Please be sure to use this path instead of /tmp for your tempfiles.* There is no expectation of data longevity in scratch space, and it is subject to purging once the node is idle. If desired data is stored here during the job, please be sure to copy it to a /projects directory as part of the job script before the job finishes. diff --git a/docs/Documentation/Systems/Gila/index.md b/docs/Documentation/Systems/Gila/index.md index bf19a9f77..e95fe50ea 100644 --- a/docs/Documentation/Systems/Gila/index.md +++ b/docs/Documentation/Systems/Gila/index.md @@ -6,6 +6,7 @@ Gila is an OpenHPC-based cluster running on #### For External Collaborators: - +*Is this still true?* There are currently no external-facing login nodes for Gila. There are two options to connect: 1. Connect to the [SSH gateway host](https://www.nrel.gov/hpc/ssh-gateway-connection.html) and log in with your username, password, and OTP code. Once connected, ssh to the login nodes as above. 1. Connect to the [HPC VPN](https://www.nrel.gov/hpc/vpn-connection.html) and ssh to the login nodes as above. - +*Need to update* There are currently two login nodes. They share the same home directory so work done on one will appear on the other. They are: vs-login-1 vs-login-2 -You may connect directly to a login node, but they may be cycled in and out of the pool. If a node is unavailable, try connecting to another login node or the `vs.hpc.nrel.gov` round-robin option. +You may connect directly to a login node, but they may be cycled in and out of the pool. If a node is unavailable, try connecting to another login node or the <>`vs.hpc.nrel.gov`> round-robin option. ## Get Help with Gila @@ -33,7 +34,7 @@ Please see the [Help and Support Page](../../help.md) for further information on ## Building code - +*Need to review* Don't build or run code on a login node. Login nodes have limited CPU and memory available. Use a compute or GPU node instead. Simply start an interactive job on an appropriately provisioned node and partition for your work and do your builds there. Similarly, build your projects under `/projects/your_project_name/` as home directories are **limited to 5GB** per user. diff --git a/docs/Documentation/Systems/Gila/modules.md b/docs/Documentation/Systems/Gila/modules.md index 9fca2ba13..8b7a419b4 100644 --- a/docs/Documentation/Systems/Gila/modules.md +++ b/docs/Documentation/Systems/Gila/modules.md @@ -1 +1 @@ - \ No newline at end of file +*For apps team to write* \ No newline at end of file diff --git a/docs/Documentation/Systems/Gila/running.md b/docs/Documentation/Systems/Gila/running.md index 1f4448f75..b4007cbf7 100644 --- a/docs/Documentation/Systems/Gila/running.md +++ b/docs/Documentation/Systems/Gila/running.md @@ -7,14 +7,15 @@ ### Compute hosts -Gila is a collection of physical nodes with each regular node containing Dual AMD EPYC 7532 Rome CPUs. However, each node is virtualized. That is it is split up into virtual nodes with each virtual node having a portion of the cores and memory of the physical node. Similar virtual nodes are then assigned slurm partitions as shown below. +Gila is a collection of physical nodes with each regular node containing . However, each node is virtualized. That is it is split up into virtual nodes with each virtual node having a portion of the cores and memory of the physical node. Similar virtual nodes are then assigned slurm partitions as shown below. - +*Move this info to filesystems page?* ### Shared file systems Gila's home directories are shared across all nodes. Each user has a quota of 5 GB. There is also /scratch/$USER and /projects spaces seen across all nodes. +*Need to update* ### Partitions Partitions are flexible and fluid on Gila. A list of partitions can be found by running the `sinfo` command. Here are the partitions as of 3/27/2025. @@ -37,7 +38,7 @@ The Walltime is the actual length of time that the job runs, in hours or fractio The **Charge Factor** for each partition is listed in the table above. - +*Add example AU calculation, like Swift page* ### Operating Software From 14bc094997fd7a5ea99949b8cdf1e27ad5c7a24c Mon Sep 17 00:00:00 2001 From: jonathancasco <4037484+jonathancasco@users.noreply.github.com> Date: Thu, 16 Oct 2025 13:03:53 -0600 Subject: [PATCH 03/23] Update filesystem.md --- docs/Documentation/Systems/Gila/filesystem.md | 19 ++++++------------- 1 file changed, 6 insertions(+), 13 deletions(-) diff --git a/docs/Documentation/Systems/Gila/filesystem.md b/docs/Documentation/Systems/Gila/filesystem.md index ff3bf8df0..017987d0c 100644 --- a/docs/Documentation/Systems/Gila/filesystem.md +++ b/docs/Documentation/Systems/Gila/filesystem.md @@ -1,8 +1,8 @@ -*Swift layout as an example* - # Gila Filesystem Architecture Overview +## Home Directories: /home +*/home directories are mounted as `/home/`. Home directories are hosted under the user's initial /project directory. Quotas in /home are included as a part of the quota of that project's storage allocation* ## Project Storage: /projects @@ -10,22 +10,17 @@ *Quota usage can be viewed at any time by issuing a `cd` command into the project directory, and using the `df -h` command to view total, used, and remaining available space for the mounted project directory* -## Home Directories: /home - -*/home directories are mounted as `/home/`. Home directories are hosted under the user's initial /project directory. Quotas in /home are included as a part of the quota of that project's storage allocation* - -## Scratch Space: /scratch/username and /scratch/username/jobid - -*For users who also have Kestrel allocations, please be aware that scratch space on Swift behaves differently, so adjustments to job scripts may be necessary.* +## Scratch Storage: /scratch/username and /scratch/username/jobid -*The scratch directory on each Swift compute node is a 1.8TB spinning disk, and is accessible only on that node. The default writable path for scratch use is `/scratch/`. There is no global, network-accessible `/scratch` space. `/projects` and `/home` are both network-accessible, and may be used as /scratch-style working space instead.* +*For users who also have Kestrel allocations, please be aware that scratch space on Gila behaves differently, so adjustments to job scripts may be necessary.* +*The scratch filesystem on Gila compute node is a 79TB spinning disk Ceph filesystem, and is accessible from login and compute nodes. The default writable path for scratch use is `/scratch/`.* ## Temporary space: $TMPDIR *When a job starts, the environment variable `$TMPDIR` is set to `/scratch//` for the duration of the job. This is temporary space only, and should be purged when your job is complete. Please be sure to use this path instead of /tmp for your tempfiles.* -There is no expectation of data longevity in scratch space, and it is subject to purging once the node is idle. If desired data is stored here during the job, please be sure to copy it to a /projects directory as part of the job script before the job finishes. +There is no expectation of data longevity in the temporary space, and is purged once a job has completed. If desired data is stored here during the job, please be sure to copy it to a /projects directory as part of the job script before the job finishes. ## Mass Storage System @@ -34,5 +29,3 @@ There is no Mass Storage System for deep archive storage on Gila. ## Backups and Snapshots There are no backups or snapshots of data on Gila. Though the system is protected from hardware failure by multiple layers of redundancy, please keep regular backups of important data on Gila, and consider using a Version Control System (such as Git) for important code. - - From c7b60f5d0a88cda235a6fe5ddcfc66df72b23e0d Mon Sep 17 00:00:00 2001 From: jonathancasco <4037484+jonathancasco@users.noreply.github.com> Date: Thu, 16 Oct 2025 13:04:12 -0600 Subject: [PATCH 04/23] Update index.md --- docs/Documentation/Systems/Gila/index.md | 18 ++++++------------ 1 file changed, 6 insertions(+), 12 deletions(-) diff --git a/docs/Documentation/Systems/Gila/index.md b/docs/Documentation/Systems/Gila/index.md index e95fe50ea..4a44a378c 100644 --- a/docs/Documentation/Systems/Gila/index.md +++ b/docs/Documentation/Systems/Gila/index.md @@ -1,32 +1,28 @@ # About Gila -Gila is an OpenHPC-based cluster running on . The nodes run as virtual machines in a local virtual private cloud (OpenStack). Gila is allocated for NREL workloads and intended for LDRD, SPP or Office of Science workloads. Allocation decisions are made by the IACAC through the annual allocation request process. Check back regularly as the configuration and capabilities for Gila are augmented over time. +Gila is an OpenHPC-based cluster running on __Dual AMD EPYC 7532 Rome CPUs__ and __Intel Xeon Icelake CPUs with NVidia A100 GPUs__. The nodes run as virtual machines in a local virtual private cloud (OpenStack). Gila is allocated for NREL workloads and intended for LDRD, SPP or Office of Science workloads. Allocation decisions are made by the IACAC through the annual allocation request process. Check back regularly as the configuration and capabilities for Gila are augmented over time. ## Accessing Gila Access to Gila requires an NREL HPC account and permission to join an existing allocation. Please see the [System Access](https://www.nrel.gov/hpc/system-access.html) page for more information on accounts and allocations. -*Need to update* #### For NREL Employees: To access Gila, log into the NREL network and connect via ssh: - ssh - ssh + ssh gila.hpc.nrel.gov #### For External Collaborators: -*Is this still true?* There are currently no external-facing login nodes for Gila. There are two options to connect: 1. Connect to the [SSH gateway host](https://www.nrel.gov/hpc/ssh-gateway-connection.html) and log in with your username, password, and OTP code. Once connected, ssh to the login nodes as above. 1. Connect to the [HPC VPN](https://www.nrel.gov/hpc/vpn-connection.html) and ssh to the login nodes as above. -*Need to update* There are currently two login nodes. They share the same home directory so work done on one will appear on the other. They are: - vs-login-1 - vs-login-2 + gila-login-1 + gila-login-2 -You may connect directly to a login node, but they may be cycled in and out of the pool. If a node is unavailable, try connecting to another login node or the <>`vs.hpc.nrel.gov`> round-robin option. +You may connect directly to a login node, but they may be cycled in and out of the pool. If a node is unavailable, try connecting to another login node or the `gila.hpc.nrel.gov` round-robin option. ## Get Help with Gila @@ -34,9 +30,7 @@ Please see the [Help and Support Page](../../help.md) for further information on ## Building code -*Need to review* -Don't build or run code on a login node. Login nodes have limited CPU and memory available. Use a compute or GPU node instead. Simply start an interactive job on an appropriately provisioned node and partition for your work and do your builds there. Similarly, build your projects under `/projects/your_project_name/` as home directories are **limited to 5GB** per user. +Do not build or run code on login nodes. Login nodes have limited CPU and memory available. Use a compute or GPU node instead. Simply start an interactive job on an appropriately provisioned node and partition for your work and do your builds there. Similarly, build your projects under `/projects/your_project_name/` as home directories are **limited to 5GB** per user. --- - From 3111ab840605d28d73a321cc13c6e09cecf19f48 Mon Sep 17 00:00:00 2001 From: jonathancasco <4037484+jonathancasco@users.noreply.github.com> Date: Thu, 16 Oct 2025 13:04:34 -0600 Subject: [PATCH 05/23] Update running.md --- docs/Documentation/Systems/Gila/running.md | 22 ++++++++++------------ 1 file changed, 10 insertions(+), 12 deletions(-) diff --git a/docs/Documentation/Systems/Gila/running.md b/docs/Documentation/Systems/Gila/running.md index b4007cbf7..abeba1cf4 100644 --- a/docs/Documentation/Systems/Gila/running.md +++ b/docs/Documentation/Systems/Gila/running.md @@ -7,26 +7,24 @@ ### Compute hosts -Gila is a collection of physical nodes with each regular node containing . However, each node is virtualized. That is it is split up into virtual nodes with each virtual node having a portion of the cores and memory of the physical node. Similar virtual nodes are then assigned slurm partitions as shown below. +Compute nodes in Gila are virtualized nodes running on either __Dual AMD EPYC Milan CPUs__ or __Intel Xeon Icelake CPUs__. These nodes are not configured as exclusive and can be shared by multiple users or jobs. +### GPU hosts + +GPU nodes available in Gila have NVidia A100 GPU's running on __Intel Xeon Icelake CPUs__. -*Move this info to filesystems page?* ### Shared file systems Gila's home directories are shared across all nodes. Each user has a quota of 5 GB. There is also /scratch/$USER and /projects spaces seen across all nodes. -*Need to update* ### Partitions -Partitions are flexible and fluid on Gila. A list of partitions can be found by running the `sinfo` command. Here are the partitions as of 3/27/2025. +A list of partitions can be found by running the `sinfo` command. Here are the partitions as of 10/16/2025 -| Partition Name | Qty | RAM | Cores/node | /var/scratch
1K-blocks | AU Charge Factor | -| :--: | :--: | :--: | :--: | :--: | :--: | -| gpu
*1 x NVIDIA Tesla A100* | 16 | 114 GB | 30 | 6,240,805,336| 12 | -| lg | 39 | 229 GB | 60 | 1,031,070,000| 7 | -| std | 60 | 114 GB | 30 | 515,010,816| 3.5 | -| sm | 28 | 61 GB | 16 | 256,981,000| 0.875 | -| t | 15 | 16 GB | 4 | 61,665,000| 0.4375 | +| Partition Name | CPU | Qty | RAM | Cores/node | /var/scratch
1K-blocks | AU Charge Factor | +| :--: | :--: | :--: | :--: | :--: | :--: | :--: | +| gpu
*NVIDIA Tesla A100-40*
| Intel Xeon Icelake | 1 | 910 GB | 42 | 6,240,805,336| 12 | +| cpu-amd | AMD Epyc Milan | 36 | 220 GB | 120 | 1,031,070,000| 7 | ### Allocation Unit (AU) Charges @@ -42,7 +40,7 @@ The **Charge Factor** for each partition is listed in the table above. ### Operating Software -The Gila HPC cluster runs fairly current versions of OpenHPC and SLURM on top of OpenStack. +The Gila HPC cluster runs on Rocky Linux 9.5. ### Operating Software From 3f6fe5a3074e30304ed9099d4cf6a0512c8e59d3 Mon Sep 17 00:00:00 2001 From: jonathancasco <4037484+jonathancasco@users.noreply.github.com> Date: Mon, 15 Dec 2025 15:46:28 -0700 Subject: [PATCH 09/23] Update running.md --- docs/Documentation/Systems/Gila/running.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/Documentation/Systems/Gila/running.md b/docs/Documentation/Systems/Gila/running.md index ee1aa8e1c..99cde8421 100644 --- a/docs/Documentation/Systems/Gila/running.md +++ b/docs/Documentation/Systems/Gila/running.md @@ -24,7 +24,7 @@ A list of partitions can be found by running the `sinfo` command. Here are the | Partition Name | CPU | Qty | RAM | Cores/node | /var/scratch
1K-blocks | AU Charge Factor | | :--: | :--: | :--: | :--: | :--: | :--: | :--: | | gpu
*NVIDIA Tesla A100-40*
| Intel Xeon Icelake | 1 | 910 GB | 42 | 6,240,805,336| 12 | -| amd | AMD Epyc Milan | 36 | 220 GB | 60 | 1,031,070,000| 7 | +| amd | 2x 30 Core AMD Epyc Milan | 36 | 220 GB | 60 | 1,031,070,000| 7 | ### Allocation Unit (AU) Charges From 8002863fd5dc25cae6a808ea2aaf5f2e396eaaf1 Mon Sep 17 00:00:00 2001 From: hyandt Date: Tue, 30 Dec 2025 08:29:41 -0700 Subject: [PATCH 10/23] Add modules page --- docs/Documentation/Systems/Gila/modules.md | 146 ++++++++++++++++++++- 1 file changed, 145 insertions(+), 1 deletion(-) diff --git a/docs/Documentation/Systems/Gila/modules.md b/docs/Documentation/Systems/Gila/modules.md index 8b7a419b4..d4f3070bc 100644 --- a/docs/Documentation/Systems/Gila/modules.md +++ b/docs/Documentation/Systems/Gila/modules.md @@ -1 +1,145 @@ -*For apps team to write* \ No newline at end of file +# Modules on Gila + +On Gila, modules are deployed and organized slightly differently than on other NLR HPC systems. +While the basic concepts of using modules remain the same, there are important differences in how modules are structured, discovered, and loaded. These differences are intentional and are designed to improve compatibility, reproducibility, and long-term maintainability. The upcoming sections of this document will walk through these differences step by step. + +The module system used on this cluster is [Lmod](../../Environment/lmod.md). + +When you log in to Gila, three modules are loaded automatically by default: + +1. `Core/25.05` +2. `DefApps` +3. `gcc/14.2.0` + +!!! note + The `DefApps` module is a convenience module that ensures both `Core` and `GCC` are loaded upon login or when you use `module restore`. It does not load additional software itself but guarantees that the essential environment is active. + + +## Module Structure on Gila + +Modules on Gila are organized into two main categories: **Base Modules** and **Core Modules**. This structure is different from many traditional flat module trees and is designed to make software compatibility explicit and predictable. + +### Base Modules + +**Base modules** define the *software toolchain context* you are working in. Loading a base module changes which additional modules are visible and available. + +Base modules allow users to: + +* **Initiate a compiler toolchain** + * Loading a specific compiler (for example, `gcc` or `oneapi`) establishes a toolchain + * Once a compiler is loaded, only software built with and compatible with that compiler becomes visible when running `ml avail` + * This behavior applies to both **GCC** and **Intel oneAPI** toolchains + +* **Use Conda/Mamba environments** + * Loading `miniforge3` enables access to Conda and Mamba for managing user-level Python environments + +* **Access installed research applications** + * Loading the `application` module exposes centrally installed research applications + +* **Enable CUDA and GPU-enabled software** + * Loading the `cuda` module provides access to CUDA + * It also makes CUDA-enabled software visible in `module avail`, ensuring GPU-compatible applications are only shown when CUDA is loaded + +In short, **base modules control which families of software are visible** by establishing the appropriate environment and compatibility constraints. + +### Core Modules + +**Core modules** are independent of any specific compiler or toolchain. + +They: + +* Do **not** rely on a particular compiler +* Contain essential utilities, libraries, and tools +* Are intended to work with **any toolchain** + +Core modules are typically always available and can be safely loaded regardless of which compiler, CUDA version, or toolchain is active. + +This separation between Base and Core modules ensures: + +* Clear compiler compatibility +* Reduced risk of mixing incompatible software +* A cleaner and more predictable module environment + + +## MPI-Enabled Software + +MPI-enabled software modules are identified by a `-mpi` suffix at the end of the module name. + +Similar to compiler modules, MPI-enabled software is **not visible by default**. These modules only appear after an MPI implementation is loaded. Supported MPI implementations include `openmpi`, `mpich`, and `intelmpi`. + +Loading an MPI implementation makes MPI-enabled software that was installed with that specific MPI stack available when running `module avail`. + +This behavior ensures that only software built against the selected MPI implementation is exposed, helping users avoid mixing incompatible MPI libraries. + +!!! note + To determine whether a software package is available on the cluster, use `module spider`. This command lists **all available versions and configurations** of a given software, including those that are not currently visible with `module avail`. + + To find out which modules must be loaded in order to access a specific software configuration, run `module spider` using the **full module name**. This will show the required modules that need to be loaded to make that software available. + + +## Containers + +Container tools such as **Apptainer** and **Podman** do not require module files on this cluster. They are available on the system **by default** and are already included in your `PATH`. + +This means you can use Apptainer and Podman at any time without loading a specific module, regardless of which compiler, MPI, or CUDA toolchain is currently active. + + +## Module Commands: restore, avail, and spider + +### module restore + +The `module restore` command reloads the set of modules that were active at the start of your login session or at the last checkpoint. This is useful if you have unloaded or swapped modules and want to return to your original environment. + +Example: + +```bash +module restore +``` + +This will restore the default modules that were loaded at login, such as `Core/25.05`, `DefApps`, and `gcc/14.2.0`. + +### module avail + +The `module avail` command lists all modules that are **currently visible** in your environment. This includes modules that are compatible with the loaded compiler, MPI, or CUDA base modules. + +Example: + +```bash +module avail +``` + +You can also search for a specific software: + +```bash +module avail python +``` + +### module spider + +The `module spider` command provides a **complete listing of all versions and configurations** of a software package, including those that are **not currently visible** with `module avail`. It also shows **which modules need to be loaded** to make a specific software configuration available. + +Example: + +```bash +module spider python/3.10 +``` + +This output will indicate any prerequisite modules you need to load before the software becomes available. + +!!! tip + Use `module avail` for quick checks and `module spider` when you need full details or to resolve dependencies for specific versions. + + +## Frequently Asked Questions + +??? note "I can't find the module I need." + Please email [HPC-Help](mailto:HPC-Help@nrel.gov). The Apps team will get in touch with you to provide the module you need. + +??? note "I need to mix and match compilers and libraries/MPI. How can I do that?" + Modules on Gila do not support mixing and matching. For example, if `oneapi` is loaded, only software compiled with `oneapi` will appear. If you require a custom combination of software stacks, you are encouraged to use **Spack** to deploy your stack. Please contact [HPC-Help](mailto:HPC-Help@nrel.gov) to be matched with a Spack expert. + +??? note "Can I use Miniforge with other modules?" + While it is technically possible, Miniforge is intended to provide an isolated environment separate from external modules. Be careful with the order in which modules are loaded, as this can impact your `PATH` and `LD_LIBRARY_PATH`. + +??? note "What if I want a different CUDA version?" + Other CUDA versions are available under **CORE** modules. If you need additional versions, please reach out to [HPC-Help](mailto:HPC-Help@nrel.gov). Note that CUDA modules under CORE do **not** automatically make CUDA-enabled software available; only CUDA modules under **Base** modules will load CUDA-enabled packages. From 0de35b60476dd88adca84dc7b320d7cc467f9c55 Mon Sep 17 00:00:00 2001 From: hyandt Date: Tue, 30 Dec 2025 10:42:24 -0700 Subject: [PATCH 11/23] Replace nrel --- docs/Documentation/Systems/Gila/index.md | 10 +++++----- mkdocs.yml | 12 ++++++------ 2 files changed, 11 insertions(+), 11 deletions(-) diff --git a/docs/Documentation/Systems/Gila/index.md b/docs/Documentation/Systems/Gila/index.md index a6b10a1ec..c41253ef1 100644 --- a/docs/Documentation/Systems/Gila/index.md +++ b/docs/Documentation/Systems/Gila/index.md @@ -1,13 +1,13 @@ # About Gila -Gila is an OpenHPC-based cluster running on __Dual AMD EPYC 7532 Rome CPUs__ and __Intel Xeon Icelake CPUs with NVidia A100 GPUs__. The nodes run as virtual machines in a local virtual private cloud (OpenStack). Gila is allocated for NREL workloads and intended for LDRD, SPP or Office of Science workloads. Allocation decisions are made by the IACAC through the annual allocation request process. Check back regularly as the configuration and capabilities for Gila are augmented over time. +Gila is an OpenHPC-based cluster running on __Dual AMD EPYC 7532 Rome CPUs__ and __Intel Xeon Icelake CPUs with NVidia A100 GPUs__. The nodes run as virtual machines in a local virtual private cloud (OpenStack). Gila is allocated for NLR workloads and intended for LDRD, SPP or Office of Science workloads. Allocation decisions are made by the IACAC through the annual allocation request process. Check back regularly as the configuration and capabilities for Gila are augmented over time. ## Accessing Gila -Access to Gila requires an NREL HPC account and permission to join an existing allocation. Please see the [System Access](https://www.nrel.gov/hpc/system-access.html) page for more information on accounts and allocations. +Access to Gila requires an NLR HPC account and permission to join an existing allocation. Please see the [System Access](https://www.nrel.gov/hpc/system-access.html) page for more information on accounts and allocations. -#### For NREL Employees: -To access Gila, log into the NREL network and connect via ssh: +#### For NLR Employees: +To access Gila, log into the NLR network and connect via ssh: ssh gila.hpc.nrel.gov @@ -26,7 +26,7 @@ You may connect directly to a login node, but they may be cycled in and out of t ## Get Help with Gila -Please see the [Help and Support Page](../../help.md) for further information on how to seek assistance with Gila or your NREL HPC account. +Please see the [Help and Support Page](../../help.md) for further information on how to seek assistance with Gila or your NLR HPC account. ## Building code diff --git a/mkdocs.yml b/mkdocs.yml index 65d1c426e..2f0658c05 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -44,12 +44,12 @@ nav: - Documentation/Systems/Swift/filesystems.md - Gila: - Documentation/Systems/Gila/index.md - - Documentation/Systems/Gila/running.md - - Documentation/Systems/Gila/modules.md - - Documentation/Systems/Gila/filesystem.md - - Vermilion: - - Documentation/Systems/Vermilion/index.md - - Documentation/Systems/Vermilion/running.md + - Running on Gila: Documentation/Systems/Gila/running.md + - Modules: Documentation/Systems/Gila/modules.md + - Filesystems: Documentation/Systems/Gila/filesystem.md + # - Vermilion: + # - Documentation/Systems/Vermilion/index.md + # - Documentation/Systems/Vermilion/running.md - Slurm Job Scheduling: - Documentation/Slurm/index.md - Monitor and Control Commands: Documentation/Slurm/monitor_and_control.md From de94e8561fe2546d799ae9a4af5daa5f44914204 Mon Sep 17 00:00:00 2001 From: hyandt Date: Tue, 30 Dec 2025 10:43:45 -0700 Subject: [PATCH 12/23] fix comment --- docs/Documentation/Systems/Gila/index.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/Documentation/Systems/Gila/index.md b/docs/Documentation/Systems/Gila/index.md index c41253ef1..6e4acb150 100644 --- a/docs/Documentation/Systems/Gila/index.md +++ b/docs/Documentation/Systems/Gila/index.md @@ -3,6 +3,8 @@ Gila is an OpenHPC-based cluster running on __Dual AMD EPYC 7532 Rome CPUs__ and __Intel Xeon Icelake CPUs with NVidia A100 GPUs__. The nodes run as virtual machines in a local virtual private cloud (OpenStack). Gila is allocated for NLR workloads and intended for LDRD, SPP or Office of Science workloads. Allocation decisions are made by the IACAC through the annual allocation request process. Check back regularly as the configuration and capabilities for Gila are augmented over time. +*TODO: Update information about the allocations (include aurorahpc allocation info)* + ## Accessing Gila Access to Gila requires an NLR HPC account and permission to join an existing allocation. Please see the [System Access](https://www.nrel.gov/hpc/system-access.html) page for more information on accounts and allocations. From f12970980ea50bdabaf5a8107424dbd0c7e2c0d0 Mon Sep 17 00:00:00 2001 From: hyandt Date: Tue, 30 Dec 2025 17:07:31 -0700 Subject: [PATCH 13/23] Updates --- docs/Documentation/Systems/Gila/filesystem.md | 2 -- docs/Documentation/Systems/Gila/index.md | 3 +- docs/Documentation/Systems/Gila/running.md | 29 +++++++++---------- 3 files changed, 16 insertions(+), 18 deletions(-) diff --git a/docs/Documentation/Systems/Gila/filesystem.md b/docs/Documentation/Systems/Gila/filesystem.md index d6a329848..4b566ad05 100644 --- a/docs/Documentation/Systems/Gila/filesystem.md +++ b/docs/Documentation/Systems/Gila/filesystem.md @@ -12,8 +12,6 @@ Quota usage can be viewed at any time by issuing a `cd` command into the project ## Scratch Storage: /scratch/username and /scratch/username/jobid -For users who also have Kestrel allocations, please be aware that scratch space on Gila behaves differently, so adjustments to job scripts may be necessary. - The scratch filesystem on Gila compute node is a 79TB spinning disk Ceph filesystem, and is accessible from login and compute nodes. The default writable path for scratch use is `/scratch/`. ## Temporary space: $TMPDIR diff --git a/docs/Documentation/Systems/Gila/index.md b/docs/Documentation/Systems/Gila/index.md index 6e4acb150..53653cdb5 100644 --- a/docs/Documentation/Systems/Gila/index.md +++ b/docs/Documentation/Systems/Gila/index.md @@ -6,7 +6,8 @@ Gila is an OpenHPC-based cluster running on __Dual AMD EPYC 7532 Rome CPUs__ and *TODO: Update information about the allocations (include aurorahpc allocation info)* ## Accessing Gila -Access to Gila requires an NLR HPC account and permission to join an existing allocation. Please see the [System Access](https://www.nrel.gov/hpc/system-access.html) page for more information on accounts and allocations. +All NLR employees with an HPC account automatically have access to Gila. Please see the [System Access](https://www.nrel.gov/hpc/system-access.html) page for more information on accounts and allocations. + #### For NLR Employees: To access Gila, log into the NLR network and connect via ssh: diff --git a/docs/Documentation/Systems/Gila/running.md b/docs/Documentation/Systems/Gila/running.md index 99cde8421..bfe9cd134 100644 --- a/docs/Documentation/Systems/Gila/running.md +++ b/docs/Documentation/Systems/Gila/running.md @@ -1,32 +1,31 @@ # Running on Gila -This page discusses the compute nodes, partitions, and gives some examples of building and running applications. +*Learn about compute nodes and job partitions on Gila.* -## About Gila - -### Compute hosts +## Compute hosts Compute nodes in Gila are virtualized nodes running on either __Dual AMD EPYC Milan CPUs__ or __Intel Xeon Icelake CPUs__. These nodes are not configured as exclusive and can be shared by multiple users or jobs. -### GPU hosts +## GPU hosts GPU nodes available in Gila have NVidia A100 GPU's running on __Intel Xeon Icelake CPUs__. -### Shared file systems +## Shared file systems -Gila's home directories are shared across all nodes. Each user has a quota of 5 GB. There is also /scratch/$USER and /projects spaces seen across all nodes. +Gila's home directories are shared across all nodes. Each user has a quota of 5 GB. There are also `/scratch/$USER` and `/projects` spaces seen across all nodes. -### Partitions +## Partitions -A list of partitions can be found by running the `sinfo` command. Here are the partitions as of 10/16/2025 +A list of partitions can be found by running the `sinfo` command. Here are the partitions as of 12/30/2025 -| Partition Name | CPU | Qty | RAM | Cores/node | /var/scratch
1K-blocks | AU Charge Factor | -| :--: | :--: | :--: | :--: | :--: | :--: | :--: | -| gpu
*NVIDIA Tesla A100-40*
| Intel Xeon Icelake | 1 | 910 GB | 42 | 6,240,805,336| 12 | -| amd | 2x 30 Core AMD Epyc Milan | 36 | 220 GB | 60 | 1,031,070,000| 7 | +| Partition Name | CPU | Qty | RAM | Cores/node | AU Charge Factor | +| :--: | :--:| :--:| :--: | :--: | :--: | +| gpu
*NVIDIA Tesla A100-40*
| Intel Xeon Icelake | 1 | 910 GB | 42 | 12 | +| amd | 2x 30 Core AMD Epyc Milan | 36 | 220 GB | 60 | 7 | +| gh | | 5 | | | 7 | -### Allocation Unit (AU) Charges +## Allocation Unit (AU) Charges The equation for calculating the AU cost of a job on Gila is: @@ -38,7 +37,7 @@ The **Charge Factor** for each partition is listed in the table above. -### Operating Software +## Operating Software The Gila HPC cluster runs on Rocky Linux 9.5. From aef5ebd88dc029ffc4300d228e48550d9579e3ea Mon Sep 17 00:00:00 2001 From: hyandt Date: Tue, 30 Dec 2025 17:15:21 -0700 Subject: [PATCH 14/23] add grace hopper node info --- docs/Documentation/Systems/Gila/running.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/Documentation/Systems/Gila/running.md b/docs/Documentation/Systems/Gila/running.md index bfe9cd134..61fc5da2c 100644 --- a/docs/Documentation/Systems/Gila/running.md +++ b/docs/Documentation/Systems/Gila/running.md @@ -23,7 +23,7 @@ A list of partitions can be found by running the `sinfo` command. Here are the | :--: | :--:| :--:| :--: | :--: | :--: | | gpu
*NVIDIA Tesla A100-40*
| Intel Xeon Icelake | 1 | 910 GB | 42 | 12 | | amd | 2x 30 Core AMD Epyc Milan | 36 | 220 GB | 60 | 7 | -| gh | | 5 | | | 7 | +| gh | | 5 | 470 GB | 72 | 7 | ## Allocation Unit (AU) Charges From 74a39306b3a9da9db81322ef6b692c64df140c2a Mon Sep 17 00:00:00 2001 From: hyandt Date: Wed, 31 Dec 2025 08:29:20 -0700 Subject: [PATCH 15/23] add hopper info --- docs/Documentation/Systems/Gila/running.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/Documentation/Systems/Gila/running.md b/docs/Documentation/Systems/Gila/running.md index 61fc5da2c..c18870aa1 100644 --- a/docs/Documentation/Systems/Gila/running.md +++ b/docs/Documentation/Systems/Gila/running.md @@ -19,11 +19,11 @@ Gila's home directories are shared across all nodes. Each user has a quota of 5 A list of partitions can be found by running the `sinfo` command. Here are the partitions as of 12/30/2025 -| Partition Name | CPU | Qty | RAM | Cores/node | AU Charge Factor | -| :--: | :--:| :--:| :--: | :--: | :--: | -| gpu
*NVIDIA Tesla A100-40*
| Intel Xeon Icelake | 1 | 910 GB | 42 | 12 | -| amd | 2x 30 Core AMD Epyc Milan | 36 | 220 GB | 60 | 7 | -| gh | | 5 | 470 GB | 72 | 7 | +| Partition Name | CPU | GPU | Qty | RAM | Cores/node | AU Charge Factor | +| :--: | :--:| :--: | :--:| :--: | :--: | :--: | +| gpu | Intel Xeon Icelake | NVIDIA Tesla A100-80 | 1 | 910 GB | 42 | 12 | +| amd | 2x 30 Core AMD Epyc Milan | | 36 | 220 GB | 60 | 7 | +| gh | NVIDIA Grace | GH200 | 5 | 470 GB | 72 | 7 | ## Allocation Unit (AU) Charges From fbc2f227ffeec16807161df0df853d9845c2797b Mon Sep 17 00:00:00 2001 From: hyandt Date: Wed, 31 Dec 2025 08:37:14 -0700 Subject: [PATCH 16/23] typoe --- docs/Documentation/Systems/Gila/index.md | 2 +- docs/Documentation/Systems/Gila/running.md | 4 +++- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/docs/Documentation/Systems/Gila/index.md b/docs/Documentation/Systems/Gila/index.md index 53653cdb5..8989889e3 100644 --- a/docs/Documentation/Systems/Gila/index.md +++ b/docs/Documentation/Systems/Gila/index.md @@ -1,7 +1,7 @@ # About Gila -Gila is an OpenHPC-based cluster running on __Dual AMD EPYC 7532 Rome CPUs__ and __Intel Xeon Icelake CPUs with NVidia A100 GPUs__. The nodes run as virtual machines in a local virtual private cloud (OpenStack). Gila is allocated for NLR workloads and intended for LDRD, SPP or Office of Science workloads. Allocation decisions are made by the IACAC through the annual allocation request process. Check back regularly as the configuration and capabilities for Gila are augmented over time. +Gila is an OpenHPC-based cluster running on __Dual AMD EPYC 7532 Rome CPUs__ and __Intel Xeon Icelake CPUs with NVIDIA A100 GPUs__. The nodes run as virtual machines in a local virtual private cloud (OpenStack). Gila is allocated for NLR workloads and intended for LDRD, SPP or Office of Science workloads. Allocation decisions are made by the IACAC through the annual allocation request process. Check back regularly as the configuration and capabilities for Gila are augmented over time. *TODO: Update information about the allocations (include aurorahpc allocation info)* diff --git a/docs/Documentation/Systems/Gila/running.md b/docs/Documentation/Systems/Gila/running.md index c18870aa1..c622da876 100644 --- a/docs/Documentation/Systems/Gila/running.md +++ b/docs/Documentation/Systems/Gila/running.md @@ -9,7 +9,9 @@ Compute nodes in Gila are virtualized nodes running on either __Dual AMD EPYC Mi ## GPU hosts -GPU nodes available in Gila have NVidia A100 GPU's running on __Intel Xeon Icelake CPUs__. +GPU nodes available in Gila have NVIDIA A100 GPU's running on __Intel Xeon Icelake CPUs__. + +There are also 5 NVIDIA Grace Hopper nodes. ## Shared file systems From 4b9b1847d35b0f1f5a3c383f61ab7cfd70f481b4 Mon Sep 17 00:00:00 2001 From: hyandt Date: Wed, 31 Dec 2025 08:39:38 -0700 Subject: [PATCH 17/23] add info --- docs/Documentation/Systems/Gila/running.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/Documentation/Systems/Gila/running.md b/docs/Documentation/Systems/Gila/running.md index c622da876..1db2cea4c 100644 --- a/docs/Documentation/Systems/Gila/running.md +++ b/docs/Documentation/Systems/Gila/running.md @@ -5,7 +5,7 @@ ## Compute hosts -Compute nodes in Gila are virtualized nodes running on either __Dual AMD EPYC Milan CPUs__ or __Intel Xeon Icelake CPUs__. These nodes are not configured as exclusive and can be shared by multiple users or jobs. +Compute nodes in Gila are virtualized nodes. These nodes are not configured as exclusive and can be shared by multiple users or jobs. ## GPU hosts From 30afe2813b512a73f6fae1535ba902b198f726b3 Mon Sep 17 00:00:00 2001 From: hyandt Date: Tue, 6 Jan 2026 14:06:33 -0700 Subject: [PATCH 18/23] revisions --- docs/Documentation/Systems/Gila/filesystem.md | 28 +++- docs/Documentation/Systems/Gila/index.md | 35 ++-- docs/Documentation/Systems/Gila/running.md | 157 ++---------------- 3 files changed, 47 insertions(+), 173 deletions(-) diff --git a/docs/Documentation/Systems/Gila/filesystem.md b/docs/Documentation/Systems/Gila/filesystem.md index 4b566ad05..088f67589 100644 --- a/docs/Documentation/Systems/Gila/filesystem.md +++ b/docs/Documentation/Systems/Gila/filesystem.md @@ -2,27 +2,39 @@ ## Home Directories: /home -/home directories are mounted as `/home/`. Home directories are hosted under the user's initial /project directory. Quotas in /home are included as a part of the quota of that project's storage allocation +`/home` directories are mounted as `/home/`. To check your usage in your /home directory, visit the [Gila Filesystem Dashboard](https://influx.hpc.nrel.gov/d/ch4vndd/ceph-filesystem-quotas?folderUid=fexgrdi5pt91ca&orgId=1&from=now-1h&to=now&timezone=browser&tab=queries). You can also check your home directory usage and quota by running the following commands: + +``` +# Check usage +getfattr -n ceph.dir.rbytes +# Check quota +getfattr -n ceph.quota.max_bytes +``` + +If you need a quota increase in your home directory, please contact [HPC-Help@nrel.gov](mailto:HPC-Help@nrel.gov). ## Project Storage: /projects -Each active project is granted a subdirectory under `/projects/`. This is where the bulk of data is expected to be, and where jobs should generally be run from. Storage quotas are based on the allocation award. +Each active project is granted a subdirectory under `/projects/`. There are currently no quotas on `/projects` directories. Please monitor your space usage at the [Gila Filesystem Dashboard](https://influx.hpc.nrel.gov/d/ch4vndd/ceph-filesystem-quotas?folderUid=fexgrdi5pt91ca&orgId=1&from=now-1h&to=now&timezone=browser&tab=queries). + +Note that there is currently no `/projects/aurorahpc` directory. Data can be kept in your `/home` directory. -Quota usage can be viewed at any time by issuing a `cd` command into the project directory, and using the `df -h` command to view total, used, and remaining available space for the mounted project directory +## Scratch Storage -## Scratch Storage: /scratch/username and /scratch/username/jobid +The scratch filesystem on Gila is a spinning disk Ceph filesystem, and is accessible from login and compute nodes. The default writable path for scratch use is `/scratch/`. -The scratch filesystem on Gila compute node is a 79TB spinning disk Ceph filesystem, and is accessible from login and compute nodes. The default writable path for scratch use is `/scratch/`. +!!! warning + Data in `/scratch` is subject to deletion after 28 days. It is recommended to store your important data, libraries, and programs in your project or home directory. ## Temporary space: $TMPDIR -When a job starts, the environment variable `$TMPDIR` is set to `/scratch//` for the duration of the job. This is temporary space only, and should be purged when your job is complete. Please be sure to use this path instead of /tmp for your tempfiles. +When a job starts, the environment variable `$TMPDIR` is set to `/scratch//` for the duration of the job. This is temporary space only, and should be purged when your job is complete. Please be sure to use this path instead of `/tmp` for your tempfiles. -There is no expectation of data longevity in the temporary space, and is purged once a job has completed. If desired data is stored here during the job, please be sure to copy it to a /projects directory as part of the job script before the job finishes. +There is no expectation of data longevity in the temporary space, and data is purged once a job has completed. If desired data is stored here during the job, please be sure to copy it to a project or home directory as part of the job script before the job finishes. ## Mass Storage System -There is no Mass Storage System for deep archive storage on Gila. +There is no Mass Storage System for deep archive storage from Gila. ## Backups and Snapshots diff --git a/docs/Documentation/Systems/Gila/index.md b/docs/Documentation/Systems/Gila/index.md index 8989889e3..1698c71a9 100644 --- a/docs/Documentation/Systems/Gila/index.md +++ b/docs/Documentation/Systems/Gila/index.md @@ -1,41 +1,36 @@ # About Gila -Gila is an OpenHPC-based cluster running on __Dual AMD EPYC 7532 Rome CPUs__ and __Intel Xeon Icelake CPUs with NVIDIA A100 GPUs__. The nodes run as virtual machines in a local virtual private cloud (OpenStack). Gila is allocated for NLR workloads and intended for LDRD, SPP or Office of Science workloads. Allocation decisions are made by the IACAC through the annual allocation request process. Check back regularly as the configuration and capabilities for Gila are augmented over time. +Gila is an OpenHPC-based cluster running on __Dual AMD EPYC 7532 Rome CPUs__ and __Intel Xeon Icelake CPUs with NVIDIA A100 GPUs__. The nodes run as virtual machines in a local virtual private cloud (OpenStack). Gila is allocated for NLR workloads and intended for LDRD, SPP or Office of Science workloads. Check back regularly as the configuration and capabilities for Gila are augmented over time. -*TODO: Update information about the allocations (include aurorahpc allocation info)* +#TODO cover grace hopper nodes here. -## Accessing Gila -All NLR employees with an HPC account automatically have access to Gila. Please see the [System Access](https://www.nrel.gov/hpc/system-access.html) page for more information on accounts and allocations. +## Gila Access and Allocations + **A specific allocation is not needed for NLR employee use of Gila.** All NLR employees with an HPC account automatically have access to Gila and can use the *aurorahpc* allocation to run jobs. If you do not have an HPC account already and would like to use Gila, please see the [User Accounts](https://www.nrel.gov/hpc/user-accounts) page to request an account. + +The aurorahpc allocation does have limited resources allowed per job. These limits are dynamic, and can be found in the MOTD displayed when you log in to Gila. Please note that this allocation is a shared resource. If excessive usage reduces productivity for the broader user community, you may be contacted by HPC Operations staff. If you need to use more resources than allowed by the aurorahpc allocation, or work with external collaborators, you can request a specific allocation for your project. For more information on requesting an allocation, please see the [Resource Allocation Requests](https://www.nrel.gov/hpc/resource-allocation-requests) page. #### For NLR Employees: -To access Gila, log into the NLR network and connect via ssh: +To access Gila, log in to the NLR network and connect via ssh to: + + gila.hpc.nrel.gov - ssh gila.hpc.nrel.gov +To use the Grace Hopper nodes, connect via ssh to: + + gila-hopper-login1.hpc.nrel.gov #### For External Collaborators: -There are currently no external-facing login nodes for Gila. There are two options to connect: +There are no external-facing login nodes for Gila. There are two options to connect: 1. Connect to the [SSH gateway host](https://www.nrel.gov/hpc/ssh-gateway-connection.html) and log in with your username, password, and OTP code. Once connected, ssh to the login nodes as above. 1. Connect to the [HPC VPN](https://www.nrel.gov/hpc/vpn-connection.html) and ssh to the login nodes as above. -There are currently two login nodes. They share the same home directory so work done on one will appear on the other. They are: - - gila-login-1 - gila-login-2 - -You may connect directly to a login node, but they may be cycled in and out of the pool. If a node is unavailable, try connecting to another login node or the `gila.hpc.nrel.gov` round-robin option. - ## Get Help with Gila Please see the [Help and Support Page](../../help.md) for further information on how to seek assistance with Gila or your NLR HPC account. -## Building code - -Do not build or run code on login nodes. Login nodes have limited CPU and memory available. Use a compute or GPU node instead. Simply start an interactive job on an appropriately provisioned node and partition for your work and do your builds there. - -Similarly, build your projects under `/projects/your_project_name/` as home directories are **limited to 5GB** per user. +## Building Code +Do not build or run code on login nodes. Login nodes have limited CPU and memory available. Use a compute node or GPU node instead. Simply start an [interactive job](../../Slurm/interactive_jobs.md) on an appropriately provisioned node and partition for your work and do your builds there. ---- diff --git a/docs/Documentation/Systems/Gila/running.md b/docs/Documentation/Systems/Gila/running.md index 1db2cea4c..390f9439b 100644 --- a/docs/Documentation/Systems/Gila/running.md +++ b/docs/Documentation/Systems/Gila/running.md @@ -3,167 +3,34 @@ *Learn about compute nodes and job partitions on Gila.* -## Compute hosts +## Compute Nodes + +Compute nodes in Gila are virtualized nodes. **These nodes are not configured as exclusive and can be shared by multiple users or jobs.** Be sure to request the resources that your job needs, including memory and cores. -Compute nodes in Gila are virtualized nodes. These nodes are not configured as exclusive and can be shared by multiple users or jobs. ## GPU hosts -GPU nodes available in Gila have NVIDIA A100 GPU's running on __Intel Xeon Icelake CPUs__. +GPU nodes in Gila have NVIDIA A100 GPUs running on __Intel Xeon Icelake CPUs__. -There are also 5 NVIDIA Grace Hopper nodes. -## Shared file systems +There are also 5 NVIDIA Grace Hopper nodes. To use the Grace Hopper nodes, submit your jobs to the gh partition from the `gila-hopper-login1.hpc.nrel.gov` login node. -Gila's home directories are shared across all nodes. Each user has a quota of 5 GB. There are also `/scratch/$USER` and `/projects` spaces seen across all nodes. ## Partitions A list of partitions can be found by running the `sinfo` command. Here are the partitions as of 12/30/2025 -| Partition Name | CPU | GPU | Qty | RAM | Cores/node | AU Charge Factor | -| :--: | :--:| :--: | :--:| :--: | :--: | :--: | -| gpu | Intel Xeon Icelake | NVIDIA Tesla A100-80 | 1 | 910 GB | 42 | 12 | -| amd | 2x 30 Core AMD Epyc Milan | | 36 | 220 GB | 60 | 7 | -| gh | NVIDIA Grace | GH200 | 5 | 470 GB | 72 | 7 | - -## Allocation Unit (AU) Charges - -The equation for calculating the AU cost of a job on Gila is: - -```AU cost = (Walltime in hours * Number of Nodes * Charge Factor)``` - -The Walltime is the actual length of time that the job runs, in hours or fractions thereof. - -The **Charge Factor** for each partition is listed in the table above. - - - -## Operating Software - -The Gila HPC cluster runs on Rocky Linux 9.5. - - - From 78a9733cfc987689646e91998b37207e8f611869 Mon Sep 17 00:00:00 2001 From: hyandt Date: Tue, 6 Jan 2026 14:07:18 -0700 Subject: [PATCH 19/23] remove comment --- docs/Documentation/Systems/Gila/index.md | 1 - 1 file changed, 1 deletion(-) diff --git a/docs/Documentation/Systems/Gila/index.md b/docs/Documentation/Systems/Gila/index.md index 1698c71a9..743fe5e77 100644 --- a/docs/Documentation/Systems/Gila/index.md +++ b/docs/Documentation/Systems/Gila/index.md @@ -3,7 +3,6 @@ Gila is an OpenHPC-based cluster running on __Dual AMD EPYC 7532 Rome CPUs__ and __Intel Xeon Icelake CPUs with NVIDIA A100 GPUs__. The nodes run as virtual machines in a local virtual private cloud (OpenStack). Gila is allocated for NLR workloads and intended for LDRD, SPP or Office of Science workloads. Check back regularly as the configuration and capabilities for Gila are augmented over time. -#TODO cover grace hopper nodes here. ## Gila Access and Allocations From ae84e3a1703f5f3bf9b8555b2732a0c03b0d95ca Mon Sep 17 00:00:00 2001 From: hyandt Date: Fri, 9 Jan 2026 10:58:06 -0700 Subject: [PATCH 20/23] add gracehopper module changes --- docs/Documentation/Systems/Gila/modules.md | 71 ++++++++++++++++++++++ 1 file changed, 71 insertions(+) diff --git a/docs/Documentation/Systems/Gila/modules.md b/docs/Documentation/Systems/Gila/modules.md index d4f3070bc..760a054a3 100644 --- a/docs/Documentation/Systems/Gila/modules.md +++ b/docs/Documentation/Systems/Gila/modules.md @@ -14,6 +14,15 @@ When you log in to Gila, three modules are loaded automatically by default: !!! note The `DefApps` module is a convenience module that ensures both `Core` and `GCC` are loaded upon login or when you use `module restore`. It does not load additional software itself but guarantees that the essential environment is active. +## X86 VS ARM + +There are two module stacks on Gila, one for each hardware architecture and each stack is loaded depending on the hardware used. +The two hardware stacks are almost identical in terms of modules offered, however some modules might be missing and/or have different versions. Please email [HPC-Help](mailto:HPC-Help@nrel.gov) for any request regarding modules availability and/or versions change. +The recommended usage is to connect to the login node corresponding to the hardware intended to be used for the compute, e.g. `gila-login-1` for **x86** and `gila-hopper-login1` for **arm**. + +!!! warning + Usage of the GraceHopper computes from the x86 login node, or the usage of x86 computes from the GraceHopper login is not allowed and will cause module problems. + ## Module Structure on Gila @@ -71,6 +80,54 @@ Loading an MPI implementation makes MPI-enabled software that was installed with This behavior ensures that only software built against the selected MPI implementation is exposed, helping users avoid mixing incompatible MPI libraries. +For example, using **module spider** to find all available variances of **HDF5**. + +```bash +[USER@gila-login-1 ~]$ ml spider hdf5 + hdf5: +-------------------------------------------- + Versions: + hdf5/1.14.5 + hdf5/1.14.5-mpi +``` + +Each version of **HDF5** requires dependency modules to be loaded so that they can be available to be used. +Please refer to the **module spider** section for more details. + +To find the dependencies needed for **hdf5/1.14.5-mpi** + +```bash +[USER@gila-login-1 ~]$ ml spider hdf5/1.14.5-mpi + + hdf5: +-------------------------------------------- + You will need to load all module(s) on one of the lines below before the 'hdf5/1.14.5-mpi' module is available to load. + gcc/14.2.0 openmpi/5.0.5 + oneapi/2025.1.3 oneapi/mpi-2021.14.0 + oneapi/2025.1.3 openmpi/5.0.5 +``` + +Without the dependencies and using **ml avail** + +```bash +[USER@gila-login-1 ~]$ ml avail hdf5 +--------------- [ gcc/14.2.0 ] ------------- + hdf5/1.14.5 +``` + +This version of **HDF5** is not *mpi* enabled. + +Now with the dependencies loaded + +```bash +[USER@gila-login-1 ~]$ ml avail hdf5 +--------------- [ gcc/14.2.0, openmpi/5.0.5 ] ------------- + hdf5/1.14.5-mpi +--------------- [ gcc/14.2.0 ] ------------- + hdf5/1.14.5 +``` + + !!! note To determine whether a software package is available on the cluster, use `module spider`. This command lists **all available versions and configurations** of a given software, including those that are not currently visible with `module avail`. @@ -84,6 +141,20 @@ Container tools such as **Apptainer** and **Podman** do not require module files This means you can use Apptainer and Podman at any time without loading a specific module, regardless of which compiler, MPI, or CUDA toolchain is currently active. +## Building on Gila + +Building on Gila should be done on compute nodes and **NOT** login nodes. +Some important build tools are not available by default and requires loading them from the module stack. + +These build tools are: + +- perl +- autoconf +- libtool +- automake +- m4 + + ## Module Commands: restore, avail, and spider ### module restore From dd482f4250e65bffa1121c5aeba9133548ce0e3d Mon Sep 17 00:00:00 2001 From: hyandt Date: Tue, 13 Jan 2026 14:35:40 -0700 Subject: [PATCH 21/23] Edit modules and running --- docs/Documentation/Systems/Gila/index.md | 2 +- docs/Documentation/Systems/Gila/modules.md | 133 ++++++++++---------- docs/Documentation/Systems/Gila/running.md | 137 +++++++++++++++++++-- 3 files changed, 201 insertions(+), 71 deletions(-) diff --git a/docs/Documentation/Systems/Gila/index.md b/docs/Documentation/Systems/Gila/index.md index 743fe5e77..fedd691c5 100644 --- a/docs/Documentation/Systems/Gila/index.md +++ b/docs/Documentation/Systems/Gila/index.md @@ -1,7 +1,7 @@ # About Gila -Gila is an OpenHPC-based cluster running on __Dual AMD EPYC 7532 Rome CPUs__ and __Intel Xeon Icelake CPUs with NVIDIA A100 GPUs__. The nodes run as virtual machines in a local virtual private cloud (OpenStack). Gila is allocated for NLR workloads and intended for LDRD, SPP or Office of Science workloads. Check back regularly as the configuration and capabilities for Gila are augmented over time. +Gila is an OpenHPC-based cluster. The [nodes](./running.md#gila-compute-nodes) run as virtual machines in a local virtual private cloud (OpenStack). Gila is allocated for NLR workloads and intended for LDRD, SPP or Office of Science workloads. Check back regularly as the configuration and capabilities for Gila are augmented over time. ## Gila Access and Allocations diff --git a/docs/Documentation/Systems/Gila/modules.md b/docs/Documentation/Systems/Gila/modules.md index 760a054a3..3a7d67f24 100644 --- a/docs/Documentation/Systems/Gila/modules.md +++ b/docs/Documentation/Systems/Gila/modules.md @@ -1,7 +1,7 @@ # Modules on Gila On Gila, modules are deployed and organized slightly differently than on other NLR HPC systems. -While the basic concepts of using modules remain the same, there are important differences in how modules are structured, discovered, and loaded. These differences are intentional and are designed to improve compatibility, reproducibility, and long-term maintainability. The upcoming sections of this document will walk through these differences step by step. +While the basic concepts of using modules remain the same, there are important differences in how modules are structured, discovered, and loaded. These differences are intentional and designed to improve compatibility, reproducibility, and long-term maintainability. The upcoming sections of this document will walk through these differences step by step. The module system used on this cluster is [Lmod](../../Environment/lmod.md). @@ -14,15 +14,19 @@ When you log in to Gila, three modules are loaded automatically by default: !!! note The `DefApps` module is a convenience module that ensures both `Core` and `GCC` are loaded upon login or when you use `module restore`. It does not load additional software itself but guarantees that the essential environment is active. -## X86 VS ARM +## x86 vs ARM -There are two module stacks on Gila, one for each hardware architecture and each stack is loaded depending on the hardware used. -The two hardware stacks are almost identical in terms of modules offered, however some modules might be missing and/or have different versions. Please email [HPC-Help](mailto:HPC-Help@nrel.gov) for any request regarding modules availability and/or versions change. -The recommended usage is to connect to the login node corresponding to the hardware intended to be used for the compute, e.g. `gila-login-1` for **x86** and `gila-hopper-login1` for **arm**. +Gila has two separate module stacks, one for each hardware architecture. The appropriate stack is automatically loaded based on which login node you use. +The two hardware stacks are almost identical in terms of available modules. However, some modules might be missing or have different versions depending on the architecture. For requests regarding module availability or version changes, please email [HPC-Help](mailto:HPC-Help@nrel.gov). + +To ensure proper module compatibility, connect to the login node corresponding to your target compute architecture: + +- **x86 architecture**: Use `gila-login-1` +- **ARM architecture**: Use `gila-hopper-login1` (Grace Hopper nodes) -!!! warning - Usage of the GraceHopper computes from the x86 login node, or the usage of x86 computes from the GraceHopper login is not allowed and will cause module problems. +!!! warning + Do not submit jobs to Grace Hopper (ARM) compute nodes from the x86 login node, or vice versa. ## Module Structure on Gila @@ -69,6 +73,50 @@ This separation between Base and Core modules ensures: * Reduced risk of mixing incompatible software * A cleaner and more predictable module environment +## Module Commands: restore, avail, and spider + +### module restore + +The `module restore` command reloads the set of modules that were active at the start of your login session or at the last checkpoint. This is useful if you have unloaded or swapped modules and want to return to your original environment. + +Example: + +```bash +module restore +``` + +This will restore the default modules that were loaded at login, such as `Core/25.05`, `DefApps`, and `gcc/14.2.0`. + +### module avail + +The `module avail` command lists all modules that are **currently visible** in your environment. This includes modules that are compatible with the loaded compiler, MPI, or CUDA base modules. + +Example: + +```bash +module avail +``` + +You can also search for a specific software: + +```bash +module avail python +``` + +### module spider + +The `module spider` command provides a **complete listing of all versions and configurations** of a software package, including those that are **not currently visible** with `module avail`. It also shows **which modules need to be loaded** to make a specific software configuration available. + +Example: + +```bash +module spider python/3.10 +``` + +This output will indicate any prerequisite modules you need to load before the software becomes available. + +!!! tip + Use `module avail` for quick checks and `module spider` when you need full details or to resolve dependencies for specific versions. ## MPI-Enabled Software @@ -76,11 +124,13 @@ MPI-enabled software modules are identified by a `-mpi` suffix at the end of the Similar to compiler modules, MPI-enabled software is **not visible by default**. These modules only appear after an MPI implementation is loaded. Supported MPI implementations include `openmpi`, `mpich`, and `intelmpi`. -Loading an MPI implementation makes MPI-enabled software that was installed with that specific MPI stack available when running `module avail`. +Loading an MPI implementation makes MPI-enabled software built with that specific MPI stack available when running `module avail`. This behavior ensures that only software built against the selected MPI implementation is exposed, helping users avoid mixing incompatible MPI libraries. -For example, using **module spider** to find all available variances of **HDF5**. +### Example: Finding and Loading MPI-Enabled HDF5 + +Use `module spider` to find all available variants of **HDF5**. ```bash [USER@gila-login-1 ~]$ ml spider hdf5 @@ -91,10 +141,10 @@ For example, using **module spider** to find all available variances of **HDF5** hdf5/1.14.5-mpi ``` -Each version of **HDF5** requires dependency modules to be loaded so that they can be available to be used. -Please refer to the **module spider** section for more details. +Each version of **HDF5** requires dependency modules to be loaded before it becomes available. +Please refer to the [module spider section](modules.md#module-spider) for more details. -To find the dependencies needed for **hdf5/1.14.5-mpi** +To find the dependencies needed for `hdf5/1.14.5-mpi`: ```bash [USER@gila-login-1 ~]$ ml spider hdf5/1.14.5-mpi @@ -107,7 +157,7 @@ To find the dependencies needed for **hdf5/1.14.5-mpi** oneapi/2025.1.3 openmpi/5.0.5 ``` -Without the dependencies and using **ml avail** +Before loading the dependencies: ```bash [USER@gila-login-1 ~]$ ml avail hdf5 @@ -115,11 +165,12 @@ Without the dependencies and using **ml avail** hdf5/1.14.5 ``` -This version of **HDF5** is not *mpi* enabled. +This version of **HDF5** is not MPI-enabled. -Now with the dependencies loaded +After loading the dependencies, both versions are now visible: ```bash +[USER@gila-login-1 ~]$ ml gcc/14.2.0 openmpi/5.0.5 [USER@gila-login-1 ~]$ ml avail hdf5 --------------- [ gcc/14.2.0, openmpi/5.0.5 ] ------------- hdf5/1.14.5-mpi @@ -128,7 +179,7 @@ Now with the dependencies loaded ``` -!!! note +!!! tip To determine whether a software package is available on the cluster, use `module spider`. This command lists **all available versions and configurations** of a given software, including those that are not currently visible with `module avail`. To find out which modules must be loaded in order to access a specific software configuration, run `module spider` using the **full module name**. This will show the required modules that need to be loaded to make that software available. @@ -144,7 +195,7 @@ This means you can use Apptainer and Podman at any time without loading a specif ## Building on Gila Building on Gila should be done on compute nodes and **NOT** login nodes. -Some important build tools are not available by default and requires loading them from the module stack. +Some important build tools are not available by default and require loading them from the module stack. These build tools are: @@ -154,51 +205,7 @@ These build tools are: - automake - m4 - -## Module Commands: restore, avail, and spider - -### module restore - -The `module restore` command reloads the set of modules that were active at the start of your login session or at the last checkpoint. This is useful if you have unloaded or swapped modules and want to return to your original environment. - -Example: - -```bash -module restore -``` - -This will restore the default modules that were loaded at login, such as `Core/25.05`, `DefApps`, and `gcc/14.2.0`. - -### module avail - -The `module avail` command lists all modules that are **currently visible** in your environment. This includes modules that are compatible with the loaded compiler, MPI, or CUDA base modules. - -Example: - -```bash -module avail -``` - -You can also search for a specific software: - -```bash -module avail python -``` - -### module spider - -The `module spider` command provides a **complete listing of all versions and configurations** of a software package, including those that are **not currently visible** with `module avail`. It also shows **which modules need to be loaded** to make a specific software configuration available. - -Example: - -```bash -module spider python/3.10 -``` - -This output will indicate any prerequisite modules you need to load before the software becomes available. - -!!! tip - Use `module avail` for quick checks and `module spider` when you need full details or to resolve dependencies for specific versions. +Please see [here](./running.md#example-compiling-a-program-on-gila) for a full example of compiling a program on Gila. ## Frequently Asked Questions @@ -213,4 +220,4 @@ This output will indicate any prerequisite modules you need to load before the s While it is technically possible, Miniforge is intended to provide an isolated environment separate from external modules. Be careful with the order in which modules are loaded, as this can impact your `PATH` and `LD_LIBRARY_PATH`. ??? note "What if I want a different CUDA version?" - Other CUDA versions are available under **CORE** modules. If you need additional versions, please reach out to [HPC-Help](mailto:HPC-Help@nrel.gov). Note that CUDA modules under CORE do **not** automatically make CUDA-enabled software available; only CUDA modules under **Base** modules will load CUDA-enabled packages. + Other CUDA versions are available under **Core** modules. If you need additional versions, please reach out to [HPC-Help](mailto:HPC-Help@nrel.gov). Note that CUDA modules under CORE do **not** automatically make CUDA-enabled software available; only CUDA modules under **Base** modules will load CUDA-enabled packages. diff --git a/docs/Documentation/Systems/Gila/running.md b/docs/Documentation/Systems/Gila/running.md index 390f9439b..97e212c4a 100644 --- a/docs/Documentation/Systems/Gila/running.md +++ b/docs/Documentation/Systems/Gila/running.md @@ -3,17 +3,25 @@ *Learn about compute nodes and job partitions on Gila.* -## Compute Nodes +## Gila Compute Nodes -Compute nodes in Gila are virtualized nodes. **These nodes are not configured as exclusive and can be shared by multiple users or jobs.** Be sure to request the resources that your job needs, including memory and cores. +Gila compute nodes are not configured as exclusive and can be shared by multiple users or jobs. Be sure to request the resources that your job needs, including memory and cores. If you need exclusive use of a node, add the `--exclusive` flag to your job submission. +### CPU Nodes -## GPU hosts +The CPU nodes in Gila are single-threaded virtualized nodes. There are two sockets and NUMA nodes per compute node, with each socket containing 30 __AMD EPYC Milan__ (x86-64) cores. Each node has 220GB of RAM that can be used. -GPU nodes in Gila have NVIDIA A100 GPUs running on __Intel Xeon Icelake CPUs__. +### GPU Nodes + +GPU nodes in Gila have 8 NVIDIA A100 GPUs running on x86-64 __Intel Xeon Icelake CPUs__. There are 42 cores on a GPU node, with one socket and NUMA node. Each GPU node has 910GB of RAM, and each NVIDIA A100 GPU has 80GB of VRAM. + +### Grace Hopper Nodes + +Gila has 6 NVIDIA Grace Hopper nodes. To use the Grace Hopper nodes, submit your jobs to the `gh` partition from the `gila-hopper-login1.hpc.nrel.gov` login node. Each Grace Hopper node has a 72 core NVIDIA Grace CPU and an NVIDIA GH200 GPU, with 96GB of VRAM and 470GB of RAM. They have one socket and NUMA node. + +Please note - the __NVIDIA Grace CPUs__ run on a different processing architecture (ARM64) than both the __Intel Xeon Icelake CPUs__ (x86-64) and the __AMD EPYC Milan__ (x86-64). Any application that is manually compiled by a user and intended to be used on the Grace Hopper nodes __MUST__ be compiled on the Grace Hopper nodes themselves. -There are also 5 NVIDIA Grace Hopper nodes. To use the Grace Hopper nodes, submit your jobs to the gh partition from the `gila-hopper-login1.hpc.nrel.gov` login node. ## Partitions @@ -23,13 +31,128 @@ A list of partitions can be found by running the `sinfo` command. Here are the | Partition Name | CPU | GPU | Qty | RAM | Cores/node | | :--: | :--:| :--: | :--:| :--: | :--: | | gpu | Intel Xeon Icelake | NVIDIA Tesla A100-80 | 1 | 910 GB | 42 | -| amd | 2x 30 Core AMD Epyc Milan | | 36 | 220 GB | 60 | +| amd | 2x 30 Core AMD Epyc Milan | N/A | 36 | 220 GB | 60 | | gh | NVIDIA Grace | GH200 | 5 | 470 GB | 72 | ## Performance Recommendations -Gila is optmized for single-node workloads. Multi-node jobs may experience degraded performance. +Gila is optimized for single-node workloads. Multi-node jobs may experience degraded performance. All MPI distribution flavors work on Gila, with noted performance from Intel-MPI. Gila is single-threaded, and applications that are compiled to make use of multiple threads will not be able to take advantage of this. + +## Example: Compiling a Program on Gila + +In this section we will describe how to compile an MPI based application using an Intel toolchain from the module system. Please see the [Modules page](./modules.md) for additional information on the Gila module system. + + +### Requesting an interactive session +First, we will begin by requesting an interactive session. This will give us a compute node from where we can carry out our work. An example command for requesting such a session is as follows: + +```salloc -N 1 -n 60 --mem 60GB --partition=amd --account=aurorahpc --time=01:00:00``` + +This will request a single node from the AMD partition with 60 cores and 60 GB of memory for one hour. We request this node using the ```aurorahpc``` account that is open to all NLR staff, but if you have an HPC allocation, please replace ```aurorahpc``` with the project handle. + +### Loading necessary modules + +Once we have an allocated node, we will need to load the initial Intel module for the toolchain `oneapi`. This will give us access to the Intel toolchain, and we will we now load the module ```intel-oneapi-mpi``` to give us access to Intel MPI. Please note, you can always check what modules are available to you by using the command ```module avail``` and you can also check what modules you have loaded by using the command ```module list```. The commands for loading the modules that we need are as follows: + +```bash +module load oneapi +module load intel-oneapi-mpi +``` + +### Copying program files + +We now have access to the tools we need from the Intel toolchain in order to be able to compile a program! First, create a directory called `program-compilation` under `/projects` or `/scratch`. + +```bash +mkdir program-compilation +cd program-compilation +``` +Now we are going to copy the `phostone.c` file from `/nopt/nrel/apps/210929a` to our `program-compilation` directory. + +```rsync -avP /nopt/nrel/apps/210929a/example/phostone.c .``` + +`rsync` is a copy command that is commonly used for transferring files, and the parameters that we put into the command allow for us to see the progress of the file transfer and preserve important file characteristics. + +### Program compilation + +Once the file is copied, we can compile the program. The command we need to use in order to compile the program is as follows: + +```bash +mpiicx -qopenmp phostone.c -o phost.intelmpi +``` + +The command ```mpiicx``` is the Intel MPI compiler that was loaded from the module ```intel-oneapi-mpi```, and we added the flag of ```-qopenmp``` to make sure that the OpenMP compiled portions of the program are able to be loaded. We then specified the file name as `phost.intelmpi` using the ```-o``` flag. + +### Submitting a job + +The following batch script requests two cores to use two MPI ranks on a single node, with a run time of up to an hour. Save this script to a file such as `submit_intel.sh`, and submit using `sbatch submit_intel.sh`. Again, if you have an HPC allocation, we request that you replace ```aurorahpc``` with the project handle. + +??? example "Batch Submission Script - Intel MPI" + + ```bash + #!/bin/bash + #SBATCH --nodes=1 + #SBATCH --ntasks=2 + #SBATCH --cpus-per-task=2 + #SBATCH --time=00:01:00 + #SBATCH --mem=20GB + #SBATCH --account=aurorahpc + + module load oneapi + module load intel-oneapi-mpi + + srun --cpus-per-task 2 -n 2 ./phost.intelmpi -F + ``` + +Your output should look similar to the following + +``` +MPI VERSION Intel(R) MPI Library 2021.14 for Linux* OS +task thread node name first task # on node core +0000 0000 gila-compute-36.novalocal 0000 0000 0001 +0000 0001 gila-compute-36.novalocal 0000 0000 0000 +0001 0000 gila-compute-36.novalocal 0000 0001 0031 +0001 0001 gila-compute-36.novalocal 0000 0001 0030 +``` + + +### Compiling with OpenMPI + +We can now follow these steps using OpenMPI as well! First, we will unload the Intel modules from the Intel toolchain. We will then load GNU modules and OpenMPI using the `module load` command from earlier. The commands are as follows: + +```bash +module unload intel-oneapi-mpi +module unload oneapi +module load gcc +module load openmpi +``` + +We can then compile the phost program again by using the following commands: + +```bash +mpicc -fopenmp phostone.c -o phost.openmpi +``` + +Once the program has been compiled against OpenMPI, we can go ahead and submit another batch script to test the program: + + +??? example "Batch Submission Script - OpenMPI" + + ```bash + #!/bin/bash + #SBATCH --nodes=1 + #SBATCH --ntasks=2 + #SBATCH --cpus-per-task=2 + #SBATCH --time=00:01:00 + #SBATCH --mem=20GB + #SBATCH --account=aurorahpc + + module load gcc + module load openmpi + + srun --cpus-per-task 2 -n 2 ./phost.openmpi -F + ``` From 6b786f0e876232a1dbaf28b67695d9a8c90d5a76 Mon Sep 17 00:00:00 2001 From: hyandt Date: Tue, 13 Jan 2026 14:52:16 -0700 Subject: [PATCH 22/23] Change hostnames --- docs/Documentation/Systems/Gila/index.md | 2 +- docs/Documentation/Systems/Gila/modules.md | 4 ++-- docs/Documentation/Systems/Gila/running.md | 2 +- 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/Documentation/Systems/Gila/index.md b/docs/Documentation/Systems/Gila/index.md index fedd691c5..16079908f 100644 --- a/docs/Documentation/Systems/Gila/index.md +++ b/docs/Documentation/Systems/Gila/index.md @@ -17,7 +17,7 @@ To access Gila, log in to the NLR network and connect via ssh to: To use the Grace Hopper nodes, connect via ssh to: - gila-hopper-login1.hpc.nrel.gov + gila-arm.hpc.nrel.gov #### For External Collaborators: There are no external-facing login nodes for Gila. There are two options to connect: diff --git a/docs/Documentation/Systems/Gila/modules.md b/docs/Documentation/Systems/Gila/modules.md index 3a7d67f24..3cd946317 100644 --- a/docs/Documentation/Systems/Gila/modules.md +++ b/docs/Documentation/Systems/Gila/modules.md @@ -21,8 +21,8 @@ The two hardware stacks are almost identical in terms of available modules. Howe To ensure proper module compatibility, connect to the login node corresponding to your target compute architecture: -- **x86 architecture**: Use `gila-login-1` -- **ARM architecture**: Use `gila-hopper-login1` (Grace Hopper nodes) +- **x86 architecture**: Use `gila.hpc.nrel.gov` +- **ARM architecture**: Use `gila-arm.hpc.nrel.gov` (Grace Hopper nodes) !!! warning diff --git a/docs/Documentation/Systems/Gila/running.md b/docs/Documentation/Systems/Gila/running.md index 97e212c4a..2b292f023 100644 --- a/docs/Documentation/Systems/Gila/running.md +++ b/docs/Documentation/Systems/Gila/running.md @@ -18,7 +18,7 @@ GPU nodes in Gila have 8 NVIDIA A100 GPUs running on x86-64 __Intel Xeon Icelake ### Grace Hopper Nodes -Gila has 6 NVIDIA Grace Hopper nodes. To use the Grace Hopper nodes, submit your jobs to the `gh` partition from the `gila-hopper-login1.hpc.nrel.gov` login node. Each Grace Hopper node has a 72 core NVIDIA Grace CPU and an NVIDIA GH200 GPU, with 96GB of VRAM and 470GB of RAM. They have one socket and NUMA node. +Gila has 6 NVIDIA Grace Hopper nodes. To use the Grace Hopper nodes, submit your jobs to the `gh` partition from the `gila-arm.hpc.nrel.gov` login node. Each Grace Hopper node has a 72 core NVIDIA Grace CPU and an NVIDIA GH200 GPU, with 96GB of VRAM and 470GB of RAM. They have one socket and NUMA node. Please note - the __NVIDIA Grace CPUs__ run on a different processing architecture (ARM64) than both the __Intel Xeon Icelake CPUs__ (x86-64) and the __AMD EPYC Milan__ (x86-64). Any application that is manually compiled by a user and intended to be used on the Grace Hopper nodes __MUST__ be compiled on the Grace Hopper nodes themselves. From ac395de86fcecb5e194b24f4e5842e4e547e5e2e Mon Sep 17 00:00:00 2001 From: hyandt Date: Tue, 13 Jan 2026 16:14:34 -0700 Subject: [PATCH 23/23] remove funding specification --- docs/Documentation/Systems/Gila/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/Documentation/Systems/Gila/index.md b/docs/Documentation/Systems/Gila/index.md index 16079908f..5c1f79c4c 100644 --- a/docs/Documentation/Systems/Gila/index.md +++ b/docs/Documentation/Systems/Gila/index.md @@ -1,7 +1,7 @@ # About Gila -Gila is an OpenHPC-based cluster. The [nodes](./running.md#gila-compute-nodes) run as virtual machines in a local virtual private cloud (OpenStack). Gila is allocated for NLR workloads and intended for LDRD, SPP or Office of Science workloads. Check back regularly as the configuration and capabilities for Gila are augmented over time. +Gila is an OpenHPC-based cluster. Most [nodes](./running.md#gila-compute-nodes) run as virtual machines, with the exception of the Grace Hopper nodes, in a local virtual private cloud (OpenStack). Gila is allocated for NLR workloads. Check back regularly as the configuration and capabilities for Gila are augmented over time. ## Gila Access and Allocations