-
Notifications
You must be signed in to change notification settings - Fork 85
Gila Documentation #824
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gila Documentation #824
Changes from 18 commits
4f185bb
dc5fd20
14bc094
c7b60f5
3111ab8
dac6014
d743002
f80e505
3f6fe5a
a99549f
8002863
0de35b6
de94e85
f129709
aef5ebd
74a3930
fbc2f22
4b9b184
30afe28
78a9733
ae84e3a
dd482f4
6b786f0
ac395de
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,29 @@ | ||
| # Gila Filesystem Architecture Overview | ||
|
|
||
| ## Home Directories: /home | ||
|
|
||
| /home directories are mounted as `/home/<username>`. Home directories are hosted under the user's initial /project directory. Quotas in /home are included as a part of the quota of that project's storage allocation | ||
|
|
||
| ## Project Storage: /projects | ||
|
|
||
| Each active project is granted a subdirectory under `/projects/<projectname>`. This is where the bulk of data is expected to be, and where jobs should generally be run from. Storage quotas are based on the allocation award. | ||
|
|
||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you setup these quotas or is this via Lex ? Also thoughts on aurorahpc project space ? |
||
| Quota usage can be viewed at any time by issuing a `cd` command into the project directory, and using the `df -h` command to view total, used, and remaining available space for the mounted project directory | ||
|
|
||
yandthj marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| ## Scratch Storage: /scratch/username and /scratch/username/jobid | ||
|
|
||
| The scratch filesystem on Gila compute node is a 79TB spinning disk Ceph filesystem, and is accessible from login and compute nodes. The default writable path for scratch use is `/scratch/<username>`. | ||
|
|
||
yandthj marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| ## Temporary space: $TMPDIR | ||
|
|
||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Checking this here |
||
| When a job starts, the environment variable `$TMPDIR` is set to `/scratch/<username>/<jobid>` for the duration of the job. This is temporary space only, and should be purged when your job is complete. Please be sure to use this path instead of /tmp for your tempfiles. | ||
|
|
||
| There is no expectation of data longevity in the temporary space, and is purged once a job has completed. If desired data is stored here during the job, please be sure to copy it to a /projects directory as part of the job script before the job finishes. | ||
|
|
||
| ## Mass Storage System | ||
|
|
||
| There is no Mass Storage System for deep archive storage on Gila. | ||
|
|
||
| ## Backups and Snapshots | ||
|
|
||
| There are no backups or snapshots of data on Gila. Though the system is protected from hardware failure by multiple layers of redundancy, please keep regular backups of important data on Gila, and consider using a Version Control System (such as Git) for important code. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,41 @@ | ||
|
|
||
| # About Gila | ||
|
|
||
yandthj marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| Gila is an OpenHPC-based cluster running on __Dual AMD EPYC 7532 Rome CPUs__ and __Intel Xeon Icelake CPUs with NVIDIA A100 GPUs__. The nodes run as virtual machines in a local virtual private cloud (OpenStack). Gila is allocated for NLR workloads and intended for LDRD, SPP or Office of Science workloads. Allocation decisions are made by the IACAC through the annual allocation request process. Check back regularly as the configuration and capabilities for Gila are augmented over time. | ||
|
|
||
| *TODO: Update information about the allocations (include aurorahpc allocation info)* | ||
|
|
||
| ## Accessing Gila | ||
| All NLR employees with an HPC account automatically have access to Gila. Please see the [System Access](https://www.nrel.gov/hpc/system-access.html) page for more information on accounts and allocations. | ||
|
|
||
|
|
||
| #### For NLR Employees: | ||
| To access Gila, log into the NLR network and connect via ssh: | ||
yandthj marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ssh gila.hpc.nrel.gov | ||
|
|
||
| #### For External Collaborators: | ||
| There are currently no external-facing login nodes for Gila. There are two options to connect: | ||
|
|
||
| 1. Connect to the [SSH gateway host](https://www.nrel.gov/hpc/ssh-gateway-connection.html) and log in with your username, password, and OTP code. Once connected, ssh to the login nodes as above. | ||
| 1. Connect to the [HPC VPN](https://www.nrel.gov/hpc/vpn-connection.html) and ssh to the login nodes as above. | ||
|
|
||
| There are currently two login nodes. They share the same home directory so work done on one will appear on the other. They are: | ||
|
|
||
| gila-login-1 | ||
| gila-login-2 | ||
|
|
||
| You may connect directly to a login node, but they may be cycled in and out of the pool. If a node is unavailable, try connecting to another login node or the `gila.hpc.nrel.gov` round-robin option. | ||
|
|
||
| ## Get Help with Gila | ||
|
|
||
| Please see the [Help and Support Page](../../help.md) for further information on how to seek assistance with Gila or your NLR HPC account. | ||
|
|
||
| ## Building code | ||
|
|
||
yandthj marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| Do not build or run code on login nodes. Login nodes have limited CPU and memory available. Use a compute or GPU node instead. Simply start an interactive job on an appropriately provisioned node and partition for your work and do your builds there. | ||
|
|
||
| Similarly, build your projects under `/projects/your_project_name/` as home directories are **limited to 5GB** per user. | ||
yandthj marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
|
|
||
| --- | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,145 @@ | ||
| # Modules on Gila | ||
|
|
||
| On Gila, modules are deployed and organized slightly differently than on other NLR HPC systems. | ||
| While the basic concepts of using modules remain the same, there are important differences in how modules are structured, discovered, and loaded. These differences are intentional and are designed to improve compatibility, reproducibility, and long-term maintainability. The upcoming sections of this document will walk through these differences step by step. | ||
|
|
||
| The module system used on this cluster is [Lmod](../../Environment/lmod.md). | ||
yandthj marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| When you log in to Gila, three modules are loaded automatically by default: | ||
|
|
||
| 1. `Core/25.05` | ||
| 2. `DefApps` | ||
| 3. `gcc/14.2.0` | ||
|
|
||
| !!! note | ||
| The `DefApps` module is a convenience module that ensures both `Core` and `GCC` are loaded upon login or when you use `module restore`. It does not load additional software itself but guarantees that the essential environment is active. | ||
|
|
||
|
|
||
| ## Module Structure on Gila | ||
|
|
||
| Modules on Gila are organized into two main categories: **Base Modules** and **Core Modules**. This structure is different from many traditional flat module trees and is designed to make software compatibility explicit and predictable. | ||
|
|
||
| ### Base Modules | ||
|
|
||
| **Base modules** define the *software toolchain context* you are working in. Loading a base module changes which additional modules are visible and available. | ||
|
|
||
| Base modules allow users to: | ||
|
|
||
| * **Initiate a compiler toolchain** | ||
| * Loading a specific compiler (for example, `gcc` or `oneapi`) establishes a toolchain | ||
| * Once a compiler is loaded, only software built with and compatible with that compiler becomes visible when running `ml avail` | ||
| * This behavior applies to both **GCC** and **Intel oneAPI** toolchains | ||
|
|
||
| * **Use Conda/Mamba environments** | ||
| * Loading `miniforge3` enables access to Conda and Mamba for managing user-level Python environments | ||
|
|
||
| * **Access installed research applications** | ||
| * Loading the `application` module exposes centrally installed research applications | ||
|
|
||
| * **Enable CUDA and GPU-enabled software** | ||
| * Loading the `cuda` module provides access to CUDA | ||
| * It also makes CUDA-enabled software visible in `module avail`, ensuring GPU-compatible applications are only shown when CUDA is loaded | ||
|
|
||
| In short, **base modules control which families of software are visible** by establishing the appropriate environment and compatibility constraints. | ||
|
|
||
| ### Core Modules | ||
|
|
||
| **Core modules** are independent of any specific compiler or toolchain. | ||
|
|
||
| They: | ||
|
|
||
| * Do **not** rely on a particular compiler | ||
| * Contain essential utilities, libraries, and tools | ||
| * Are intended to work with **any toolchain** | ||
|
|
||
| Core modules are typically always available and can be safely loaded regardless of which compiler, CUDA version, or toolchain is active. | ||
|
|
||
| This separation between Base and Core modules ensures: | ||
|
|
||
| * Clear compiler compatibility | ||
| * Reduced risk of mixing incompatible software | ||
| * A cleaner and more predictable module environment | ||
|
|
||
|
|
||
| ## MPI-Enabled Software | ||
|
|
||
| MPI-enabled software modules are identified by a `-mpi` suffix at the end of the module name. | ||
|
|
||
| Similar to compiler modules, MPI-enabled software is **not visible by default**. These modules only appear after an MPI implementation is loaded. Supported MPI implementations include `openmpi`, `mpich`, and `intelmpi`. | ||
|
|
||
| Loading an MPI implementation makes MPI-enabled software that was installed with that specific MPI stack available when running `module avail`. | ||
|
|
||
| This behavior ensures that only software built against the selected MPI implementation is exposed, helping users avoid mixing incompatible MPI libraries. | ||
|
|
||
| !!! note | ||
| To determine whether a software package is available on the cluster, use `module spider`. This command lists **all available versions and configurations** of a given software, including those that are not currently visible with `module avail`. | ||
|
|
||
| To find out which modules must be loaded in order to access a specific software configuration, run `module spider` using the **full module name**. This will show the required modules that need to be loaded to make that software available. | ||
|
|
||
|
|
||
| ## Containers | ||
|
|
||
| Container tools such as **Apptainer** and **Podman** do not require module files on this cluster. They are available on the system **by default** and are already included in your `PATH`. | ||
|
|
||
| This means you can use Apptainer and Podman at any time without loading a specific module, regardless of which compiler, MPI, or CUDA toolchain is currently active. | ||
|
|
||
|
|
||
| ## Module Commands: restore, avail, and spider | ||
|
|
||
| ### module restore | ||
|
|
||
| The `module restore` command reloads the set of modules that were active at the start of your login session or at the last checkpoint. This is useful if you have unloaded or swapped modules and want to return to your original environment. | ||
|
|
||
| Example: | ||
|
|
||
| ```bash | ||
| module restore | ||
| ``` | ||
|
|
||
| This will restore the default modules that were loaded at login, such as `Core/25.05`, `DefApps`, and `gcc/14.2.0`. | ||
|
|
||
| ### module avail | ||
|
|
||
| The `module avail` command lists all modules that are **currently visible** in your environment. This includes modules that are compatible with the loaded compiler, MPI, or CUDA base modules. | ||
|
|
||
| Example: | ||
|
|
||
| ```bash | ||
| module avail | ||
| ``` | ||
|
|
||
| You can also search for a specific software: | ||
|
|
||
| ```bash | ||
| module avail python | ||
| ``` | ||
|
|
||
| ### module spider | ||
|
|
||
| The `module spider` command provides a **complete listing of all versions and configurations** of a software package, including those that are **not currently visible** with `module avail`. It also shows **which modules need to be loaded** to make a specific software configuration available. | ||
|
|
||
| Example: | ||
|
|
||
| ```bash | ||
| module spider python/3.10 | ||
| ``` | ||
|
|
||
| This output will indicate any prerequisite modules you need to load before the software becomes available. | ||
|
|
||
| !!! tip | ||
| Use `module avail` for quick checks and `module spider` when you need full details or to resolve dependencies for specific versions. | ||
|
|
||
|
|
||
| ## Frequently Asked Questions | ||
|
|
||
| ??? note "I can't find the module I need." | ||
| Please email [HPC-Help](mailto:HPC-Help@nrel.gov). The Apps team will get in touch with you to provide the module you need. | ||
|
|
||
| ??? note "I need to mix and match compilers and libraries/MPI. How can I do that?" | ||
| Modules on Gila do not support mixing and matching. For example, if `oneapi` is loaded, only software compiled with `oneapi` will appear. If you require a custom combination of software stacks, you are encouraged to use **Spack** to deploy your stack. Please contact [HPC-Help](mailto:HPC-Help@nrel.gov) to be matched with a Spack expert. | ||
|
|
||
| ??? note "Can I use Miniforge with other modules?" | ||
| While it is technically possible, Miniforge is intended to provide an isolated environment separate from external modules. Be careful with the order in which modules are loaded, as this can impact your `PATH` and `LD_LIBRARY_PATH`. | ||
|
|
||
| ??? note "What if I want a different CUDA version?" | ||
| Other CUDA versions are available under **CORE** modules. If you need additional versions, please reach out to [HPC-Help](mailto:HPC-Help@nrel.gov). Note that CUDA modules under CORE do **not** automatically make CUDA-enabled software available; only CUDA modules under **Base** modules will load CUDA-enabled packages. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,169 @@ | ||
| # Running on Gila | ||
|
|
||
| *Learn about compute nodes and job partitions on Gila.* | ||
|
|
||
|
|
||
| ## Compute hosts | ||
|
|
||
| Compute nodes in Gila are virtualized nodes. These nodes are not configured as exclusive and can be shared by multiple users or jobs. | ||
|
|
||
| ## GPU hosts | ||
|
|
||
| GPU nodes available in Gila have NVIDIA A100 GPU's running on __Intel Xeon Icelake CPUs__. | ||
|
|
||
| There are also 5 NVIDIA Grace Hopper nodes. | ||
|
|
||
| ## Shared file systems | ||
|
|
||
| Gila's home directories are shared across all nodes. Each user has a quota of 5 GB. There are also `/scratch/$USER` and `/projects` spaces seen across all nodes. | ||
|
|
||
| ## Partitions | ||
|
|
||
yandthj marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| A list of partitions can be found by running the `sinfo` command. Here are the partitions as of 12/30/2025 | ||
|
|
||
| | Partition Name | CPU | GPU | Qty | RAM | Cores/node | AU Charge Factor | | ||
| | :--: | :--:| :--: | :--:| :--: | :--: | :--: | | ||
| | gpu | Intel Xeon Icelake | NVIDIA Tesla A100-80 | 1 | 910 GB | 42 | 12 | | ||
| | amd | 2x 30 Core AMD Epyc Milan | | 36 | 220 GB | 60 | 7 | | ||
| | gh | NVIDIA Grace | GH200 | 5 | 470 GB | 72 | 7 | | ||
|
|
||
| ## Allocation Unit (AU) Charges | ||
|
|
||
yandthj marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| The equation for calculating the AU cost of a job on Gila is: | ||
|
|
||
| ```AU cost = (Walltime in hours * Number of Nodes * Charge Factor)``` | ||
|
|
||
| The Walltime is the actual length of time that the job runs, in hours or fractions thereof. | ||
|
|
||
| The **Charge Factor** for each partition is listed in the table above. | ||
|
|
||
| <!-- *Add example AU calculation, like Swift page* --> | ||
|
|
||
| ## Operating Software | ||
|
|
||
yandthj marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| The Gila HPC cluster runs on Rocky Linux 9.5. | ||
|
|
||
| <!-- Docs from Vermilion page: --> | ||
| <!-- ## Examples: Build and run simple applications | ||
|
|
||
| This section discusses how to compile and run a simple MPI application, as well as how to link against the Intel MKL library. | ||
|
|
||
| In the directory **/nopt/nrel/apps/210929a** you will see a subdirectory **example**. This contains a makefile for a simple hello world program written in both Fortran and C and several run scripts. The README.md file contains additional information, some of which is replicated here. | ||
|
|
||
| We will begin by creating a new directory and copying the source for a simple MPI test program. More details about the test program are available in the README.md file that accompanies it. Run the following commands to create a new directory and make a copy of the source code: | ||
|
|
||
| ```bash | ||
| mkdir example | ||
| cd example | ||
| cp /nopt/nrel/apps/210929a/example/phostone.c . | ||
| ``` | ||
|
|
||
| ### Compile and run with Intel MPI | ||
|
|
||
| First we will look at how to compile and run the application using Intel MPI. To build the application, we load the necessary Intel modules. Execute the following commands to load the modules and build the application, naming the output `phost.intelmpi`. Note that this application uses OpenMP as well as MPI, so we provide the `-fopenmp` flag to link against the OpenMP libraries. | ||
|
|
||
| ```bash | ||
| ml intel-oneapi-mpi intel-oneapi-compilers | ||
| mpiicc -fopenmp phostone.c -o phost.intelmpi | ||
| ``` | ||
|
|
||
| The following batch script is an example that runs the job using two MPI ranks on a single node with two threads per rank. Save this script to a file such as `submit_intel.sh`, replace `<myaccount>` with the appropriate account, and submit using `sbatch submit_intel.sh`. Feel free to experiment with different numbers of tasks and threads. Note that multi-node jobs on Vermilion can be finicky, and applications may not scale as well as they do on other systems. At this time, it is not expected that multi-node jobs will always run successfully. | ||
|
|
||
|
|
||
| ??? example "Intel MPI submission script" | ||
|
|
||
| ```bash | ||
| #!/bin/bash | ||
| #SBATCH --nodes=1 | ||
| #SBATCH --exclusive | ||
| #SBATCH --time=00:01:00 | ||
| #SBATCH --account=<myaccount> | ||
|
|
||
| ml intel-oneapi-mpi intel-oneapi-compilers | ||
|
|
||
| export OMP_NUM_THREADS=2 | ||
| export I_MPI_OFI_PROVIDER=tcp | ||
| srun --mpi=pmi2 --cpus-per-task 2 -n 2 ./phost.intelmpi -F | ||
| ``` | ||
|
|
||
| Your output should look similar to the following: | ||
|
|
||
| ``` | ||
| MPI VERSION Intel(R) MPI Library 2021.9 for Linux* OS | ||
|
|
||
| task thread node name first task # on node core | ||
| 0000 0000 vs-std-0044 0000 0000 0001 | ||
| 0000 0001 vs-std-0044 0000 0000 0000 | ||
| 0001 0000 vs-std-0044 0000 0001 0003 | ||
| 0001 0001 vs-std-0044 0000 0001 0002 | ||
| ``` | ||
|
|
||
| ### Link Intel's MKL library | ||
|
|
||
| The `intel-oneapi-mkl` module is available for linking against Intel's MKL | ||
| library. Then to build against MKL using the Intel compilers icc or ifort, you | ||
| normally just need to add the flag `-qmkl`. There are examples in the directory | ||
| `/nopt/nrel/apps/210929a/example/mkl`, and there is a Readme.md file that | ||
| explains in a bit more detail. | ||
|
|
||
| To compile a simple test program that links against MKL, run: | ||
|
|
||
| ```bash | ||
| cp /nopt/nrel/apps/210929a/example/mkl/mkl.c . | ||
|
|
||
| ml intel-oneapi-mkl intel-oneapi-compilers | ||
| icc -O3 -qmkl mkl.c -o mkl | ||
| ``` | ||
|
|
||
| An example submission script is: | ||
|
|
||
| ??? example "Intel MKL submission script" | ||
|
|
||
| ```bash | ||
| #!/bin/bash | ||
| #SBATCH --nodes=1 | ||
| #SBATCH --exclusive | ||
| #SBATCH --time=00:01:00 | ||
| #SBATCH --account=<myaccount> | ||
|
|
||
| source /nopt/nrel/apps/210929a/myenv.2110041605 | ||
| ml intel-oneapi-mkl intel-oneapi-compilers gcc | ||
|
|
||
| ./mkl | ||
| ``` | ||
|
|
||
|
|
||
| ### Compile and run with Open MPI | ||
|
|
||
| !!! warning | ||
|
|
||
| Please note that multi-node jobs are not currently supported with Open MPI. | ||
|
|
||
| Use the following commands to load the Open MPI modules and compile the test program into an executable named `phost.openmpi`: | ||
|
|
||
| ```bash | ||
| ml gcc openmpi | ||
| mpicc -fopenmp phostone.c -o phost.openmpi | ||
| ``` | ||
|
|
||
| The following is an example script that runs two tasks on a single node, with two threads per task: | ||
|
|
||
| ??? example "Open MPI submission script" | ||
|
|
||
| ```bash | ||
| #!/bin/bash | ||
| #SBATCH --nodes=1 | ||
| #SBATCH --exclusive | ||
| #SBATCH --time=00:01:00 | ||
| #SBATCH --account=<myaccount> | ||
|
|
||
| ml gcc openmpi | ||
|
|
||
| export OMP_NUM_THREADS=2 | ||
| mpirun -np 2 --map-by socket:PE=2 ./phost.openmpi -F | ||
| ``` | ||
|
|
||
|
|
||
| ## Running VASP on Vermilion | ||
|
|
||
| Please see the [VASP page](../../Applications/vasp.md) for detailed information and recommendations for running VASP on Vermilion. --> | ||
Uh oh!
There was an error while loading. Please reload this page.