Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
e4774ce
first pass on available storage and data transfers
RobJY Feb 20, 2025
ee20363
added globus page and images
RobJY Feb 21, 2025
14f1a26
first pass of storage best practices
RobJY Feb 21, 2025
2a5954f
added first pass of system status page
RobJY Feb 21, 2025
778d9c3
added first pass of storage: rps, data management and intro
RobJY Feb 24, 2025
359d0a6
added first pass of datasets and sharing data on hpc pages
RobJY Feb 24, 2025
bc7937c
first pass of large number of small files page
RobJY Feb 24, 2025
507b884
first pass of transferring cloud storage data with rclone
RobJY Feb 24, 2025
2068f85
first pass of software on greene
RobJY Feb 25, 2025
2ae4fdc
first pass of singularity: run custom...
RobJY Feb 25, 2025
bc5e0cf
moved a couple files as mentioned in ticket
RobJY Feb 25, 2025
20d1e57
changed embedded iframe to link for current hpc rps stakeholders
RobJY Feb 26, 2025
168940c
fixes from issue ticket
RobJY Feb 26, 2025
ad4d913
lint fixes
RobJY Feb 26, 2025
5119a7a
more lint fixes
RobJY Feb 26, 2025
2eb04d4
more lint fixes
RobJY Feb 26, 2025
7227908
renamed to pass CI and fixed note
RobJY Feb 26, 2025
db4e2e5
replace removed category json file to fix merge conflict
RobJY Feb 26, 2025
0bdd3de
fixed spacing
RobJY Feb 26, 2025
12d9620
Merge branch 'main' into storage
RobJY Feb 26, 2025
c8f16e0
fixed broken link
RobJY Feb 26, 2025
c6cb6f2
fixed more broken links
RobJY Feb 26, 2025
946628c
proposed change to data management doc
RobJY Feb 28, 2025
2dfabce
removed navigation links from bottom of data management page
RobJY Feb 28, 2025
7a3dc32
update storage intro with suggestions from PR
RobJY Feb 28, 2025
3155134
updates from PR suggestions
RobJY Feb 28, 2025
48200ab
changes suggested in PR
RobJY Mar 3, 2025
ebcc77e
bug fix
RobJY Mar 3, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 7 additions & 7 deletions docs/hpc/02_connecting_to_hpc/01_connecting_to_hpc.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,9 @@ The following sections will outline basic ways to connect to the Greene cluster.

If you are connecting from a remote location that is not on the NYU network (your home for example), you have two options:

1. **VPN Option:** [Set up your computer to use the NYU VPN][nyu vpn link]. Once you've created a VPN connection, you can proceed as if you were connected to the NYU net
1. **VPN Option:** [Set up your computer to use the NYU VPN][nyu vpn link]. Once you've created a VPN connection, you can proceed as if you were connected to the NYU net

2. **Gateway Option:** Go through our gateway servers (example below). Gateways are designed to support only a very minimal set of commands and their only purpose is to let users connect HPC systems without needing to first connect to the VPN
2. **Gateway Option:** Go through our gateway servers (example below). Gateways are designed to support only a very minimal set of commands and their only purpose is to let users connect HPC systems without needing to first connect to the VPN

You do not need to use the NYU VPN or gateways if you are connected to the NYU network (wired connection in your office or WiFi) or if you have VPN connection initiated. In this case you can ssh directly to the clusters.

Expand Down Expand Up @@ -52,16 +52,16 @@ Instructions on WSL installation can be found here: [https://docs.microsoft.com/

Instead of typing your password every time you need to log in, you can also specify an ssh key.

- Only do that on the computer you trust
- Only do that on the computer you trust

- Generate ssh key pair (terminal in Linux/Mac or cmd/WSL in Windows):
- Generate ssh key pair (terminal in Linux/Mac or cmd/WSL in Windows):
[https://www.ssh.com/ssh/keygen/][ssh instructions keygen link]

- Note the path to ssh key files. Don't share key files with anybody - anybody with this key file can login to your account
- Note the path to ssh key files. Don't share key files with anybody - anybody with this key file can login to your account

- Log into cluster using regular login/password and then add the content of generated public key file (the one with .pub) to `$HOME/.ssh/authorized_keys` on cluster
- Log into cluster using regular login/password and then add the content of generated public key file (the one with .pub) to `$HOME/.ssh/authorized_keys` on cluster

- Next time you will log into cluster no password will be required
- Next time you will log into cluster no password will be required

For additional recommendations on how to configure your SSH sessions, see the \[ssh configuring and x11 forwarding page].

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -129,17 +129,17 @@ This is the equivalent to running "ssh hpcgwtunnel" in the reusable tunnel instr

### Creating the tunnel

1. First open Putty and prepare to log in to gw.hpc.nyu.edu. If you saved your session during that process, you can load it by selecting from the "Saved Sessions" box and hitting "Load". Don't hit "Open" yet!
1. First open Putty and prepare to log in to gw.hpc.nyu.edu. If you saved your session during that process, you can load it by selecting from the "Saved Sessions" box and hitting "Load". Don't hit "Open" yet!

2. Under "Connection" -> "SSH", just below "X11", select "Tunnels
2. Under "Connection" -> "SSH", just below "X11", select "Tunnels

3. Write "8026" (the port number) in the "Source port" box, and "greene.hpc.nyu.edu:22" (the machine you wish to tunnel to - 22 is the port that ssh listens on) in the "Destination" box
3. Write "8026" (the port number) in the "Source port" box, and "greene.hpc.nyu.edu:22" (the machine you wish to tunnel to - 22 is the port that ssh listens on) in the "Destination" box

4. Click "Add". You can repeat step 3 with a different port number and a different destination. If you do this you will create multiple tunnels, one to each destination
4. Click "Add". You can repeat step 3 with a different port number and a different destination. If you do this you will create multiple tunnels, one to each destination

5. Before hitting "Open", go back to the "Sessions" page, give the session a name ("hpcgw_tunnel") and hit "Save". Then next time you need not do all this again, just load the saved session
5. Before hitting "Open", go back to the "Sessions" page, give the session a name ("hpcgw_tunnel") and hit "Save". Then next time you need not do all this again, just load the saved session

6. Hit "Open" to login in to gw.hpc.nyu.edu and create the tunnel. A terminal window will appear, asking for your login name (NYU NetID) and password (NYU password). Windows may also ask you to allow certain connections through its firewall - this is so you can ssh to port 8026 on your workstation - the entrance to the tunnel
6. Hit "Open" to login in to gw.hpc.nyu.edu and create the tunnel. A terminal window will appear, asking for your login name (NYU NetID) and password (NYU password). Windows may also ask you to allow certain connections through its firewall - this is so you can ssh to port 8026 on your workstation - the entrance to the tunnel


:::note
Expand All @@ -150,19 +150,19 @@ Using your SSH tunnel: To log in via the tunnel, first the tunnel must be open.

Starting the tunnel: During a session, you need only do this once - as long as the tunnel is open, new connections will go over it.

1. Start Putty.exe (again, if necessary), and load the session you saved in settings during procedure above
1. Start Putty.exe (again, if necessary), and load the session you saved in settings during procedure above

2. Hit "Open", and log in to the bastion host with your NYU NetID and password. This will create the tunnel.
2. Hit "Open", and log in to the bastion host with your NYU NetID and password. This will create the tunnel.

### Logging in via your SSH tunnel

1. Start the second Putty.exe. In the "Host Name" box, write "localhost" and in the "Port" box, write "8026" (or whichever port number you specified when you set up the tunnel in the procedure above). We use "localhost" because the entrance of the tunnel is actually on this workstation, at port 8026
1. Start the second Putty.exe. In the "Host Name" box, write "localhost" and in the "Port" box, write "8026" (or whichever port number you specified when you set up the tunnel in the procedure above). We use "localhost" because the entrance of the tunnel is actually on this workstation, at port 8026

2. Go to "Connections" -> "SSH" -> "X11" and check "Enable X11 forwarding"
2. Go to "Connections" -> "SSH" -> "X11" and check "Enable X11 forwarding"

3. Optionally, give this session a name (in "Saved Sessions") and hit "Save" to save it. Then next time instead of steps 1 and 2 you can simply load this saved session
3. Optionally, give this session a name (in "Saved Sessions") and hit "Save" to save it. Then next time instead of steps 1 and 2 you can simply load this saved session

4. Hit "Open". You will again get a terminal window asking for your login (NYU NetID) and password (NYU password). You are now logged in to the HPC cluster!
4. Hit "Open". You will again get a terminal window asking for your login (NYU NetID) and password (NYU password). You are now logged in to the HPC cluster!

## X11 Forwarding

Expand Down
1 change: 0 additions & 1 deletion docs/hpc/03_storage/01_intro.md

This file was deleted.

131 changes: 131 additions & 0 deletions docs/hpc/03_storage/01_intro_and_data_management.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
# HPC Storage

The NYU HPC clusters are served by a General Parallel File System (GPFS) cluster and an all Flash VAST storage cluster.

The NYU HPC team supports data storage, transfer, and archival needs on the HPC clusters, as well as collaborative research services like the [Research Project Space (RPS)](./05_research_project_space.md).

## Highlights
- 9.5 PB Total GPFS Storage
- Up to 78 GB per second read speeds
- Up to 650k input/output operations per second (IOPS)
- Research Project Space (RPS): RPS volumes provide working spaces for sharing data and code amongst project or lab members

## Introduction to HPC Data Management
The NYU HPC Environment provides access to a number of ***file systems*** to better serve the needs of researchers managing data during the various stages of the research data lifecycle (data capture, analysis, archiving, etc.). Each HPC file system comes with different features, policies, and availability.

In addition, a number of ***data management tools*** are available that enable data transfers and data sharing, recommended best practices, and various scenarios and use cases of managing data in the HPC Environment.

Multiple ***public data sets*** are available to all users of the HPC environment, such as a subset of The Cancer Genome Atlas (TCGA), the Million Song Database, ImageNet, and Reference Genomes.

Below is a list of file systems with their characteristics and a summary table. Reviewing the list of available file systems and the various Scenarios/Use cases that are presented below, can help select the right file systems for a research project. As always, if you have any questions about data storage in the HPC environment, you can request a consultation with the HPC team by sending email to [[email protected]](mailto:[email protected]).

### Data Security Warning
::::warning
#### Moderate Risk Data - HPC Approved
- The HPC Environment has been approved for storing and analyzing **Moderate Risk research data**, as defined in the [NYU Electronic Data and System Risk Classification Policy](https://www.nyu.edu/about/policies-guidelines-compliance/policies-and-guidelines/electronic-data-and-system-risk-classification.html).
- **High Risk** research data, such as those that include Personal Identifiable Information (**PII**) or electronic Protected Health Information (**ePHI**) or Controlled Unclassified Information (**CUI**) **should NOT be stored in the HPC Environment**.
:::note
only the Office of Sponsored Projects (OSP) and Global Office of Information Security (GOIS) are empowered to classify the risk categories of data.
:::
:::tip
#### High Risk Data - Secure Research Data Environments (SRDE) Approved
Because the HPC system is not approved for High Risk data, we recommend using an approved system like the [Secure Research Data Environments (SRDE)](../../srde/01_getting_started/01_intro.md).
:::
::::

### Data Storage options in the HPC Environment
#### User Home Directories
Every individual user has a home directory (under **`/home/$USER`**, environment variable **`$HOME`**) for permanently storing code and important configuration files. Home Directories provide limited storage space (**50 GB**) and inodes (files) **30,000** per user. Users can check their quota utilization using the [myquota](http://www.info-ren.org/projects/ckp/tech/software/version/myquota.html) command.

User home directories are backed up daily and old files under **`$HOME`** are not purged.

The User home directories are available on all HPC clusters (Greene) and on every cluster node (login nodes, compute nodes) as well as and Data Transfer Node (gDTN).

:::warning
Avoid changing file and directory permissions in your home directory to allow other users to access files.
:::
User Home Directories are not ideal for sharing files and folders with other users. HPC Scratch of Research Project Space (RPS) are better file systems for sharing data.

:::warning
**One of the common issues that users report regarding their home directories is running out of inodes,** i.e. the number of files stored under their home exceeds the inode limit, which by default is set to 30,000 files. This typically occurs when users install software under their home directories, for example, when working with Conda and Julia environments, that involve many small files.
:::

:::tip
- To find out the current space and inode quota utilization and the distribution of files under your home directory, please see: [Understanding user quota limits and the myquota command.](./06_best_practices.md#user-quota-limits-and-the-myquota-command)
- **Working with Conda environments:** To avoid running out of inode limits in home directories, the HPC team recommends **setting up conda environments with Singularity overlay images**
:::

#### HPC Scratch
The HPC scratch file system is the HPC file system where most of the users store research data needed during the analysis phase of their research projects. The scratch file system provides ***temporary*** storage for datasets needed for running jobs.

Files stored in the HPC scratch file system are subject to the <ins>**HPC Scratch old file purging policy:** Files on the /scratch file system that have not been accessed for 60 or more days will be purged.</ins>

Every user has a dedicated scratch directory (**/scratch/$USER**) with **5 TB** disk quota and **1,000,000 inodes** (files) limit per user.

The scratch file system is available on all nodes (compute, login, etc.) on Greene as well as Data Transfer Node (gDTN).

:::warning
There are **No Back ups of the scratch file system.** ***Files that were deleted accidentally or removed due to storage system failures CAN NOT be recovered.***
:::

:::tip

- Since there are ***no back ups of HPC Scratch file system***, users should not put important source code, scripts, libraries, executables in `/scratch`. These important files should be stored in file systems that are backed up, such as `/home` or [Research Project Space (RPS)](./05_research_project_space.md). Code can also be stored in a ***git*** repository.
- ***Old file purging policy on HPC Scratch:*** <ins>All files on the HPC Scratch file system that have not been accessed ***for more than 60 days*** will be removed. It is a policy violation to use scripts to change the file access time.</ins> Any user found to be violating this policy will have their HPC account locked. A second violation may result in your HPC account being turned off.
- To find out the user's current disk space and inode quota utilization and the distribution of files under your scratch directory, please see: [Understanding user quota Limits and the myquota command.](./06_best_practices.md#user-quota-limits-and-the-myquota-command)
- Once a research project completes, users should archive their important files in the [HPC Archive file system](./01_intro_and_data_management.mdx#hpc-archive).
:::

#### HPC Vast
The HPC Vast all-flash file system is the HPC file system where users store research data needed during the analysis phase of their research projects, particuarly for high I/O data that can bottleneck on the scratch file system. The Vast file system provides ***temporary*** storage for datasets needed for running jobs.

Files stored in the HPC vast file system are subject to the <ins>***HPC Vast old file purging policy:*** Files on the `/vast` file system that have not been accessed for **60 or more days** will be purged.</ins>

Every user has a dedicated vast directory (**`/vast/$USER`**) with **2 TB** disk quota and **5,000,000 inodes** (files) limit per user.

The vast file system is available on all nodes (compute, login, etc.) on Greene as well as Data Transfer Node (gDTN).

:::warning
There are **No Back ups** of the vastsc file system. ***Files that were deleted accidentally or removed due to storage system failures CAN NOT be recovered.***
:::

:::tip
- Since there are ***no back ups of HPC Vast file system***, users should not put important source code, scripts, libraries, executables in `/vast`. These important files should be stored in file systems that are backed up, such as `/home` or [Research Project Space (RPS)](./05_research_project_space.md). Code can also be stored in a ***git*** repository.
- ***Old file purging policy on HPC Vast:*** <ins>All files on the HPC Vast file system that have not been accessed ***for more than 60 days will be removed.*** It is a policy violation to use scripts to change the file access time.</ins> Any user found to be violating this policy will have their HPC account locked. A second violation may result in your HPC account being turned off.
- To find out the user's current disk space and inode quota utilization and the distribution of files under your vast directory, please see: [Understanding user quota Limits and the myquota command.](./06_best_practices.md#user-quota-limits-and-the-myquota-command)
- Once a research project completes, users should archive their important files in the [HPC Archive file system](./01_intro_and_data_management.mdx#hpc-archive).
:::

#### HPC Research Project Space
The HPC Research Project Space (RPS) provides data storage space for research projects that is easily shared amongst collaborators, ***backed up***, and ***not subject to the old file purging policy***. HPC RPS was introduced to ease data management in the HPC environment and eliminate the need of having to frequently copying files between Scratch and Archive file systems by having all projects files under one area. ***These benefits of the HPC RPS come at a cost***. The cost is determined by the allocated disk space and the number of files (inodes).
- For detailed information about RPS see: [HPC Research Project Space](./05_research_project_space.md)

#### HPC Work
The HPC team makes available a number of public sets that are commonly used in analysis jobs. The data sets are available Read-Only under **`/scratch/work/public`**.

For some of the datasets users must provide a signed usage agreement before accessing.

Public datasets available on the HPC clusters can be viewed on the [Datasets page](../01_getting_started/01_intro.md).

#### HPC Archive
Once the Analysis stage of the research data lifecycle has completed, <ins>_HPC users should **tar** their data and code into a single tar.gz file and then copy the file to their archive directory (**`/archive/$USER`**_).</ins> The HPC Archive file system is not accessible by running jobs; it is suitable for long-term data storage. Each user has access to a default disk quota of **2TB** and ***20,000 inode (files) limit***. The rather low limit on the number of inodes per user is intentional. The archive file system is available only ***on login nodes*** of Greene. The archive file system is backed up daily.

- Here is an example ***tar*** command that combines the data in a directory named ***my_run_dir*** under ***`$SCRATCH`*** and outputs the tar file in the user's ***`$ARCHIVE`***:
```sh
# to archive `$SCRATCH/my_run_dir`
tar cvf $ARCHIVE/simulation_01.tar -C $SCRATCH my_run_dir
```

#### NYU (Google) Drive
Google Drive ([NYU Drive](https://www.nyu.edu/life/information-technology/communication-and-collaboration/document-collaboration-and-sharing/nyu-drive.html)) is accessible from the NYU HPC environment and provides an option to users who wish to archive data or share data with external collaborators who do not have access to the NYU HPC environment.

Currently (March 2021) there is no limit on the amount of data a user can store on Google Drive and there is no cost associated with storing data on Google Drive (although we hear rumors that free storage on Google Drive may be ending soon).

However, there are limits to the data transfer rate in moving to/from Google Drive. Thus, moving many small files to Google Drive is not going to be efficient.

Please read the [Instructions on how to use cloud storage within the NYU HPC Environment](./09_transferring_cloud_storage_data_with_rclone.md).

#### HPC Storage Mounts Comparison Table
<iframe width="100%" height="300em" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vT-q0rRueYg1Be_gcWSghB-GGFDonP8DaXNnm8Qi036w-Vi_l7CCOav4IPxi1yZy8TSnTRFy7S5dNTJ/pubhtml?widget=true&amp;headers=false"></iframe>

Please see the next page for best practices for data management on NYU HPC systems.
Loading