NYU-RTS · s-sajid-ali · Mar 3, 2025 · Feb 20, 2025 · Feb 21, 2025 · Feb 21, 2025
diff --git a/docs/hpc/02_connecting_to_hpc/01_connecting_to_hpc.md b/docs/hpc/02_connecting_to_hpc/01_connecting_to_hpc.md
@@ -10,9 +10,9 @@ The following sections will outline basic ways to connect to the Greene cluster.
 
 If you are connecting from a remote location that is not on the NYU network (your home for example), you have two options:
 
-1. **VPN Option:** [Set up your computer to use the NYU VPN][nyu vpn link]. Once you've created a VPN connection, you can proceed as if you were connected to the NYU net
+1.  **VPN Option:** [Set up your computer to use the NYU VPN][nyu vpn link]. Once you've created a VPN connection, you can proceed as if you were connected to the NYU net
 
-2. **Gateway Option:** Go through our gateway servers (example below). Gateways are designed to support only a very minimal set of commands and their only purpose is to let users connect HPC systems without needing to first connect to the VPN
+2.  **Gateway Option:** Go through our gateway servers (example below). Gateways are designed to support only a very minimal set of commands and their only purpose is to let users connect HPC systems without needing to first connect to the VPN
 
 You do not need to use the NYU VPN or gateways if you are connected to the NYU network (wired connection in your office or WiFi) or if you have VPN connection initiated. In this case you can ssh directly to the clusters.
 
@@ -52,16 +52,16 @@ Instructions on WSL installation can be found here: [https://docs.microsoft.com/
 
 Instead of typing your password every time you need to log in, you can also specify an ssh key.
 
-- Only do that on the computer you trust
+-   Only do that on the computer you trust
 
-- Generate ssh key pair (terminal in Linux/Mac or cmd/WSL in Windows):
+-   Generate ssh key pair (terminal in Linux/Mac or cmd/WSL in Windows):
     [https://www.ssh.com/ssh/keygen/][ssh instructions keygen link]
 
-- Note the path to ssh key files. Don't share key files with anybody - anybody with this key file can login to your account
+-   Note the path to ssh key files. Don't share key files with anybody - anybody with this key file can login to your account
 
-- Log into cluster using regular login/password and then add the content of generated public key file (the one with .pub) to `$HOME/.ssh/authorized_keys` on cluster
+-   Log into cluster using regular login/password and then add the content of generated public key file (the one with .pub) to `$HOME/.ssh/authorized_keys` on cluster
 
-- Next time you will log into cluster no password will be required
+-   Next time you will log into cluster no password will be required
 
 For additional recommendations on how to configure your SSH sessions, see the \[ssh configuring and x11 forwarding page].
 

diff --git a/docs/hpc/02_connecting_to_hpc/02_ssh_tunneling_and_x11_forwarding.md b/docs/hpc/02_connecting_to_hpc/02_ssh_tunneling_and_x11_forwarding.md
@@ -129,17 +129,17 @@ This is the equivalent to running "ssh hpcgwtunnel" in the reusable tunnel instr
 
 ### Creating the tunnel
 
-1. First open Putty and prepare to log in to gw.hpc.nyu.edu. If you saved your session during that process, you can load it by selecting from the "Saved Sessions" box and hitting "Load". Don't hit "Open" yet!
+1.  First open Putty and prepare to log in to gw.hpc.nyu.edu. If you saved your session during that process, you can load it by selecting from the "Saved Sessions" box and hitting "Load". Don't hit "Open" yet!
 
-2. Under "Connection" -> "SSH", just below "X11", select "Tunnels
+2.  Under "Connection" -> "SSH", just below "X11", select "Tunnels
 
-3. Write "8026" (the port number) in the "Source port" box, and "greene.hpc.nyu.edu:22" (the machine you wish to tunnel to - 22 is the port that ssh listens on) in the 	"Destination" box
+3.  Write "8026" (the port number) in the "Source port" box, and "greene.hpc.nyu.edu:22" (the machine you wish to tunnel to - 22 is the port that ssh listens on) in the 	"Destination" box
 
-4. Click "Add". You can repeat step 3 with a different port number and a different destination. If you do this you will create multiple tunnels, one to each destination
+4.  Click "Add". You can repeat step 3 with a different port number and a different destination. If you do this you will create multiple tunnels, one to each destination
 
-5. Before hitting "Open", go back to the "Sessions" page, give the session a name ("hpcgw_tunnel") and hit "Save". Then next time you need not do all this again, just load the saved session
+5.  Before hitting "Open", go back to the "Sessions" page, give the session a name ("hpcgw_tunnel") and hit "Save". Then next time you need not do all this again, just load the saved session
 
-6. Hit "Open" to login in to gw.hpc.nyu.edu and create the tunnel. A terminal window will appear, asking for your login name (NYU NetID) and password (NYU password). 	Windows may also ask you to allow certain connections through its firewall - this is so you can ssh to port 8026 on your workstation - the entrance to the tunnel
+6.  Hit "Open" to login in to gw.hpc.nyu.edu and create the tunnel. A terminal window will appear, asking for your login name (NYU NetID) and password (NYU password). 	Windows may also ask you to allow certain connections through its firewall - this is so you can ssh to port 8026 on your workstation - the entrance to the tunnel
 
 
 :::note
@@ -150,19 +150,19 @@ Using your SSH tunnel: To log in via the tunnel, first the tunnel must be open.
 
 Starting the tunnel: During a session, you need only do this once - as long as the tunnel is open, new connections will go over it.
 
-1. Start Putty.exe (again, if necessary), and load the session you saved in settings during procedure above
+1.  Start Putty.exe (again, if necessary), and load the session you saved in settings during procedure above
 
-2. Hit "Open", and log in to the bastion host with your NYU NetID and password. This will create the tunnel. 
+2.  Hit "Open", and log in to the bastion host with your NYU NetID and password. This will create the tunnel. 
 
 ### Logging in via your SSH tunnel
 
-1. Start the second Putty.exe. In the "Host Name" box, write "localhost" and in the "Port" box, write "8026" (or whichever port number you specified when you set up the tunnel in the procedure above). We use "localhost" because the entrance of the tunnel is actually on this workstation, at port 8026
+1.  Start the second Putty.exe. In the "Host Name" box, write "localhost" and in the "Port" box, write "8026" (or whichever port number you specified when you set up the tunnel in the procedure above). We use "localhost" because the entrance of the tunnel is actually on this workstation, at port 8026
 
-2. Go to "Connections" -> "SSH" -> "X11" and check "Enable X11 forwarding"
+2.  Go to "Connections" -> "SSH" -> "X11" and check "Enable X11 forwarding"
 
-3. Optionally, give this session a name (in "Saved Sessions") and hit "Save" to save it. Then next time instead of steps 1 and 2 you can simply load this saved session
+3.  Optionally, give this session a name (in "Saved Sessions") and hit "Save" to save it. Then next time instead of steps 1 and 2 you can simply load this saved session
 
-4. Hit "Open". You will again get a terminal window asking for your login (NYU NetID) and password (NYU password). You are now logged in to the HPC cluster!
+4.  Hit "Open". You will again get a terminal window asking for your login (NYU NetID) and password (NYU password). You are now logged in to the HPC cluster!
 
 ## X11 Forwarding
 

diff --git a/docs/hpc/03_storage/01_intro.md b/docs/hpc/03_storage/01_intro.md
diff --git a/docs/hpc/03_storage/01_intro_and_data_management.mdx b/docs/hpc/03_storage/01_intro_and_data_management.mdx
@@ -0,0 +1,131 @@
+# HPC Storage
+
+The NYU HPC clusters are served by a General Parallel File System (GPFS) cluster and an all Flash VAST storage cluster.
+
+The NYU HPC team supports data storage, transfer, and archival needs on the HPC clusters, as well as collaborative research services like the [Research Project Space (RPS)](./05_research_project_space.md).
+
+## Highlights
+-   9.5 PB Total GPFS Storage
+    -   Up to 78 GB per second read speeds
+    -   Up to 650k input/output operations per second (IOPS)
+-   Research Project Space (RPS): RPS volumes provide working spaces for sharing data and code amongst project or lab members
+
+## Introduction to HPC Data Management
+The NYU HPC Environment provides access to a number of ***file systems*** to better serve the needs of researchers managing data during the various stages of the research data lifecycle (data capture, analysis, archiving, etc.). Each HPC file system comes with different features, policies, and availability.  
+
+In addition, a number of ***data management tools*** are available that enable data transfers and  data sharing, recommended best practices, and various scenarios and use cases of managing data in the HPC Environment.
+
+Multiple ***public data sets*** are available to all users of the HPC environment, such as a subset of The Cancer Genome Atlas (TCGA), the Million Song Database, ImageNet, and Reference Genomes. 
+
+Below is a list of file systems with their characteristics and a summary table. Reviewing the list of available file systems and the various Scenarios/Use cases that are presented below, can help select the right file systems for a research project. As always, if you have any questions about data storage in the HPC environment, you can request a consultation with the HPC team by sending email to [[email protected]](mailto:[email protected]).
+
+### Data Security Warning
+::::warning
+#### Moderate Risk Data - HPC Approved
+-   The HPC Environment has been approved for storing and analyzing **Moderate Risk research data**, as defined in the [NYU Electronic Data and System Risk Classification Policy](https://www.nyu.edu/about/policies-guidelines-compliance/policies-and-guidelines/electronic-data-and-system-risk-classification.html). 
+-   **High Risk** research data, such as those that include Personal Identifiable Information (**PII**) or electronic Protected Health Information (**ePHI**) or Controlled Unclassified Information (**CUI**) **should NOT be stored in the HPC Environment**. 
+:::note
+only the Office of Sponsored Projects (OSP) and Global Office of Information Security (GOIS) are empowered to classify the risk categories of data.
+:::
+:::tip
+#### High Risk Data - Secure Research Data Environments (SRDE) Approved
+Because the HPC system is not approved for High Risk data, we recommend using an approved system like the  [Secure Research Data Environments (SRDE)](../../srde/01_getting_started/01_intro.md).
+:::
+::::
+
+### Data Storage options in the HPC Environment
+#### User Home Directories
+Every individual user has a home directory (under **`/home/$USER`**, environment variable **`$HOME`**) for permanently storing code and important configuration files. Home Directories provide limited storage space (**50 GB**) and inodes (files) **30,000** per user. Users can check their quota utilization using the [myquota](http://www.info-ren.org/projects/ckp/tech/software/version/myquota.html) command.
+
+User home directories are backed up daily and old files under **`$HOME`** are not purged.
+
+The User home directories are available on all HPC clusters (Greene) and on every cluster node (login nodes, compute nodes) as well as and Data Transfer Node (gDTN). 
+
+:::warning
+Avoid changing file and directory permissions in your home directory to allow other users to access files.
+:::
+User Home Directories are not ideal for sharing files and folders with other users. HPC Scratch of Research Project Space (RPS) are better file systems for sharing data.
+
+:::warning
+**One of the common issues that users report regarding their home directories is running out of inodes,** i.e. the number of files  stored under their home exceeds the inode limit, which by default is set to 30,000 files. This typically occurs when users install software under their home directories, for example, when working with Conda and Julia environments, that involve many small files.
+:::
+
+:::tip
+-   To find out the current space and inode quota utilization and the distribution of files under your home directory, please see: [Understanding user quota limits and the myquota command.](./06_best_practices.md#user-quota-limits-and-the-myquota-command) 
+-   **Working with Conda environments:** To avoid running out of inode limits in home directories, the HPC team recommends **setting up conda environments with Singularity overlay images**
+:::
+
+#### HPC Scratch
+The HPC scratch file system is the HPC file system where most of the users store research data needed during the analysis phase of their research projects. The scratch file system provides ***temporary*** storage for datasets needed for running jobs. 
+
+Files stored in the HPC scratch file system are subject to the <ins>**HPC Scratch old file purging policy:** Files on the /scratch file system that have not been accessed for 60 or more days will be purged.</ins>
+
+Every user has a dedicated scratch directory (**/scratch/$USER**) with **5 TB** disk quota and **1,000,000 inodes** (files) limit per user. 
+
+The scratch file system is available on all nodes (compute, login, etc.) on Greene as well as Data Transfer Node (gDTN).
+
+:::warning
+There are **No Back ups of the scratch file system.** ***Files that were deleted accidentally or removed due to storage system failures CAN NOT be recovered.***
+:::
+
+:::tip
+
+-   Since there are ***no back ups of HPC Scratch file system***, users should not put important source code, scripts, libraries, executables in `/scratch`. These important files should be stored in file systems that are backed up, such as `/home` or [Research Project Space (RPS)](./05_research_project_space.md). Code can also be stored in a ***git*** repository.
+-   ***Old file purging policy on HPC Scratch:*** <ins>All files on the HPC Scratch file system that have not been accessed ***for more than 60 days*** will be removed.  It is a policy violation to use scripts to change the file access time.</ins> Any user found to be violating this policy will have their HPC account locked. A second violation may result in your HPC account being turned off.  
+-   To find out the user's current disk space and inode quota utilization and the distribution of files under your scratch directory, please see: [Understanding user quota Limits and the myquota command.](./06_best_practices.md#user-quota-limits-and-the-myquota-command)
+-   Once a research project completes, users should archive their important files in the [HPC Archive file system](./01_intro_and_data_management.mdx#hpc-archive).
+:::
+
+#### HPC Vast
+The HPC Vast all-flash file system is the HPC file system where users store research data needed during the analysis phase of their research projects, particuarly for high I/O data that can bottleneck on the scratch file system. The Vast file system provides ***temporary*** storage for datasets needed for running jobs. 
+
+Files stored in the HPC vast file system are subject to the <ins>***HPC Vast old file purging policy:*** Files on the `/vast` file system that have not been accessed for **60 or more days** will be purged.</ins>
+
+Every user has a dedicated vast directory (**`/vast/$USER`**) with **2 TB** disk quota and **5,000,000 inodes** (files) limit per user. 
+
+The vast file system is available on all nodes (compute, login, etc.) on Greene as well as Data Transfer Node (gDTN).
+
+:::warning
+There are **No Back ups** of the vastsc file system. ***Files that were deleted accidentally or removed due to storage system failures CAN NOT be recovered.***
+:::
+
+:::tip
+-   Since there are ***no back ups of HPC Vast file system***, users should not put important source code, scripts, libraries, executables in `/vast`. These important files should be stored in file systems that are backed up, such as `/home` or [Research Project Space (RPS)](./05_research_project_space.md). Code can also be stored in a ***git*** repository.
+-   ***Old file purging policy on HPC Vast:*** <ins>All files on the HPC Vast file system that have not been accessed ***for more than 60 days will be removed.***  It is a policy violation to use scripts to change the file access time.</ins> Any user found to be violating this policy will have their HPC account locked. A second violation may result in your HPC account being turned off.  
+-   To find out the user's current disk space and inode quota utilization and the distribution of files under your vast directory, please see: [Understanding user quota Limits and the myquota command.](./06_best_practices.md#user-quota-limits-and-the-myquota-command)
+-   Once a research project completes, users should archive their important files in the [HPC Archive file system](./01_intro_and_data_management.mdx#hpc-archive).
+:::
+
+#### HPC Research Project Space
+The HPC Research Project Space (RPS) provides data storage space for research projects that is easily shared amongst collaborators, ***backed up***, and ***not subject to the old file purging policy***. HPC RPS was introduced to ease data management in the HPC environment and eliminate the need of having to frequently copying files between Scratch and Archive file systems by having all projects files under one area. ***These benefits of the HPC RPS come at a cost***. The cost is determined by the allocated disk space and the number of files (inodes). 
+-   For detailed information about RPS see: [HPC Research Project Space](./05_research_project_space.md)
+
+#### HPC Work
+The HPC team makes available a number of public sets that are commonly used in analysis jobs. The data sets are available Read-Only under **`/scratch/work/public`**. 
+
+For some of the datasets users must provide a signed usage agreement before accessing.
+
+Public datasets available on the HPC clusters can be viewed on the [Datasets page](../01_getting_started/01_intro.md).
+
+#### HPC Archive
+Once the Analysis stage of the research data lifecycle has completed, <ins>_HPC users should **tar** their data and code into a single tar.gz file and then copy the  file to their archive directory (**`/archive/$USER`**_).</ins> The HPC Archive file system is not accessible by running jobs; it is suitable for long-term data storage. Each user has access to a default disk quota of **2TB** and ***20,000 inode (files) limit***. The rather low limit on the number of inodes per user is intentional.  The archive file system is available only ***on login nodes*** of Greene. The archive file system is backed up daily.
+
+-   Here is an example ***tar*** command that combines the data in a directory named ***my_run_dir*** under ***`$SCRATCH`*** and outputs the tar file in the user's ***`$ARCHIVE`***: 
+```sh
+# to archive `$SCRATCH/my_run_dir`
+tar cvf $ARCHIVE/simulation_01.tar -C $SCRATCH my_run_dir
+```
+
+#### NYU (Google) Drive
+Google Drive ([NYU Drive](https://www.nyu.edu/life/information-technology/communication-and-collaboration/document-collaboration-and-sharing/nyu-drive.html)) is accessible from the NYU HPC environment and provides an option to users who wish to archive data or share data with external collaborators who do not have access to the NYU HPC environment. 
+
+Currently (March 2021) there is no limit on the amount of data a user can store on Google Drive and there is no cost associated with storing data on Google Drive (although we hear rumors that free storage on Google Drive may be ending soon).
+
+However, there are limits to the data transfer rate in moving to/from Google Drive. Thus, moving many small files to Google Drive is not going to be efficient. 
+
+Please read the [Instructions on how to use cloud storage within the NYU HPC Environment](./09_transferring_cloud_storage_data_with_rclone.md). 
+
+#### HPC Storage Mounts Comparison Table
+<iframe width="100%" height="300em" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vT-q0rRueYg1Be_gcWSghB-GGFDonP8DaXNnm8Qi036w-Vi_l7CCOav4IPxi1yZy8TSnTRFy7S5dNTJ/pubhtml?widget=true&amp;headers=false"></iframe>
+
+Please see the next page for best practices for data management on NYU HPC systems.