diff --git a/docs/hpc/02_connecting_to_hpc/01_connecting_to_hpc.md b/docs/hpc/02_connecting_to_hpc/01_connecting_to_hpc.md
index 31951be890..f9c23939ad 100644
--- a/docs/hpc/02_connecting_to_hpc/01_connecting_to_hpc.md
+++ b/docs/hpc/02_connecting_to_hpc/01_connecting_to_hpc.md
@@ -10,9 +10,9 @@ The following sections will outline basic ways to connect to the Greene cluster.
If you are connecting from a remote location that is not on the NYU network (your home for example), you have two options:
-1. **VPN Option:** [Set up your computer to use the NYU VPN][nyu vpn link]. Once you've created a VPN connection, you can proceed as if you were connected to the NYU net
+1. **VPN Option:** [Set up your computer to use the NYU VPN][nyu vpn link]. Once you've created a VPN connection, you can proceed as if you were connected to the NYU net
-2. **Gateway Option:** Go through our gateway servers (example below). Gateways are designed to support only a very minimal set of commands and their only purpose is to let users connect HPC systems without needing to first connect to the VPN
+2. **Gateway Option:** Go through our gateway servers (example below). Gateways are designed to support only a very minimal set of commands and their only purpose is to let users connect HPC systems without needing to first connect to the VPN
You do not need to use the NYU VPN or gateways if you are connected to the NYU network (wired connection in your office or WiFi) or if you have VPN connection initiated. In this case you can ssh directly to the clusters.
@@ -52,16 +52,16 @@ Instructions on WSL installation can be found here: [https://docs.microsoft.com/
Instead of typing your password every time you need to log in, you can also specify an ssh key.
-- Only do that on the computer you trust
+- Only do that on the computer you trust
-- Generate ssh key pair (terminal in Linux/Mac or cmd/WSL in Windows):
+- Generate ssh key pair (terminal in Linux/Mac or cmd/WSL in Windows):
[https://www.ssh.com/ssh/keygen/][ssh instructions keygen link]
-- Note the path to ssh key files. Don't share key files with anybody - anybody with this key file can login to your account
+- Note the path to ssh key files. Don't share key files with anybody - anybody with this key file can login to your account
-- Log into cluster using regular login/password and then add the content of generated public key file (the one with .pub) to `$HOME/.ssh/authorized_keys` on cluster
+- Log into cluster using regular login/password and then add the content of generated public key file (the one with .pub) to `$HOME/.ssh/authorized_keys` on cluster
-- Next time you will log into cluster no password will be required
+- Next time you will log into cluster no password will be required
For additional recommendations on how to configure your SSH sessions, see the \[ssh configuring and x11 forwarding page].
diff --git a/docs/hpc/02_connecting_to_hpc/02_ssh_tunneling_and_x11_forwarding.md b/docs/hpc/02_connecting_to_hpc/02_ssh_tunneling_and_x11_forwarding.md
index d3a4416f96..8616116bf1 100644
--- a/docs/hpc/02_connecting_to_hpc/02_ssh_tunneling_and_x11_forwarding.md
+++ b/docs/hpc/02_connecting_to_hpc/02_ssh_tunneling_and_x11_forwarding.md
@@ -129,17 +129,17 @@ This is the equivalent to running "ssh hpcgwtunnel" in the reusable tunnel instr
### Creating the tunnel
-1. First open Putty and prepare to log in to gw.hpc.nyu.edu. If you saved your session during that process, you can load it by selecting from the "Saved Sessions" box and hitting "Load". Don't hit "Open" yet!
+1. First open Putty and prepare to log in to gw.hpc.nyu.edu. If you saved your session during that process, you can load it by selecting from the "Saved Sessions" box and hitting "Load". Don't hit "Open" yet!
-2. Under "Connection" -> "SSH", just below "X11", select "Tunnels
+2. Under "Connection" -> "SSH", just below "X11", select "Tunnels
-3. Write "8026" (the port number) in the "Source port" box, and "greene.hpc.nyu.edu:22" (the machine you wish to tunnel to - 22 is the port that ssh listens on) in the "Destination" box
+3. Write "8026" (the port number) in the "Source port" box, and "greene.hpc.nyu.edu:22" (the machine you wish to tunnel to - 22 is the port that ssh listens on) in the "Destination" box
-4. Click "Add". You can repeat step 3 with a different port number and a different destination. If you do this you will create multiple tunnels, one to each destination
+4. Click "Add". You can repeat step 3 with a different port number and a different destination. If you do this you will create multiple tunnels, one to each destination
-5. Before hitting "Open", go back to the "Sessions" page, give the session a name ("hpcgw_tunnel") and hit "Save". Then next time you need not do all this again, just load the saved session
+5. Before hitting "Open", go back to the "Sessions" page, give the session a name ("hpcgw_tunnel") and hit "Save". Then next time you need not do all this again, just load the saved session
-6. Hit "Open" to login in to gw.hpc.nyu.edu and create the tunnel. A terminal window will appear, asking for your login name (NYU NetID) and password (NYU password). Windows may also ask you to allow certain connections through its firewall - this is so you can ssh to port 8026 on your workstation - the entrance to the tunnel
+6. Hit "Open" to login in to gw.hpc.nyu.edu and create the tunnel. A terminal window will appear, asking for your login name (NYU NetID) and password (NYU password). Windows may also ask you to allow certain connections through its firewall - this is so you can ssh to port 8026 on your workstation - the entrance to the tunnel
:::note
@@ -150,19 +150,19 @@ Using your SSH tunnel: To log in via the tunnel, first the tunnel must be open.
Starting the tunnel: During a session, you need only do this once - as long as the tunnel is open, new connections will go over it.
-1. Start Putty.exe (again, if necessary), and load the session you saved in settings during procedure above
+1. Start Putty.exe (again, if necessary), and load the session you saved in settings during procedure above
-2. Hit "Open", and log in to the bastion host with your NYU NetID and password. This will create the tunnel.
+2. Hit "Open", and log in to the bastion host with your NYU NetID and password. This will create the tunnel.
### Logging in via your SSH tunnel
-1. Start the second Putty.exe. In the "Host Name" box, write "localhost" and in the "Port" box, write "8026" (or whichever port number you specified when you set up the tunnel in the procedure above). We use "localhost" because the entrance of the tunnel is actually on this workstation, at port 8026
+1. Start the second Putty.exe. In the "Host Name" box, write "localhost" and in the "Port" box, write "8026" (or whichever port number you specified when you set up the tunnel in the procedure above). We use "localhost" because the entrance of the tunnel is actually on this workstation, at port 8026
-2. Go to "Connections" -> "SSH" -> "X11" and check "Enable X11 forwarding"
+2. Go to "Connections" -> "SSH" -> "X11" and check "Enable X11 forwarding"
-3. Optionally, give this session a name (in "Saved Sessions") and hit "Save" to save it. Then next time instead of steps 1 and 2 you can simply load this saved session
+3. Optionally, give this session a name (in "Saved Sessions") and hit "Save" to save it. Then next time instead of steps 1 and 2 you can simply load this saved session
-4. Hit "Open". You will again get a terminal window asking for your login (NYU NetID) and password (NYU password). You are now logged in to the HPC cluster!
+4. Hit "Open". You will again get a terminal window asking for your login (NYU NetID) and password (NYU password). You are now logged in to the HPC cluster!
## X11 Forwarding
diff --git a/docs/hpc/03_storage/01_intro.md b/docs/hpc/03_storage/01_intro.md
deleted file mode 100644
index a7e03d4d74..0000000000
--- a/docs/hpc/03_storage/01_intro.md
+++ /dev/null
@@ -1 +0,0 @@
-# Available storage systems
diff --git a/docs/hpc/03_storage/01_intro_and_data_management.mdx b/docs/hpc/03_storage/01_intro_and_data_management.mdx
new file mode 100644
index 0000000000..f086cb7005
--- /dev/null
+++ b/docs/hpc/03_storage/01_intro_and_data_management.mdx
@@ -0,0 +1,131 @@
+# HPC Storage
+
+The NYU HPC clusters are served by a General Parallel File System (GPFS) cluster and an all Flash VAST storage cluster.
+
+The NYU HPC team supports data storage, transfer, and archival needs on the HPC clusters, as well as collaborative research services like the [Research Project Space (RPS)](./05_research_project_space.md).
+
+## Highlights
+- 9.5 PB Total GPFS Storage
+ - Up to 78 GB per second read speeds
+ - Up to 650k input/output operations per second (IOPS)
+- Research Project Space (RPS): RPS volumes provide working spaces for sharing data and code amongst project or lab members
+
+## Introduction to HPC Data Management
+The NYU HPC Environment provides access to a number of ***file systems*** to better serve the needs of researchers managing data during the various stages of the research data lifecycle (data capture, analysis, archiving, etc.). Each HPC file system comes with different features, policies, and availability.
+
+In addition, a number of ***data management tools*** are available that enable data transfers and data sharing, recommended best practices, and various scenarios and use cases of managing data in the HPC Environment.
+
+Multiple ***public data sets*** are available to all users of the HPC environment, such as a subset of The Cancer Genome Atlas (TCGA), the Million Song Database, ImageNet, and Reference Genomes.
+
+Below is a list of file systems with their characteristics and a summary table. Reviewing the list of available file systems and the various Scenarios/Use cases that are presented below, can help select the right file systems for a research project. As always, if you have any questions about data storage in the HPC environment, you can request a consultation with the HPC team by sending email to [hpc@nyu.edu](mailto:hpc@nyu.edu).
+
+### Data Security Warning
+::::warning
+#### Moderate Risk Data - HPC Approved
+- The HPC Environment has been approved for storing and analyzing **Moderate Risk research data**, as defined in the [NYU Electronic Data and System Risk Classification Policy](https://www.nyu.edu/about/policies-guidelines-compliance/policies-and-guidelines/electronic-data-and-system-risk-classification.html).
+- **High Risk** research data, such as those that include Personal Identifiable Information (**PII**) or electronic Protected Health Information (**ePHI**) or Controlled Unclassified Information (**CUI**) **should NOT be stored in the HPC Environment**.
+:::note
+only the Office of Sponsored Projects (OSP) and Global Office of Information Security (GOIS) are empowered to classify the risk categories of data.
+:::
+:::tip
+#### High Risk Data - Secure Research Data Environments (SRDE) Approved
+Because the HPC system is not approved for High Risk data, we recommend using an approved system like the [Secure Research Data Environments (SRDE)](../../srde/01_getting_started/01_intro.md).
+:::
+::::
+
+### Data Storage options in the HPC Environment
+#### User Home Directories
+Every individual user has a home directory (under **`/home/$USER`**, environment variable **`$HOME`**) for permanently storing code and important configuration files. Home Directories provide limited storage space (**50 GB**) and inodes (files) **30,000** per user. Users can check their quota utilization using the [myquota](http://www.info-ren.org/projects/ckp/tech/software/version/myquota.html) command.
+
+User home directories are backed up daily and old files under **`$HOME`** are not purged.
+
+The User home directories are available on all HPC clusters (Greene) and on every cluster node (login nodes, compute nodes) as well as and Data Transfer Node (gDTN).
+
+:::warning
+Avoid changing file and directory permissions in your home directory to allow other users to access files.
+:::
+User Home Directories are not ideal for sharing files and folders with other users. HPC Scratch of Research Project Space (RPS) are better file systems for sharing data.
+
+:::warning
+**One of the common issues that users report regarding their home directories is running out of inodes,** i.e. the number of files stored under their home exceeds the inode limit, which by default is set to 30,000 files. This typically occurs when users install software under their home directories, for example, when working with Conda and Julia environments, that involve many small files.
+:::
+
+:::tip
+- To find out the current space and inode quota utilization and the distribution of files under your home directory, please see: [Understanding user quota limits and the myquota command.](./06_best_practices.md#user-quota-limits-and-the-myquota-command)
+- **Working with Conda environments:** To avoid running out of inode limits in home directories, the HPC team recommends **setting up conda environments with Singularity overlay images**
+:::
+
+#### HPC Scratch
+The HPC scratch file system is the HPC file system where most of the users store research data needed during the analysis phase of their research projects. The scratch file system provides ***temporary*** storage for datasets needed for running jobs.
+
+Files stored in the HPC scratch file system are subject to the **HPC Scratch old file purging policy:** Files on the /scratch file system that have not been accessed for 60 or more days will be purged.
+
+Every user has a dedicated scratch directory (**/scratch/$USER**) with **5 TB** disk quota and **1,000,000 inodes** (files) limit per user.
+
+The scratch file system is available on all nodes (compute, login, etc.) on Greene as well as Data Transfer Node (gDTN).
+
+:::warning
+There are **No Back ups of the scratch file system.** ***Files that were deleted accidentally or removed due to storage system failures CAN NOT be recovered.***
+:::
+
+:::tip
+
+- Since there are ***no back ups of HPC Scratch file system***, users should not put important source code, scripts, libraries, executables in `/scratch`. These important files should be stored in file systems that are backed up, such as `/home` or [Research Project Space (RPS)](./05_research_project_space.md). Code can also be stored in a ***git*** repository.
+- ***Old file purging policy on HPC Scratch:*** All files on the HPC Scratch file system that have not been accessed ***for more than 60 days*** will be removed. It is a policy violation to use scripts to change the file access time. Any user found to be violating this policy will have their HPC account locked. A second violation may result in your HPC account being turned off.
+- To find out the user's current disk space and inode quota utilization and the distribution of files under your scratch directory, please see: [Understanding user quota Limits and the myquota command.](./06_best_practices.md#user-quota-limits-and-the-myquota-command)
+- Once a research project completes, users should archive their important files in the [HPC Archive file system](./01_intro_and_data_management.mdx#hpc-archive).
+:::
+
+#### HPC Vast
+The HPC Vast all-flash file system is the HPC file system where users store research data needed during the analysis phase of their research projects, particuarly for high I/O data that can bottleneck on the scratch file system. The Vast file system provides ***temporary*** storage for datasets needed for running jobs.
+
+Files stored in the HPC vast file system are subject to the ***HPC Vast old file purging policy:*** Files on the `/vast` file system that have not been accessed for **60 or more days** will be purged.
+
+Every user has a dedicated vast directory (**`/vast/$USER`**) with **2 TB** disk quota and **5,000,000 inodes** (files) limit per user.
+
+The vast file system is available on all nodes (compute, login, etc.) on Greene as well as Data Transfer Node (gDTN).
+
+:::warning
+There are **No Back ups** of the vastsc file system. ***Files that were deleted accidentally or removed due to storage system failures CAN NOT be recovered.***
+:::
+
+:::tip
+- Since there are ***no back ups of HPC Vast file system***, users should not put important source code, scripts, libraries, executables in `/vast`. These important files should be stored in file systems that are backed up, such as `/home` or [Research Project Space (RPS)](./05_research_project_space.md). Code can also be stored in a ***git*** repository.
+- ***Old file purging policy on HPC Vast:*** All files on the HPC Vast file system that have not been accessed ***for more than 60 days will be removed.*** It is a policy violation to use scripts to change the file access time. Any user found to be violating this policy will have their HPC account locked. A second violation may result in your HPC account being turned off.
+- To find out the user's current disk space and inode quota utilization and the distribution of files under your vast directory, please see: [Understanding user quota Limits and the myquota command.](./06_best_practices.md#user-quota-limits-and-the-myquota-command)
+- Once a research project completes, users should archive their important files in the [HPC Archive file system](./01_intro_and_data_management.mdx#hpc-archive).
+:::
+
+#### HPC Research Project Space
+The HPC Research Project Space (RPS) provides data storage space for research projects that is easily shared amongst collaborators, ***backed up***, and ***not subject to the old file purging policy***. HPC RPS was introduced to ease data management in the HPC environment and eliminate the need of having to frequently copying files between Scratch and Archive file systems by having all projects files under one area. ***These benefits of the HPC RPS come at a cost***. The cost is determined by the allocated disk space and the number of files (inodes).
+- For detailed information about RPS see: [HPC Research Project Space](./05_research_project_space.md)
+
+#### HPC Work
+The HPC team makes available a number of public sets that are commonly used in analysis jobs. The data sets are available Read-Only under **`/scratch/work/public`**.
+
+For some of the datasets users must provide a signed usage agreement before accessing.
+
+Public datasets available on the HPC clusters can be viewed on the [Datasets page](../01_getting_started/01_intro.md).
+
+#### HPC Archive
+Once the Analysis stage of the research data lifecycle has completed, _HPC users should **tar** their data and code into a single tar.gz file and then copy the file to their archive directory (**`/archive/$USER`**_). The HPC Archive file system is not accessible by running jobs; it is suitable for long-term data storage. Each user has access to a default disk quota of **2TB** and ***20,000 inode (files) limit***. The rather low limit on the number of inodes per user is intentional. The archive file system is available only ***on login nodes*** of Greene. The archive file system is backed up daily.
+
+- Here is an example ***tar*** command that combines the data in a directory named ***my_run_dir*** under ***`$SCRATCH`*** and outputs the tar file in the user's ***`$ARCHIVE`***:
+```sh
+# to archive `$SCRATCH/my_run_dir`
+tar cvf $ARCHIVE/simulation_01.tar -C $SCRATCH my_run_dir
+```
+
+#### NYU (Google) Drive
+Google Drive ([NYU Drive](https://www.nyu.edu/life/information-technology/communication-and-collaboration/document-collaboration-and-sharing/nyu-drive.html)) is accessible from the NYU HPC environment and provides an option to users who wish to archive data or share data with external collaborators who do not have access to the NYU HPC environment.
+
+Currently (March 2021) there is no limit on the amount of data a user can store on Google Drive and there is no cost associated with storing data on Google Drive (although we hear rumors that free storage on Google Drive may be ending soon).
+
+However, there are limits to the data transfer rate in moving to/from Google Drive. Thus, moving many small files to Google Drive is not going to be efficient.
+
+Please read the [Instructions on how to use cloud storage within the NYU HPC Environment](./09_transferring_cloud_storage_data_with_rclone.md).
+
+#### HPC Storage Mounts Comparison Table
+
+
+Please see the next page for best practices for data management on NYU HPC systems.
diff --git a/docs/hpc/03_storage/02_available_storage_systems.md b/docs/hpc/03_storage/02_available_storage_systems.md
index a7e03d4d74..9378df77c5 100644
--- a/docs/hpc/03_storage/02_available_storage_systems.md
+++ b/docs/hpc/03_storage/02_available_storage_systems.md
@@ -1 +1,42 @@
# Available storage systems
+
+The NYU HPC clusters are served by a General Parallel File System (GPFS) storage cluster. GPFS is a high-performance clustered file system software developed by IBM that provides concurrent high-speed file access to applications executing on multiple nodes of clusters.
+
+## GPFS
+### Configuration
+The NYU HPC cluster storage runs on Lenovo Distributed Storage Solution DSS-G hardware:
+- 2x DSS-G 202
+ - 116 Solid State Drives (SSDs)
+ - 464TB raw storage
+- 2x DSS-G 240
+ - 668 Hard Disk Drives (HDDs)
+ - 9.1PB raw storage
+
+### Performance
+- Read Speed: 78 GB per second read speeds
+- Write Speed: 42 GB per second write speeds
+- I/O Performance: up to 650k input/output operations per second (IOPS)
+
+## Flash Tier Storage (VAST)
+- An all flash file system, using [VAST Flash storage](https://www.vastdata.com/), is now available on Greene. Flash storage is optimal for computational workloads with high I/O rates. For example, If you have jobs to run with huge amount of tiny files, VAST may be a good candidate. If you and your lab members are interested, please reach out to [hpc@nyu.edu](mailto:hpc@nyu.edu) for more information.
+- NVMe interface
+- Total size: 778 TB
+- Access: /vast is available for all users to read and available to approved users to write data.
+
+## Research Project Space (RPS)
+- Research Project Space (RPS) volumes provide working spaces for sharing data and code amongst project or lab members.
+- RPS directories are available on the Greene HPC cluster.
+- There is no old-file purging policy on RPS.
+- RPS is backed up.
+- There is a cost per TB per year and inodes per year for RPS volumes.
+
+More information on the [Research Project Space is available page](./05_research_project_space.md).
+
+## Data Transfer Nodes Specs (gDTN)
+- Node type: Lenovo SR630
+- Number of nodes: 2
+- CPU: 2x Intel Xeon Gold 6244 8C 150W 3.6GHz Processor
+- Memory: 192GB (total) - 12x 16GB DDR4, 2933MHz
+- Local disk: 1x 1.92TB SSD
+- Infiniband interconnect: 1x Mellanox ConnectX-6 HDR100 /100GbE VPI 1-Port x16 PCIe 3.0 HCA
+- Ethernet connectivity to the NYU High-Speed Research Network ( HSRN ): 200Gbit - 1x Mellanox ConnectX-5 EDR IB/100GbE VPI Dual-Port x16 PCIe 3.0 HCA
diff --git a/docs/hpc/03_storage/03_data_transfers.md b/docs/hpc/03_storage/03_data_transfers.md
index 5f3d37cdb9..34cf2b5ee9 100644
--- a/docs/hpc/03_storage/03_data_transfers.md
+++ b/docs/hpc/03_storage/03_data_transfers.md
@@ -1 +1,76 @@
-# Data transfers
+# Data Transfers
+
+## Introduction
+The main tools to transfer data to/from HPC systems
+
+- Linux tools like scp and rsync
+- Please use Data transfer nodes
+:::note
+Note: while one can transfer data while on login nodes, it is considered a bad practice
+:::
+- Globus
+- rclone to/from cloud storage like NYU (Google) Drive
+- OpenOnDemand
+- Other tools
+
+## Data-Transfer nodes
+Attached to the NYU HPC cluster Greene, the Greene Data Transfer Node (gDTN) are nodes optimized for transferring data between cluster file systems (e.g. scratch) and other endpoints outside the NYU HPC clusters, including user laptops and desktops. The gDTNs have 100-Gb/s Ethernet connections to the High Speed Research Network (HSRN) and are connected to the HDR Infiniband fabric of the HPC clusters.
+
+The HPC cluster filesystems include `/home`, `/scratch`, `/archive` and the [HPC Research Project Space](./05_research_project_space.md) are available on the gDTN.
+
+The Data-Transfer Node (DTN) can be access in a variety of ways
+- From NYU-net and the High Speed Research Network: use SSH to the DTN hostname gdtn.hpc.nyu.edu
+- From the Greene cluster (e.g., the login nodes): the hostname can be shortened to gdtn
+- For example, to log in to a DTN from the Greene cluster, to carry out some copy operation, and to log back out, you can use a command sequence like:
+```sh
+ssh gdtn
+rsync ...
+logout
+```
+- Via specific tools like Globus (see below)
+
+## Linux & Mac Tools
+### scp and rsync
+Please transfer data using Data-Transfer nodes
+
+Sometimes these two tools are convenient for transferring small files. Using the DTNs does not require to set up an SSH tunnel; use the hostname dtn.hpc.nyu.edu for one-step copying. See below for examples of commands invoked on the command line on a laptop running a Unix-like operating system:
+```sh
+scp HMLHWBGX7_n01_HK16.fastq.gz jdoe55@dtn.hpc.nyu.edu:/scratch/jdoe55/
+rsync -av HMLHWBGX7_n01_HK16.fastq.gz jdoe55@dtn.hpc.nyu.edu:/scratch/jdoe55/
+```
+In particular, rsync can also be used on the DTNs to copy directories recursively between filesystems, e.g. (assuming that you are logged in to a DTN),
+```sh
+rsync -av /scratch/username/project1 /rw/sharename/
+```
+where username would be your user name, project1 a directory to be copied to the Research Workspace, and sharename the name of a share on the Research Workspace (either your NetID or the name of a project you're a member of).
+
+## Windows Tools
+### File Transfer Clients
+Windows 10 machines may have the Linux Subsystem installed, which will allow for the use of Linux tools, as listed above, but generally it is recommended to use a client such as [WinSCP](https://winscp.net/eng/docs/tunneling) or [FileZilla](https://filezilla-project.org/) to transfer data. Additionally, Windows users may also take advantage of [Globus](./04_globus.md) to transfer files.
+
+### Tunneling
+[Read the detailed instructions for setting up tunnels.](../02_connecting_to_hpc/02_ssh_tunneling_and_x11_forwarding.md)
+
+## Globus
+Globus is the recommended tool to use for large-volume data transfers. It features automatic performance tuning and automatic retries in cases of file-transfer failures. Data-transfer tasks can be submitted via a web portal. The Globus service will take care of the rest, to make sure files are copied efficiently, reliably, and securely. Globus is also a tool for you to share data with collaborators, for whom you only need to provide the email addresses.
+
+The Globus endpoint for Greene is available at `nyu#greene`. The endpoint `nyu#prince` has been retired.
+
+[Please see detailed instructions](./04_globus.md)
+
+## rclone
+rclone - rsync for cloud storage, is a command line program to sync files and directories to and from cloud storage systems such as Google Drive, Amazon Drive, S3, B2 etc. rclone is available on DTNs.
+
+[Please see the documentation for how to use it.](https://rclone.org/)
+
+## Open OnDemand
+One can use Open OnDemand interface to upload data.
+However, please use it only for small data!
+
+:::tip
+Please use Data-Transfer nodes while moving large data
+:::
+
+### FDT
+FDT stands for "Fast Data Transfer". It is a command line application written in Java. With the plugin mechanism, FDT allows users to load user-defined classes for Pre- and Post-Processing of file transfers. Users can start their own server processes. If you have use cases for FDT, visit the download page to get `fdt.jar` to start. Please contact [hpc@nyu.edu](mailto:hpc@nyu.edu) for any questions.
+
diff --git a/docs/hpc/03_storage/04_globus.md b/docs/hpc/03_storage/04_globus.md
index 020956aab9..118f48acfb 100644
--- a/docs/hpc/03_storage/04_globus.md
+++ b/docs/hpc/03_storage/04_globus.md
@@ -1 +1,74 @@
# Globus
+
+The Globus project aims at providing powerful tools for scientific data management, to help researchers to focus on their domain subjects and solve data intensive research problems. Globus has been grown maturely to enable grid computing by connecting computing resources distributed globally across organizational boundary. Universities, national laboratories and computing facilities are using services of Globus.
+
+## Transferring data between endpoints
+### Endpoint
+
+A globus **Endpoint** is a data transfer location, a location where data can be moved to or from using Globus transfer, sync and sharing service. An endpoint can either be a ***personal endpoint*** (on a user’s personal computer) or a ***server endpoint*** (located on a server, for use by multiple users). [Please read for details.](https://www.globus.org/data-transfer)
+
+### Collection
+
+A collection is a named set of files (or blobs), hierarchically organized in folders.
+
+### Data Sharing
+
+How to share data using Globus is described: [https://docs.globus.org/how-to/share-files/](https://docs.globus.org/how-to/share-files/)
+
+The first step in transferring data is to get a Globus account at [https://www.globus.org/](https://www.globus.org/). Click on "Log in" at upper right corner. Select "New York University" from the pull-down menu and click on "Continue".
+
+
+
+Enter your NYU NetID and password in the familiar screen, and hit "LOGIN" then go through the Multi-Factor Authentication.
+
+
+
+The "File Manager" panel should come up as the following image. In order to be able to transfer files, you will need to specify two Collections. A collection is defined on top of an endpoint. We can search for a collection using an endpoint name. The **Server Endpoint** on the NYU HPC storage is **nyu#greene** .
+
+
+
+
+
+
+
+## Server and Personal Endpoints
+:::note
+The NYU HPC Server Endpoint: nyu#greene
+:::
+
+**Globus Connect Server** is already installed on the NYU HPC cluster creating a ***Server Endpoint*** named **nyu#greene**, that is available to authorized users (users with a valid HPC account) using Globus. If you want to move data to or from your computer and the NYU HPC cluster, you need to install **Globus Connect Personal** on your computer, thus creating a ***Personal Endpoint*** on your computer.
+
+### Moving data between Server Endpoints
+
+If you plan to transfer data between ***Server Endpoints***, such as between the NYU server endpoint **nyu#greene** and a server endpoint at another institution, you do not need to install Globus Connect Personal on your computer.
+
+#### Creating a Personal Endpoint on your computer
+
+This needs to be done only once on your personal computer.
+
+After clicking "Transfer or Sync to...", click "Search" on the upper right side. Then follow the link "Install Globus Connect Personal".
+
+More information about **Globus Connect Personal** and download links for Linux, Mac and Windows can be found at: [https://www.globus.org/globus-connect-personal](https://www.globus.org/globus-connect-personal)
+
+
+
+### Transfer files between your Personal Endpoint and NYU nyu#greene
+To transfer files you need to specify two collections (endpoints). Specify one of them as **Greene scratch directory**, or **Greene archive directory** or **Greene home directory**. The other endpoint is the one created for your personal computer (e.g. My Mac Laptop) if it is involved in the transfer. When you first use the Greene directory collection, authentication/consent is required for the Globus web app to manage collections on this endpoint on your behalf.
+
+
+
+When writing to your Greene archive directory, please pay attention that there is a default inode limit of 20K per user.
+
+When the second Endpoint is chosen to be your personal computer, your computer home directory content will show up. Now select directory and files (you may select multiple files when clicking on file names while pressing down "shift" key), click one of the two blue Start buttons to indicate the transfer direction. After clicking the blue Start button, you should see a message indicating a transfer request has been submitted successfully, and a transfer ID is generated. Globus file transfer service takes care of the actual copying.
+
+When the transfer is done, you should receive an email notification. Click "ACTIVITY" on the Globus portal, select the transfer you want to check, a finished transfer should look like the following:
+
+
+
+### Small file download from web browsers
+
+Globus support HTTPS access to data. To download a small file from your web browser, select a file and right-click your mouse, then click 'Download' at the popup menu.
+
+
+
+Additional info can be found at this page [https://docs.globus.org/how-to/get-started/](https://docs.globus.org/how-to/get-started/). Feel free to send any question. Good luck!
diff --git a/docs/hpc/03_storage/05_research_project_space.md b/docs/hpc/03_storage/05_research_project_space.md
index e796d096d8..060d6c3c9e 100644
--- a/docs/hpc/03_storage/05_research_project_space.md
+++ b/docs/hpc/03_storage/05_research_project_space.md
@@ -1,2 +1,67 @@
# Research Project Space (RPS)
+## Description
+Research Project Space (RPS) volumes provide working space for sharing data and code amongst project or lab members. RPS directories are built on the same parallel file system (GPFS) like HPC Scratch. They are mounted on the cluster Compute Nodes, and thus they can be accessed by running jobs. RPS directories are backed up and there is no old file purging policy. These features of RPS simplify the management of data in the HPC environment as users of the HPC Cluster can store their data and code on RPS directories and they do not need to move data between the HPC Scratch and the HPC Archive file systems.
+
+Due to limitations of the underlying parallel file system, ***the total number of RPS volumes that can be created is limited***. There is an annual cost associated with RPS. The disk space and inode usage in RPS directories do not count towards quota limits in other HPC file systems (Home, Scratch, and Archive).
+
+## Calculating RPS Costs
+The PI should estimate the cost of the RPS volume by taking into account storage size and number of inodes (files). The cost is calculated annually. Costs are divided into the total space, in terabytes, and the number of inodes, in blocks of 200,000.
+
+- 1 TB of Storage Cost: $100
+- 200,000 inodes Cost: $100
+
+An initial RPS volume request must include both storage space and inodes. Modifications of existing RPS volumes can include just Storage or just inode adjustments.
+
+An initial request includes 1TB and 200,000 inodes for an annual cost of $200.
+
+### Example RPS Requests
+Requests can include more storage or files, as needed, such as 1TB and 400,000 inodes or 2TB and 200,000 inodes. Both of the previous examples would cost $300, since they are requesting an incremental increase of storage or inodes, respectively.
+
+This would be the breakdown of the examples listed above:
+
+- 1 TB ($100) + 400,000 inodes ($200) = $300
+- 2 TB ($200) + 200,000 inodes ($100) = $300
+
+## Submitting an RPS volume Request or Modification
+### Step 1: Decide the size (in TB) and number of inodes (files) that is needed for one year
+
+The minimum size of an RPS request (to create a new RPS volume or extend an existing one) is 1TB of space.
+
+If this is a new/first request, you must purchase both storage and inodes. ***A typical request includes 200,000 inodes per TB of storage.***
+
+Before submitting an RPS request (request for a new RPS volume or extending the size of an existing volume) PIs should estimate the growth of their data (in terms of storage space and number of files) during the entire year, rather than submitting a request based on their data storage needs at the time of the request.
+
+### Step 2: Determine the cost of the request
+
+Determine the total annual cost of the request and the contact info of the School/Department/Center finance person.
+
+### Step 3: Verify that the project PI has a valid HPC user account
+
+The PI administers the top level RTS directory and grants access to other users. Thus ***the PI must have a valid HPC user account*** at the time of request. Please note that the HPC user account of NYU faculty never expires and thus does not need to be renewed every year. If the PI does not have an HPC account, please [request one here](../01_getting_started/02_getting_and_renewing_an_account.md).
+
+### Step 4: The PI submits the request to the HPC team via email
+
+Only PIs can submit RPS requests by contacting the HPC team via email ([hpc@nyu.edu](mailto:hpc@nyu.edu)). Please include in the request the size (TB and number of inodes), and the contact information of the School/Department/Center finance person. The HPC team will review the request and will contact the PI with any questions. If the request is approved, the HPC team will create (or adjust) the RPS volume with the PI's HPC user account as the owner of the RPS directory. An invoice will be generated by the IT finance team.
+
+## Current HPC RPS Stakeholders
+[HPC RPS Stakeholders](https://docs.google.com/spreadsheets/d/1NYH5y8yif7UpwGVtmdowMAy5NCIWtE1a2B697RNF_zU/edit?usp=sharing)
+
+## FAQs
+### Data Retention and Backups
+- How long can I keep the lab data in RPS?
+ - For as long as the lab pays for the RPS resources. Even if the current HPC cluster retires, the RPS volumes will be transferred to the next cluster
+- How can I find out how much of the storage and inodes have I used in my lab RPS volume
+ - Please contact [HPC support](mailto:hpc@nyu.edu)
+- What kind of backups are provided?
+ - Backups are done once a day (daily incremental). Backups are kept for 30 days. This means that if something was deleted more than 30 days ago, it won't be in the back ups and thus it won't be recoverable.
+- Where are backups stored?
+ - RPS backups are stored on public cloud (AWS S3 Storage buckets).
+
+### Billing and Payments
+- What happens if I do not pay my bill?
+ - If the invoice is not paid for more than 60 days, the lab RPS directory will be 'tar'-ed and copied to an archival area. If 60 more days pass and the invoice is still not paid the tar files will be deleted.
+- Can I pay for RPS using a credit card?
+ - Unfortunately we're unable to process credit card payments
+- Can I pay for multiple years instead of paying every year?
+ - Yes, we can arrange for multiyear agreement
diff --git a/docs/hpc/03_storage/06_best_practices.md b/docs/hpc/03_storage/06_best_practices.md
new file mode 100644
index 0000000000..7c6fb89141
--- /dev/null
+++ b/docs/hpc/03_storage/06_best_practices.md
@@ -0,0 +1,50 @@
+# Best Practices on HPC Storage
+## User Quota Limits and the myquota command
+All users have quote limits set on HPC fie systems. There are several types of quota limits, such as limits on the amount of disk space (disk quota), number of files (inode quota) etc. The default user quota limits on HPC file systems are listed [on our Data Management page](./01_intro_and_data_management.mdx#hpc-storage-mounts-comparison-table).
+
+Running out of quota causes a variety of issues such as running user jobs being interrupted or users being unable to finish the installation of packages under their home directory.
+
+_One of the common issues users report is running out of inodes in their home directory._ This usually occurs during software installation, for example installing conda environment under their home directory. Users can check their current utilization of quota using the myquota command. The myquota command provides a report of the current quota limits on mounted file systems, the user's quota utilization, as well as the percentage of quota utilization.
+
+In the following example the user who executes the `myquota` command is out of inodes in their home directory. The user inode quota
+
+limit on the `/home` file system **30.0K inodes** and the user has **33000 inodes**, thus **110%** of the inode quota limit.
+```sh
+$ myquota
+Hostname: log-1 at Sun Mar 21 21:59:08 EDT 2021
+Filesystem Environment Backed up? Allocation Current Usage
+Space Variable /Flushed? Space / Files Space(%) / Files(%)
+/home $HOME Yes/No 50.0GB/30.0K 8.96GB(17.91%)/33000(110.00%)
+/scratch $SCRATCH No/Yes 5.0TB/1.0M 811.09GB(15.84%)/2437(0.24%)
+/archive $ARCHIVE Yes/No 2.0TB/20.0K 0.00GB(0.00%)/1(0.00%)
+/vast $VAST No/Yes 2.0TB/5.0M 0.00GB(0.00%)/1(0.00%)
+```
+Users can find out the number of inodes (files) used per subdirectory under their home directory (`$HOME`), by running the following commands:
+```sh
+$cd $HOME
+$ for d in $(find $(pwd) -maxdepth 1 -mindepth 1 -type d | sort -u); do n_files=$(find $d | wc -l); echo $d $n_files; done
+/home/netid/.cache 1507
+/home/netid/.conda 2
+/home/netid/.config 2
+/home/netid/.ipython 11
+/home/netid/.jupyter 2
+/home/netid/.keras 2
+/home/netid/.local 24185
+/home/netid/.nv 2
+/home/netid/.sacrebleu 46
+/home/netid/.singularity 1
+/home/netid/.ssh 5
+/home/netid/.vscode-server 7216
+```
+
+## Large number of small files
+In case your dataset or workflow requires to use large number of small files, this can create a bottleneck due to read/write rates.
+
+Please refer to our page on working with a [large number of files](./07_large_number_of_small_files.md) to learn about some of the options we recommend to consider.
+
+## Installing Python packages
+Your home directory has relatively small number of inodes.
+In case you would create conda or python environment in you home directory, this can eat up all the inodes.
+
+Please review best practices for managing packages under the Package Management section of the [Greene Software Page](../06_tools_and_software/05_software_on_greene.md).
+
diff --git a/docs/hpc/03_storage/06_data_management.md b/docs/hpc/03_storage/06_data_management.md
deleted file mode 100644
index 3e49b6aa4c..0000000000
--- a/docs/hpc/03_storage/06_data_management.md
+++ /dev/null
@@ -1 +0,0 @@
-# Data Management
diff --git a/docs/hpc/03_storage/07_large_number_of_small_files.md b/docs/hpc/03_storage/07_large_number_of_small_files.md
new file mode 100644
index 0000000000..c2e98c3614
--- /dev/null
+++ b/docs/hpc/03_storage/07_large_number_of_small_files.md
@@ -0,0 +1,97 @@
+# Large Number of Small Files
+
+## Motivation
+Many datasets contain a large number of files (for example [ImageNet](https://en.wikipedia.org/wiki/ImageNet) contains 14 million images, with ~150 GB size). How to deal with this data? How to store it? How to use for computations? Long-term storage of data is not an issue - an archive like tar.gz can handle this pretty well. However, when you want to use data in computations, the performance may depend on how you handle the data on disk.
+
+Here are some ideas you can try and evaluate performance for your own project
+
+## Squash file system to be used with Singularity
+Please read [here](../07_containers/04_squash_file_system_and_singularity.md)
+
+## Use jpg/png files on disk
+One option is to store image files (like png or jpg) on the disk and read from disk directly.
+
+An issue with this approach, is that many linux file system can hold only a limited number of files.
+```sh
+# One can open greene cluster and run the following command
+$ df -ih /scratch/
+Filesystem Inodes IUsed IFree IUse% Mounted on
+10.0.0.40@o2ib:10.0.0.41@o2ib:/scratch1 1.6G 209M 1.4G 14% /scratch
+```
+This shows us that the total number of files '/scratch' can hold (currently) is about 1.6 G. This looks like a large number. But let us translate this into number of datasets like ImageNet (14 mil images) -> 100 datasets like that would almost fully occupy Total possible slots for files! This is a problem!
+
+And even if you can ignore this on your own PC, on HPC, there is a limit of files each user can put on /scratch to prevent such problems.
+
+This is the reason why you just can't extract all those files in `/scratch`
+
+## SLURM_TMPDIR
+Another option would be to start SLURM job and extract everything into `$SLURM_TMPDIR`. This can work, but would require to do un-tar every time you run SLURM command.
+
+## SLURM_RAM_TMPDIR
+You can also use the custom-made RAM mapped disk using `#SLURM_RAM_TMPDIR` while submitting the job. In this case when you start a job you first un-tar your files to `$SLURM_RAM_TMPDIR` and then read from there. This basically requires you to use 2*(size of the data) size of RAM just to hold the data.
+
+## Binary files (pickle, etc)
+Store data in some binary file (say pickle in Python) which you load fully when you start SLURM job.
+
+This option may require a lot of RAM - thus you may have to wait a long time for scheduler to find resources for your job. Also this approach would not work on regular PC without so much RAM, and thus your scripts are not transferable.
+
+## Container files, one-file databases
+Special containers, which allow to either load data fast fully or access chosen elements without loading the whole dataset into RAM.
+
+### SQLite
+If you have structured data, a good option would be to use SQLite. Please refer to this page for more information
+
+### HDF5
+One can think about HDF5 file as a "container file" (database of a sort), which holds a lot of objects inside.
+
+HDF5 files do not have a file size limitation, and can hold huge number of objects inside, providing fast read/write access to those objects.
+
+It is easy to learn how to subset data and load to RAM only to those data objects that you need.
+
+More info:
+- [Developers website](https://www.hdfgroup.org/)
+ - [book (free with NYU email)](https://www.oreilly.com/library/view/python-and-hdf5/9781491944981/)
+- [hdf5 in Python](https://www.h5py.org/)
+- [hdf5 in R](https://www.bioconductor.org/packages/release/bioc/vignettes/rhdf5/inst/doc/rhdf5.html)
+
+hdf5 supports reading and writing in parallel, so you can use several nodes reading from the same file.
+
+More info: [Documentation](https://support.hdfgroup.org/documentation/index.html), [Tutorial](https://github.com/HDFGroup/hdf5-tutorial), [Help Desk](https://hdfgroup.atlassian.net/servicedesk/customer/portal/6/user/login?destination=portal%2F6)
+
+### LMDB
+LMDB (Lightning Memory-Mapped Database) is a light-weight, high-speed embedded database for key-value data.
+
+Essentially, this is a large file sitting on the disk that contains a lot of smaller objects inside.
+
+This is a memory-mapped database meaning, file can be larger than RAM. OS is responsible for managing the pages (like caching frequently uses pages).
+
+For practical use it means: say you have 10 GB of RAM, and LMDB file of 100 GB. When you connect to this file, OS may deside to load 5GB to RAM, and the rest 95GB will be attached as virtual memory. PRINCE does not have limit for virtual memory. Of course, if your RAM is larger than LMDB file, this database will perform the best, as OS will have enough resources to keep what is needed directly in RAM.
+
+:::note
+when you write key-value pairs to LMDB they have to be byte-encoded. For example, if you use Python you can use: for string `st.encode()`, for np.array use `ar.tobytes()`, or in general `pickle.dumps()`
+:::
+
+Problem with large number of files: LMDB uses [B Tree](https://en.wikipedia.org/wiki/B-tree), which has O(long n) complexity for search.
+
+Thus, when number of elements in LMDB becomes really big, search of specific element slows down considerably
+
+More info:
+- [Developer website](https://www.symas.com/mdb)
+- [Python package for lmd](https://lmdb.readthedocs.io/en/release/)
+- [R package for lmdb](https://github.com/richfitz/thor)
+- Deep Learning
+ - [Tensorflow with LMDB example](https://stackoverflow.com/questions/37337523/how-do-you-load-an-lmdb-file-into-tensorflow)
+ - [Pytorch with LMDB example](https://discuss.pytorch.org/t/whats-the-best-way-to-load-large-data/2977/2)
+
+LMDB supports reading by many readers and many parallel thread from the same file
+
+#### Formats inside HDF5/LMDB: binary, numpy, other..
+One can store data in different way inside LMDB or HDF5. For example we can store binary representation of jpeg, or we can store python numpy array. In the first case file can be read from any language, in the second - only from Python. We can also store objects from other languages - for example tibble in R
+
+#### Other formats
+There are other formats like [Bcolz](http://bcolz.blosc.org/), [Zarr](https://github.com/alimanfoo/zarr-python), and others. Some examples can be found [here](https://alimanfoo.github.io/2016/04/14/to-hdf5-and-beyond.html).
+
+## Example Code
+- A benchmarking of various ways of reading data was performed on now retired Prince HPC cluster. You can find the code used to perform that benchmarking and the results [at this repository](https://github.com/nyuhpc/public_ml/tree/master/Data_read_benchmarking).
+- For those of you interested in using multiple cores for data reading, [this code example below may be useful](https://github.com/nyuhpc/public_ml/blob/master/Data_read_benchmarking/TextImages/read_benchmarks/read_parallel.py).
+ - Multiple cores on the same node are used. Parallelization is based on `joblib` Python module
diff --git a/docs/hpc/03_storage/08_system_status.md b/docs/hpc/03_storage/08_system_status.md
new file mode 100644
index 0000000000..138aa22b28
--- /dev/null
+++ b/docs/hpc/03_storage/08_system_status.md
@@ -0,0 +1,13 @@
+# HPC Storage System Status
+
+## Allocation and Utilization data
+Below you can find data for the following file system mounts
+
+- GPFS file system: /home, /scratch, /archive
+- VAST file system: /vast
+- HDFS file system of Hadoop cluster Peel
+
+:::note
+To be able to see the panels below you need to be within NYU network.
+Use VPN if you are not on campus. You can find details about connecting to the VPN on the [Connecting to the HPC Cluster page](../02_connecting_to_hpc/01_connecting_to_hpc.md).
+:::
diff --git a/docs/hpc/03_storage/09_transferring_cloud_storage_data_with_rclone.md b/docs/hpc/03_storage/09_transferring_cloud_storage_data_with_rclone.md
new file mode 100644
index 0000000000..d0cbf02625
--- /dev/null
+++ b/docs/hpc/03_storage/09_transferring_cloud_storage_data_with_rclone.md
@@ -0,0 +1,199 @@
+# Transferring Cloud Storage Data with rclone
+
+## Transferring files to and from Google Drive with RCLONE
+Having access to Google Drive from the HPC environment provides an option to archive data and even share data with collaborators who have no access to the NYU HPC environment. Other options to archiving data include the HPC Archive file system and using Globus to share data with collaborators.
+
+Access to Google Drive is provided by [rclone](https://rclone.org/drive/) - rsync for cloud storage - a command line program to sync files and directories to and from cloud storage systems such as Google Drive, Amazon Drive, S3, B2 etc. [rclone](https://rclone.org/drive/) is available on Greene cluster as a module, the module currently (November 2022) is **rclone/1.60.1**
+
+For more details on how to use rclone to sync files to Google Drive, please see: [https://rclone.org/drive/](https://rclone.org/drive/)
+
+rclone can be invoked in one of the three modes:
+- [Copy](https://rclone.org/commands/rclone_copy/) mode to just copy new/changed files
+- [Sync](https://rclone.org/commands/rclone_sync/) (one way) mode to make a directory identical
+- [Check](https://rclone.org/commands/rclone_check/) mode to check for file hash equality
+
+Please try with these options:
+```sh
+rclone --transfers=32 --checkers=16 --drive-chunk-size=16384k --drive-upload-cutoff=16384k copy source:sourcepath dest:destpath
+```
+
+This option works great for file sizes 1Gb+ to 250GB. Keep in mind that there is a rate limiting of 2 files/sec for upload into Google Drive. Small file transfers don’t work that well. If you have many small jobs, please tar the parent directory of such folders and splits the tar file into 100GB chunks and uploads then into Google Drive.
+
+## rclone Configuration
+You need to configure rclone before you will be able to move files between the HPC Environment and Google Drive
+
+There are specific instruction on the rclone web site: [https://rclone.org/drive/](https://rclone.org/drive/)
+
+### Step 1: Login to Greene:
+
+Follow [instructions](../02_connecting_to_hpc/01_connecting_to_hpc.md) to log into the Greene HPC cluster.
+
+### Step 2: Load the rclone module
+```sh
+$ module load rclone/1.60.1
+```
+
+### Step 3: Configure rclone
+
+Configuring rclone and setting up remote access to your Google Drive, using the command:
+```sh
+$ rclone config
+```
+
+This will try to open the config files and you will see the below content:
+
+You can select one of the options (here we show how to set up a new remote)
+```sh
+2021/03/23 18:10:29 NOTICE: Config file "/home/netid/.config/rclone/rclone.conf" not found - using defaults
+No remotes found - make a new one
+n) New remote
+s) Set configuration password
+q) Quit config
+n/s/q> n
+name> remote1
+Type of storage to configure.
+Enter a string value. Press Enter for the default ("").
+Choose a number from below, or type in your own value
+ 1 / 1Fichier
+ \ "fichier"
+ 2 / Alias for an existing remote
+ \ "alias"
+ 3 / Amazon Drive
+ \ "amazon cloud drive"
+ 4 / Amazon S3 Compliant Storage Provider (AWS, Alibaba, Ceph, Digital Ocean, Dreamhost, IBM COS, Minio, Tencent COS, etc)
+ \ "s3"
+ 5 / Backblaze B2
+ \ "b2"
+ 6 / Box
+ \ "box"
+ 7 / Cache a remote
+ \ "cache"
+ 8 / Citrix Sharefile
+ \ "sharefile"
+ 9 / Dropbox
+ \ "dropbox"
+10 / Encrypt/Decrypt a remote
+ \ "crypt"
+11 / FTP Connection
+ \ "ftp"
+12 / Google Cloud Storage (this is not Google Drive)
+ \ "google cloud storage"
+13 / Google Drive
+ \ "drive"
+14 / Google Photos
+ \ "google photos"
+....
+....
+....
+37 / premiumize.me
+ \ "premiumizeme"
+38 / seafile
+ \ "seafile"
+Storage> 13
+** See help for drive backend at: https://rclone.org/drive/ **
+Google Application Client Id
+Setting your own is recommended.
+See https://rclone.org/drive/#making-your-own-client-id for how to create your own.
+If you leave this blank, it will use an internal key which is low performance.
+Enter a string value. Press Enter for the default ("").
+client_id> Just Hit Enter
+OAuth Client Secret
+Leave blank normally.
+Enter a string value. Press Enter for the default ("").
+client_secret> Just Hit Enter
+Scope that rclone should use when requesting access from drive.
+Enter a string value. Press Enter for the default ("").
+Choose a number from below, or type in your own value
+ 1 / Full access all files, excluding Application Data Folder.
+ \ "drive"
+ 2 / Read-only access to file metadata and file contents.
+ \ "drive.readonly"
+ / Access to files created by rclone only.
+ 3 | These are visible in the drive website.
+ | File authorization is revoked when the user deauthorizes the app.
+ \ "drive.file"
+ / Allows read and write access to the Application Data folder.
+ 4 | This is not visible in the drive website.
+ \ "drive.appfolder"
+ / Allows read-only access to file metadata but
+ 5 | does not allow any access to read or download file content.
+ \ "drive.metadata.readonly"
+scope> 1
+ID of the root folder
+Leave blank normally.
+Fill in to access "Computers" folders (see docs), or for rclone to use
+a non root folder as its starting point.
+Enter a string value. Press Enter for the default ("").
+root_folder_id> Just Hit Enter
+Service Account Credentials JSON file path
+Leave blank normally.
+Needed only if you want use SA instead of interactive login.
+Leading `~` will be expanded in the file name as will environment variables such as `${RCLONE_CONFIG_DIR}`.
+Enter a string value. Press Enter for the default ("").
+service_account_file> Just Hit Enter
+Edit advanced config? (y/n)
+y) Yes
+n) No (default)
+y/n> n
+Remote config
+Use auto config?
+ * Say Y if not sure
+ * Say N if you are working on a remote or headless machine
+y) Yes (default)
+n) No
+y/n> n
+Please go to the following link: https://accounts.google.com/o/oauth2/auth?access_type=offline&client_id=
+ CUT AND PASTE The URL ABOVE INTO A BROWSER ON YOUR LAPTOP/DESKTOP
+Log in and authorize rclone for access
+Enter verification code> ENTER VERIFICATION CODE HERE
+Configure this as a team drive?
+y) Yes
+n) No (default)
+y/n> n
+--------------------
+[remote1]
+type = drive
+scope = drive
+token = {"access_token":", removed "}
+--------------------
+y) Yes this is OK (default)
+e) Edit this remote
+d) Delete this remote
+y/e/d> y
+Current remotes:
+Name Type
+==== ====
+remote1 drive
+e) Edit existing remote
+n) New remote
+d) Delete remote
+r) Rename remote
+c) Copy remote
+s) Set configuration password
+q) Quit config
+e/n/d/r/c/s/q> q
+```
+### Step 4:
+
+Sample commands:
+```sh
+$ rclone lsd remote1:
+```
+
+Transferring files to Google Drive, using the command below:
+```sh
+$ rclone copy :
+```
+
+It looks something like below:
+```sh
+$ rclone copy /home/user1 remote1:backup_home_user1
+```
+
+### Step 5:
+
+The files are transferred and you can find the files on your Google Drive.
+
+:::note
+Rclone only copies new files or files different from the already existing files on Google Drive.
+:::
diff --git a/docs/hpc/03_storage/10_sharing_data_on_hpc.md b/docs/hpc/03_storage/10_sharing_data_on_hpc.md
new file mode 100644
index 0000000000..1a256c7bf9
--- /dev/null
+++ b/docs/hpc/03_storage/10_sharing_data_on_hpc.md
@@ -0,0 +1,163 @@
+# Sharing Data on HPC
+
+## Introduction
+To share files on the cluster with other users, we recommend using file access control lists (FACL) for a user to share access to their data with others. FACL mechanism allows a fine-grained control access to any files by any users or groups of users. We discourage users from setting '777' permissions with `chmod`, because this can lead to data loss (by a malicious user or unintentionally, by accident). Linux commands `getfacl` and `setfacl` are used to view and set access.
+
+ACL mechanism, just like regular Linux POSIX, allows three different levels of access control:
+
+- Read (r) - the permission to see the contents of a file
+- Write (w) - the permission to edit a file
+- eXecute (X) - the permission to call a file or run it (in this case we use X instead of x because the X permission uses inherited executable permissions and not all files need execution)
+
+This level of access can be granted to
+
+- user (owner of the file)
+- group (owner group)
+- other (everyone else)
+
+ACL allows to grant the same type access without modifying file ownership and without changing POSIX permissions.
+
+## Viewing ACL
+Use `getfacl` to retrieve access permissions for a file.
+```sh
+$ getfacl myfile.txt
+# file: myfile.txt
+# owner: ab123
+# group: users
+user::rw-
+group::---
+other::---
+The example above illustrates that in most cases ACL looks just like the chmod-based permissions: owner of the file has read and write permission, members of the group and everyone else have no permissions at all.
+
+Setting ACL
+Modify access permissions
+Use setfacl:
+
+# general syntax:
+$ setfacl [option] [action/specification] file
+
+# most important options are
+# -m to modify ACL
+# -x to remove ACL
+# -R to apply the action recursively (apply to everything inside the directory)
+
+# To set permissions for a user (user is either the user name or ID):
+$ setfacl -m "u:user:permissions"
+
+## To set permissions for a group (group is either the group name or ID):
+$ setfacl -m "g:group:permissions"
+
+# To set permissions for others:
+$ setfacl -m "other:permissions"
+
+# To allow all newly created files or directories to inherit entries from the parent directory (this will not affect files which will be copied into the directory afterwards):
+$ setfacl -dm "entry"
+
+# To remove a specific entry:
+$ setfacl -x "entry"
+
+# To remove the default entries:
+$ setfacl -k
+
+# To remove all entries (entries of the owner, group and others are retained):
+$ setfacl -b
+```
+
+### Important: Give Access to Parent Directories in the Path
+When you would like to set ACL to say `/a/b/c/example.out`, you also need to set appropriate ACLs to all the parent directories in the path. If you want to give read/write/execute permissions for the file `/a/b/c/example.out`, you would also need to give at least r-x permissions to the directories: `/a`, `/a/b`, and `/a/b/c`.
+
+### Remove All ACL Entries
+```sh
+# setfacl -b abc
+```
+
+### Check ACLs
+```sh
+# getfacl abc
+# file: abc
+# owner: someone
+# group: someone
+user::rw-
+group::r--
+other::r--
+```
+
+You can see with `ls -l` if a file has extended permissions set with setfacl: the `+` in the last column of the permissions field indicates that this file has detailed access permissions via ACLs:
+```sh
+$ ls -la
+total 304
+drwxr-x---+ 18 ab123 users 4096 Apr 3 14:32 .
+drwxr-xr-x 1361 root root 0 Apr 3 09:35 ..
+-rw------- 1 ab123 users 4502 Mar 28 22:27 my_private_file
+-rw-r-xr--+ 1 ab123 users 29 Feb 11 23:18 dummy.txt
+```
+
+### Flags
+Please read 'man setfacl' for possible flags. For example:
+
+- '-m' - modify
+- '-x' - remove
+- '-R' - recursive (apply ACL to all content inside a directory)
+- '-d' - default (set given settings as default - useful for a directory - all the new content inside in the future will have given ACL)
+
+## Examples
+### File ACL Example
+Set read, write, and execute (rwX) permissions for user johnny to file named abc:
+```sh
+# setfacl -m "u:johnny:rwX" abc
+```
+
+:::note
+We recommend for the permissions using a capital 'X' as using a lowercase 'x' will make all files executable, so we reommcned this:
+
+Check permissions:
+```sh
+# getfacl abc
+# file: abc
+# owner: someone
+# group: someone
+user::rw-
+user:johnny:rwX
+group::r--
+mask::rwX
+other::r--
+```
+
+Change permissions for user johnny:
+```sh
+# setfacl -m "u:johnny:r-X" abc
+```
+
+Check permissions:
+```sh
+# getfacl abc
+# file: abc
+# owner: someone
+# group: someone
+user::rw-
+user:johnny:r-X
+group::r--
+mask::r-X
+other::r--
+```
+:::
+
+### Directory ACL Example
+Let's say alice123 wants to share directory `/scratch/alice123/shared/researchGroup/group1` with user `bob123`
+```sh
+## Read/execute access to /scratch/alice123
+setfacl -m u:bob123:r-X /scratch/alice123
+## Read/execute access to /scratch/alice123/shared
+setfacl -m u:bob123:r-X /scratch/alice123/shared
+## Read/execute access to /scratch/alice123/shared/researchGroup
+setfacl -m u:bob123:r-X /scratch/alice123/shared/researchGroup
+## Now I can finally can give access to directory /scratch/alice123/shared/researchGroup/group1
+setfacl -Rm u:bob123:rwX /scratch/alice123/shared/researchGroup/group1
+```
+:::note
+user bob123 will be able to see content of the following directories
+
+- `/scratch/alise123/`
+- `/scratch/alise123/shared`
+- `/scratch/alise123/shared/researchGroup/`
+- `/scratch/alise123/shared/researchGroup/group1`
diff --git a/docs/hpc/03_storage/static/disk_drive_image.jpg b/docs/hpc/03_storage/static/disk_drive_image.jpg
new file mode 100644
index 0000000000..27c1660d04
Binary files /dev/null and b/docs/hpc/03_storage/static/disk_drive_image.jpg differ
diff --git a/docs/hpc/03_storage/static/globus_collections.png b/docs/hpc/03_storage/static/globus_collections.png
new file mode 100644
index 0000000000..df83f230f7
Binary files /dev/null and b/docs/hpc/03_storage/static/globus_collections.png differ
diff --git a/docs/hpc/03_storage/static/globus_connect_personal.png b/docs/hpc/03_storage/static/globus_connect_personal.png
new file mode 100644
index 0000000000..df1fc51c6c
Binary files /dev/null and b/docs/hpc/03_storage/static/globus_connect_personal.png differ
diff --git a/docs/hpc/03_storage/static/globus_download.png b/docs/hpc/03_storage/static/globus_download.png
new file mode 100644
index 0000000000..c9942d2a3a
Binary files /dev/null and b/docs/hpc/03_storage/static/globus_download.png differ
diff --git a/docs/hpc/03_storage/static/globus_filemanager.png b/docs/hpc/03_storage/static/globus_filemanager.png
new file mode 100644
index 0000000000..59ab303acf
Binary files /dev/null and b/docs/hpc/03_storage/static/globus_filemanager.png differ
diff --git a/docs/hpc/03_storage/static/globus_login.png b/docs/hpc/03_storage/static/globus_login.png
new file mode 100644
index 0000000000..6a1eb06efc
Binary files /dev/null and b/docs/hpc/03_storage/static/globus_login.png differ
diff --git a/docs/hpc/03_storage/static/globus_login_mfa.png b/docs/hpc/03_storage/static/globus_login_mfa.png
new file mode 100644
index 0000000000..0bcc3567b4
Binary files /dev/null and b/docs/hpc/03_storage/static/globus_login_mfa.png differ
diff --git a/docs/hpc/03_storage/static/globus_start_transfer.png b/docs/hpc/03_storage/static/globus_start_transfer.png
new file mode 100644
index 0000000000..f4a783c01d
Binary files /dev/null and b/docs/hpc/03_storage/static/globus_start_transfer.png differ
diff --git a/docs/hpc/03_storage/static/globus_success.png b/docs/hpc/03_storage/static/globus_success.png
new file mode 100644
index 0000000000..e1d82563b2
Binary files /dev/null and b/docs/hpc/03_storage/static/globus_success.png differ
diff --git a/docs/hpc/03_storage/static/globus_transfer.png b/docs/hpc/03_storage/static/globus_transfer.png
new file mode 100644
index 0000000000..c50bd1c0d4
Binary files /dev/null and b/docs/hpc/03_storage/static/globus_transfer.png differ
diff --git a/docs/hpc/04_datasets/01_intro.md b/docs/hpc/04_datasets/01_intro.md
index e4e87bfaf2..5184be7825 100644
--- a/docs/hpc/04_datasets/01_intro.md
+++ b/docs/hpc/04_datasets/01_intro.md
@@ -1 +1,165 @@
-# Datasets available
+# Datasets Available
+
+## General
+The HPC team makes available a number of public sets that are commonly used in analysis jobs. The data sets are available Read-Only under
+- `/scratch/work/public/ml-datasets/`
+- `/vast/work/public/ml-datasets/`
+
+We recommend to use version stored at `/vast` (when available) to have better read performance
+
+:::note
+For some of the datasets users must provide a signed usage agreement before accessing
+:::
+
+## Format
+Many datasets are available in the form of '.sqf' file, which can be used with Singularity.
+For example, in order to use coco dataset, one can run the following commands
+```sh
+$ singularity exec \
+ --overlay //pytorch1.8.0-cuda11.1.ext3:ro \
+ --overlay /vast/work/public/ml-datasets/coco/coco-2014.sqf:ro \
+ --overlay /vast/work/public/ml-datasets/coco/coco-2015.sqf:ro \
+ --overlay /vast/work/public/ml-datasets/coco/coco-2017.sqf:ro \
+ /scratch/work/public/singularity/cuda11.1-cudnn8-devel-ubuntu18.04.sif /bin/bash
+
+$ singularity exec \
+ --overlay //pytorch1.8.0-cuda11.1.ext3:ro \
+ --overlay /vast/work/public/ml-datasets/coco/coco-2014.sqf:ro \
+ --overlay /vast/work/public/ml-datasets/coco/coco-2015.sqf:ro \
+ --overlay /vast/work/public/ml-datasets/coco/coco-2017.sqf:ro \
+ /scratch/work/public/singularity/cuda11.1-cudnn8-devel-ubuntu18.04.sif find /coco | wc -l
+
+532896
+```
+
+## Data Sets
+### COCO Dataset
+*About data set*: [https://cocodataset.org/](https://cocodataset.org/#home)
+
+Common Objects in Context (COCO) is a large-scale object detection, segmentation, and captioning dataset.
+
+*Dataset is available under*
+`/scratch`
+- `/scratch/work/public/ml-datasets/coco/coco-2014.sqf`
+- `/scratch/work/public/ml-datasets/coco/coco-2015.sqf`
+- `/scratch/work/public/ml-datasets/coco/coco-2017.sqf`
+
+`/vast`
+- `/vast/work/public/ml-datasets/coco/coco-2014.sqf`
+- `/vast/work/public/ml-datasets/coco/coco-2015.sqf`
+- `/vast/work/public/ml-datasets/coco/coco-2017.sqf`
+
+### ImageNet and ILSVRC
+About data set: [ImageNet (image-net.org)](https://image-net.org/)
+
+ImageNet is an image dataset organized according to the [WordNet](https://wordnet.princeton.edu/) hierarchy (Miller, 1995). Each concept in WordNet, possibly described by multiple words or word phrases, is called a “synonym set” or “synset”. ImageNet populates 21,841 synsets of WordNet with an average of 650 manually verified and full resolution images. As a result, ImageNet contains 14,197,122 annotated images organized by the semantic hierarchy of WordNet (as of August 2014). ImageNet is larger in scale and diversity than the other image classification datasets ([https://arxiv.org/abs/1409.0575](https://arxiv.org/abs/1409.0575)).
+
+:::note
+WordNet® is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept ([https://wordnet.princeton.edu/](https://wordnet.princeton.edu/))
+:::
+
+#### ILSVRC (subset of ImageNet)
+
+ILSVRC uses a subset of ImageNet images for training the algorithms and some of ImageNet’s image collection protocols for annotating additional images for testing the algorithms ([https://arxiv.org/abs/1409.0575](https://arxiv.org/abs/1409.0575)). The name comes from 'ImageNet Large Scale Visual Recognition Challenge ([ILSVRC](https://image-net.org/challenges/LSVRC/2017/))'. Competition was moved to Kaggle ([http://image-net.org/challenges/LSVRC/2017/](http://image-net.org/challenges/LSVRC/2017/))
+
+*What is included* ([https://arxiv.org/abs/1409.0575](https://arxiv.org/abs/1409.0575)).
+- 1000 object classes
+- approximately 1.2 million training images
+- 50 thousand validation images
+- 100 thousand test images
+- Size of data is about 150 GB (for train and validation)
+
+*Dataset is available under*
+- `/scratch/work/public/ml-datasets/imagenet`
+- `/vast/work/public/ml-datasets/imagenet`
+
+##### Get access to Data
+
+New York University does not own this dataset.
+
+Please open the ImageNet site, find the terms of use ([http://image-net.org/download](http://image-net.org/download)), copy them, replace the needed parts with your name, send us an email including the terms with your name - thereby confirming you agree to the these terms. Once you do this, we can grant you access to the copy of the dataset on the cluster.
+
+### Millions Songs
+*About data set*: [https://labrosa.ee.columbia.edu/millionsong/](https://labrosa.ee.columbia.edu/millionsong/)
+
+*Dataset is available under*
+
+- `/scratch/work/public/MillionSongDataset`
+- `/vast/work/public/ml-datasets/millionsongdataset/`
+
+### Twitter Decahose
+*About data set*: [https://developer.twitter.com/en/docs/twitter-api/enterprise/decahose-api/overview/decahose](https://developer.twitter.com/en/docs/twitter-api/enterprise/decahose-api/overview/decahose)
+
+NYU has a subscription to Twitter Decahose - 10% random sample of the realtime Twitter Firehose through a streaming connection
+
+*Data are stored* in GCP cloud (BigQuery) and on HPC clusters Greene and Peel (Parquet format).
+
+Please contact Megan Brown at [The Center for Social Media & Politics](https://csmapnyu.org/) to get access to data and learn the tools available to work with it.
+
+*On cluster dataset is available under (given that you have permissions)*
+- `/scratch/work/twitter_decahose/`
+
+### ProQuest Congressional Record
+About data set: [ProQuest Congressional Record](https://guides.nyu.edu/tdm/proquest-congressional-record-tdm-guide)
+
+The ProQuest Congressional Record text-as-data collection consists of machine-readable files capturing the full text and a small number of metadata fields for a full run of the Congressional Record between 1789 and 2005. Metadata fields include the date of publication, subjects (for issues for which such information exists in the ProQuest system), and URLs linking the full text to the canonical online record for that issue on the ProQuest Congressional platform. A total of 31,952 issues are available.
+
+*Dataset is available under*:
+- `/scratch/work/public/proquest/`
+
+### C4
+*About data set*: [c4 | TensorFlow Datasets](https://www.tensorflow.org/datasets/catalog/c4)
+
+A colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset: [https://commoncrawl.org](https://commoncrawl.org)
+
+*Dataset is available under*
+- `/scratch/work/public/ml-datasets/c4`
+- `/vast/work/public/ml-datasets/c4`
+
+### GQA
+*About data set*: [GQA: Visual Reasoning in the Real World (stanford.edu)](https://cs.stanford.edu/people/dorarad/gqa/index.html)
+
+Question Answering on Image Scene Graphs
+
+*Dataset is available under*
+- `/scratch/work/public/ml-datasets/gqa`
+- `/vast/work/public/ml-datasets/gqa`
+
+### MJSynth
+*About data set*: [Visual Geometry Group - University of Oxford](https://www.robots.ox.ac.uk/~vgg/data/text/)
+
+This is synthetically generated dataset which found to be sufficient for training text recognition on real-world images
+
+This dataset consists of 9 million images covering 90k English words, and includes the training, validation and test splits used in the author's work (archived dataset is about 10 GB)
+
+*Dataset is available under*
+- `/vast/work/public/ml-datasets/mjsynth`
+
+### open-images-dataset
+*About data set*: [Open Images Dataset – opensource.google](https://storage.googleapis.com/openimages/web/index.html)
+
+A dataset of ~9 million varied images with rich annotations
+
+The images are very diverse and often contain complex scenes with several objects (8.4 per image on average). It contains image-level labels annotations, object bounding boxes, object segmentations, visual relationships, localized narratives, and more
+
+*Dataset is available under*
+- `/scratch/work/public/ml-datasets/open-images-dataset`
+- `/vast/work/public/ml-datasets/open-images-dataset`
+
+### Pile
+*About data set*: [The Pile (eleuther.ai)](https://pile.eleuther.ai/)
+
+The Pile is a 825 GiB diverse, open source language modeling data set that consists of 22 smaller, high-quality datasets combined together.
+
+*Dataset is available under*
+- `/scratch/work/public/ml-datasets/pile`
+- `/vast/work/public/ml-datasets/pile`
+
+### Waymo open dataset
+*About data set*: [Open Dataset – Waymo](https://waymo.com/open/)
+
+The field of machine learning is changing rapidly. Waymo is in a unique position to contribute to the research community with some of the largest and most diverse autonomous driving datasets ever released.
+
+*Dataset is available under*
+- `/vast/work/public/ml-datasets/waymo_open_dataset_v_1_2_0_individual_files`
+
diff --git a/docs/hpc/04_datasets/_category_.json b/docs/hpc/04_datasets/_category_.json
index 015b2ae8a0..a8dc2c44a4 100644
--- a/docs/hpc/04_datasets/_category_.json
+++ b/docs/hpc/04_datasets/_category_.json
@@ -1,3 +1,3 @@
{
"label": "Datasets"
-}
+}
\ No newline at end of file
diff --git a/docs/hpc/05_submitting_jobs/01_slurm_submitting_jobs.md b/docs/hpc/05_submitting_jobs/01_slurm_submitting_jobs.md
index 9c4bcbf0d0..012df381fe 100644
--- a/docs/hpc/05_submitting_jobs/01_slurm_submitting_jobs.md
+++ b/docs/hpc/05_submitting_jobs/01_slurm_submitting_jobs.md
@@ -3,10 +3,10 @@
## Batch vs Interactive Jobs
-- HPC workloads are usually better suited to *batch processing* than *interactive* working.
-- A batch job is sent to the system when submitted with an ~sbatch~ command.
-- The working pattern we are all familiar with is *interactive* - where we type ( or click ) something interactively, and the computer performs the associated action. Then we type ( or click ) the next thing.
-- Comments at the start of the script, which match a special pattern ( #SBATCH ) are read as Slurm options.
+- HPC workloads are usually better suited to *batch processing* than *interactive* working.
+- A batch job is sent to the system when submitted with an ~sbatch~ command.
+- The working pattern we are all familiar with is *interactive* - where we type ( or click ) something interactively, and the computer performs the associated action. Then we type ( or click ) the next thing.
+- Comments at the start of the script, which match a special pattern ( #SBATCH ) are read as Slurm options.
### The trouble with interactive environments
@@ -14,24 +14,24 @@ There is a reason why GUIs are less common in HPC environments: **point-and-clic
The job might not start immediately, and might take hours or days, so we prefer a *batch* approach:
-- Plan the sequence of commands which will perform the actions we need
- - Write them into a script.
+- Plan the sequence of commands which will perform the actions we need
+ - Write them into a script.
I can now run the script interactively, which is a great way to save effort if i frequently use the same workflow, or ...
-- Submit the script to a batch system, to run on dedicated resources when they become available.
+- Submit the script to a batch system, to run on dedicated resources when they become available.
### Where does the output go ?
-- The batch system writes stdout and stderr from a job to a file named for example *"slurm-12345.out"*
- - You can change either stdout or stderr using sbatch options.
-- While a job is running, it caches the stdout an stderr in the job working directory.
-- You can use redirection to send output of a specific command into a file.
+- The batch system writes stdout and stderr from a job to a file named for example *"slurm-12345.out"*
+ - You can change either stdout or stderr using sbatch options.
+- While a job is running, it caches the stdout an stderr in the job working directory.
+- You can use redirection to send output of a specific command into a file.
### Writing and Submitting a Job
There are two aspects to a batch job script:
-- A set of *SBATCH* directives describing the resources required and other information about the job.
-- The script itself, comprised of commands to set up and perform the computations without additional user interaction.
+- A set of *SBATCH* directives describing the resources required and other information about the job.
+- The script itself, comprised of commands to set up and perform the computations without additional user interaction.
### A simple example
@@ -222,93 +222,93 @@ sbatch --nodes=2 --ntasks-per-node=4 my_script.sh
### Options to manage job output
-- `-J jobname`
- - Give the job a name. The default is the filename of the job script. Within the job, `$SLURM_JOB_NAME` expands to the job name.
+- `-J jobname`
+ - Give the job a name. The default is the filename of the job script. Within the job, `$SLURM_JOB_NAME` expands to the job name.
-- `-o path/for/stdout`
- - Send `stdout` to `path/for/stdout`. The default filename is slurm-`${SLURM_JOB_ID}.out`, e.g. slurm-`12345.out`, in the directory from which the job was submitted.
+- `-o path/for/stdout`
+ - Send `stdout` to `path/for/stdout`. The default filename is slurm-`${SLURM_JOB_ID}.out`, e.g. slurm-`12345.out`, in the directory from which the job was submitted.
-- `-e path/for/stderr`
- - Send `stderr` to `path/for/stderr`.
+- `-e path/for/stderr`
+ - Send `stderr` to `path/for/stderr`.
-- `--mail-user=my_email_address@nyu.edu`
- - Send mail to my_email_address@nyu.edu when certain events occur.
+- `--mail-user=my_email_address@nyu.edu`
+ - Send mail to my_email_address@nyu.edu when certain events occur.
-- `--mail-type=type`
- - Valid type values are NONE, BEGIN, END, FAIL, REQUIRE, ALL.
+- `--mail-type=type`
+ - Valid type values are NONE, BEGIN, END, FAIL, REQUIRE, ALL.
### Options to set the job environment:
-- `--export=VAR1,VAR2="some value",VAR3`
- - Pass variables to the job, either with a specific value (the `VAR=` form) or from the submitting environment ( without "`=`" )
+- `--export=VAR1,VAR2="some value",VAR3`
+ - Pass variables to the job, either with a specific value (the `VAR=` form) or from the submitting environment ( without "`=`" )
- - `--get-user-env`\[=timeout]\[mode]
- - Run something like "su `-` \ -c /usr/bin/env" and parse the output. Default timeout is 8 seconds. The mode value can be "S", or "L" in which case "su" is executed with "`-`" option.
+ - `--get-user-env`\[=timeout]\[mode]
+ - Run something like "su `-` \ -c /usr/bin/env" and parse the output. Default timeout is 8 seconds. The mode value can be "S", or "L" in which case "su" is executed with "`-`" option.
### Options to request compute resources
-- `-t, --time=time`
- - `Set a limit on the total run time. Acceptable formats include "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds"`.
+- `-t, --time=time`
+ - `Set a limit on the total run time. Acceptable formats include "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds"`.
-- `--mem=MB`
- - Maximum memory per node the job will need in MegaBytes
+- `--mem=MB`
+ - Maximum memory per node the job will need in MegaBytes
-- `--mem-per-cpu=MB`
- - `Memory required per allocated CPU in MegaBytes`
+- `--mem-per-cpu=MB`
+ - `Memory required per allocated CPU in MegaBytes`
-- `-N, --node=num`
- - Number of nodes are required. Default is 1 node.
- - `-n, --ntasks=num`
- - Maximum number tasks will be launched. Default is one task per node.
- - `--ntasks-per-node=ntasks`
- - Request that ntasks be invoked on each node.
- - `-c, --cpus-per-task=ncpus`
- - Require ncpus number of CPU cores per task. Without this option, allocate one core per task.
- - Requesting the resources you need, as accurately as possible, allows your job to be started at the earliest opportunity as well as helping the system to schedule work efficiently to everyone's benefit.
+- `-N, --node=num`
+ - Number of nodes are required. Default is 1 node.
+ - `-n, --ntasks=num`
+ - Maximum number tasks will be launched. Default is one task per node.
+ - `--ntasks-per-node=ntasks`
+ - Request that ntasks be invoked on each node.
+ - `-c, --cpus-per-task=ncpus`
+ - Require ncpus number of CPU cores per task. Without this option, allocate one core per task.
+ - Requesting the resources you need, as accurately as possible, allows your job to be started at the earliest opportunity as well as helping the system to schedule work efficiently to everyone's benefit.
### Options for running interactively on the compute nodes with srun
-- `-nnum`
- - `Specify the number of tasks to run, eg. -n4. Default is one CPU core per task.` Don't just submit the job, but also wait for it to start and connect `stdout`, `stderr`and `stdin` to the current terminal.
+- `-nnum`
+ - `Specify the number of tasks to run, eg. -n4. Default is one CPU core per task.` Don't just submit the job, but also wait for it to start and connect `stdout`, `stderr`and `stdin` to the current terminal.
-- `-ttime`
- - Request job running duration, eg. `-t1:30:00`
+- `-ttime`
+ - Request job running duration, eg. `-t1:30:00`
-- `--mem=MB`
- - Specify the real memory required per node in MegaBytes, eg. `--mem=4000`
- - `--pty`
- - Execute the first task in pseudo terminal mode, eg. `--pty /bin/bash`, to start a bash command shell
+- `--mem=MB`
+ - Specify the real memory required per node in MegaBytes, eg. `--mem=4000`
+ - `--pty`
+ - Execute the first task in pseudo terminal mode, eg. `--pty /bin/bash`, to start a bash command shell
-- `--x11`
- - Enable X forwarding, so programs using a GUI can be used during the session (provided you have X forwarding to your workstation set up)
- - To leave an interactive batch session, type `exit` at the command prompt
+- `--x11`
+ - Enable X forwarding, so programs using a GUI can be used during the session (provided you have X forwarding to your workstation set up)
+ - To leave an interactive batch session, type `exit` at the command prompt
### Options for delaying starting a job
-- `--begin=time`
- - Delay starting this job until after the specified date and time, eg. `--begin=9:42:00`, to start the job at 9:42:00 am
-
-- `-d, --dependency=dependency_list`
- - (More info here [https://slurm.schedmd.com/sbatch.html](https://slurm.schedmd.com/sbatch.html))
- - Example 1
- - `--dependency=afterok:12345`, to delay starting this job until the job 12345 has completed successfully
- - Example 2
- - Let us say job 1 uses sbatch file job1.sh, and job 2 uses job2.sh
- - Inside the batch file of the second job (job2.sh) add
- - `#SBATCH --dependency=afterok:$job1`
- - Start the first job and get id of the job
- - `job1=$(echo $(sbatch job1.sh) | grep -Eo "[0-9]+")`
- - Schedule second jobs to start when the first one ends
- - `sbatch job2.sh`
+- `--begin=time`
+ - Delay starting this job until after the specified date and time, eg. `--begin=9:42:00`, to start the job at 9:42:00 am
+
+- `-d, --dependency=dependency_list`
+ - (More info here [https://slurm.schedmd.com/sbatch.html](https://slurm.schedmd.com/sbatch.html))
+ - Example 1
+ - `--dependency=afterok:12345`, to delay starting this job until the job 12345 has completed successfully
+ - Example 2
+ - Let us say job 1 uses sbatch file job1.sh, and job 2 uses job2.sh
+ - Inside the batch file of the second job (job2.sh) add
+ - `#SBATCH --dependency=afterok:$job1`
+ - Start the first job and get id of the job
+ - `job1=$(echo $(sbatch job1.sh) | grep -Eo "[0-9]+")`
+ - Schedule second jobs to start when the first one ends
+ - `sbatch job2.sh`
### Options for running many similar jobs
-- `-a, --array=indexes`
- - Submit an array of jobs with array ids as specified. Array ids can be specified as a numerical range, a comma-seperated list of numbers, or as some combination of the two. Each job instance will have an environment variable `SLURM_ARRAY_JOB_ID` and `SLURM_ARRAY_TASK_ID`. For example:
- - `--array=1-11`, to start an array job with index from 1 to 11
- - `--array=1-7:2`, to submit an array job with index step size 2
- - `--array=1-9%4`, to submit an array job with simultaneously running job elements set to 4
- - The srun command is similar to `pbsdsh`. It launches tasks on allocated resources
+- `-a, --array=indexes`
+ - Submit an array of jobs with array ids as specified. Array ids can be specified as a numerical range, a comma-seperated list of numbers, or as some combination of the two. Each job instance will have an environment variable `SLURM_ARRAY_JOB_ID` and `SLURM_ARRAY_TASK_ID`. For example:
+ - `--array=1-11`, to start an array job with index from 1 to 11
+ - `--array=1-7:2`, to submit an array job with index step size 2
+ - `--array=1-9%4`, to submit an array job with simultaneously running job elements set to 4
+ - The srun command is similar to `pbsdsh`. It launches tasks on allocated resources
## R Job Example
@@ -431,24 +431,24 @@ There are three NVIDIA GPU types and one AMD GPU type that can be used.
**To request NVIDIA GPUs**
-- RTX8000
+- RTX8000
```sh
#SBATCH --gres=gpu:rtx8000:1
```
-- V100
+- V100
```sh
#SBATCH --gres=gpu:v100:1
```
-- A100
+- A100
```sh
#SBATCH --gres=gpu:a100:1
```
**To request AMD GPUs**
-- MI50
+- MI50
```sh
#SBATCH --gres=gpu:mi50:1
```
@@ -501,15 +501,15 @@ The majority of the jobs on the NYU HPC cluster are submitted with the sbatch co
There are cases when users need to run applications interactively ( interactive jobs ). Interactive jobs allow the users to enter commands and data on the command line (or in a graphical interface ), providing an experience similar to working on a desktop or laptop. Examples of common interactive tasks are:
-- Editing files
+- Editing files
-- Compiling and debugging code
+- Compiling and debugging code
-- Exploring data, to insights
+- Exploring data, to insights
-- A graphical window to run visualization
+- A graphical window to run visualization
-- etc
+- etc
To support interactive use in a batch environment, Slurm allows for interactive batch jobs.
@@ -549,23 +549,23 @@ cd $SLURM_SUBMIT_DIR
(Don't just submit the job, but also wait for it to start and connect `stdout`, `stderr` and `stdin` to the current terminal)
-- `-nnum`
- - Specify the number of tasks to run, eg. -n4. Default is one CPU core per task
+- `-nnum`
+ - Specify the number of tasks to run, eg. -n4. Default is one CPU core per task
-- `-ttime`
- - Request job running duration, eg. `-t1:30:00`
+- `-ttime`
+ - Request job running duration, eg. `-t1:30:00`
-- `--mem=MB`
- - Specify the real memory required per node in MegaBytes, eg. `--mem=4000`
- - `--pty`
- - Execute the first task in pseudo terminal mode, eg. `--pty /bin/bash`, to start a bash command shell
+- `--mem=MB`
+ - Specify the real memory required per node in MegaBytes, eg. `--mem=4000`
+ - `--pty`
+ - Execute the first task in pseudo terminal mode, eg. `--pty /bin/bash`, to start a bash command shell
-- `--gres=gpu:N`
- - To request `N` number of GPUs
+- `--gres=gpu:N`
+ - To request `N` number of GPUs
-- `--x11`
- - Enable X forwarding, so programs using a GUI can be used during the session (provided you have X forwarding to your workstation set up)
- - To leave an interactive batch session, type `exit` at the command prompt
+- `--x11`
+ - Enable X forwarding, so programs using a GUI can be used during the session (provided you have X forwarding to your workstation set up)
+ - To leave an interactive batch session, type `exit` at the command prompt
Certain tasks need user iteraction - such as debugging and some GUI-based applications. However the HPC clusters rely on batch job scheduling to efficiently allocate resources. Interactive batch jobs allow these apparently conflicting requirements to be met.
diff --git a/docs/hpc/05_submitting_jobs/02_slurm_tutorial.md b/docs/hpc/05_submitting_jobs/02_slurm_tutorial.md
index 5ecf220ad5..56db8a198c 100644
--- a/docs/hpc/05_submitting_jobs/02_slurm_tutorial.md
+++ b/docs/hpc/05_submitting_jobs/02_slurm_tutorial.md
@@ -12,11 +12,11 @@ In NYU HPC clusters the users coming from many departments with various discipli
The Slurm software system is a resource manager and a job scheduler, which is designed to allocate resources and schedule jobs. Slurm is an open-source software, with a large user community, and has been installed on many top 500 supercomputers.
-- This tutorial assumes you have a NYU HPC account. If not, you may find the steps to apply for an account on the [Getting and renewing an account page][getting and renewing an account page].
+- This tutorial assumes you have a NYU HPC account. If not, you may find the steps to apply for an account on the [Getting and renewing an account page](../01_getting_started/02_getting_and_renewing_an_account.md).
-- It also assumes you are comfortable with Linux command-line environment. To learn about linux please read \[Tutorial 1].
+- It also assumes you are comfortable with Linux command-line environment. To learn about linux please read \[Tutorial 1].
-- Please review the \[Hardware Specs page] for more information on Greene's hardware specifications.
+- Please review the \[Hardware Specs page] for more information on Greene's hardware specifications.
## Slurm Commands
@@ -30,23 +30,23 @@ To use a given software package, you load the corresponding module. Unloading th
Below is a list of modules and their associated functions:
-- `module load ` : loads a module
- - For example : `module load python3`
+- `module load ` : loads a module
+ - For example : `module load python3`
-- `module unload ` : unloads a module
- - For example : `module unload python3`
+- `module unload ` : unloads a module
+ - For example : `module unload python3`
-- `module show ` : see exactly what effect loading a module will have with
+- `module show ` : see exactly what effect loading a module will have with
-- `module purge` : remove all loaded modules from your environment
+- `module purge` : remove all loaded modules from your environment
-- `module whatis ` : Find out more about a software package
+- `module whatis ` : Find out more about a software package
-- `module list` : check which modules are currently loaded in your environment
+- `module list` : check which modules are currently loaded in your environment
-- `module avail` : check what software packages are available
+- `module avail` : check what software packages are available
-- `module help ` : A module file may include more detailed help for software package
+- `module help ` : A module file may include more detailed help for software package
## Batch Job Example
@@ -153,15 +153,15 @@ cat slurm-.out
While the majority of the jobs on the cluster are submitted with the `sbatch` command, and executed in the background, there are also methods to run applications interactively throughthe `srun` command. Interactive jobs allow the users to enter commands and data on the command line (or in a graphical interface), providing an experience similar to working on a desktop or laptop. Examples of common interactive tasks are:
-- Editing files
+- Editing files
-- Compiling and debugging code
+- Compiling and debugging code
-- Exploring data, to obtain a rough idea of characteristics on the topic
+- Exploring data, to obtain a rough idea of characteristics on the topic
-- Getting graphical windows to run visualization
+- Getting graphical windows to run visualization
-- Running software tools in interactive sessions
+- Running software tools in interactive sessions
Interactive jobs also help avoid issues with the login nodes. If you are working on a login node and your job is too IO intensive, it may be removed without notice. Running interactive jobs on compute nodes does not impact many users and in addition provides access to resources that are not available on the login nodes, such as interactive access to GPUs, high memory, exclusive access to all the resources of a compute node, etc.
diff --git a/docs/hpc/05_submitting_jobs/03_slurm_main_commands.md b/docs/hpc/05_submitting_jobs/03_slurm_main_commands.md
index 95effc1134..74f3768261 100644
--- a/docs/hpc/05_submitting_jobs/03_slurm_main_commands.md
+++ b/docs/hpc/05_submitting_jobs/03_slurm_main_commands.md
@@ -16,9 +16,9 @@ Slurm offers many utility commands to work with, some of the most popularly used
Run a parallel job on cluster managed by Slurm, can be used:
-1. Individual job submission where resources are allocated.
-2. In `sbatch` batch scripts as `job steps` making use of the allocated resource pool.
-3. within `salloc` instance making use of the resource pool.
+1. Individual job submission where resources are allocated.
+2. In `sbatch` batch scripts as `job steps` making use of the allocated resource pool.
+3. within `salloc` instance making use of the resource pool.
```sh
man srun # for more information
diff --git a/docs/hpc/06_tools_and_software/02_conda_environments.md b/docs/hpc/06_tools_and_software/02_conda_environments.md
index 04ec42bffd..e19dea3932 100644
--- a/docs/hpc/06_tools_and_software/02_conda_environments.md
+++ b/docs/hpc/06_tools_and_software/02_conda_environments.md
@@ -34,7 +34,7 @@ module load anaconda3/2020.07
Conda init can create problems with package installation, so we suggest using `source activate` instead of `conda activate`, even though conda activate is considered a best practice by the Anaconda developers.
### Automatic deletion of your files
-This page describes the installation of packages on /scratch. One has to remember, though, that files stored in the HPC scratch file system are subject to the HPC Scratch old file purging policy: Files on the /scratch file system that have not been accessed for 60 or more days will be purged (read more about [Data Management](../03_storage/06_data_management.md).
+This page describes the installation of packages on /scratch. One has to remember, though, that files stored in the HPC scratch file system are subject to the HPC Scratch old file purging policy: Files on the /scratch file system that have not been accessed for 60 or more days will be purged (read more about [Data Management](../03_storage/01_intro_and_data_management.mdx).
Thus you can consider the following options
diff --git a/docs/hpc/06_tools_and_software/03_python_packages_with_virtual_environments.md b/docs/hpc/06_tools_and_software/03_python_packages_with_virtual_environments.md
index 5fb013a0be..3728855a09 100644
--- a/docs/hpc/06_tools_and_software/03_python_packages_with_virtual_environments.md
+++ b/docs/hpc/06_tools_and_software/03_python_packages_with_virtual_environments.md
@@ -16,7 +16,7 @@ module load python/intel/3.8.6
```
## Automatic deletion of your files
-This page describes the installation of packages on /scratch. One has to remember, though, that files stored in the HPC scratch file system are subject to the HPC Scratch old file purging policy: Files on the /scratch file system that have not been accessed for 60 or more days will be purged (read [more](../03_storage/06_data_management.md)).
+This page describes the installation of packages on /scratch. One has to remember, though, that files stored in the HPC scratch file system are subject to the HPC Scratch old file purging policy: Files on the /scratch file system that have not been accessed for 60 or more days will be purged (read [more](../03_storage/01_intro_and_data_management.mdx)).
Thus you can consider the following options
diff --git a/docs/hpc/06_tools_and_software/04_r_packages_with_renv.md b/docs/hpc/06_tools_and_software/04_r_packages_with_renv.md
index e0987d72f7..177689ac8f 100644
--- a/docs/hpc/06_tools_and_software/04_r_packages_with_renv.md
+++ b/docs/hpc/06_tools_and_software/04_r_packages_with_renv.md
@@ -13,7 +13,7 @@ R
```
### Automatic deletion of your files
-This page describes the installation of packages on /scratch. One has to remember, though, that files stored in the HPC scratch file system are subject to the HPC Scratch old file purging policy: Files on the `/scratch` file system that have not been accessed for 60 or more days will be purged (read [more](../03_storage/06_data_management.md)).
+This page describes the installation of packages on /scratch. One has to remember, though, that files stored in the HPC scratch file system are subject to the HPC Scratch old file purging policy: Files on the `/scratch` file system that have not been accessed for 60 or more days will be purged (read [more](../03_storage/01_intro_and_data_management.mdx)).
Thus you can consider the following options:
diff --git a/docs/hpc/06_tools_and_software/05_software_on_greene.md b/docs/hpc/06_tools_and_software/05_software_on_greene.md
new file mode 100644
index 0000000000..8d09f1b89d
--- /dev/null
+++ b/docs/hpc/06_tools_and_software/05_software_on_greene.md
@@ -0,0 +1,327 @@
+# Software on Greene
+
+## Software Overview
+There are different types of software packages available
+
+- Use `module avail` command to see preinstalled software.
+ - This includes the licensed software listed below
+- Singularity Containers
+ - You can find those already built and ready to use, at location `/scratch/work/public/singularity/`
+ - For more information on running software with Singularity, [click here](../06_tools_and_software/06_singularity_run_custom_applications_with_containers.md).
+- Python/R/Julia packages can be installed by a user
+
+If you need another linux program installed, please contact us at [hpc@nyu.edu](mailto:hpc@nyu.edu)
+
+## Software and Environment Modules
+Lmod, an Environment Module system, is a tool for managing multiple versions and configurations of software packages and is used by many HPC centers around the world. With Environment Modules, software packages are installed away from the base system directories, and for each package, an associated modulefile describes what must be altered in a user's shell environment - such as the $PATH environment variable - in order to use the software package. The modulefile also describes dependencies and conflicts between this software package and other packages and versions.
+
+To use a given software package, you load the corresponding module. Unloading the module afterwards cleanly undoes the changes that loading the module made to your environment, thus freeing you to use other software packages that might have conflicted with the first one.
+
+Below is a list of modules and their associated functions:
+
+| Command | Function |
+|-----------------------------------|-----------------------------------------------------------------------|
+| module unload `` | unload a module |
+| module show `` | see exactly what effect loading the module will have with |
+| module purge | remove all loaded modules from your environment |
+| module load `` | load a module |
+| module whatis `` | find out more about a software package |
+| module list | check which modules are currently loaded in your environment |
+| module avail | check what software packages are available |
+| module help `` | A module file may include more detailed help for the software package |
+
+
+## Package Management for R, Python, & Julia, and Conda in general
+- [Conda environments (Python, R)](../06_tools_and_software/02_conda_environments.md)
+- [Using virtual environments for Python](../06_tools_and_software/03_python_packages_with_virtual_environments.md)
+- [Managing R packages with renv](../06_tools_and_software/04_r_packages_with_renv.md)
+- [Singularity with Miniconda](../07_containers/03_singularity_with_conda.md)
+
+## Examples of software usage on Greene
+Examples can be found under `/scratch/work/public/examples/` and include the following
+
+| | | |
+|-------------------|-----------------------|-----------------------|
+| alphafold | knitro | Singularity |
+| amd GPUs | lammps | slurm |
+| comsol | matlab | spark |
+| c-sharp | mathematica | stata |
+| crystal17 | namd | squashfs |
+| fluent | orca | trinity |
+| gaussian | quantum-espresso | vnc |
+| hadoop-streaming | R | vscode |
+| julia | sas | xvfb |
+| jupyter notebooks | schrodinger | |
+
+## Accessing Datasets with Singularity
+- [Singularity for Datasets](../07_containers/04_squash_file_system_and_singularity.md)
+
+## Licensed Software
+### SCHRODINGER
+Schrödinger provides a complete suite of software solutions with the latest advances in pharmaceutical research and computational chemistry. The NYU New York campus has a limited number of licenses for the Biologics Suite (ConfGen, Epik, Jaguar, Jaguar pKa, MacroModel, Prime, QSite, SiteMap), BioLuminate and the Basic Docking Suite.
+
+:::note
+Schrödinger can be used for non-commercial, academic purposes ONLY.
+:::
+
+#### Using SCHRODINGER on HPC Cluster
+
+To load Schrodinger module execute
+```sh
+$ module load schrodinger/2021-1
+```
+#### Using SCHRODINGER on NYU Lab Computers
+
+1. Request your account at: [https://www.schrodinger.com/request-account](https://www.schrodinger.com/request-account)
+2. Download the software at: [https://www.schrodinger.com/downloads/releases](https://www.schrodinger.com/downloads/releases)
+3. [Contact NYU-HPC team](mailto:hpc@nyu.edu) to request your license file.
+
+These license servers are accessible from NYU subnet.
+
+Please see the following links for installation of the license file:
+- [https://www.schrodinger.com/kb/377238](https://www.schrodinger.com/kb/377238)
+- [https://www.schrodinger.com/license-installation-instructions](https://www.schrodinger.com/license-installation-instructions)
+
+To check licenses status
+```sh
+# module load schrodinger/2021-1 # load schrodinger if not already loaded
+# licadmin STAT
+# licutil -jobs
+
+## For example:
+
+[wang@cs001 ~]$ licutil -jobs
+######## Server /share/apps/schrodinger/schrodinger.lic
+Product & job type Jobs
+BIOLUMINATE 10
+BIOLUMINATE, Docking 1
+BIOLUMINATE, Shared 10
+CANVAS 50
+COMBIGLIDE, Grid Generation 11
+COMBIGLIDE, Library Generation 50
+COMBIGLIDE, Protein Prep 11
+COMBIGLIDE, Reagent Prep 1
+EPIK 11
+GLIDE, Grid Generation 11
+GLIDE, Protein Prep 11
+GLIDE, SP Docking 1
+GLIDE, XP Descriptors 1
+GLIDE, XP Docking 1
+IMPACT 11
+JAGUAR 5
+JAGUAR, PKA 5
+KNIME 50
+LIGPREP, Desalter 1
+LIGPREP, Ionizer 3511
+LIGPREP, Ligparse 1
+LIGPREP, Neutralizer 1
+LIGPREP, Premin Bmin 1
+LIGPREP, Ring Conf 1
+LIGPREP, Stereoizer 1
+LIGPREP, Tautomerizer 1
+MACROMODEL 5
+MACROMODEL, Autoref 5
+MACROMODEL, Confgen 5
+MACROMODEL, Csearch Mbae 5
+MAESTRO, Unix 1000
+MMLIBS 3511
+PHASE, CL Phasedb Confsites 1
+PHASE, CL Phasedb Convert 1
+PHASE, CL Phasedb Manage 1
+PHASE, DPM Ligprep Clean Structures 1
+PHASE, DPM Ligprep Generate Conformers 5
+PHASE, MD Create sites 1
+PRIME, CM Build Membrane 2
+PRIME, CM Build Structure 2
+PRIME, CM Edit Alignment 2
+PRIME, CM Struct Align 18
+PRIME, Threading Search 2
+QSITE 5
+SITEMAP 10
+```
+
+#### Schrodinger Example Files
+Example SBATCH jobs and outputs are available to review here:
+```sh
+/scratch/work/public/examples/schrodinger/
+```
+
+### COMSOL
+COMSOL is a problem-solving simulation environment, enforcing compatibility guarantees consistent multiphysics models. COMSOL Multiphysics is a general-purpose software platform, based on advanced numerical methods, for modeling and simulating physics-based problems. The package is cross-platform (Windows, Mac, Linux). The COMSOL Desktop helps you organize your simulation by presenting a clear overview of your model at any point. It uses functional form, structure, and aesthetics as the means to achieve simplicity for modeling complex realities.
+
+:::note
+This license is for academic use only with Floating Network Licensing in nature i.e., authorized users are allowed to use the software on desktops. Please contact [hpc@nyu.edu](mailto:hpc.nyu.edu) for the license. However, COMSOL is also available on NYU HPC cluster Greene.
+:::
+
+In order to check what Comsol licenses are available on Greene use `comsol_licenses` command in your terminal session.
+
+Several versions of COMSOL are available on the HPC cluster. To use COMSOL on the Greene HPC cluster, please
+load the relevant module in your batch job submission script:
+```sh
+module load comsol/5.6.0.280
+```
+To submit a COMSOL job in a parallel fashion, running on multiple processing cores, follow the steps below:
+1. Create a directory on "scratch" as given below.
+```sh
+mkdir /scratch//example
+cd /scratch//example
+```
+2. Copy example files to your newly created directory
+```sh
+cp /scratch/work/public/examples/comsol/run-comsol.sbatch /scratch//example/
+cp /scratch/work/public/examples/comsol/test-input.mph /scratch//example/
+```
+3. Edit the slurm batch script file (run-comsol.sbatch) to match your case (for example chance location of the run directory).
+4. Once the slurm batch script file is ready, it can be submitted to the job scheduler using sbatch. After successful completion of job, verify output log file for detail output information.
+```sh
+sbatch run-comsol.sbatch
+```
+
+### MATHEMATICA
+Mathematica is a general computing environment with organizing algorithmic, visualization, and user interface capabilities. The many mathematical algorithms included in Mathematica make computation easy and fast.
+
+To run Mathematica on the Greene HPC cluster, please load the relevant module in your batch job submission script:
+```sh
+module load mathematica/12.1.1
+```
+:::note
+In the example below the module is loaded already in the sbatch script.
+:::
+
+To submit a batch Mathematica job for running in a parallel mode on multiple processing cores, follow below steps:
+1. Create a directory on "scratch" as given below.
+```sh
+mkdir /scratch//example
+cd /scratch//example
+```
+2. Copy example files to your newly created directory.
+```sh
+cp /scratch/work/public/examples/mathematica/basic/example.m /scratch//example/
+cp /scratch/work/public/examples/mathematica/basic/run-mathematica.sbatch /scratch//example
+```
+3. Edit the slurm batch script file (run-mathematica.sbatch) to match your case (for example chance location of the run directory).
+4. Once the sbatch script file is ready, it can be submitted to the job scheduler using sbatch. After successful completion of job, verify output log file generated.
+```sh
+sbatch run-mathematica.sbatch
+```
+
+### SAS
+SAS is a software package which enables programmers to perform many tasks, including:
+- Information retrieval
+- Data management
+- Report writing & graphics
+- Statistical analysis and data mining
+- Business planning
+- Forecasting and decision support
+- Operations research and project management
+- Quality improvement
+- Applications development
+- Data warehousing (extract, transform, load)
+- Platform independent and remote computing.
+
+There are licenses for 2 CPUs on the HPC Cluster.
+
+#### Running a parallel SAS job on HPC cluster (Greene):
+
+To submit a SAS job for running on multiple processing elements, follow below steps:
+
+1. Create a directory on "scratch":
+```sh
+mkdir /scratch//example
+cd /scratch//example
+```
+2. Copy example files to your newly created directory.
+```sh
+cp /scratch/work/public/examples/sas/test.sas /scratch//example/
+cp /scratch/work/public/examples/sas/test2.sas /scratch//example/
+cp /scratch/work/public/examples/sas/run-sas.sbatch /scratch//example/
+```
+3. Submit as shown below. After successful completion of job, verify output log file generated.
+```sh
+sbatch run-sas.sbatch
+```
+
+### MATLAB
+[MATLAB](https://www.mathworks.com/products/matlab.html) is a technical computing environment for high performance numeric computation and visualization. MATLAB integrates numerical analysis, matrix computation, signal processing, and graphics in an easy to use environment without using traditional programming.
+
+#### MATLAB on personal computers and laptops
+
+NYU has a Total Academic Headcount (TAH) license which provides campus-wide access to MATLAB, Simulink, and a variety of add-on products. All faculty, researchers, and students (on any NYU campus) can use MATLAB on their personal computers and laptops and may go to the following site to download the NYU site license software free of charge.
+
+[https://www.mathworks.com/academia/tah-portal/new-york-university-618777.html](https://www.mathworks.com/academia/tah-portal/new-york-university-618777.html)
+
+MATLAB can be used for non-commercial, academic purposes.
+
+There are several versions of Matlab available on the cluster and the relevant version can be loaded.
+```sh
+module load matlab/2020b
+module load matlab/2021a
+```
+In order to run MATLAB interactively on the cluster, [start an interactive slurm job](../05_submitting_jobs/01_slurm_submitting_jobs.md), load the matlab module and launch an interactive matlab session in the terminal.
+
+Mathworks has provided a [Greene Matlab User Guide](https://drive.google.com/file/d/1lNNzf4lsFuH9a4bbsO18roCGhT3DwUq2/view) that presents useful tips and practices for using Matlab on the cluster.
+
+### STATA
+Stata is a command and menu-driven software package for statistical analysis. It is available for Windows, Mac, and Linux operating systems. Most of its users work in research. Stata's capabilities include data management, statistical analysis, graphics, simulations, regression and custom programming.
+
+#### Running a parallel STATA job on HPC cluster (Greene):
+
+To submit a STATA job for running on multiple processing elements, follow below steps.
+
+1. Create a directory on "scratch":
+```sh
+mkdir /scratch//example
+cd /scratch//example
+```
+2. Copy example files to your newly created directory.
+```sh
+cp /scratch/work/public/examples/stata/run-stata.sbatch /scratch//example/
+cp /scratch/work/public/examples/stata/stata-test.do /scratch//example/
+```
+3. Submit using sbatch. After successful completion of job, verify output log file generated.
+```sh
+sbatch run-stata.sbatch
+```
+
+### GAUSSIAN
+Gaussian uses basic quantum mechanic electronic structure programs. This software is capable of handling proteins and large molecules using semi-empirical, ab initio molecular orbital (MO), density functional, and molecular mechanics calculations.
+
+The NYU Gaussian license only covers PIs at the Washington Square Park campus. We will grant access to you after verifying your WSP affiliation. For access, please email [hpc@nyu.edu](mailto:hpc.nyu.edu).
+
+#### Running a parallel Gaussian job on HPC cluster (Greene):
+
+To submit a Gaussian job for running on multiple processing elements, follow below steps.
+
+1. Create a directory on "scratch":
+```sh
+mkdir /scratch//example
+cd /scratch//example #Copy example files to your newly created directory.
+cp /scratch/work/public/examples/gaussian/basic/test435.com /scratch//example/
+cp /scratch/work/public/examples/gaussian/basic/run-gaussian.sbatch /scratch//example/
+```
+2. Once the sbatch script file is ready, it can be submitted to the job scheduler using sbatch. After successful completion of job, verify output log file generated.
+```sh
+sbatch run-gaussian.sbatch
+```
+
+### Knitro
+Knitro is a commercial software package for solving large scale mathematical optimization problems. Knitro is specialized for nonlinear optimization, but also solves linear programming problems, quadratic programming problems, systems of nonlinear equations, and problems with equilibrium constraints. The unknowns in these problems must be continuous variables in continuous functions; however, functions can be convex or nonconvex. Knitro computes a numerical solution to the problem—it does not find a symbolic mathematical solution. Knitro versions 9.0.1 and 10.1.1 are available.
+
+#### Running a parallel Knitro job on HPC cluster (Greene):
+
+To submit a Knitro job for running on multiple processing elements, follow below steps.
+
+1. Create a directory on "scratch":
+```sh
+mkdir /scratch//example
+cd /scratch//example
+```
+2. Copy example files to your newly created directory.
+```sh
+cp /scratch/work/public/examples/knitro/knitro.py /scratch//example/
+```
+3. There is no sample sbatch script available for knitro.
+4. After creating your own sbatch script you can execute it as follows:
+```sh
+sbatch