Skip to content

Commit 82da759

Browse files
Merge pull request #302039 from Padmalathas/StorageOptions
Adding Storage Options for HPC
2 parents c196e30 + 5ec7482 commit 82da759

File tree

2 files changed

+122
-3
lines changed

2 files changed

+122
-3
lines changed

articles/high-performance-computing/TOC.yml

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -62,9 +62,11 @@
6262
- name: HPC Performance and Benchmarking
6363
expanded: true
6464
items:
65-
- name: Performance and Benchmarking Overview
65+
- name: Performance and benchmarking overview
6666
href: ./performance-benchmarking/overview.md
67-
- name: HPC Performance and Benchmarking Applications
67+
- name: HPC workload best practices and storage solutions
68+
href: ./performance-benchmarking/hpc-storage-options.md
69+
- name: HPC performance and benchmarking applications
6870
href: ./performance-benchmarking/high-performance-computing-performance-benchmarking-applications.md
69-
- name: Performance Optimization for HPC and AI VMs
71+
- name: Performance optimization for HPC and AI VMs
7072
href: ./performance-benchmarking/optimize-performance.md
Lines changed: 117 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
---
2+
title: "High-Performance Computing (HPC) workload best practices and storage options"
3+
description: A comprehensive guide to choosing a storage solution best suited to your HPC workloads.
4+
author: christinechen2
5+
ms.author: padmalathas
6+
ms.reviewer: normesta
7+
ms.date: 06/25/2025
8+
ms.service: azure-virtual-machines
9+
ms.subservice: hpc
10+
ms.topic: concept-article
11+
# Customer intent: "As a Cloud architect, HPC administrator, I want to evaluate and select the most suitable Azure HPC storage solution based on performance, scalability, protocol support, and workload alignment for AI, HPC, and data-intensive applications."
12+
---
13+
14+
# High-performance computing (HPC) workload best practices and storage options guide
15+
16+
<!-- [!INCLUDE[appliesto-sqlvm](../../includes/appliesto-sqlvm.md)] -->
17+
18+
This guide provides best practices, guidelines, a detailed comparison and technical specifications of storage solutions that is best suited to your HPC workload on Azure VMs. It includes performance metrics, protocol support, cost tiers, and use case alignment for each storage type. There's typically a trade-off between optimizing for costs and optimizing for performance. If your workload is less demanding, you might not require every recommended optimization. Consider your performance needs, costs, and workload patterns as you evaluate these recommendations.
19+
20+
## Overview
21+
22+
Storage for HPC workloads consists of core storage and in some cases, an accelerator. Core storage acts as the permanent home for your data. It contains rich data management features and is durable, available, scalable, elastic, and secure. An accelerator enhances core storage by providing high-performance data access. An accelerator can be provisioned on demand and gives your computational workload much faster access to data.
23+
24+
## Storage Services Comparison
25+
26+
| Feature | Standard Blob | Premium Blob | Premium Files | Azure NetApp Files | Azure Managed Lustre |
27+
|----------------|---------------|--------------|----------------|---------------------|-----------------------|
28+
| **Capacity** | 20+ PiB | 20+ PiB | 100 TiB | 500 TiB | 1 PiB |
29+
| **Bandwidth** | 15 GB/s | 15 GB/s | 10 GB/s | 10 GiB/s | Up to 512 GB/s |
30+
| **IOPS** | 20,000 | 20,000 | 100,000 | 800,000 | >100,000 |
31+
| **Latency** | <100 ms | <10 ms | 2–4 ms | <1 ms | <2 ms |
32+
| **Protocols** | REST, HDFS, NFSv3, SFTP, FUSE, CSI | Same | REST, NFSv4.1, SMB3, CSI | NFSv3/4.1, SMB3, CSI | Lustre, CSI |
33+
34+
## Initial consideration
35+
36+
If you are starting from scratch, see [Understand data store models](/azure/architecture/guide/technology-choices/data-store-overview) to choose a data store and [Choose an Azure storage service](/azure/architecture/guide/technology-choices/storage-options) or [Introduction to Azure Storage](/azure/storage/common/storage-introduction) to get an idea of your storage service options.
37+
38+
## At a glance
39+
40+
Start with the amount of data that you plan to store. Then, consider the number of CPU cores used by your workload and the size of your files. These factors help you to narrow down which core storage service best suits your workload and whether to use an accelerator to enhance performance.
41+
42+
|Configuration |CPU cores |Sizes of files |Core Storage Recommendation |Accelerator Recommendation |
43+
|---------|---------|---------|---------|---------|
44+
|Under 50 TiB |N/A |N/A | [Azure Files](/azure/storage/files/) or [Azure NetApp Files](/azure/azure-netapp-files/). |No accelerator |
45+
|50 TiB - 5,000 TiB |Less than 500 |N/A|[Azure Files](/azure/storage/files/) or [Azure NetApp Files](/azure/azure-netapp-files/). |No accelerator |
46+
|50 TiB - 5,000 TiB |Over 500 |1 MiB and larger| [Azure Standard Blob](/azure/storage/blobs/). It’s supported by all accelerators, supports many protocols, and is cost-effective. | [Azure Managed Lustre](/azure/azure-managed-lustre/). |
47+
|50 TiB - 5,000 TiB |Over 500 |Smaller than 1 MiB| [Azure Premium Blob](/azure/storage/blobs/storage-blob-block-blob-premium) or [Azure Standard Blob](/azure/storage/blobs/). | [Azure Managed Lustre](/azure/azure-managed-lustre/). |
48+
|50 TiB - 5,000 TiB |Over 500 |Smaller than 512 KiB| [Azure NetApp Files](/azure/azure-netapp-files/). |No accelerator |
49+
|Over 5,000 TiB |N/A |N/A| |Talk to your field or account team. |
50+
<!---| |[Use ZRS disks when sharing disks between VMs](#use-zrs-disks-when-sharing-disks-between-vms). |Prevents a shared disk from becoming a single point of failure. | --->
51+
52+
---
53+
54+
## Solution details
55+
56+
If you are still stuck between options after using the decision trees, here are more details for each solution:
57+
58+
|Solution |Optimal Performance & Scale |Data Access (Access Protocol) |Billing Model |Core Storage or Accelerator |
59+
|---|---|---|---|---|
60+
| [**Azure Standard Blob**](/azure/storage/blobs/) | * Good for large file, bandwidth-intensive workloads.<br> * Designed for unstructured data. <br> * Supports high-throughput workloads. | * Good for traditional (file) and cloud-native (REST) HPC apps. <br>* Easy to access, share, manage datasets.<br> * Works with all accelerators. | Pay for what you use. | Core Storage. |
61+
| [**Azure Premium Blob**](/azure/storage/blobs/storage-blob-block-blob-premium) | * IOPS and latency better than Standard Blob. <br> * Good for datasets with many medium-sized files and mixed file sizes. | Good for traditional (file) and cloud-native (REST) HPC apps. <br> Easy to access, share, manage datasets. <br> Works with all accelerators.| Pay for what you use. | Core Storage. |
62+
| [**Azure Premium Files**](/azure/storage/files/) | * Capacity and bandwidth suited for smaller scale (<1k cores). <br> * IOPS and latency good for medium sized files (>512 KiB). <br> * Offers premium (low latency, high IOPS) SKUs. <br> * Hybrid access via Azure File Sync. | Easy integration with Linux (NFS) and Windows (SMB), but can't use both NFS+SMB to access the same data. | Pay for what you provision. | Core Storage. |
63+
| [**Azure NetApp Files**](/azure/azure-netapp-files/) | * Capacity and bandwidth good for midrange jobs (1k-10k cores). <br> * IOPS and latency good for small-file datasets (<512 KiB). <br> * Excellent for small, many-file workloads. <br> * Enterprise-grade file storage with ONTAP technology. <br> * Dynamic performance scaling across Standard, Premium, Ultra tiers. | Easy to integrate for Linux and Windows, supports multiprotocol for workflows using both Linux + Windows. | Pay what you provision. | Either. |
64+
| [**Azure Managed Lustre**](/azure/azure-managed-lustre/) | Bandwidth to support all job sizes (1k - >10k cores). <br> * IOPS and latency good for thousands of medium-sized files (>512 KiB). <br> * Best for bandwidth-intensive read and write workloads. <br> * Parallel file system optimized for HPC/AI.<br> * Seamless integration with Azure Blob for tiered storage. | Lustre, CSI. | Pay for what you provision. | Durable enough to run as standalone (core) storage, most cost-effective as an accelerator. |
65+
66+
---
67+
68+
## Specialized Storage Solutions
69+
Azure offers a range of storage services tailored to meet the demanding needs of HPC workloads. Each solution is optimized for different performance characteristics, access patterns, and cost profiles. Following is an overview of the most relevant storage options and what they are best suited for in HPC scenarios.
70+
71+
| Storage Solution | Use Cases | Performance Benchmarks | Scalability Options | Integration with Other Azure Services |
72+
|------|------|-----|-----|-----|
73+
| Azure Blob Storage | * Data Analytics <br> * Content Distribution <br> * Backup and Archival | Throughput up to 30GB/s with BlobFuse2 | * Storage Accounts up to 5 PiB per account <br> * Unlimited number of containers per account | * Azure AI <br>* AKS <br> * Azure Data Lake |
74+
||||||
75+
| Azure Files | * DevOps <br> * Backups <br> * Remote Work | Encryption in Transit (TLS 1.3 for NFS shares) | * File Shares up to 100 TiB per share (Standard) <br> * IOPS up to 100,000 (Premium) | * Azure Backup <br> * Azure Monitor <br> * Microsoft Entra ID |
76+
||||||
77+
| Azure NetApp Files | * Databases <br> * VDI <br> * HPC | IOPS and Throughput measured using FIO | * Capacity Pools up to 100 TiB per pool <br> * Volumes up to 100 TiB per volume | * AKS <br> * Azure Backup <br> * Azure Monitor |
78+
||||||
79+
| Azure Managed Lustre | * Large-scale simulations <br> * Genomics <br> * Scientific Workloads | Throughput up to 30GB/s with the 250MB/s/TiB performance tier | * File Systems up to 1.5 PB capacity<br> * Throughput up to 375 GB/s | * Azure Blob Storage <br> * AKS <br> * Azure Monitor  |
80+
||||||
81+
82+
---
83+
84+
## AI and RAG Workload Storage Requirements
85+
86+
The storage requirements for AI and RAG workloads vary across different stages. During the training stage, it is essential to have high throughput, checkpointing, local caching, and the ability to load large models. For the inference stage, fast model access, low latency, and concurrent GPU access are required. In the RAG stage, secure unstructured storage, vector database integration, freshness, and low latency are necessary.
87+
88+
---
89+
90+
## Partner Solutions
91+
92+
| Partner | Protocols | Scale | Unique Features |
93+
|-------------------|---------------------|---------------|------------------------------------------------------|
94+
| Qumulo | NFS, SMB, S3 | 200 PiB | Azure-native SaaS, global namespace, cost-effective |
95+
| Dell APEX | NFS, SMB, S3, HDFS | 5.6 PiB | On-prem parity, policy-based tiering |
96+
| Nasuni | NFS, SMB, S3 || File locking, blob as primary tier |
97+
| Hammerspace | NFS, SMB, S3, pNFS || Global namespace, caching alternative |
98+
| Weka | NFS, SMB, S3 | 14 EB | High IOPS, low latency, linear scale-out |
99+
| IBM SpectrumScale | GPFS, NFS, SMB || Full GPFS stack |
100+
| DDN Exascaler | Lustre, NFS, SMB | Petabytes | Full DDN Lustre stack |
101+
102+
---
103+
104+
## Performance Optimization Tips
105+
- Size volumes based on performance, not just capacity.
106+
- Use Availability Zones to control latency.
107+
- Use large volume features in ANF for max bandwidth.
108+
- Consider caching and tiering strategies for cost efficiency.
109+
110+
## Core storage price comparison
111+
112+
In order of most to least expensive, the core storage option prices are:
113+
- Azure NetApp Files
114+
- Azure Premium Blob and Azure Premium Files
115+
- Azure Standard Blob
116+
117+
For more info on the pricing, see [Azure product pricing](https://azure.microsoft.com/pricing/#product-pricing).

0 commit comments

Comments
 (0)