Skip to content

Commit b9a4166

Browse files
authored
Merge pull request #16092 from pauljewellmsft/amlfs-troubleshoot
[AMLFS] Create troubleshooting article for cluster deployment failures
2 parents 10975e6 + 0076ccc commit b9a4166

File tree

2 files changed

+82
-0
lines changed

2 files changed

+82
-0
lines changed

azure-managed-lustre/TOC.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -54,3 +54,7 @@
5454
items:
5555
- name: Recover from a regional outage
5656
href: amlfs-region-recovery.md
57+
- name: Troubleshooting
58+
items:
59+
- name: Troubleshoot cluster deployment failures
60+
href: troubleshoot-deployment.md
Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
---
2+
title: Troubleshoot Azure Managed Lustre cluster deployment issues
3+
description: Learn how to troubleshoot common cluster deployment issues in Azure Managed Lustre
4+
author: pauljewellmsft
5+
ms.author: pauljewell
6+
ms.service: azure-managed-lustre
7+
ms.topic: troubleshooting-general
8+
ms.date: 11/01/2024
9+
10+
---
11+
12+
# Troubleshoot Azure Managed Lustre deployment issues
13+
14+
In this article, you learn how to troubleshoot common issues that you might encounter when deploying an Azure Managed Lustre file system.
15+
16+
## Cluster deployment fails due to incorrect network configuration
17+
18+
In this section, we cover the following causes:
19+
20+
- [Cause 1: Network ports are blocked](#cause-1-network-ports-are-blocked)
21+
- [Cause 2: Resources within the subnet are incompatible](#cause-2-resources-within-the-subnet-are-incompatible)
22+
- [Cause 3: Network security group rules aren't configured correctly](#cause-3-network-security-group-rules-arent-configured-correctly)
23+
24+
### Cause 1: Network ports are blocked
25+
26+
Port 988 and port 22 must be open within the subnet for the cluster to communicate with the Azure Managed Lustre service. If either port is blocked, the deployment fails.
27+
28+
### Solution: Verify the network configuration
29+
30+
Allow inbound and outbound access between hosts within the Azure Managed Lustre subnet. For example, access to TCP port 22 (SSH) is necessary for cluster deployment.
31+
32+
Your network security group (NSG) must allow inbound and outbound access on port 988 and ports 1019-1023. No other services can reserve or use these ports on your Lustre clients. If you use the `ypbind` daemon on your clients to maintain Network Information Services (NIS) binding information, you must ensure that `ypbind` doesn't reserve port 988.
33+
34+
Make sure that the virtual network, subnet, and NSG meet the requirements for Azure Managed Lustre. To learn more, see [Network prerequisites](amlfs-prerequisites.md#network-prerequisites).
35+
36+
### Cause 2: Resources within the subnet are incompatible
37+
38+
Azure Managed Lustre and Azure NetApp Files resources can't share a subnet. The deployment fails if you try to create an Azure Managed Lustre file system in a subnet that currently contains, or has previously contained, Azure NetApp Files resources.
39+
40+
### Solution: Verify the subnet configuration
41+
42+
If you use the Azure NetApp Files service, you must create your Azure Managed Lustre file system in a separate subnet. To learn more, see [Network prerequisites](amlfs-prerequisites.md#network-prerequisites).
43+
44+
### Cause 3: Network security group rules aren't configured correctly
45+
46+
If you're using a network security group to filter network traffic between Azure resources in an Azure virtual network, the security rules that allow or deny inbound and outbound network traffic must be properly configured. If the network security group rules aren't correctly configured for Azure Managed Lustre file system support, the deployment fails.
47+
48+
### Solution: Verify the network security group configuration
49+
50+
For detailed guidance about configuring inbound and outbound security rules to support Azure Managed Lustre file systems, see [Configure network security group rules](configure-network-security-group.md#configure-network-security-group-rules).
51+
52+
## Cluster deployment fails due to incorrect blob container configuration
53+
54+
In this section, we cover the following causes:
55+
56+
- [Cause 1: Blob container allows public access](#cause-1-blob-container-allows-public-access)
57+
- [Cause 2: Blob container can't be accessed by the file system](#cause-2-blob-container-cant-be-accessed-by-the-file-system)
58+
59+
### Cause 1: Blob container allows public access
60+
61+
To comply with security requirements, the blob container anonymous access level must be set to private. If the blob container is set to public, the deployment fails.
62+
63+
### Solution: Set the blob container access level to private
64+
65+
Configure the blob container to allow private access only. You can disallow public access at the storage account level, or you can configure access at the container level. To learn more, see [About anonymous read access](/azure/storage/blobs/anonymous-read-access-configure#about-anonymous-read-access).
66+
67+
### Cause 2: Blob container can't be accessed by the file system
68+
69+
If the file system can't access the blob container, the deployment fails. You must add role assignments at the storage account scope or higher to allow the file system to access the container.
70+
71+
### Solution: Authorize access to the storage account
72+
73+
To authorize access to the storage account, add the following role assignments to the service principal **HPC Cache Resource Provider**:
74+
75+
- [Storage Account Contributor](/azure/role-based-access-control/built-in-roles#storage-account-contributor)
76+
- [Storage Blob Data Contributor](/azure/role-based-access-control/built-in-roles#storage-blob-data-contributor)
77+
78+
To learn more, see [Access role for blob integration](amlfs-prerequisites.md#access-roles-for-blob-integration).

0 commit comments

Comments
 (0)