Skip to content

Commit bc5c289

Browse files
authored
Merge pull request #104897 from jswoodward/ekpgh-hpc-add-troubleshooting
hpc-cache: add NAS troubleshooting article
2 parents be59beb + 8944a46 commit bc5c289

File tree

4 files changed

+155
-3
lines changed

4 files changed

+155
-3
lines changed

articles/hpc-cache/hpc-cache-prereqs.md

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ description: Prerequisites for using Azure HPC Cache
44
author: ekpgh
55
ms.service: hpc-cache
66
ms.topic: conceptual
7-
ms.date: 02/12/2020
7+
ms.date: 02/20/2020
88
ms.author: rohogue
99
---
1010

@@ -90,7 +90,9 @@ If using an NFS storage system (for example, an on-premises hardware NAS system)
9090
> [!NOTE]
9191
> Storage target creation will fail if the cache has insufficient access to the NFS storage system.
9292
93-
* **Network connectivity:** The Azure HPC Cache needs high-bandwidth network access between the cache subnet and the NFS system's data center. [ExpressRoute](https://docs.microsoft.com/azure/expressroute/) or similar access is recommended. If using a VPN, you might need to configure it to clamp TCP MSS at 1350 to make sure large packets are not blocked.
93+
More information is included in [Troubleshoot NAS configuration and NFS storage target issues](troubleshoot-nas.md).
94+
95+
* **Network connectivity:** The Azure HPC Cache needs high-bandwidth network access between the cache subnet and the NFS system's data center. [ExpressRoute](https://docs.microsoft.com/azure/expressroute/) or similar access is recommended. If using a VPN, you might need to configure it to clamp TCP MSS at 1350 to make sure large packets are not blocked. Read [VPN packet size restrictions](troubleshoot-nas.md#adjust-vpn-packet-size-restrictions) for additional help troubleshooting VPN settings.
9496

9597
* **Port access:** The cache needs access to specific TCP/UDP ports on your storage system. Different types of storage have different port requirements.
9698

@@ -104,6 +106,8 @@ If using an NFS storage system (for example, an on-premises hardware NAS system)
104106
rpcinfo -p <storage_IP> |egrep "100000\s+4\s+tcp|100005\s+3\s+tcp|100003\s+3\s+tcp|100024\s+1\s+tcp|100021\s+4\s+tcp"| awk '{print $4 "/" $3 " " $5}'|column -t
105107
```
106108

109+
Make sure that all of the ports returned by the ``rpcinfo`` query allow unrestricted traffic from the Azure HPC Cache's subnet.
110+
107111
* In addition to the ports returned by the `rpcinfo` command, make sure that these commonly used ports allow inbound and outbound traffic:
108112
109113
| Protocol | Port | Service |
@@ -116,17 +120,21 @@ If using an NFS storage system (for example, an on-premises hardware NAS system)
116120
117121
* Check firewall settings to be sure that they allow traffic on all of these required ports. Be sure to check firewalls used in Azure as well as on-premises firewalls in your data center.
118122
119-
* **Directory access:** Enable the `showmount` command on the storage system. Azure HPC Cache uses this command to check that your storage target configuration points to a valid export, and also to make sure that multiple mounts don't access the same subdirectories (which risks file collisions).
123+
* **Directory access:** Enable the `showmount` command on the storage system. Azure HPC Cache uses this command to check that your storage target configuration points to a valid export, and also to make sure that multiple mounts don't access the same subdirectories (a risk for file collision).
120124

121125
> [!NOTE]
122126
> If your NFS storage system uses NetApp's ONTAP 9.2 operating system, **do not enable `showmount`**. [Contact Microsoft Service and Support](hpc-cache-support-ticket.md) for help.
123127
128+
Learn more about directory listing access in the NFS storage target [troubleshooting article](troubleshoot-nas.md#enable-export-listing).
129+
124130
* **Root access:** The cache connects to the back-end system as user ID 0. Check these settings on your storage system:
125131
126132
* Enable `no_root_squash`. This option ensures that the remote root user can access files owned by root.
127133
128134
* Check export policies to make sure they do not include restrictions on root access from the cache's subnet.
129135

136+
* If your storage has any exports that are subdirectories of another export, make sure the cache has root access to the lowest segment of the path. Read [Root access on directory paths](troubleshoot-nas.md#allow-root-access-on-directory-paths) in the NFS storage target troubleshooting article for details.
137+
130138
* NFS back-end storage must be a compatible hardware/software platform. Contact the Azure HPC Cache team for details.
131139

132140
## Next steps

articles/hpc-cache/index.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,11 @@ landingContent:
5959
url: hpc-cache-edit-storage.md
6060
- text: Work around firewall settings to create Blob storage targets
6161
url: hpc-cache-blob-firewall-fix.md
62+
- text: Troubleshoot NFS storage target creation
63+
url: troubleshoot-nas.md
6264
- linkListType: concept
6365
links:
6466
- text: Recover from a regional outage
6567
url: hpc-region-recovery.md
68+
- text: Use Azure NetApp Files storage targets
69+
url: hpc-cache-netapp.md

articles/hpc-cache/toc.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,8 @@
4646
href: hpc-cache-support-ticket.md
4747
- name: Work around Blob storage account firewall settings
4848
href: hpc-cache-blob-firewall-fix.md
49+
- name: Troubleshoot NFS storage target creation
50+
href: troubleshoot-nas.md
4951
- name: Recover from a regional outage
5052
href: hpc-region-recovery.md
5153
- name: Use Azure NetApp Files with Azure HPC Cache
Lines changed: 138 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,138 @@
1+
---
2+
title: Troubleshoot Azure HPC Cache NFS storage targets
3+
description: Tips to avoid and fix configuration errors and other problems that can cause failure when creating an NFS storage target
4+
author: ekpgh
5+
ms.service: hpc-cache
6+
ms.topic: conceptual
7+
ms.date: 02/20/2020
8+
ms.author: rohogue
9+
---
10+
11+
# Troubleshoot NAS configuration and NFS storage target issues
12+
13+
This article gives solutions for some common configuration errors and other issues that could prevent Azure HPC Cache from adding an NFS storage system as a storage target.
14+
15+
This article includes details about how to check ports and how to enable root access to a NAS system. It also includes detailed information about less common issues that might cause NFS storage target creation to fail.
16+
17+
> [!TIP]
18+
> Before using this guide, read [prerequisites for NFS storage targets](hpc-cache-prereqs.md#nfs-storage-requirements).
19+
20+
If the solution to your problem is not included here, please [open a support ticket](hpc-cache-support-ticket.md) so that Microsoft Service and Support can work with you to investigate and solve the problem.
21+
22+
## Check port settings
23+
24+
Azure HPC Cache needs read/write access to several UDP/TCP ports on the back-end NAS storage system. Make sure these ports are accessible on the NAS system and also that traffic is permitted to these ports through any firewalls between the storage system and the cache subnet. You might need to work with firewall and network administrators for your data center to verify this configuration.
25+
26+
The ports are different for storage systems from different vendors, so check your system's requirements when setting up a storage target.
27+
28+
In general, the cache needs access to these ports:
29+
30+
| Protocol | Port | Service |
31+
|----------|-------|----------|
32+
| TCP/UDP | 111 | rpcbind |
33+
| TCP/UDP | 2049 | NFS |
34+
| TCP/UDP | 4045 | nlockmgr |
35+
| TCP/UDP | 4046 | mountd |
36+
| TCP/UDP | 4047 | status |
37+
38+
To learn the specific ports needed for your system, use the following ``rpcinfo`` command. This command below lists the ports and formats the relevant results in a table. (Use your system's IP address in place of the *<storage_IP>* term.)
39+
40+
You can issue this command from any Linux client that has NFS infrastructure installed. If you use a client inside the cluster subnet, it also can help verify connectivity between the subnet and the storage system.
41+
42+
```bash
43+
rpcinfo -p <storage_IP> |egrep "100000\s+4\s+tcp|100005\s+3\s+tcp|100003\s+3\s+tcp|100024\s+1\s+tcp|100021\s+4\s+tcp"| awk '{print $4 "/" $3 " " $5}'|column -t
44+
```
45+
46+
Make sure that all of the ports returned by the ``rpcinfo`` query allow unrestricted traffic from the Azure HPC Cache's subnet.
47+
48+
Check these settings both on the NAS itself as well as on any firewalls between the storage system and the cache subnet.
49+
50+
## Check root access
51+
52+
Azure HPC Cache needs access to your storage system's exports to create the storage target. Specifically, it mounts the exports as user ID 0.
53+
54+
Different storage systems use different methods to enable this access:
55+
56+
* Linux servers generally add ``no_root_squash`` to the exported path in ``/etc/exports``.
57+
* NetApp and EMC systems typically control access with export rules that are tied to specific IP addresses or networks.
58+
59+
If using export rules, remember that the cache can use multiple different IP addresses from the cache subnet. Allow access from the full range of possible subnet IP addresses.
60+
61+
Work with your NAS storage vendor to enable the right level of access for the cache.
62+
63+
### Allow root access on directory paths
64+
<!-- linked in prereqs article -->
65+
66+
For NAS systems that export hierarchical directories, Azure HPC Cache needs root access to each export level.
67+
68+
For example, a system might show three exports like these:
69+
70+
* ``/ifs``
71+
* ``/ifs/accounting``
72+
* ``/ifs/accounting/payroll``
73+
74+
The export ``/ifs/accounting/payroll`` is a child of ``/ifs/accounting``, and ``/ifs/accounting`` is itself a child of ``/ifs``.
75+
76+
If you add the ``payroll`` export as an HPC cache storage target, the cache actually mounts ``/ifs/`` and accesses the payroll directory from there. So Azure HPC Cache needs root access to ``/ifs`` in order to access the ``/ifs/accounting/payroll`` export.
77+
78+
This requirement is related to the way the cache indexes files and avoids file collisions, using file handles that the storage system provides.
79+
80+
A NAS system with hierarchical exports can give different file handles for the same file if the file is retrieved from different exports. For example, a client could mount ``/ifs/accounting`` and access the file ``payroll/2011.txt``. Another client mounts ``/ifs/accounting/payroll`` and accesses the file ``2011.txt``. Depending on how the storage system assigns file handles, these two clients might receive the same file with different file handles (one for ``<mount2>/payroll/2011.txt`` and one for ``<mount3>/2011.txt``).
81+
82+
The back-end storage system keeps internal aliases for file handles, but Azure HPC Cache cannot tell which file handles in its index reference the same item. So it is possible that the cache can have different writes cached for the same file, and apply the changes incorrectly because it does not know that they are the same file.
83+
84+
To avoid this possible file collision for files in multiple exports, Azure HPC Cache automatically mounts the shallowest available export in the path (``/ifs`` in the example) and uses the file handle given from that export. If multiple exports use the same base path, Azure HPC Cache needs root access to that path.
85+
86+
## Enable export listing
87+
<!-- link in prereqs article -->
88+
89+
The NAS must list its exports when the Azure HPC Cache queries it.
90+
91+
On most NFS storage systems, you can test this by sending the following query from a Linux client: ``showmount -e <storage IP address>``
92+
93+
Use a Linux client from the same virtual network as your cache, if possible.
94+
95+
If that command doesn't list the exports, the cache will have trouble connecting to your storage system. Work with your NAS vendor to enable export listing.
96+
97+
## Adjust VPN packet size restrictions
98+
<!-- link in prereqs article -->
99+
100+
If you have a VPN between the cache and your NAS device, the VPN might block full-sized 1500-byte Ethernet packets. You might have this problem if large exchanges between the NAS and the Azure HPC Cache instance do not complete, but smaller updates work as expected.
101+
102+
There isn't a simple way to tell whether or not your system has this problem unless you know the details of your VPN configuration. Here are a few methods that can help you check for this issue.
103+
104+
* Use packet sniffers on both sides of the VPN to detect which packets transfer successfully.
105+
* If your VPN allows ping commands, you can test sending a full-sized packet.
106+
107+
Run a ping command over the VPN to the NAS with these options. (Use your storage system's IP address in place of the *<storage_IP>* value.)
108+
109+
```bash
110+
ping -M do -s 1472 -c 1 <storage_IP>
111+
```
112+
113+
These are the options in the command:
114+
115+
* ``-M do`` - Do not fragment
116+
* ``-c 1`` - Send only one packet
117+
* ``-s 1472`` - Set the size of the payload to 1472 bytes. This is the maximum size payload for a 1500-byte packet after accounting for the Ethernet overhead.
118+
119+
A successful response looks like this:
120+
121+
```bash
122+
PING 10.54.54.11 (10.54.54.11) 1472(1500) bytes of data.
123+
1480 bytes from 10.54.54.11: icmp_seq=1 ttl=64 time=2.06 ms
124+
```
125+
126+
If the ping fails with 1472 bytes, you might need to configure MSS clamping on the VPN to make the remote system properly detect the maximum frame size. Read the [VPN Gateway IPsec/IKE parameters documentation](../vpn-gateway/vpn-gateway-about-vpn-devices.md#ipsec) to learn more.
127+
128+
## Check for ACL security style
129+
130+
Some NAS systems use a hybrid security style that combines access control lists (ACLs) with traditional POSIX or UNIX security.
131+
132+
If your system reports its security style as UNIX or POSIX without including the acronym "ACL", this issue does not affect you.
133+
134+
For systems that use ACLs, Azure HPC Cache needs to track additional user-specific values in order to control file access. This is done by enabling an access cache. There isn't a user-facing control to turn on the access cache, but you can open a support ticket to request that it be enabled for the affected storage targets on your cache system.
135+
136+
## Next steps
137+
138+
If you have a problem that was not addressed in this article, [open a support ticket](hpc-cache-support-ticket.md) to get expert help.

0 commit comments

Comments
 (0)