Add docs on troubleshooting NFS repos (#97601) (#97812)

DaveCTurner · web-flow · commit 8aa461beb06a · 2023-07-19T09:24:09.000-04:00
Spell out a bit more clearly that ES works through the OS's filesystem
abstraction, giving advice about how to reproduce problems outside of
ES.
diff --git a/docs/reference/snapshot-restore/apis/verify-repo-api.asciidoc b/docs/reference/snapshot-restore/apis/verify-repo-api.asciidoc
@@ -4,7 +4,7 @@
 <titleabbrev>Verify snapshot repository</titleabbrev>
 ++++
 
-Verifies that a snapshot repository is functional. See
+Checks for common misconfigurations in a snapshot repository. See
 <<snapshots-repository-verification>>.
 
 ////
diff --git a/docs/reference/snapshot-restore/repository-shared-file-system.asciidoc b/docs/reference/snapshot-restore/repository-shared-file-system.asciidoc
@@ -3,20 +3,14 @@
 
 include::{es-repo-dir}/snapshot-restore/on-prem-repo-type.asciidoc[]
 
-Use a shared file system repository to store snapshots on a
-shared file system.
+Use a shared file system repository to store snapshots on a shared file system.
 
 To register a shared file system repository, first mount the file system to the
-same location on all master and data nodes. Then add the file system's
-path or parent directory to the `path.repo` setting in `elasticsearch.yml` for
-each master and data node. For running clusters, this requires a
+same location on all master and data nodes. Then add the file system's path or
+parent directory to the `path.repo` setting in `elasticsearch.yml` for each
+master and data node. For running clusters, this requires a
 <<restart-cluster-rolling,rolling restart>> of each node.
 
-IMPORTANT: By default, a network file system (NFS) uses user IDs (UIDs) and
-group IDs (GIDs) to match accounts across nodes. If your shared file system is
-an NFS and your nodes don't use the same UIDs and GIDs, update your NFS
-configuration to account for this.
-
 Supported `path.repo` values vary by platform:
 
 include::{es-repo-dir}/tab-widgets/register-fs-repo-widget.asciidoc[]
@@ -47,3 +41,46 @@ Maximum number of snapshots the repository can contain.
 Defaults to `Integer.MAX_VALUE`, which is `2^31-1` or `2147483647`.
 
 include::repository-shared-settings.asciidoc[]
+
+==== Troubleshooting a shared file system repository
+
+{es} interacts with a shared file system repository using the file system
+abstraction in your operating system. This means that every {es} node must be
+able to perform operations within the repository path such as creating,
+opening, and renaming files, and creating and listing directories, and
+operations performed by one node must be visible to other nodes as soon as they
+complete.
+
+Check for common misconfigurations using the <<verify-snapshot-repo-api>> API
+and the <<repo-analysis-api>> API. When the repository is properly configured,
+these APIs will complete successfully. If the verify repository or repository
+analysis APIs report a problem then you will be able to reproduce this problem
+outside {es} by performing similar operations on the file system directly.
+
+If the verify repository or repository analysis APIs fail with an error
+indicating insufficient permissions then adjust the configuration of the
+repository within your operating system to give {es} an appropriate level of
+access. To reproduce such problems directly, perform the same operations as
+{es} in the same security context as the one in which {es} is running. For
+example, on Linux, use a command such as `su` to switch to the user as which
+{es} runs.
+
+If the verify repository or repository analysis APIs fail with an error
+indicating that operations on one node are not immediately visible on another
+node then adjust the configuration of the repository within your operating
+system to address this problem. If your repository cannot be configured with
+strong enough visibility guarantees then it is not suitable for use as an {es}
+snapshot repository.
+
+The verify repository and repository analysis APIs will also fail if the
+operating system returns any other kind of I/O error when accessing the
+repository. If this happens, address the cause of the I/O error reported by the
+operating system.
+
+TIP: Many NFS implementations match accounts across nodes using their _numeric_
+user IDs (UIDs) and group IDs (GIDs) rather than their names. It is possible
+for {es} to run under an account with the same name (often `elasticsearch`) on
+each node, but for these accounts to have different numeric user or group IDs.
+If your shared file system uses NFS then ensure that every node is running with
+the same numeric UID and GID, or else update your NFS configuration to account
+for the variance in numeric IDs across nodes.