Skip to content

Commit ee32626

Browse files
committed
feat: Enhance GCS connector docs for distcp with HNS
This PR improves the Cloud Storage connector documentation to better support users performing `distcp` operations with Hierarchical Namespace (HNS) enabled buckets in self-managed Hadoop environments. This change directly addresses customer issues observed in Salesforce case [500Kf00000Y8MmwIAF] and Buganizer report [389061732], where users experienced intermittent `distcp` failures, often manifesting as `DEADLINE_EXCEEDED` errors or generic `SSH operator error: exit status = 25`. Key changes include: - **`gcs/CONFIGURATION.md`**: - Clarified guidance on `fs.gs.http.read-timeout` and `fs.gs.hierarchical.namespace.folders.enable` to address `DEADLINE_EXCEEDED` errors and ensure proper HNS interaction. - Added troubleshooting tips for generic exit codes and recommendations for using shaded JARs to resolve dependency conflicts. - **`gcs/INSTALL.md`**: - Expanded the "Troubleshooting the installation" section with more detailed advice on diagnosing dependency conflicts and enabling verbose logging, specifically highlighting its utility for `DEADLINE_EXCEEDED` errors. - **`gcs/README.md`**: - Updated the "Configuring the connector" section to prominently guide users facing `distcp` and HNS issues, including `DEADLINE_EXCEEDED` errors, to the more detailed `CONFIGURATION.md`. These updates aim to provide clearer instructions and troubleshooting steps, reducing the need for support engagement for these common problems in non-Dataproc Hadoop deployments. Related CL: ... Addresses support issue: go/sf/55915396 Fixes: b/389061732
1 parent 7f5f6bb commit ee32626

File tree

3 files changed

+62
-4
lines changed

3 files changed

+62
-4
lines changed

gcs/CONFIGURATION.md

Lines changed: 31 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -140,6 +140,32 @@
140140
fs.gs.storage.http.headers.another-custom-header=another_custom_value
141141
```
142142
143+
* `fs.gs.hierarchical.namespace.folders.enable` (default: `false`)
144+
145+
Whether to create objects for the parent directories of objects with `/` in
146+
their path e.g. creating `gs://bucket/foo/` upon deleting or renaming
147+
`gs://bucket/foo/bar`. When using `distcp` with Hierarchical Namespace (HNS)
148+
enabled buckets, ensure this property is set to `true` in your
149+
`core-site.xml` for proper interaction with the hierarchical structure of
150+
the bucket. Conversely, if your bucket does not have HNS enabled, this
151+
property should be `false`. Incorrect configuration can lead to
152+
`DEADLINE_EXCEEDED` errors or generic SSH operator failures
153+
(e.g., `exit status = 25`).
154+
155+
* **Dependency Conflicts and Shaded JARs:** In non-Dataproc Hadoop environments,
156+
you might encounter dependency conflicts (e.g.,
157+
`java.lang.NoSuchMethodError` or `java.lang.ClassNotFoundException`). To
158+
resolve these, consider using a shaded version of the `gcs-connector` JAR
159+
(e.g., `gcs-connector-hadoop3-*-shaded.jar`), which bundles its dependencies
160+
to avoid conflicts.
161+
162+
* **Troubleshooting Generic Exit Codes:** Generic exit codes (like `exit status
163+
= 25`) from `distcp` or other Hadoop operations often indicate an underlying
164+
issue that requires examining more detailed Hadoop and GCS connector logs.
165+
Increasing the logging verbosity for the GCS connector can provide crucial
166+
diagnostic information. Refer to `gcs/INSTALL.md` for guidance on enabling
167+
verbose logging.
168+
143169
### Encryption ([CSEK](https://cloud.google.com/storage/docs/encryption/customer-supplied-keys))
144170
145171
* `fs.gs.encryption.algorithm` (not set by default)
@@ -401,10 +427,13 @@ Knobs configure the vectoredRead API
401427
402428
Timeout to establish a connection. Use `0` for an infinite timeout.
403429
404-
* `fs.gs.http.read-timeout` (default: `5s`)
430+
* `fs.gs.http.read-timeout` (default: `5s`)
405431
406432
Timeout to read from an established connection. Use `0` for an infinite
407-
timeout.
433+
timeout. For `distcp` operations, especially with Hierarchical Namespace (HNS)
434+
enabled buckets or large transfers, increasing this timeout can help
435+
mitigate intermittent failures caused by network instability and prevent
436+
`DEADLINE_EXCEEDED` errors.
408437
409438
### API client configuration
410439

gcs/INSTALL.md

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -137,6 +137,34 @@ the installation.
137137
gs://<some-bucket>`), and that the credentials in your configuration are
138138
correct.
139139

140+
* **Dependency Conflicts:** In non-Dataproc Hadoop environments, unexpected
141+
errors such as `java.lang.NoSuchMethodError` or
142+
`java.lang.ClassNotFoundException` can occur due to version conflicts
143+
with libraries already present in your Hadoop classpath. To mitigate
144+
these, consider using a shaded version of the GCS connector JAR
145+
(e.g., `gcs-connector-hadoop3-*-shaded.jar`). These JARs package all of
146+
the connector's dependencies, reducing the chance of conflicts.
147+
148+
* **Enabling Verbose Logging:** For more detailed diagnostics of issues
149+
beyond basic installation, such as `DEADLINE_EXCEEDED` errors, enable
150+
verbose logging for the GCS connector. This can provide crucial
151+
information for troubleshooting `distcp` failures or other unexpected
152+
behavior. Add the following to your `hadoop-env.sh` file:
153+
154+
```bash
155+
export HADOOP_CLIENT_OPTS="-Djava.util.logging.config.file=/tmp/gcs-connector-logging.properties"
156+
```
157+
158+
Then, create a file named `/tmp/gcs-connector-logging.properties` with the
159+
following content:
160+
161+
```
162+
handlers = java.util.logging.ConsoleHandler
163+
java.util.logging.ConsoleHandler.level = ALL
164+
com.google.level = FINE
165+
sun.net.www.protocol.http.HttpURLConnection.level = ALL
166+
```
167+
140168
* To troubleshoot other issues run `hadoop fs` command with debug logs:
141169

142170
```

gcs/README.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -45,10 +45,11 @@ installed automatically.
4545
When you set up a Hadoop cluster by following the directions in `INSTALL.md`,
4646
the cluster is automatically configured for optimal use with the connector.
4747
Typically, there is no need for further configuration.
48-
4948
To customize the connector, specify configuration values in `core-site.xml` in
5049
the Hadoop configuration directory on the machine on which the connector is
51-
installed.
50+
installed. For `distcp`-specific tuning and troubleshooting in self-managed
51+
Hadoop environments with HNS, especially when encountering
52+
`DEADLINE_EXCEEDED` errors, refer to `CONFIGURATION.md`.
5253

5354
For a complete list of configuration keys and their default values see
5455
[CONFIGURATION.md](/gcs/CONFIGURATION.md).

0 commit comments

Comments
 (0)