Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 31 additions & 2 deletions gcs/CONFIGURATION.md
Original file line number Diff line number Diff line change
Expand Up @@ -140,6 +140,32 @@
fs.gs.storage.http.headers.another-custom-header=another_custom_value
```

* `fs.gs.hierarchical.namespace.folders.enable` (default: `false`)

Whether to create objects for the parent directories of objects with `/` in
their path e.g. creating `gs://bucket/foo/` upon deleting or renaming
`gs://bucket/foo/bar`. When using `distcp` with Hierarchical Namespace (HNS)
enabled buckets, ensure this property is set to `true` in your
`core-site.xml` for proper interaction with the hierarchical structure of
the bucket. Conversely, if your bucket does not have HNS enabled, this
property should be `false`. Incorrect configuration can lead to
`DEADLINE_EXCEEDED` errors or generic SSH operator failures
(e.g., `exit status = 25`).

* **Dependency Conflicts and Shaded JARs:** In non-Dataproc Hadoop environments,
you might encounter dependency conflicts (e.g.,
`java.lang.NoSuchMethodError` or `java.lang.ClassNotFoundException`). To
resolve these, consider using a shaded version of the `gcs-connector` JAR
(e.g., `gcs-connector-hadoop3-*-shaded.jar`), which bundles its dependencies
to avoid conflicts.

* **Troubleshooting Generic Exit Codes:** Generic exit codes (like `exit status
= 25`) from `distcp` or other Hadoop operations often indicate an underlying
issue that requires examining more detailed Hadoop and GCS connector logs.
Increasing the logging verbosity for the GCS connector can provide crucial
diagnostic information. Refer to `gcs/INSTALL.md` for guidance on enabling
verbose logging.

### Encryption ([CSEK](https://cloud.google.com/storage/docs/encryption/customer-supplied-keys))

* `fs.gs.encryption.algorithm` (not set by default)
Expand Down Expand Up @@ -401,10 +427,13 @@ Knobs configure the vectoredRead API

Timeout to establish a connection. Use `0` for an infinite timeout.

* `fs.gs.http.read-timeout` (default: `5s`)
* `fs.gs.http.read-timeout` (default: `5s`)

Timeout to read from an established connection. Use `0` for an infinite
timeout.
timeout. For `distcp` operations, especially with Hierarchical Namespace (HNS)
enabled buckets or large transfers, increasing this timeout can help
mitigate intermittent failures caused by network instability and prevent
`DEADLINE_EXCEEDED` errors.

### API client configuration

Expand Down
28 changes: 28 additions & 0 deletions gcs/INSTALL.md
Original file line number Diff line number Diff line change
Expand Up @@ -137,6 +137,34 @@ the installation.
gs://<some-bucket>`), and that the credentials in your configuration are
correct.

* **Dependency Conflicts:** In non-Dataproc Hadoop environments, unexpected
errors such as `java.lang.NoSuchMethodError` or
`java.lang.ClassNotFoundException` can occur due to version conflicts
with libraries already present in your Hadoop classpath. To mitigate
these, consider using a shaded version of the GCS connector JAR
(e.g., `gcs-connector-hadoop3-*-shaded.jar`). These JARs package all of
the connector's dependencies, reducing the chance of conflicts.

* **Enabling Verbose Logging:** For more detailed diagnostics of issues
beyond basic installation, such as `DEADLINE_EXCEEDED` errors, enable
verbose logging for the GCS connector. This can provide crucial
information for troubleshooting `distcp` failures or other unexpected
behavior. Add the following to your `hadoop-env.sh` file:

```bash
export HADOOP_CLIENT_OPTS="-Djava.util.logging.config.file=/tmp/gcs-connector-logging.properties"
```

Then, create a file named `/tmp/gcs-connector-logging.properties` with the
following content:

```
handlers = java.util.logging.ConsoleHandler
java.util.logging.ConsoleHandler.level = ALL
com.google.level = FINE
sun.net.www.protocol.http.HttpURLConnection.level = ALL
```

* To troubleshoot other issues run `hadoop fs` command with debug logs:

```
Expand Down
5 changes: 3 additions & 2 deletions gcs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,10 +45,11 @@ installed automatically.
When you set up a Hadoop cluster by following the directions in `INSTALL.md`,
the cluster is automatically configured for optimal use with the connector.
Typically, there is no need for further configuration.

To customize the connector, specify configuration values in `core-site.xml` in
the Hadoop configuration directory on the machine on which the connector is
installed.
installed. For `distcp`-specific tuning and troubleshooting in self-managed
Hadoop environments with HNS, especially when encountering
`DEADLINE_EXCEEDED` errors, refer to `CONFIGURATION.md`.

For a complete list of configuration keys and their default values see
[CONFIGURATION.md](/gcs/CONFIGURATION.md).
Expand Down