Skip to content

[DOC] docs for distcp with HNS in self-managed Hadoop are sparse #1375

@cjac

Description

@cjac

Customers using distcp from a self-managed Hadoop cluster to a Google Cloud Storage (GCS) bucket with Hierarchical Namespace (HNS) enabled have experienced intermittent failures, often manifesting as DEADLINE_EXCEEDED errors or generic SSH operator error: exit status = 25. This issue appears to stem from inadequate or unclear documentation regarding specific GCS connector configurations required for these scenarios.

Impact:

Without clear guidance, customers face difficulty troubleshooting and resolving these distcp failures, leading to inefficient data transfers and increased support engagement.

Proposed Documentation Changes:

  • gcs/CONFIGURATION.md:

    • Clarified guidance on fs.gs.http.read-timeout and fs.gs.hierarchical.namespace.folders.enable: Added specific instructions to set fs.gs.hierarchical.namespace.folders.enable to true for HNS-enabled buckets and increased the recommended fs.gs.http.read-timeout to mitigate DEADLINE_EXCEEDED errors. It also explicitly warns against incorrect configuration leading to DEADLINE_EXCEEDED or exit status = 25 errors.
    • Added troubleshooting for generic exit codes: Provides advice to examine detailed Hadoop and GCS connector logs for generic exit codes like exit status = 25.
    • Included recommendations for shaded JARs: Suggests using shaded GCS connector JARs (gcs-connector-hadoop3-*-shaded.jar) to resolve dependency conflicts, which can manifest as NoSuchMethodError or ClassNotFoundException.
  • gcs/INSTALL.md:

    • Expanded "Troubleshooting the installation": Added more detailed advice on diagnosing dependency conflicts and enabling verbose logging for the GCS connector, specifically highlighting its utility for DEADLINE_EXCEEDED errors and general distcp troubleshooting.
  • gcs/README.md:

    • Updated "Configuring the connector": Now prominently guides users facing distcp and HNS issues, including DEADLINE_EXCEEDED errors, to the more detailed CONFIGURATION.md for specific tuning and troubleshooting.

These updates aim to provide clearer instructions and troubleshooting steps for distcp operations with HNS-enabled buckets in non-Dataproc Hadoop environments, thereby reducing the need for support engagement for these common problems.

Reference:

Self link: go/ghgcd/hadoop-connectors/issues/1375

Addresses customer support issue:
go/sf/55915396 (case)
go/sf/56459963 (consult)

Issue discussion: b/389061732

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions