Skip to content

Pattern 4 - ADLS Gen2 token authentication on multinode is missing hadoop configuration for RDD's #9

@dmuijen

Description

@dmuijen

Hello,

Thanks for the extensive comparison between setups in relation to ADLS and ADB, really helpful. In setting up an environment using Pattern 4 - Cluster scoped Service principal there is a bit of configuration missing producing errors in multinode processing using RDD's, e.g. when reading files directly from ADSL using sc.BinaryFiles(). The worker nodes do not seem to be able to access the ADLS Gen2 Token causing errors using a multinode setup through abfss://, whereas using a Spark DataFrame or a single node (driver node only) cluster there are no issues.

Turns out working with RDD's requires additional cluster configuration related to the hadoop config, see https://www.data-engineering.wiki/docs/spark/accessing-adls-gen-2-with-rdd/

If this bit or other method to achieve the same could be added in relation to working with RDD's that'd be great.

spark.hadoop.fs.azure.account.auth.type OAuth
spark.hadoop.fs.azure.account.oauth.provider.type org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
spark.hadoop.fs.azure.account.oauth2.client.id <service-principal-application-id>
spark.hadoop.fs.azure.account.oauth2.client.secret {{secrets/<your scope name>/<secret name>}}
spark.hadoop.fs.azure.account.oauth2.client.endpoint https://login.microsoftonline.com/<tenant id>/oauth2/token

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions