-
Notifications
You must be signed in to change notification settings - Fork 47
Description
Hello,
Thanks for the extensive comparison between setups in relation to ADLS and ADB, really helpful. In setting up an environment using Pattern 4 - Cluster scoped Service principal there is a bit of configuration missing producing errors in multinode processing using RDD's, e.g. when reading files directly from ADSL using sc.BinaryFiles(). The worker nodes do not seem to be able to access the ADLS Gen2 Token causing errors using a multinode setup through abfss://, whereas using a Spark DataFrame or a single node (driver node only) cluster there are no issues.
Turns out working with RDD's requires additional cluster configuration related to the hadoop config, see https://www.data-engineering.wiki/docs/spark/accessing-adls-gen-2-with-rdd/
If this bit or other method to achieve the same could be added in relation to working with RDD's that'd be great.
spark.hadoop.fs.azure.account.auth.type OAuth
spark.hadoop.fs.azure.account.oauth.provider.type org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
spark.hadoop.fs.azure.account.oauth2.client.id <service-principal-application-id>
spark.hadoop.fs.azure.account.oauth2.client.secret {{secrets/<your scope name>/<secret name>}}
spark.hadoop.fs.azure.account.oauth2.client.endpoint https://login.microsoftonline.com/<tenant id>/oauth2/token