How to access data in S3 from a Flyte spark task running locally? #3229
xshen8888
started this conversation in
Deployment Tips & Tricks
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
When executing a flyte workflow locally on the developer's PC, a spark task trying to access data stored in AWS S3 will need extra work that is not documented in the Flyte community.
the spark task code is like
spark = flytekit.current_context().spark_session
spark_df = spark.read.parquet("s3a://bucket/key_to_parquet_data")
Solution is to
a) add the following 3 line to the flyte task's spark_conf section (only needed when you run spark task locally):
e.g.
@task(
task_config=Spark(
spark_conf={
"spark.jars.packages": "org.apache.hadoop:hadoop-aws:?.?.?",
"spark.hadoop.fs.s3a.access.key": "xxx",
"spark.hadoop.fs.s3a.secret.key": "yyy",
"spark.hadoop.fs.s3a.session.token": "zzz",
Note: hadoop-aws version must match what your pyspark version is asking for.
b) In order to avoid adding the above spark properties to every spark task, alternatively, create a folder conf in your flyte venv's pyspark installation and add a file spark-defaults.conf there with the following content.
e.g.
spark.jars.packages org.apache.hadoop:hadoop-aws:3.3.2
spark.hadoop.fs.s3a.access.key xxx
spark.hadoop.fs.s3a.secret.key yyy
spark.hadoop.fs.s3a.session.token zzz
c) Create an OS env var for your terminal session:
export SPARK_LOCAL_IP="127.0.0.1"
Alternatively, add the env var to your bash or zsh profile.
Beta Was this translation helpful? Give feedback.
All reactions