Skip to content
Discussion options

You must be logged in to vote

Okay I'll get into some architecture here to try and explain things. Reading data into Spark usually involves a few operations. Note that the order of these operations and the machine that they run on can change based off of what the input format is.

  1. Predicate push down/metadata calculations - This is to figure out what data to read in order to avoid reading too much data.
  2. Data transfer - This is actually copying the data from where it is stored to the Spark node so it can be processed more
  3. Data Decoding - This is translating the data into a format that Spark wants.

For file formats, like Parquet and ORC, stored in a blob store, like S3, we can only really accelerate the data decoding. …

Replies: 7 comments

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Answer selected by sameerz
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
2 participants
Converted from issue

This discussion was converted from issue #4903 on April 27, 2022 15:31.