Spark vs Trino Parquet reads #19158
Replies: 3 comments
-
Trino has a pipelined execution model, no barriers. This means that output can start appearing for user as soon as some rows are processed - no need to wait for entire data to be processed. The other thing is that the Trino devs have spent a lot of time implementing an optimizing a Parquet reader instead of using the parquet-mr reader for the whole thing. How are you measuring the execution time however? If using the trino-cli then note that it paginates the results so unless you print all results (by redirecting output to some file or |
Beta Was this translation helpful? Give feedback.
-
thanks @hashhar regarding:
Can you give me source code that points to this please? |
Beta Was this translation helpful? Give feedback.
-
My comparison is based on query runs for same parquet datasource same (ish) query - Yes I took some liberties in saying same(ish) because we do a ton of crap for spark in terms of dataframe processing / register datasource as temp tables before we run the queries but times are accurate by our measurement in terms of same data. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I am new to Trino world - Absolutely amazed at how fast parquet reads are. E.g.
A parquet with 150+ columns with ~ 1 TB of data returns in about 40 s on trino, on spark (leaving aside some headaches same stuff takes 1 hour to read and return).
Queries are
I am trying to understand what makes this happen? Predicate pushdowns are in both systems however trino is so fast... how. Can anyone point in code how this happens?
Beta Was this translation helpful? Give feedback.
All reactions