Spark vs Trino Parquet reads #19158

AnirudhVyas · 2023-09-26T00:26:55Z

AnirudhVyas
Sep 26, 2023

I am new to Trino world - Absolutely amazed at how fast parquet reads are. E.g.
A parquet with 150+ columns with ~ 1 TB of data returns in about 40 s on trino, on spark (leaving aside some headaches same stuff takes 1 hour to read and return).

Queries are

select *

I am trying to understand what makes this happen? Predicate pushdowns are in both systems however trino is so fast... how. Can anyone point in code how this happens?

hashhar · 2023-09-27T06:56:57Z

hashhar
Sep 27, 2023
Collaborator

Trino has a pipelined execution model, no barriers. This means that output can start appearing for user as soon as some rows are processed - no need to wait for entire data to be processed.
So query execution can start as soon as planning is complete but all files on S3 don't need to be listed before query can start.
The listing keeps on happening concurrently with query execution. This is one major difference from Spark.

The other thing is that the Trino devs have spent a lot of time implementing an optimizing a Parquet reader instead of using the parquet-mr reader for the whole thing.

How are you measuring the execution time however? If using the trino-cli then note that it paginates the results so unless you print all results (by redirecting output to some file or /dev/null) you may not be comparing apples to apples.

0 replies

AnirudhVyas · 2023-10-10T16:40:45Z

AnirudhVyas
Oct 10, 2023
Author

thanks @hashhar

regarding:

Trino has a pipelined execution model, no barriers. This means that output can start appearing for user as soon as some rows are processed - no need to wait for entire data to be processed.
So query execution can start as soon as planning is complete but all files on S3 don't need to be listed before query can start.

Can you give me source code that points to this please?

0 replies

AnirudhVyas · 2023-10-10T16:47:05Z

AnirudhVyas
Oct 10, 2023
Author

My comparison is based on query runs for same parquet datasource same (ish) query - Yes I took some liberties in saying same(ish) because we do a ton of crap for spark in terms of dataframe processing / register datasource as temp tables before we run the queries but times are accurate by our measurement in terms of same data.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Spark vs Trino Parquet reads #19158

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Spark vs Trino Parquet reads #19158

Uh oh!

AnirudhVyas Sep 26, 2023

Replies: 3 comments

Uh oh!

hashhar Sep 27, 2023 Collaborator

Uh oh!

AnirudhVyas Oct 10, 2023 Author

Uh oh!

AnirudhVyas Oct 10, 2023 Author

AnirudhVyas
Sep 26, 2023

hashhar
Sep 27, 2023
Collaborator

AnirudhVyas
Oct 10, 2023
Author

AnirudhVyas
Oct 10, 2023
Author