refactor: (WIP)exprimental parquet reader #18498

dantengsky · 2025-08-07T15:22:40Z

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

This PR introduces an experimental Parquet deserialization component aimed at improving the efficiency of Parquet data deserialization for Fuse tables.

This PR is the first part of a series of improvements.

However, many additional features and capabilities still need to be implemented (TODO: add details todo items).

Initial testing on TPC-H SF 1000 shows that the optimizations are effective:

Comparison of scanning columns of table lineitem

using select col from lineitem ignore_result
data fully cached on local nvme ssd
warm up before each measure (columns almost all in page cache for each measured run)

Column Type	experimental_reader (s)	production_reader (s)	experimental: production
l_orderkey (BIGINT)	4.479	5.719	78.32%
l_partkey (BIGINT)	5.424	7.17	75.65%
l_suppkey (BIGINT)	4.855	6.738	72.05%
l_linenumber (BIGINT)	4.264	4.824	88.39%
l_quantity (DECIMAL)	4.16	16.563	25.12%
l_extendedprice (DECIMAL)	4.74	18.316	25.88%
l_discount (DECIMAL)	4.598	15.849	29.01%
l_tax (DECIMAL)	4.514	15.813	28.55%
l_returnflag (VARCHAR)	4.641	11.36	40.85%
l_linestatus (VARCHAR)	4.495	11.833	37.99%
l_shipdate (DATE)	0.446	1.392	32.04%
l_commitdate (DATE)	2.24	2.758	81.22%
l_receiptdate (DATE)	2.17	2.561	84.73%
l_shipinstruct (VARCHAR)	9.01	17.387	51.82%
l_shipmode (VARCHAR)	6.059	13.922	43.52%
l_comment (VARCHAR)	20.381	38.873	52.43%

To enable to experiment reader set use_parquet2 = 1 (will be changed to other setting later)

Tests

Unit Test
Logic Test
Benchmark Test
No Test - Explain why

Type of change

Bug Fix (non-breaking change which fixes an issue)
New Feature (non-breaking change which adds functionality)
Breaking Change (fix or feature that could cause existing functionality not to work as expected)
Documentation Update
Refactoring
Performance Improvement
Other (please describe):

This change is

github-actions · 2025-08-14T15:47:12Z

Docker Image for PR

tag: pr-18498-9d2251f-1755186352

note: this image tag is only available for internal use.

github-actions bot added the pr-refactor this PR changes the code base without new features or bugfix label Aug 7, 2025

dantengsky force-pushed the refactor-experiment-reader-cut-2 branch 2 times, most recently from 94d48cf to b24cb26 Compare August 7, 2025 16:04

dantengsky and others added 13 commits August 11, 2025 13:17

refactor: exprimental parquet reader

4eefb2b

revert string iterator( performance degrades)

078e3b2

fix missing license header

7c401f6

revert configs

d9c8fb6

taplo fmt

d0a3197

toml format

dfa5af9

format toml, again

485aa61

fix: revert arrow converters

585c9b2

fix bitmap len mismatch

b8e1e76

tweak parquet physical type mappings

2f0b44a

align lz4_flex version

75ee589

rename switch settings name to use_experimental_parquet_reader

c9e6f2c

re-enable dictionary decoding for string column

06c7e5d

dantengsky force-pushed the refactor-experiment-reader-cut-2 branch from c62d3ec to 06c7e5d Compare August 11, 2025 05:29

dantengsky added 7 commits August 13, 2025 23:39

optimize plain encoding

004a557

try optimize small string rle path

3b72afe

optimize rle decode

7c3f6d0

optimize string view

ff1f7a5

optimize ptr access

b4639d3

use u8 for small string len

42ab20e

try unflold loop

7291750

dantengsky added the ci-cloud Build docker image for cloud test label Aug 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor: (WIP)exprimental parquet reader #18498

refactor: (WIP)exprimental parquet reader #18498

Uh oh!

dantengsky commented Aug 7, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Aug 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

refactor: (WIP)exprimental parquet reader #18498

Are you sure you want to change the base?

refactor: (WIP)exprimental parquet reader #18498

Uh oh!

Conversation

dantengsky commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Tests

Type of change

Uh oh!

github-actions bot commented Aug 14, 2025

Docker Image for PR

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dantengsky commented Aug 7, 2025 •

edited

Loading