Skip to content

Conversation

dantengsky
Copy link
Member

@dantengsky dantengsky commented Aug 7, 2025

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

This PR introduces an experimental Parquet deserialization component aimed at improving the efficiency of Parquet data deserialization for Fuse tables.

This PR is the first part of a series of improvements.

However, many additional features and capabilities still need to be implemented (TODO: add details todo items).

Initial testing on TPC-H SF 1000 shows that the optimizations are effective:

Comparison of scanning columns of table lineitem

  • using select col from lineitem ignore_result
  • data fully cached on local nvme ssd
  • warm up before each measure (columns almost all in page cache for each measured run)
Column Type experimental_reader (s) production_reader (s) experimental: production
l_orderkey (BIGINT) 4.479 5.719 78.32%
l_partkey (BIGINT) 5.424 7.17 75.65%
l_suppkey (BIGINT) 4.855 6.738 72.05%
l_linenumber (BIGINT) 4.264 4.824 88.39%
l_quantity (DECIMAL) 4.16 16.563 25.12%
l_extendedprice (DECIMAL) 4.74 18.316 25.88%
l_discount (DECIMAL) 4.598 15.849 29.01%
l_tax (DECIMAL) 4.514 15.813 28.55%
l_returnflag (VARCHAR) 4.641 11.36 40.85%
l_linestatus (VARCHAR) 4.495 11.833 37.99%
l_shipdate (DATE) 0.446 1.392 32.04%
l_commitdate (DATE) 2.24 2.758 81.22%
l_receiptdate (DATE) 2.17 2.561 84.73%
l_shipinstruct (VARCHAR) 9.01 17.387 51.82%
l_shipmode (VARCHAR) 6.059 13.922 43.52%
l_comment (VARCHAR) 20.381 38.873 52.43%

To enable to experiment reader set use_parquet2 = 1 (will be changed to other setting later)

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

@github-actions github-actions bot added the pr-refactor this PR changes the code base without new features or bugfix label Aug 7, 2025
@dantengsky dantengsky force-pushed the refactor-experiment-reader-cut-2 branch 2 times, most recently from 94d48cf to b24cb26 Compare August 7, 2025 16:04
@dantengsky dantengsky force-pushed the refactor-experiment-reader-cut-2 branch from c62d3ec to 06c7e5d Compare August 11, 2025 05:29
@dantengsky dantengsky added the ci-cloud Build docker image for cloud test label Aug 14, 2025
Copy link
Contributor

Docker Image for PR

  • tag: pr-18498-9d2251f-1755186352

note: this image tag is only available for internal use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-cloud Build docker image for cloud test pr-refactor this PR changes the code base without new features or bugfix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant