Skip to content

refactor: (WIP)exprimental parquet reader #18498

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

dantengsky
Copy link
Member

@dantengsky dantengsky commented Aug 7, 2025

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

This PR introduces an experimental Parquet deserialization component aimed at improving the efficiency of Parquet data deserialization for Fuse tables.

This PR is the first part of a series of improvements.

However, many additional features and capabilities still need to be implemented (TODO: add details todo items).

Initial testing on TPC-H SF 1000 shows that the optimizations are effective:

Comparison of scanning columns of table lineitem

  • using select col from lineitem ignore_result
  • data fully cached on local nvme ssd
  • warm up before each measure (columns almost all in page cache for each measured run)
Column Type experimental_reader (s) production_reader (s) experimental: production
l_orderkey (BIGINT) 4.479 5.719 78.32%
l_partkey (BIGINT) 5.424 7.17 75.65%
l_suppkey (BIGINT) 4.855 6.738 72.05%
l_linenumber (BIGINT) 4.264 4.824 88.39%
l_quantity (DECIMAL) 4.16 16.563 25.12%
l_extendedprice (DECIMAL) 4.74 18.316 25.88%
l_discount (DECIMAL) 4.598 15.849 29.01%
l_tax (DECIMAL) 4.514 15.813 28.55%
l_returnflag (VARCHAR) 4.641 11.36 40.85%
l_linestatus (VARCHAR) 4.495 11.833 37.99%
l_shipdate (DATE) 0.446 1.392 32.04%
l_commitdate (DATE) 2.24 2.758 81.22%
l_receiptdate (DATE) 2.17 2.561 84.73%
l_shipinstruct (VARCHAR) 9.01 17.387 51.82%
l_shipmode (VARCHAR) 6.059 13.922 43.52%
l_comment (VARCHAR) 20.381 38.873 52.43%

To enable to experiment reader set use_parquet2 = 1 (will be changed to other setting later)

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

@github-actions github-actions bot added the pr-refactor this PR changes the code base without new features or bugfix label Aug 7, 2025
@dantengsky dantengsky force-pushed the refactor-experiment-reader-cut-2 branch 2 times, most recently from 94d48cf to b24cb26 Compare August 7, 2025 16:04
@dantengsky dantengsky force-pushed the refactor-experiment-reader-cut-2 branch from c62d3ec to 06c7e5d Compare August 11, 2025 05:29
Copy link
Contributor

🤖 Smart Auto-retry Analysis

Workflow: 16871687290

📊 Summary

  • Total Jobs: 22
  • Failed Jobs: 1
  • Retryable: 0
  • Code Issues: 1

NO RETRY NEEDED

All failures appear to be code/test issues requiring manual fixes.

🔍 Job Details

  • linux / check: Not retryable (Code/Test)

🤖 About

Automated analysis using job annotations to distinguish infrastructure issues (auto-retried) from code/test issues (manual fixes needed).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-refactor this PR changes the code base without new features or bugfix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant