feat: primitive parquet reader with page pruning #3244

kszucs · 2025-10-14T21:38:42Z

Prototype implementation for an arrow-rs based page pruning parquet reader for low latency limit/offset queries.

It is a standalone library for now, haven't been integrated to the viewer yet.

Install

cd libs/libviewer
pip install maturin
maturin develop -r

Index Dataset

dv --use-cache nvidia/OpenCodeReasoning index

This uses huggingface_hub to download and cache the dataset files.
Then creates a metadata file for each parque file in the dataset with
offset index included.

Remove --use-cache to directly download the files from the hub.

Execute a limit/offset query

dv --use-cache nvidia/OpenCodeReasoning query --limit 10 --offset 0

This will query the dataset using the local metadata index files.
The scanner only reads the necessary parquet pages to minimize the
network traffic.

Remove --use-cache to directly query data from the hub.

Integration and testing

Before covering it with tests, it would be nice to see the necessary API for integration.

Supersedes #3199

kszucs · 2025-10-15T13:08:52Z

.github/workflows/_unit-tests-python.yml

-          cache: "poetry"
-          cache-dependency-path: |
-            ${{ inputs.working-directory }}/poetry.lock
+          # cache: "poetry"


I temporarily turned it off because it wasn't installing the optional libviewer dependency properly.

kszucs · 2025-10-15T13:09:32Z

Dockerfile

 # Build with: docker build --target <service_name> -t <tag> .

+ARG PYTHON_VERSION=3.12.11
+FROM python:${PYTHON_VERSION}-slim AS viewer


Building the rust based libviewer as a wheel to not include the compiler toolchains in the final docker images.

kszucs · 2025-10-15T13:13:38Z

libs/libcommon/pyproject.toml

+optional = true
+
+[tool.poetry.group.libviewer.dependencies]
+libviewer = { path = "../libviewer", develop = true }


Originally I added it as a mandatory dependency but I wasn't able to convince poetry to skip installing it from source in the docker image, but rather use a prebuilt wheel, see https://github.com/huggingface/dataset-viewer/pull/3244/files#r2432505025.

Apparently the path dependencies doesn't work well with compiled extension modules. Ideally we should build wheels for all the internal libs (libviewer, libcommon, libapi) but the dependency versions pinned in the pyproject files are more loose than what we have in the poetry lockfiles and some of the builds/tests are sensitive to those dependencies.

So I chose to define libviewer as an optional dependency which we install only to the relevant services using prebuilt wheels in the containers and using --with libviewer during local development.

lhoestq

Looks pretty good ! The main point to address is raising TooBigRows when possible to avoid OOMing the /rows worker

libs/libviewer/src/dataset.rs

lhoestq · 2025-10-15T13:35:03Z

services/rows/src/rows/routes/rows.py

+                            # pa_table, truncated_columns = rows_index.query_truncated_binary(
+                            #     offset=offset, length=length
+                            # )
+                            pa_table = rows_index.query_with_page_pruning(offset=offset, length=length)


We can see later to truncate binary data (sometimes users have a column with very long binary data and we truncate them when reading them to not OOM)

Though what we will need right away is to raise TooBigRows if the resulting record batches are likely to cause a OOM (if they can use >300MB of ram). We can use a simple heuristic based on average row size in the row group to know if it's safe to run the query or not

kszucs · 2025-11-05T08:50:33Z

libs/libcommon/src/libcommon/constants.py


 YAML_FIELDS_TO_CHECK = ["dataset_info", "configs", "viewer", "language"]
+
+USE_LIBVIEWER_FOR_DATASETS = True


Here we can either enable libviewer globally or provide a set of dataset names to selectively enable libviewer for them.

kszucs · 2025-11-05T08:54:14Z

libs/libcommon/src/libcommon/parquet_utils.py

        parquet_metadata_directory: StrPath,
        max_arrow_data_in_memory: int,
+        max_scan_size: int,
+        hf_token: Optional[str] = None,


hf_token is needed for libviewer but not for the old index since httpfs already contains it. I'm not unifying it because of the session handling for httpfs is a little hard to follow, also would complicate the testing. If libviewer turns out to be stable we can just remove httpfs entirely with the old indexer.

kszucs · 2025-11-05T08:55:23Z

libs/libcommon/src/libcommon/parquet_utils.py

+    def _init_dataset_info(self, parquet_metadata_directory: StrPath) -> None:
+        # get the list of parquet files and features
+        with StepProfiler(method="rows_index._get_dataset_metadata", step="all"):
+            response = get_previous_step_or_raise(


Query the dataset information from mongo here, note that we should switch to an asynchronous approach here to avoid blocking the event loop. Libviewer supports async queries so RowsIndex can be async first in the future.

…dex_is_partial()`

chore: add .env.debug configuration chore: add .env.debug feat: primitive parquet reader with page pruning add poetry build for libviewer add libviewer to rows refactor: only extract metadata and don't try to calculate offset index ci: update dockerfiles to include the rust toolchain and libviewer chore: pin python to 3.12.11 in libviewer and update lockfile feat: use PageIndexPolicy to optionally read offset index feat: support querying RowsIndex with page pruning build: add libviewer as a dependency to libcommon style: ruff format libcommon changes chore: use query_with_page_pruning from the rows endpoint chore: fix mypy errors style: import Sequence from collections.abc build: don't use libviewer as an editable dependency build: try to configure poetry to properly install libviewer ci: temporarily disable poetry cache style: fixx ruff check errors build: relock projects depending on libcommon build: add rust toolchain to more dockerfiles build: copy the entire libviewer directory in dockerfiles because poetry install is called at the build phase build: turn libviewer an optional dependency due to build difficulties chore: missing api stage from dockerfile ci: install libviewer extra in the libcommon build style: fix ruff check error in parquet utils ci: disable poetry cache feat: raise TooBigRows exceptions if the scan size would exceed a limit feat: implement binary truncation for page pruning reader style: ignore variable shadowing ruff check ci: install libviewer in the worker image feat: pass hf_token to the opendal store chore: remove files_to_index estimation chore: poetry lock worker service chore: remove reduntand gitignore entries from libviewer ci: install libviewer in the worker build style: fix mypy ignore chore: cleanup the libviewer python code style: try to please mypy due to missing import style: make token optional test: make the mocking compatible with the page pruning reader in test_first_rows

.github/workflows/l-libcommon.yml

lhoestq

Looks good to me overall ! 🚀

Any idea why we get this in the e2e CI ?

File "/src/libs/libcommon/src/libcommon/parquet_utils.py", line 561, in query_libviewer_index
    batches = self.viewer_index.sync_scan(offset=offset, limit=length, scan_size_limit=self.max_scan_size)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
libviewer.PyDatasetError: External: Generic LocalFileSystem error: Requested range was invalid

and for the worker CI, you can fix this with this change I believe

- assert len(mock_http_file.call_args_list) == 1
+ assert len(mock_http_file.call_args_list) == 0  #  it uses libviewer, not pyarrow + HTTPFile

libs/libcommon/src/libcommon/parquet_utils.py

.github/workflows/l-libcommon.yml

+    uses: ./.github/workflows/_unit-tests-python.yml
+    with:
+      working-directory: libs/libviewer
+      poetry-args: "--with dev"
+    secrets: inherit


To fix the problem, an explicit permissions block should be added to the workflow file. In this case, since no job appears to need write access (all jobs do quality checks and run unit tests, based on their names), it is safest to set contents: read at the workflow level, unless any reusable workflow requires something broader in its own documentation (not shown here).

Steps to fix:

Add a permissions: section at the root of .github/workflows/l-libcommon.yml (e.g., after name: and before on:).

Use contents: read as the default minimal permissions required.

If later you discover a job or called reusable workflow requires more (such as pull-requests: write), you can add those specifically.

Regions to change:

Insert the following after line 4 (after name: libs/libcommon):
permissions: contents: read

This does not require any new methods, imports, or definitions, as it is a YAML configuration change.

kszucs mentioned this pull request Oct 14, 2025

feat: primitive parquet reader with page pruning #3199

Closed

kszucs force-pushed the libviewer2 branch from eb51194 to 78c2a49 Compare October 14, 2025 22:00

kszucs commented Oct 15, 2025

View reviewed changes

lhoestq reviewed Oct 15, 2025

View reviewed changes

kszucs force-pushed the libviewer2 branch from f522fe1 to 4df749c Compare October 27, 2025 12:01

lhoestq mentioned this pull request Nov 3, 2025

[Rows] sub-rowgroup loading using arrow-rs + libviewer #3213

Closed

2 tasks

kszucs force-pushed the libviewer2 branch 7 times, most recently from 8a13593 to d4d0259 Compare November 5, 2025 08:49

kszucs commented Nov 5, 2025

View reviewed changes

kszucs force-pushed the libviewer2 branch 2 times, most recently from 6a05418 to 77b78a2 Compare November 5, 2025 09:35

kszucs requested a review from lhoestq November 5, 2025 09:35

kszucs force-pushed the libviewer2 branch 2 times, most recently from bc63468 to cf46d41 Compare November 5, 2025 10:06

kszucs added 5 commits November 5, 2025 12:52

refactor(libcommon): remove unused RowsIndex.partial and `duckdb_in…

2b070fe

…dex_is_partial()`

chore: restore RowsIndex.partial

c48352e

test(libviewer): add a generic test case to exercise sync scanning

4d59701

ci(libviewer): try to add a github actions job for libviewer

dadb0dc

kszucs force-pushed the libviewer2 branch from 8925f7f to dadb0dc Compare November 5, 2025 11:52

chore(libviewer): relock poetry

647ad08

kszucs added 2 commits November 5, 2025 13:12

chore(libviewer): add and install pytest as a dev dependency

0e4b1e8

ci(libviewer): add style build for libviewer

2615158

github-advanced-security bot found potential problems Nov 5, 2025

View reviewed changes

.github/workflows/l-libcommon.yml Fixed Show fixed Hide fixed

kszucs added 2 commits November 5, 2025 13:36

ci(libviewer): remove style build

7eb69c3

ci(libviewer): don't inherit secrets in the libviewer tests

97bed9a

github-advanced-security bot found potential problems Nov 5, 2025

View reviewed changes

.github/workflows/l-libcommon.yml Fixed Show fixed Hide fixed

lhoestq reviewed Nov 5, 2025

View reviewed changes

libs/libcommon/src/libcommon/parquet_utils.py Outdated Show resolved Hide resolved

libs/libcommon/src/libcommon/parquet_utils.py Outdated Show resolved Hide resolved

kszucs added 2 commits November 6, 2025 10:55

chore: debug

c8e0fd3

chore: debug

216e1de

github-advanced-security bot found potential problems Nov 6, 2025

View reviewed changes

kszucs added 4 commits November 6, 2025 11:28

chore: debug

677f826

chore: debug

efc25e3

chore: debug

817f0fe

chore: temp disable libviewer

7226448

kszucs force-pushed the libviewer2 branch from 6d49501 to 7226448 Compare November 6, 2025 11:26

kszucs added 7 commits November 6, 2025 15:44

chore: don't pass file size to read_metadata

5f6bd66

chore: check that the metadata file exists

2414f91

chore: capture backtrace

7667b24

chore: capture backtrace

083d26c

chore: force capture backtrace

92d6f09

chore: force dev profile

1a21ceb

chore: try not to load index

694e08b


		YAML_FIELDS_TO_CHECK = ["dataset_info", "configs", "viewer", "language"]

		USE_LIBVIEWER_FOR_DATASETS = True

@@ -2,6 +2,8 @@
             # Copyright 2022 The HuggingFace Authors.
             name: libs/libcommon
+            permissions:
+              contents: read
             on:
               workflow_dispatch:
               push:

feat: primitive parquet reader with page pruning #3244

Are you sure you want to change the base?

feat: primitive parquet reader with page pruning #3244

Uh oh!

Conversation

kszucs commented Oct 14, 2025

Install

Index Dataset

Execute a limit/offset query

Integration and testing

Uh oh!

kszucs Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

kszucs Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

kszucs Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lhoestq Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

kszucs Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

kszucs Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

kszucs Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lhoestq left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Check warning

Copilot Autofix

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lhoestq left a comment •

edited

Loading