Skip to content

Conversation

@kszucs
Copy link
Member

@kszucs kszucs commented Oct 14, 2025

Prototype implementation for an arrow-rs based page pruning parquet reader for low latency limit/offset queries.

It is a standalone library for now, haven't been integrated to the viewer yet.

Install

cd libs/libviewer
pip install maturin
maturin develop -r  

Index Dataset

dv --use-cache nvidia/OpenCodeReasoning index

This uses huggingface_hub to download and cache the dataset files.
Then creates a metadata file for each parque file in the dataset with
offset index included.

Remove --use-cache to directly download the files from the hub.

Execute a limit/offset query

dv --use-cache nvidia/OpenCodeReasoning query --limit 10 --offset 0

This will query the dataset using the local metadata index files.
The scanner only reads the necessary parquet pages to minimize the
network traffic.

Remove --use-cache to directly query data from the hub.

Integration and testing

Before covering it with tests, it would be nice to see the necessary API for integration.

Supersedes #3199

cache: "poetry"
cache-dependency-path: |
${{ inputs.working-directory }}/poetry.lock
# cache: "poetry"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I temporarily turned it off because it wasn't installing the optional libviewer dependency properly.

# Build with: docker build --target <service_name> -t <tag> .

ARG PYTHON_VERSION=3.12.11
FROM python:${PYTHON_VERSION}-slim AS viewer
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Building the rust based libviewer as a wheel to not include the compiler toolchains in the final docker images.

optional = true

[tool.poetry.group.libviewer.dependencies]
libviewer = { path = "../libviewer", develop = true }
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally I added it as a mandatory dependency but I wasn't able to convince poetry to skip installing it from source in the docker image, but rather use a prebuilt wheel, see https://github.com/huggingface/dataset-viewer/pull/3244/files#r2432505025.

Apparently the path dependencies doesn't work well with compiled extension modules. Ideally we should build wheels for all the internal libs (libviewer, libcommon, libapi) but the dependency versions pinned in the pyproject files are more loose than what we have in the poetry lockfiles and some of the builds/tests are sensitive to those dependencies.

So I chose to define libviewer as an optional dependency which we install only to the relevant services using prebuilt wheels in the containers and using --with libviewer during local development.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good ! The main point to address is raising TooBigRows when possible to avoid OOMing the /rows worker

Comment on lines 103 to 106
# pa_table, truncated_columns = rows_index.query_truncated_binary(
# offset=offset, length=length
# )
pa_table = rows_index.query_with_page_pruning(offset=offset, length=length)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can see later to truncate binary data (sometimes users have a column with very long binary data and we truncate them when reading them to not OOM)

Though what we will need right away is to raise TooBigRows if the resulting record batches are likely to cause a OOM (if they can use >300MB of ram). We can use a simple heuristic based on average row size in the row group to know if it's safe to run the query or not


YAML_FIELDS_TO_CHECK = ["dataset_info", "configs", "viewer", "language"]

USE_LIBVIEWER_FOR_DATASETS = True
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we can either enable libviewer globally or provide a set of dataset names to selectively enable libviewer for them.

parquet_metadata_directory: StrPath,
max_arrow_data_in_memory: int,
max_scan_size: int,
hf_token: Optional[str] = None,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hf_token is needed for libviewer but not for the old index since httpfs already contains it. I'm not unifying it because of the session handling for httpfs is a little hard to follow, also would complicate the testing. If libviewer turns out to be stable we can just remove httpfs entirely with the old indexer.

def _init_dataset_info(self, parquet_metadata_directory: StrPath) -> None:
# get the list of parquet files and features
with StepProfiler(method="rows_index._get_dataset_metadata", step="all"):
response = get_previous_step_or_raise(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Query the dataset information from mongo here, note that we should switch to an asynchronous approach here to avoid blocking the event loop. Libviewer supports async queries so RowsIndex can be async first in the future.

@kszucs kszucs force-pushed the libviewer2 branch 2 times, most recently from 6a05418 to 77b78a2 Compare November 5, 2025 09:35
@kszucs kszucs requested a review from lhoestq November 5, 2025 09:35
@kszucs kszucs force-pushed the libviewer2 branch 2 times, most recently from bc63468 to cf46d41 Compare November 5, 2025 10:06
chore: add .env.debug configuration

chore: add .env.debug

feat: primitive parquet reader with page pruning

add poetry build for libviewer

add libviewer to rows

refactor: only extract metadata and don't try to calculate offset index

ci: update dockerfiles to include the rust toolchain and libviewer

chore: pin python to 3.12.11 in libviewer and update lockfile

feat: use PageIndexPolicy to optionally read offset index

feat: support querying RowsIndex with page pruning

build: add libviewer as a dependency to libcommon

style: ruff format libcommon changes

chore: use query_with_page_pruning from the rows endpoint

chore: fix mypy errors

style: import Sequence from collections.abc

build: don't use libviewer as an editable dependency

build: try to configure poetry to properly install libviewer

ci: temporarily disable poetry cache

style: fixx ruff check errors

build: relock projects depending on libcommon

build: add rust toolchain to more dockerfiles

build: copy the entire libviewer directory in dockerfiles because poetry install is called at the build phase

build: turn libviewer an optional dependency due to build difficulties

chore: missing api stage from dockerfile

ci: install libviewer extra in the libcommon build

style: fix ruff check error in parquet utils

ci: disable poetry cache

feat: raise TooBigRows exceptions if the scan size would exceed a limit

feat: implement binary truncation for page pruning reader

style: ignore variable shadowing ruff check

ci: install libviewer in the worker image

feat: pass hf_token to the opendal store

chore: remove files_to_index estimation

chore: poetry lock worker service

chore: remove reduntand gitignore entries from libviewer

ci: install libviewer in the worker build

style: fix mypy ignore

chore: cleanup the libviewer python code

style: try to please mypy due to missing import

style: make token optional

test: make the mocking compatible with the page pruning reader in test_first_rows
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me overall ! 🚀

Any idea why we get this in the e2e CI ?

File "/src/libs/libcommon/src/libcommon/parquet_utils.py", line 561, in query_libviewer_index
    batches = self.viewer_index.sync_scan(offset=offset, limit=length, scan_size_limit=self.max_scan_size)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
libviewer.PyDatasetError: External: Generic LocalFileSystem error: Requested range was invalid

and for the worker CI, you can fix this with this change I believe

- assert len(mock_http_file.call_args_list) == 1
+ assert len(mock_http_file.call_args_list) == 0  #  it uses libviewer, not pyarrow + HTTPFile

Comment on lines +37 to +41
uses: ./.github/workflows/_unit-tests-python.yml
with:
working-directory: libs/libviewer
poetry-args: "--with dev"
secrets: inherit

Check warning

Code scanning / CodeQL

Workflow does not contain permissions Medium

Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {}

Copilot Autofix

AI 3 days ago

To fix the problem, an explicit permissions block should be added to the workflow file. In this case, since no job appears to need write access (all jobs do quality checks and run unit tests, based on their names), it is safest to set contents: read at the workflow level, unless any reusable workflow requires something broader in its own documentation (not shown here).

Steps to fix:

  • Add a permissions: section at the root of .github/workflows/l-libcommon.yml (e.g., after name: and before on:).
  • Use contents: read as the default minimal permissions required.
  • If later you discover a job or called reusable workflow requires more (such as pull-requests: write), you can add those specifically.

Regions to change:

  • Insert the following after line 4 (after name: libs/libcommon):
    permissions:
      contents: read
    

This does not require any new methods, imports, or definitions, as it is a YAML configuration change.


Suggested changeset 1
.github/workflows/l-libcommon.yml

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/.github/workflows/l-libcommon.yml b/.github/workflows/l-libcommon.yml
--- a/.github/workflows/l-libcommon.yml
+++ b/.github/workflows/l-libcommon.yml
@@ -2,6 +2,8 @@
 # Copyright 2022 The HuggingFace Authors.
 
 name: libs/libcommon
+permissions:
+  contents: read
 on:
   workflow_dispatch:
   push:
EOF
@@ -2,6 +2,8 @@
# Copyright 2022 The HuggingFace Authors.

name: libs/libcommon
permissions:
contents: read
on:
workflow_dispatch:
push:
Copilot is powered by AI and may make mistakes. Always verify output.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants