-
Notifications
You must be signed in to change notification settings - Fork 96
feat: primitive parquet reader with page pruning #3244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| cache: "poetry" | ||
| cache-dependency-path: | | ||
| ${{ inputs.working-directory }}/poetry.lock | ||
| # cache: "poetry" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I temporarily turned it off because it wasn't installing the optional libviewer dependency properly.
| # Build with: docker build --target <service_name> -t <tag> . | ||
|
|
||
| ARG PYTHON_VERSION=3.12.11 | ||
| FROM python:${PYTHON_VERSION}-slim AS viewer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Building the rust based libviewer as a wheel to not include the compiler toolchains in the final docker images.
| optional = true | ||
|
|
||
| [tool.poetry.group.libviewer.dependencies] | ||
| libviewer = { path = "../libviewer", develop = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally I added it as a mandatory dependency but I wasn't able to convince poetry to skip installing it from source in the docker image, but rather use a prebuilt wheel, see https://github.com/huggingface/dataset-viewer/pull/3244/files#r2432505025.
Apparently the path dependencies doesn't work well with compiled extension modules. Ideally we should build wheels for all the internal libs (libviewer, libcommon, libapi) but the dependency versions pinned in the pyproject files are more loose than what we have in the poetry lockfiles and some of the builds/tests are sensitive to those dependencies.
So I chose to define libviewer as an optional dependency which we install only to the relevant services using prebuilt wheels in the containers and using --with libviewer during local development.
lhoestq
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good ! The main point to address is raising TooBigRows when possible to avoid OOMing the /rows worker
| # pa_table, truncated_columns = rows_index.query_truncated_binary( | ||
| # offset=offset, length=length | ||
| # ) | ||
| pa_table = rows_index.query_with_page_pruning(offset=offset, length=length) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can see later to truncate binary data (sometimes users have a column with very long binary data and we truncate them when reading them to not OOM)
Though what we will need right away is to raise TooBigRows if the resulting record batches are likely to cause a OOM (if they can use >300MB of ram). We can use a simple heuristic based on average row size in the row group to know if it's safe to run the query or not
8a13593 to
d4d0259
Compare
|
|
||
| YAML_FIELDS_TO_CHECK = ["dataset_info", "configs", "viewer", "language"] | ||
|
|
||
| USE_LIBVIEWER_FOR_DATASETS = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we can either enable libviewer globally or provide a set of dataset names to selectively enable libviewer for them.
| parquet_metadata_directory: StrPath, | ||
| max_arrow_data_in_memory: int, | ||
| max_scan_size: int, | ||
| hf_token: Optional[str] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hf_token is needed for libviewer but not for the old index since httpfs already contains it. I'm not unifying it because of the session handling for httpfs is a little hard to follow, also would complicate the testing. If libviewer turns out to be stable we can just remove httpfs entirely with the old indexer.
| def _init_dataset_info(self, parquet_metadata_directory: StrPath) -> None: | ||
| # get the list of parquet files and features | ||
| with StepProfiler(method="rows_index._get_dataset_metadata", step="all"): | ||
| response = get_previous_step_or_raise( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Query the dataset information from mongo here, note that we should switch to an asynchronous approach here to avoid blocking the event loop. Libviewer supports async queries so RowsIndex can be async first in the future.
6a05418 to
77b78a2
Compare
bc63468 to
cf46d41
Compare
…dex_is_partial()`
chore: add .env.debug configuration chore: add .env.debug feat: primitive parquet reader with page pruning add poetry build for libviewer add libviewer to rows refactor: only extract metadata and don't try to calculate offset index ci: update dockerfiles to include the rust toolchain and libviewer chore: pin python to 3.12.11 in libviewer and update lockfile feat: use PageIndexPolicy to optionally read offset index feat: support querying RowsIndex with page pruning build: add libviewer as a dependency to libcommon style: ruff format libcommon changes chore: use query_with_page_pruning from the rows endpoint chore: fix mypy errors style: import Sequence from collections.abc build: don't use libviewer as an editable dependency build: try to configure poetry to properly install libviewer ci: temporarily disable poetry cache style: fixx ruff check errors build: relock projects depending on libcommon build: add rust toolchain to more dockerfiles build: copy the entire libviewer directory in dockerfiles because poetry install is called at the build phase build: turn libviewer an optional dependency due to build difficulties chore: missing api stage from dockerfile ci: install libviewer extra in the libcommon build style: fix ruff check error in parquet utils ci: disable poetry cache feat: raise TooBigRows exceptions if the scan size would exceed a limit feat: implement binary truncation for page pruning reader style: ignore variable shadowing ruff check ci: install libviewer in the worker image feat: pass hf_token to the opendal store chore: remove files_to_index estimation chore: poetry lock worker service chore: remove reduntand gitignore entries from libviewer ci: install libviewer in the worker build style: fix mypy ignore chore: cleanup the libviewer python code style: try to please mypy due to missing import style: make token optional test: make the mocking compatible with the page pruning reader in test_first_rows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me overall ! 🚀
Any idea why we get this in the e2e CI ?
File "/src/libs/libcommon/src/libcommon/parquet_utils.py", line 561, in query_libviewer_index
batches = self.viewer_index.sync_scan(offset=offset, limit=length, scan_size_limit=self.max_scan_size)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
libviewer.PyDatasetError: External: Generic LocalFileSystem error: Requested range was invalid
and for the worker CI, you can fix this with this change I believe
- assert len(mock_http_file.call_args_list) == 1
+ assert len(mock_http_file.call_args_list) == 0 # it uses libviewer, not pyarrow + HTTPFile| uses: ./.github/workflows/_unit-tests-python.yml | ||
| with: | ||
| working-directory: libs/libviewer | ||
| poetry-args: "--with dev" | ||
| secrets: inherit |
Check warning
Code scanning / CodeQL
Workflow does not contain permissions Medium
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 3 days ago
To fix the problem, an explicit permissions block should be added to the workflow file. In this case, since no job appears to need write access (all jobs do quality checks and run unit tests, based on their names), it is safest to set contents: read at the workflow level, unless any reusable workflow requires something broader in its own documentation (not shown here).
Steps to fix:
- Add a
permissions:section at the root of.github/workflows/l-libcommon.yml(e.g., aftername:and beforeon:). - Use
contents: readas the default minimal permissions required. - If later you discover a job or called reusable workflow requires more (such as
pull-requests: write), you can add those specifically.
Regions to change:
- Insert the following after line 4 (after
name: libs/libcommon):permissions: contents: read
This does not require any new methods, imports, or definitions, as it is a YAML configuration change.
-
Copy modified lines R5-R6
| @@ -2,6 +2,8 @@ | ||
| # Copyright 2022 The HuggingFace Authors. | ||
|
|
||
| name: libs/libcommon | ||
| permissions: | ||
| contents: read | ||
| on: | ||
| workflow_dispatch: | ||
| push: |
Prototype implementation for an arrow-rs based page pruning parquet reader for low latency limit/offset queries.
It is a standalone library for now, haven't been integrated to the viewer yet.
Install
cd libs/libviewer pip install maturin maturin develop -rIndex Dataset
This uses
huggingface_hubto download and cache the dataset files.Then creates a metadata file for each parque file in the dataset with
offset index included.
Remove
--use-cacheto directly download the files from the hub.Execute a limit/offset query
This will query the dataset using the local metadata index files.
The scanner only reads the necessary parquet pages to minimize the
network traffic.
Remove
--use-cacheto directly query data from the hub.Integration and testing
Before covering it with tests, it would be nice to see the necessary API for integration.
Supersedes #3199