-
Notifications
You must be signed in to change notification settings - Fork 96
feat: primitive parquet reader with page pruning #3244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
kszucs
wants to merge
23
commits into
main
Choose a base branch
from
libviewer2
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 8 commits
Commits
Show all changes
23 commits
Select commit
Hold shift + click to select a range
2b070fe
refactor(libcommon): remove unused `RowsIndex.partial` and `duckdb_in…
kszucs c48352e
chore: restore RowsIndex.partial
kszucs 68970f4
build: use a single compose file with .env file
kszucs 4d59701
test(libviewer): add a generic test case to exercise sync scanning
kszucs dadb0dc
ci(libviewer): try to add a github actions job for libviewer
kszucs 647ad08
chore(libviewer): relock poetry
kszucs 0e4b1e8
chore(libviewer): add and install pytest as a dev dependency
kszucs 2615158
ci(libviewer): add style build for libviewer
kszucs 7eb69c3
ci(libviewer): remove style build
kszucs 97bed9a
ci(libviewer): don't inherit secrets in the libviewer tests
kszucs c8e0fd3
chore: debug
kszucs 216e1de
chore: debug
kszucs 677f826
chore: debug
kszucs efc25e3
chore: debug
kszucs 817f0fe
chore: debug
kszucs 7226448
chore: temp disable libviewer
kszucs 5f6bd66
chore: don't pass file size to read_metadata
kszucs 2414f91
chore: check that the metadata file exists
kszucs 7667b24
chore: capture backtrace
kszucs 083d26c
chore: capture backtrace
kszucs 92d6f09
chore: force capture backtrace
kszucs 1a21ceb
chore: force dev profile
kszucs 694e08b
chore: try not to load index
kszucs File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,120 +1,140 @@ | ||
| # Multi-stage Dockerfile for all dataset-viewer services and jobs | ||
| # Build with: docker build --target <service_name> -t <tag> . | ||
|
|
||
| ARG PYTHON_VERSION=3.12.11 | ||
| FROM python:${PYTHON_VERSION}-slim AS viewer | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Building the rust based libviewer as a wheel to not include the compiler toolchains in the final docker images. |
||
|
|
||
| # Install Rust and minimal build deps | ||
| RUN apt-get update \ | ||
| && apt-get install -y --no-install-recommends curl build-essential \ | ||
| && rm -rf /var/lib/apt/lists/* | ||
|
|
||
| # Install Rust toolchain and maturin | ||
| RUN curl https://sh.rustup.rs -sSf | sh -s -- -y \ | ||
| && . $HOME/.cargo/env \ | ||
| && pip install maturin \ | ||
| && rustc --version \ | ||
| && cargo --version | ||
| # Add cargo bin dir to PATH (so maturin + cargo available globally) | ||
| ENV PATH="/root/.cargo/bin:${PATH}" | ||
|
|
||
| # Build libviewer | ||
| COPY libs/libviewer /src/libs/libviewer | ||
| WORKDIR /src/libs/libviewer | ||
| RUN maturin build --release --strip --out /tmp/dist | ||
|
|
||
| # Base stage with shared setup | ||
| FROM python:3.12.11-slim AS common | ||
| FROM python:${PYTHON_VERSION}-slim AS common | ||
|
|
||
| # System dependencies | ||
| RUN apt-get update \ | ||
| && apt-get install -y unzip wget procps htop ffmpeg libavcodec-extra libsndfile1 \ | ||
| && rm -rf /var/lib/apt/lists/* | ||
|
|
||
| # Common environment variables | ||
| ARG POETRY_VERSION=2.1.4 | ||
| ENV PYTHONFAULTHANDLER=1 \ | ||
| PYTHONUNBUFFERED=1 \ | ||
| PYTHONHASHSEED=random \ | ||
| PIP_NO_CACHE_DIR=1 \ | ||
| PIP_DISABLE_PIP_VERSION_CHECK=on \ | ||
| PIP_DEFAULT_TIMEOUT=100 \ | ||
| POETRY_NO_INTERACTION=1 \ | ||
| POETRY_VERSION=2.1.4 \ | ||
| POETRY_VIRTUALENVS_CREATE=false \ | ||
| PATH="$PATH:/root/.local/bin" | ||
|
|
||
| # Install pip and poetry | ||
| RUN pip install -U pip && pip install "poetry==$POETRY_VERSION" | ||
| RUN pip install -U pip && pip install "poetry==${POETRY_VERSION}" | ||
|
|
||
| # Install libcommon's dependencies but not libcommon itself | ||
| COPY libs/libcommon/poetry.lock \ | ||
| libs/libcommon/pyproject.toml \ | ||
| /src/libs/libcommon/ | ||
| RUN poetry install --no-cache --no-root --no-directory -P /src/libs/libcommon | ||
|
|
||
| # Base image for services including libapi's dependencies | ||
| FROM common AS service | ||
| COPY libs/libapi/poetry.lock \ | ||
| libs/libapi/pyproject.toml \ | ||
| /src/libs/libapi/ | ||
| RUN poetry install --no-cache --no-root --no-directory -P /src/libs/libapi | ||
| WORKDIR /src/libs/libcommon | ||
| RUN poetry install --no-cache --no-root | ||
|
|
||
| # Below are the actual API services which depend on libapi and libcommon. | ||
| # Since the majority of the dependencies are already installed in the `api` | ||
| # we let poetry to actually install the `libs` and the specific service. | ||
| # Since the majority of the dependencies are already installed in the | ||
| # `common` stage we let poetry to handle the rest. | ||
|
|
||
| # API service | ||
| FROM service AS api | ||
| FROM common AS api | ||
| COPY libs /src/libs | ||
| COPY services/api /src/services/api | ||
| RUN poetry install --no-cache -P /src/services/api | ||
| WORKDIR /src/services/api/ | ||
| WORKDIR /src/services/api | ||
| RUN poetry install --no-cache | ||
| ENTRYPOINT ["poetry", "run", "python", "src/api/main.py"] | ||
|
|
||
| # Admin service | ||
| FROM service AS admin | ||
| FROM common AS admin | ||
| COPY libs /src/libs | ||
| COPY services/admin /src/services/admin | ||
| RUN poetry install --no-cache -P /src/services/admin | ||
| WORKDIR /src/services/admin/ | ||
| WORKDIR /src/services/admin | ||
| RUN poetry install --no-cache | ||
| ENTRYPOINT ["poetry", "run", "python", "src/admin/main.py"] | ||
|
|
||
| # Rows service | ||
| FROM service AS rows | ||
| FROM common AS rows | ||
| COPY --from=viewer /tmp/dist /tmp/dist | ||
| RUN pip install /tmp/dist/libviewer-*.whl | ||
| COPY libs /src/libs | ||
| COPY services/rows /src/services/rows | ||
| RUN poetry install --no-cache -P /src/services/rows | ||
| WORKDIR /src/services/rows/ | ||
| WORKDIR /src/services/rows | ||
| RUN poetry install --no-cache | ||
| ENTRYPOINT ["poetry", "run", "python", "src/rows/main.py"] | ||
|
|
||
| # Search service | ||
| FROM service AS search | ||
| FROM common AS search | ||
| COPY libs /src/libs | ||
| COPY services/search /src/services/search | ||
| RUN poetry install --no-cache -P /src/services/search | ||
| WORKDIR /src/services/search/ | ||
| WORKDIR /src/services/search | ||
| RUN poetry install --no-cache | ||
| ENTRYPOINT ["poetry", "run", "python", "src/search/main.py"] | ||
|
|
||
| # SSE API service | ||
| FROM service AS sse-api | ||
| FROM common AS sse-api | ||
| COPY libs /src/libs | ||
| COPY services/sse-api /src/services/sse-api | ||
| RUN poetry install --no-cache -P /src/services/sse-api | ||
| WORKDIR /src/services/sse-api/ | ||
| WORKDIR /src/services/sse-api | ||
| RUN poetry install --no-cache | ||
| ENTRYPOINT ["poetry", "run", "python", "src/sse_api/main.py"] | ||
|
|
||
| # Webhook service | ||
| FROM service AS webhook | ||
| FROM common AS webhook | ||
| COPY libs /src/libs | ||
| COPY services/webhook /src/services/webhook | ||
| RUN poetry install --no-cache -P /src/services/webhook | ||
| WORKDIR /src/services/webhook/ | ||
| WORKDIR /src/services/webhook | ||
| RUN poetry install --no-cache | ||
| ENTRYPOINT ["poetry", "run", "python", "src/webhook/main.py"] | ||
|
|
||
| # Worker service | ||
| FROM common AS worker | ||
| COPY --from=viewer /tmp/dist /tmp/dist | ||
| RUN pip install /tmp/dist/libviewer-*.whl | ||
| COPY libs /src/libs | ||
| COPY services/worker /src/services/worker | ||
| WORKDIR /src/services/worker | ||
| # presidio-analyzer > spacy > thinc doesn't ship aarch64 wheels so need to compile | ||
| RUN if [ "$(uname -m)" = "aarch64" ]; then \ | ||
| apt-get update && apt-get install -y build-essential && \ | ||
| rm -rf /var/lib/apt/lists/*; \ | ||
| fi | ||
| RUN poetry install --no-cache -P /src/services/worker | ||
| RUN poetry install --no-cache | ||
| RUN python -m spacy download en_core_web_lg | ||
| WORKDIR /src/services/worker/ | ||
| ENTRYPOINT ["poetry", "run", "python", "src/worker/main.py"] | ||
|
|
||
| # Cache maintenance job | ||
| FROM common AS cache_maintenance | ||
| COPY libs /src/libs | ||
| COPY jobs/cache_maintenance /src/jobs/cache_maintenance | ||
| RUN poetry install --no-cache -P /src/jobs/cache_maintenance | ||
| WORKDIR /src/jobs/cache_maintenance/ | ||
| WORKDIR /src/jobs/cache_maintenance | ||
| RUN poetry install --no-cache | ||
| ENTRYPOINT ["poetry", "run", "python", "src/cache_maintenance/main.py"] | ||
|
|
||
| # MongoDB migration job | ||
| FROM common AS mongodb_migration | ||
| COPY libs /src/libs | ||
| COPY jobs/mongodb_migration /src/jobs/mongodb_migration | ||
| RUN poetry install --no-cache -P /src/jobs/mongodb_migration | ||
| WORKDIR /src/jobs/mongodb_migration/ | ||
| ENTRYPOINT ["poetry", "run", "python", "src/mongodb_migration/main.py"] | ||
| WORKDIR /src/jobs/mongodb_migration | ||
| RUN poetry install --no-cache | ||
| ENTRYPOINT ["poetry", "run", "python", "src/mongodb_migration/main.py"] | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I temporarily turned it off because it wasn't installing the optional libviewer dependency properly.