Skip to content

Pin pyarrow build to tagged release for s390x reproducibility #2433

@coderabbitai

Description

@coderabbitai

Problem Description

The current s390x wheel-builder stage in jupyter/datascience/ubi9-python-3.12/Dockerfile.cpu clones the Apache Arrow repository from the default branch (HEAD), which creates reproducibility issues and potential compatibility problems:

  1. Reproducibility Risk: Building from HEAD means different builds may use different Arrow versions, leading to inconsistent wheel outputs
  2. Version Mismatch: The pylock.toml specifies pyarrow 20.0.0 (with s390x exclusion), but building from HEAD may produce a different version
  3. Build Instability: HEAD builds may include breaking changes or unstable features that could cause build failures

Affected Files:

  • jupyter/datascience/ubi9-python-3.12/Dockerfile.cpu (lines ~122-173)
  • Related: jupyter/datascience/ubi9-python-3.12/pylock.toml (pyarrow version specification)

Current Implementation

git clone --depth 1 https://github.com/apache/arrow.git && \

Proposed Solution

Pin the pyarrow build to match the version specified in pylock.toml:

ARG PYARROW_TAG=apache-arrow-20.0.0
# Build pyarrow optimized for s390x
RUN --mount=type=cache,target=/root/.cache/pip \
    --mount=type=cache,target=/root/.cache/dnf \
    if [ "$TARGETARCH" = "s390x" ]; then \
        # Install build dependencies (shared for pyarrow and onnx)
        dnf install -y cmake make gcc-c++ pybind11-devel wget && \
        dnf clean all && \
        # Build and collect pyarrow wheel
        git clone --depth 1 --branch ${PYARROW_TAG} https://github.com/apache/arrow.git && \
        # ... rest of build process ...
        cp dist/pyarrow-20.*.whl /tmp/wheels/ && \

Additional Considerations

Consider enabling common codecs (LZ4, Zstd, Snappy) if image size permits, or document the feature limitations when these codecs are disabled.

Acceptance Criteria

  • pyarrow build is pinned to apache-arrow-20.0.0 tag
  • Build process uses consistent Arrow version across builds
  • Wheel filename pattern matches expected version (pyarrow-20.*.whl)
  • s390x builds remain functional
  • Build reproducibility is improved

Implementation Notes

  • Add ARG PYARROW_TAG=apache-arrow-20.0.0 before the build RUN command
  • Update git clone to use --branch ${PYARROW_TAG}
  • Update wheel copy pattern to match version-specific naming
  • Consider adding inline documentation about codec limitations

Context

PR: #2432 - s390x(jupyter/datascience): make image buildable on s390x
Review Comment: #2432 (comment)

This issue addresses build reproducibility concerns identified during the s390x architecture support implementation.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

✅Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions