Skip to content

Enable core Arrow codecs in s390x pyarrow build for datascience runtime #2305

@coderabbitai

Description

@coderabbitai

Problem Description

The s390x pyarrow build in runtimes/datascience/ubi9-python-3.12/Dockerfile.cpu (lines 99-103) explicitly disables core Arrow codecs:

-DARROW_WITH_LZ4=OFF \
-DARROW_WITH_ZSTD=OFF \
-DARROW_WITH_SNAPPY=OFF \

This configuration prevents reading most real-world Parquet and Arrow datasets that use these common compression formats, significantly limiting the functionality of the datascience runtime on s390x architecture.

Impact Analysis

  • Data Compatibility: Users cannot read Parquet files compressed with LZ4, Zstd, or Snappy (the most common compression formats)
  • Runtime Failures: Applications attempting to read compressed datasets will fail with codec-related errors
  • User Experience: s390x datascience runtime becomes significantly less capable than other architectures

Root Cause

The codecs were likely disabled to avoid build complexity or dependencies, but with -DARROW_DEPENDENCY_SOURCE=BUNDLED already set, the required codec libraries should be built in-tree without requiring additional system dependencies.

Solution

Enable the core codecs in the Arrow build configuration:

# Change from:
-DARROW_WITH_LZ4=OFF \
-DARROW_WITH_ZSTD=OFF \
-DARROW_WITH_SNAPPY=OFF \

# To:
-DARROW_WITH_LZ4=ON \
-DARROW_WITH_ZSTD=ON \
-DARROW_WITH_SNAPPY=ON \

Acceptance Criteria

  • Core Arrow codecs (LZ4, Zstd, Snappy) are enabled in s390x pyarrow build
  • s390x datascience runtime can successfully read Parquet files compressed with these formats
  • Build time impact is acceptable (should be minimal with BUNDLED dependencies)
  • No regression in build success rate for s390x architecture

Files Affected

  • runtimes/datascience/ubi9-python-3.12/Dockerfile.cpu

Context

Identified during PR #1513 review: #1513 (comment)

The current implementation prioritizes build simplicity over runtime functionality, but enabling these codecs should not introduce significant complexity given the bundled dependency strategy.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

📋 Backlog

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions