Skip to content

Comments

Add support for reading/writing parquet files with pyarrow#252

Merged
huddlej merged 1 commit intomasterfrom
add-pyarrow
Jul 3, 2025
Merged

Add support for reading/writing parquet files with pyarrow#252
huddlej merged 1 commit intomasterfrom
add-pyarrow

Conversation

@huddlej
Copy link
Contributor

@huddlej huddlej commented May 22, 2025

Description of proposed changes

We need support for reading/writing parquet files to prepare submissions to the SARS-CoV-2 variant hub [1]. The pyarrow library is one of two supported by pandas [2] along with fastparquet. The pyarrow library provides a more comprehensive set of tools for the Arrow spec [3], while fastparquet is defined to provide a minimal library for the parquet format. I've opted for the larger pyarrow library here, since it will eventually be a required dependency for pandas [4].

[1] nextstrain/forecasts-ncov#132
[2] https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-parquet
[3] https://arrow.apache.org/docs/cpp/user_guide.html
[4] https://pandas.pydata.org/pdeps/0010-required-pyarrow-dependency.html

Related issue(s)

nextstrain/forecasts-ncov#132

Checklist

  • Checks pass

@huddlej huddlej changed the title Add support for reading/writing parquet files Add support for reading/writing parquet files with pyarrow May 23, 2025
@huddlej
Copy link
Contributor Author

huddlej commented May 23, 2025

Closed in favor of #253

@huddlej huddlej closed this May 23, 2025
@huddlej huddlej deleted the add-pyarrow branch May 23, 2025 18:08
@huddlej huddlej restored the add-pyarrow branch July 2, 2025 22:37
@huddlej huddlej reopened this Jul 2, 2025
We need support for reading/writing parquet files to prepare submissions
to the SARS-CoV-2 variant hub [1]. The pyarrow library is one of two
supported by pandas [2] along with fastparquet. The pyarrow library provides
a more comprehensive set of tools for the Arrow spec [3], while
fastparquet is defined to provide a minimal library for the parquet
format. We need to switch to the larger pyarrow library here, because it
supports the parquet DATE data type that we need for our SARS-CoV-2
nowcast submissions.

[1] nextstrain/forecasts-ncov#132
[2] https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-parquet
[3] https://arrow.apache.org/docs/cpp/user_guide.html
Copy link
Member

@victorlin victorlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it installed successfully in the build logs:

#36 [linux/amd64 builder-target-platform 12/21] RUN pip3 install pyarrow==20.0.0
#36 0.621 Collecting pyarrow==20.0.0
#36 0.663   Downloading pyarrow-20.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (42.3 MB)
#36 0.950      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 42.3/42.3 MB 86.2 MB/s eta 0:00:00
#36 1.487 Installing collected packages: pyarrow
#36 2.581 Successfully installed pyarrow-20.0.0

#35 [linux/arm64 builder-target-platform 12/21] RUN pip3 install pyarrow==20.0.0
#35 27.49 Successfully installed pyarrow-20.0.0

@huddlej huddlej merged commit 9270fb3 into master Jul 3, 2025
61 checks passed
@huddlej huddlej deleted the add-pyarrow branch July 3, 2025 00:24
Comment on lines 262 to +266
# Install openpyxl for pandas in GenoFLU
RUN pip3 install openpyxl==3.1.0

# Install fastparquet for pandas to support parquet files.
RUN pip3 install fastparquet==2024.11.0
# Install pyarrow for pandas to support parquet files.
RUN pip3 install pyarrow==20.0.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that, given the way this comment is worded, it might make more sense to pip install "pandas[parquet]" which lets pandas resolve a compatible version of pyarrow. But then I realized that pandas is not directly installed in the Dockerfile – it's installed as a dependency of TreeTime and Augur.

If/when Augur declares a dependency of pandas[parquet], we can remove the separate command that installs pyarrow.

Similarly, openpyxl is included in pip install "pandas[excel]", but there's an image size argument to be made for installing openpyxl separately since pandas[excel] includes other unneeded dependencies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants