Skip to content
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 13 additions & 5 deletions python/datafusion/dataframe.py
Original file line number Diff line number Diff line change
Expand Up @@ -620,16 +620,24 @@ def write_csv(self, path: str | pathlib.Path, with_header: bool = False) -> None
def write_parquet(
self,
path: str | pathlib.Path,
compression: str = "uncompressed",
compression: str = "ZSTD",
compression_level: int | None = None,
) -> None:
"""Execute the :py:class:`DataFrame` and write the results to a Parquet file.

Args:
path: Path of the Parquet file to write.
compression: Compression type to use.
compression_level: Compression level to use.
"""
path (str | pathlib.Path): The file path to write the Parquet file.
compression (str): The compression algorithm to use. Default is "ZSTD".
compression_level (int | None): The compression level to use. For ZSTD, the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should document that the compression level is different per algorithm. It's only zstd that has a 1-22 range IIRC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean like

compression_level (int | None): The compression level to use. For ZSTD, the
            recommended range is 1 to 22, with the default being 3. Higher levels
            provide better compression but slower speed.

recommended range is 1 to 22, with the default being 3. Higher levels
provide better compression but slower speed.
"""
# default compression level to 3 for ZSTD
if compression == "ZSTD":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@kosiew kosiew Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ion-elgreco ,

I added the Compression Enum but omitted the check_valid_levels because these are already implemented in Rust DataFrame eg

"zstd" => Compression::ZSTD(
ZstdLevel::try_new(verify_compression_level(compression_level)? as i32)
.map_err(|e| PyValueError::new_err(format!("{e}")))?,
),

Compression levels are tested in:

@pytest.mark.parametrize(
"compression, compression_level",
[("gzip", 12), ("brotli", 15), ("zstd", 23), ("wrong", 12)],
)
def test_write_compressed_parquet_wrong_compression_level(
df, tmp_path, compression, compression_level
):
path = tmp_path
with pytest.raises(ValueError):
df.write_parquet(
str(path),
compression=compression,
compression_level=compression_level,

if compression_level is None:
compression_level = 3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 seems like an awfully low compression default. We should evaluate what other libraries use as the default compression setting.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice to dig into what DuckDB's defaults are: https://duckdb.org/docs/data/parquet/overview.html#writing-to-parquet-files

Copy link
Contributor Author

@kosiew kosiew Dec 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 seems like an awfully low compression default. We should evaluate what other libraries use as the default compression setting.

I used the default compression level in the manual from Facebook (author of zstd) - https://facebook.github.io/zstd/zstd_manual.html

I could not find a default in DuckDB's documentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @kylebarron ,

Shall we adopt delta-rs' default, and use 4 as the default ZSTD compression level?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that sounds good to me.

Copy link
Contributor Author

@kosiew kosiew Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.
I have amended the default to 4.

elif not (1 <= compression_level <= 22):
raise ValueError("Compression level for ZSTD must be between 1 and 22")
self.df.write_parquet(str(path), compression, compression_level)

def write_json(self, path: str | pathlib.Path) -> None:
Expand Down
Loading