- 
                Notifications
    
You must be signed in to change notification settings  - Fork 133
 
Default to ZSTD compression when writing Parquet #981
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
0c3fed9
              b1db46c
              819de0d
              56965f4
              df7d65e
              f62a7a8
              b5b3c47
              2362992
              b86b142
              41e1742
              fe502e8
              67529b8
              811f633
              50a58b3
              55fc97e
              73519fe
              File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 
          
            
          
           | 
    @@ -620,16 +620,24 @@ def write_csv(self, path: str | pathlib.Path, with_header: bool = False) -> None | |||||||||||||||||||||||||||||||||||||
| def write_parquet( | ||||||||||||||||||||||||||||||||||||||
| self, | ||||||||||||||||||||||||||||||||||||||
| path: str | pathlib.Path, | ||||||||||||||||||||||||||||||||||||||
| compression: str = "uncompressed", | ||||||||||||||||||||||||||||||||||||||
| compression: str = "ZSTD", | ||||||||||||||||||||||||||||||||||||||
| compression_level: int | None = None, | ||||||||||||||||||||||||||||||||||||||
| ) -> None: | ||||||||||||||||||||||||||||||||||||||
| """Execute the :py:class:`DataFrame` and write the results to a Parquet file. | ||||||||||||||||||||||||||||||||||||||
| 
     | 
||||||||||||||||||||||||||||||||||||||
| Args: | ||||||||||||||||||||||||||||||||||||||
| path: Path of the Parquet file to write. | ||||||||||||||||||||||||||||||||||||||
| compression: Compression type to use. | ||||||||||||||||||||||||||||||||||||||
| compression_level: Compression level to use. | ||||||||||||||||||||||||||||||||||||||
| """ | ||||||||||||||||||||||||||||||||||||||
| path (str | pathlib.Path): The file path to write the Parquet file. | ||||||||||||||||||||||||||||||||||||||
| compression (str): The compression algorithm to use. Default is "ZSTD". | ||||||||||||||||||||||||||||||||||||||
| compression_level (int | None): The compression level to use. For ZSTD, the | ||||||||||||||||||||||||||||||||||||||
                
       | 
||||||||||||||||||||||||||||||||||||||
| recommended range is 1 to 22, with the default being 3. Higher levels | ||||||||||||||||||||||||||||||||||||||
| provide better compression but slower speed. | ||||||||||||||||||||||||||||||||||||||
| """ | ||||||||||||||||||||||||||||||||||||||
| # default compression level to 3 for ZSTD | ||||||||||||||||||||||||||||||||||||||
| if compression == "ZSTD": | ||||||||||||||||||||||||||||||||||||||
                
       | 
||||||||||||||||||||||||||||||||||||||
| "zstd" => Compression::ZSTD( | |
| ZstdLevel::try_new(verify_compression_level(compression_level)? as i32) | |
| .map_err(|e| PyValueError::new_err(format!("{e}")))?, | |
| ), | 
Compression levels are tested in:
datafusion-python/python/tests/test_dataframe.py
Lines 1093 to 1106 in 63b13da
| @pytest.mark.parametrize( | |
| "compression, compression_level", | |
| [("gzip", 12), ("brotli", 15), ("zstd", 23), ("wrong", 12)], | |
| ) | |
| def test_write_compressed_parquet_wrong_compression_level( | |
| df, tmp_path, compression, compression_level | |
| ): | |
| path = tmp_path | |
| with pytest.raises(ValueError): | |
| df.write_parquet( | |
| str(path), | |
| compression=compression, | |
| compression_level=compression_level, | 
        
          
              
                Outdated
          
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 seems like an awfully low compression default. We should evaluate what other libraries use as the default compression setting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be nice to dig into what DuckDB's defaults are: https://duckdb.org/docs/data/parquet/overview.html#writing-to-parquet-files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 seems like an awfully low compression default. We should evaluate what other libraries use as the default compression setting.
I used the default compression level in the manual from Facebook (author of zstd) - https://facebook.github.io/zstd/zstd_manual.html
I could not find a default in DuckDB's documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @kylebarron ,
Shall we adopt delta-rs' default, and use 4 as the default ZSTD compression level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, that sounds good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
I have amended the default to 4.
Uh oh!
There was an error while loading. Please reload this page.