Add WriteZeroOptionalFields option for Snowflake compatibility#3
Open
aratz-lasa wants to merge 10 commits intomainfrom
Open
Add WriteZeroOptionalFields option for Snowflake compatibility#3aratz-lasa wants to merge 10 commits intomainfrom
aratz-lasa wants to merge 10 commits intomainfrom
Conversation
This commit adds a new WriterConfig option 'WriteZeroOptionalFields' that forces
the thrift encoder to write optional fields even when they have zero values.
This fixes compatibility issues with systems like Snowflake that require explicit
null counts in statistics (NullCount=0) even when columns have no null values.
Changes:
- Added WriteZeroOptionalFields feature flag to thrift protocol
- Made CompactProtocol configurable with SetFeatures method
- Updated thrift encoder to respect writeZeroOptionalFields flag
- Added WriteZeroOptionalFields config option to WriterConfig
- Wired the option through to ColumnWriter protocol initialization
The default behavior remains unchanged (false) for backward compatibility.
Usage:
writer := parquet.NewWriter(output, schema,
parquet.WriteZeroOptionalFields(true),
)
e15cfe9 to
780e670
Compare
Tests cover: - Default behavior (disabled) vs enabled - Columns with and without null values - Thrift encoding validation - Metadata verification - Backward compatibility between enabled/disabled modes - Data integrity across both modes
The feature was only configuring the protocol for page headers but not for the file footer where Statistics are actually written. Added: - Store writeZeroOptionalFields flag in writer struct - Configure protocol features in writeFileFooter() - Add tests that verify the feature works by: 1. Inspecting raw parquet file from user 2. Comparing file sizes between enabled/disabled 3. Comparing footer sizes Tests confirm files with WriteZeroOptionalFields(true) are larger, proving zero-valued optional fields like NullCount=0 are being written. This fixes Snowflake compatibility for columns with no nulls.
The previous fix only configured the protocol for ColumnIndex/OffsetIndex encoding, but the FileMetaData footer was still being encoded with a fresh unconfigured protocol at line 1134. This meant Statistics in the footer still had zero-valued fields omitted. Changed line 1134 from: thrift.Marshal(new(thrift.CompactProtocol), w.fileMetaData) To: thrift.Marshal(protocol, w.fileMetaData) This uses the already-configured protocol that has WriteZeroOptionalFields set if enabled. Test results now show: - Before: +45 bytes (only ColumnIndex/OffsetIndex affected) - After: +105 bytes (FileMetaData footer also affected) Also updated test file name to match current parquet file.
Previous tests used size comparison as a proxy which could miss issues. This test validates the feature works by directly comparing file sizes between files with WriteZeroOptionalFields enabled vs disabled. When enabled, files are 105 bytes larger (63 bytes in footer) proving that zero-valued optional fields like NullCount=0 are being written.
This is a simplified approach that ensures NullCount=0 and other zero-valued optional fields are always written to Parquet file statistics, making files compatible with strict consumers like Snowflake. Key changes: - Modified thrift encoder to always write zero-valued optional fields for regular structs (unions still skip zero fields to maintain their invariant) - Removed all WriterConfig complexity and feature flags - Updated test expectations to match new behavior This is essentially a one-line change in encode.go that fixes the core issue.
Reverted CompactProtocol changes - no need for features field or SetFeatures. The fix works with the default protocol as-is.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.