Fix nested struct and LIST encoding support in strict_schema: false mode
#672
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
The
strict_schema: falsemode was unusable for parquet files with nested structures or non-standard LIST encoding due to:cloud.account.uidwere converted to flat strings instead of preserved as nested map structures.array) were not properly unwrapped.list.elementwrapper patterns resulted in nested objects instead of arrays.list.element)This caused type conversion errors, schema mismatch failures, and 100% row failure rates when processing files with these characteristics.
Solution
Implemented comprehensive nested struct and array handling in
strict_schema: falsemode:1. Nested Struct Reconstruction
Added path-based column mapping to reconstruct full nested map hierarchy from flat parquet row values:
getColumnPath()- Maps column indices to full dot-notation pathsfindColumnPath()- Recursive helper to traverse schema treesetNestedValue()- Builds nested map hierarchy from pathscloud.account.uid→{"cloud": {"account": {"uid": "..."}}})2. Iterative Array Unwrapping
Loop-based unwrapping to handle multi-level wrappers:
parent.field.array(1 level)parent.field.list.element(2 levels)buildArrayFieldMap()for reliable identification3. Array Type Preservation
isArrayFieldtracking throughout the pipeline["cloud"]not"cloud"nullChanges
Modified:
internal/impl/parquet/common.gobuildArrayFieldMap()- New function to detect array fields from schema structuregetColumnPath()- New function to map column indices to full dot-notation pathsfindColumnPath()- New recursive helper to traverse schema treecountLeafColumns()- New helper to count leaf columns in nested structuresbuildNestedStructure()- Complete rewrite to reconstruct nested maps and unwrap arrayssetNestedValue()- New function to build nested map hierarchy from pathsconvertParquetValues()- AddedisArrayFieldparameter to preserve array typesreadLenient()- Updated to callbuildArrayFieldMap()and pass tobuildNestedStructure()Modified:
internal/impl/parquet/processor_decode.gobuildNestedStructure()signature with array field detectionAdded:
internal/impl/parquet/strict_schema_test.goTestNestedStructSupport()- Tests nested struct reconstruction at various depthsTestStandardParquetGoListEncoding()- Tests standard.list.elementunwrappingTestAWSSecurityLakeArrayEncoding()- Tests non-standard.arrayunwrappingTesting
All tests pass:
Backward Compatibility
✅ Fully backward compatible
strict_schema: truebehavior unchanged (default mode)strict_schema: falsenow works correctly (previously broken for nested structs and non-standard LIST encoding)Performance
Examples
Before (broken)
{ "cloud.account.uid": "353785743975", "metadata": { "profiles": {"list": {"element": ["cloud", "security_control"]}} } }After (fixed)
{ "cloud": { "account": { "uid": "353785743975" } }, "metadata": { "profiles": ["cloud", "security_control"] } }Use Cases
This fix enables
strict_schema: falseto correctly process:.array).list.elementencodingRelated Issues
Fixes #671
Resolves issues with: