Skip to content

Conversation

@triddell
Copy link
Contributor

Problem

The strict_schema: false mode was unusable for parquet files with nested structures or non-standard LIST encoding due to:

  1. Nested structs being flattened - Fields like cloud.account.uid were converted to flat strings instead of preserved as nested map structures
  2. Non-standard LIST encoding not unwrapped - Files using non-standard wrappers (e.g., .array) were not properly unwrapped
  3. Standard parquet-go LIST encoding not unwrapped - Files with .list.element wrapper patterns resulted in nested objects instead of arrays
  4. Single-element arrays converted to scalars - Arrays with one element were incorrectly converted to scalar values
  5. Only single-level unwrapping - The code could only unwrap one level of wrapper, but standard parquet-go uses two levels (.list.element)

This caused type conversion errors, schema mismatch failures, and 100% row failure rates when processing files with these characteristics.

Solution

Implemented comprehensive nested struct and array handling in strict_schema: false mode:

1. Nested Struct Reconstruction

Added path-based column mapping to reconstruct full nested map hierarchy from flat parquet row values:

  • getColumnPath() - Maps column indices to full dot-notation paths
  • findColumnPath() - Recursive helper to traverse schema tree
  • setNestedValue() - Builds nested map hierarchy from paths
  • Properly handles arbitrary nesting depth (e.g., cloud.account.uid{"cloud": {"account": {"uid": "..."}}})

2. Iterative Array Unwrapping

Loop-based unwrapping to handle multi-level wrappers:

  • Supports non-standard format: parent.field.array (1 level)
  • Supports standard parquet-go format: parent.field.list.element (2 levels)
  • Schema-based detection via buildArrayFieldMap() for reliable identification
  • Continues unwrapping until no more wrapper keywords found

3. Array Type Preservation

  • Added isArrayField tracking throughout the pipeline
  • Single-element arrays stay as arrays: ["cloud"] not "cloud"
  • Empty arrays properly converted to null
  • Consistent type handling regardless of array size

Changes

Modified: internal/impl/parquet/common.go

  • buildArrayFieldMap() - New function to detect array fields from schema structure
  • getColumnPath() - New function to map column indices to full dot-notation paths
  • findColumnPath() - New recursive helper to traverse schema tree
  • countLeafColumns() - New helper to count leaf columns in nested structures
  • buildNestedStructure() - Complete rewrite to reconstruct nested maps and unwrap arrays
  • setNestedValue() - New function to build nested map hierarchy from paths
  • convertParquetValues() - Added isArrayField parameter to preserve array types
  • readLenient() - Updated to call buildArrayFieldMap() and pass to buildNestedStructure()

Modified: internal/impl/parquet/processor_decode.go

  • Updated to use new buildNestedStructure() signature with array field detection

Added: internal/impl/parquet/strict_schema_test.go

  • TestNestedStructSupport() - Tests nested struct reconstruction at various depths
    • Simple 2-level nested struct
    • 3-level nested struct with multiple fields
    • Complex multi-level nested structs
    • Nested struct with arrays
  • TestStandardParquetGoListEncoding() - Tests standard .list.element unwrapping
    • Multiple element arrays
    • Single element arrays (preserved as arrays)
    • Empty arrays (converted to null)
    • Nested structs with standard list encoding
  • TestAWSSecurityLakeArrayEncoding() - Tests non-standard .array unwrapping
    • Non-standard array wrapper patterns

Testing

All tests pass:

=== RUN   TestNestedStructSupport
=== RUN   TestNestedStructSupport/simple_2-level_nested_struct
=== RUN   TestNestedStructSupport/3-level_nested_struct_with_multiple_fields
=== RUN   TestNestedStructSupport/complex_multi-level_nested_structs
=== RUN   TestNestedStructSupport/nested_struct_with_arrays
--- PASS: TestNestedStructSupport (0.00s)

=== RUN   TestStandardParquetGoListEncoding
=== RUN   TestStandardParquetGoListEncoding/standard_list_encoding_with_multiple_elements
=== RUN   TestStandardParquetGoListEncoding/single_element_array_preserved
=== RUN   TestStandardParquetGoListEncoding/empty_array_becomes_null
=== RUN   TestStandardParquetGoListEncoding/nested_struct_with_standard_list_encoding
--- PASS: TestStandardParquetGoListEncoding (0.00s)

=== RUN   TestAWSSecurityLakeArrayEncoding
=== RUN   TestAWSSecurityLakeArrayEncoding/non-standard_array_wrapper
--- PASS: TestAWSSecurityLakeArrayEncoding (0.00s)

PASS
ok      github.com/warpstreamlabs/bento/internal/impl/parquet

Backward Compatibility

Fully backward compatible

  • strict_schema: true behavior unchanged (default mode)
  • strict_schema: false now works correctly (previously broken for nested structs and non-standard LIST encoding)
  • Standard parquet files produce identical output with both modes
  • No breaking changes to existing functionality

Performance

  • No performance regression
  • Schema analysis done once per file (not per row)
  • Nested struct reconstruction is O(n) where n = number of columns
  • Array field detection is O(f) where f = number of fields in schema

Examples

Before (broken)

{
  "cloud.account.uid": "353785743975",
  "metadata": {
    "profiles": {"list": {"element": ["cloud", "security_control"]}}
  }
}

After (fixed)

{
  "cloud": {
    "account": {
      "uid": "353785743975"
    }
  },
  "metadata": {
    "profiles": ["cloud", "security_control"]
  }
}

Use Cases

This fix enables strict_schema: false to correctly process:

  • Parquet files with deeply nested OCSF structures
  • Files using non-standard LIST wrappers (e.g., .array)
  • Standard parquet-go files with .list.element encoding
  • Mixed array sizes requiring consistent type handling
  • Downstream processing with Bloblang that accesses nested fields

Related Issues

Fixes #671

Resolves issues with:

  • Nested struct access in Bloblang processors
  • Array field type mismatches in parquet output
  • OCSF schema compatibility
  • Type conversion errors when processing files with non-standard LIST encoding

Fixes strict_schema: false to properly handle nested structs and non-standard
LIST encoding formats.

- Preserves nested struct hierarchy instead of flattening to dot-notation
- Implements iterative unwrapping for multi-level LIST wrappers (e.g., .list.element, .array)
- Maintains array types for single-element arrays
- Adds unit tests for nested structs and various LIST encoding patterns

Resolves issues with parquet files using non-standard LIST wrappers and deeply nested OCSF structures.
@triddell
Copy link
Contributor Author

CI Fixes

Fixed formatting issue in strict_schema_test.go - all tests and lint checks now pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug Report: strict_schema: false doesn't handle nested structs or non-standard LIST encoding

1 participant