-
Notifications
You must be signed in to change notification settings - Fork 167
Add aws_s3_parquet_stream output for memory-efficient Parquet streaming to S3 #661
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add aws_s3_parquet_stream output for memory-efficient Parquet streaming to S3 #661
Conversation
Adds a new output plugin that writes Parquet files to S3 using multipart uploads with incremental row group streaming, reducing memory usage by 88% compared to the batch approach. Key features: - Streams row groups directly to S3 (no full file buffering) - partition_by parameter for correct partition routing - schema_file parameter for external schema definitions - Supports all Parquet compression types (snappy, zstd, gzip, etc.)
- Added S3API interface to enable mock-based testing - Updated StreamingParquetWriter to use S3API instead of concrete *s3.Client - Added 8 new unit tests covering S3 multipart upload lifecycle: - Initialization and double-init protection - Write lifecycle (before init, after close) - Part upload tracking - CompleteMultipartUpload on Close - Multiple parts with correct numbering - AbortMultipartUpload on errors
Fixes "insufficient definition levels" errors by: 1. Using consistent schema/message type generation (both with struct tags + pointers) 2. Preserving actual column metadata from temp files instead of approximating
Fixed Nested Optional StructsFound and fixed the "insufficient definition levels" bug for nested optional STRUCT fields. Root causes:
Fix:
|
Tests verify column metadata extraction from temp files and proper definition/repetition level encoding in streaming parquet output.
- Fix documentation typo: timestamp().format() → now().ts_format() - Replace fmt.Errorf with errors.New for static error strings - Update AWS SDK endpoint resolver to use BaseEndpoint - Update testify assertions to use specific assertion methods - Remove unused test struct fields - Regenerate documentation
CI FixesFixed all linting errors and CI check failures:
All tests and lint checks now pass. I will review the generated documentation output more closely for any additional improvements needed. |
Preserve ColumnIndex and OffsetIndex from temporary parquet files to enable query engine optimization through page-level pruning and predicate pushdown. - Extract and store page index data during row group flush - Write in correct Parquet format: all ColumnIndex, then all OffsetIndex - Track filePosition separately from uploadSize for accurate offset calculation - Add comprehensive unit tests for page index preservation
Page Index Preservation FixIssueFurther in-depth testing revealed that while page index metadata pointers (ColumnIndexOffset, OffsetIndexOffset, etc.) were being copied from temporary files, the actual ColumnIndex and OffsetIndex binary data was not preserved. This prevented query engines from utilizing page-level pruning. Solution
|
Overview
This PR adds a new
aws_s3_parquet_streamoutput plugin that streams Parquet files directly to S3 using multipart uploads, addressing memory limitations and partition routing issues in the current batch approach.Resolves: #660
Related: #659 (partition routing bug)
Motivation
Problem 1: High Memory Usage
The current approach (
aws_s3+parquet_encodeprocessor) buffers entire Parquet files in memory before uploading to S3:Memory usage for 100K events (minimal schema):
For production workloads with complex schemas (OCSF) or larger datasets, memory can exceed 10+ GB, causing OOM errors in constrained environments.
Problem 2: Partition Routing
As documented in #659, the batch approach evaluates path expressions once per batch, causing incorrect partition routing:
With 100K events across 2 accounts:
This breaks Iceberg/Hive/Delta table partitioning and causes silent data corruption.
Solution
This PR introduces
aws_s3_parquet_streamwith:partition_byexpressionsArchitecture
Memory profile per writer: ~30-60 MB (independent of dataset size)
Key Features
1. Partition Routing with
partition_bypartition_byexpressions2. Schema Loading
Supports both inline and external schema definitions:
3. Compression Support
All Parquet compression types supported:
uncompressedsnappy(default)gzipzstdbrotlilz4raw4. Configurable Row Groups
Implementation Details
Files Added
output_aws_s3_parquet_stream.go(543 lines)parquet_streaming_writer.go(623 lines)output_aws_s3_parquet_stream_test.go(327 lines)parquet_streaming_writer_test.go(216 lines)output_aws_s3_parquet_stream_localstack_test.go(266 lines)Files Modified
internal/impl/parquet/convert.go(+17 lines)json.Numbertype support for int32/int64 conversionjson.Numberinternal/impl/parquet/schema.go(+26 lines)SchemaOptsstruct for use by output pluginsTesting
Unit Tests
Integration Tests
partition_bywith multiple partitionsManual Testing
Extensive testing documented in project with:
Performance
Test: 100K events with minimal schema
Test: Multiple partitions (2 dates × 2 accounts)
Backwards Compatibility
parquetpackageConfiguration Example
Use Cases
This output is particularly valuable for: