Add aws_s3_stream output #664
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overview
This pull request introduces
aws_s3_stream, a new output plugin designed to stream generic text-based files (JSON, NDJSON, CSV, logs) directly to Amazon S3 using multipart uploads. The implementation addresses two critical issues: excessive memory consumption and incorrect partition routing in the existing batch approach.Key Problems Addressed
Memory Efficiency: The current
aws_s3output with batching buffers entire batches in memory before writing, consuming approximately 3.31 GB for 100K events. This solution reduces memory usage to ~180 MB—a 95% improvement. For 500K events, the batch approach fails completely after 15+ minutes, while streaming succeeds in 38.56 seconds with only 170 MB of memory.Partition Routing: The batch approach evaluates path expressions once per batch rather than per message. When batch boundaries cross partition boundaries, events are written to the wrong partitions. Testing with 10K interleaved events showed ~50% of events ending up in incorrect partitions, causing silent data corruption for Iceberg/Hive/Delta tables and multi-tenant data leakage.
Small File Problem: Batch processing creates many small files instead of optimal large files per partition. For example, 500K events produces 13+ incomplete files with batch vs 1 optimal 788 MB file with streaming, negatively impacting query performance and S3 costs.
Solution Architecture
The implementation uses:
partition_byexpressionsMemory footprint: 170–260 MB (constant regardless of dataset size).
Files Added
Core implementation (5 files):
output_aws_s3_stream.go(438 lines): Plugin configuration and partition routing logics3_streaming_writer.go(397 lines): Generic S3 streaming writer with multipart upload managementoutput_aws_s3_stream_test.go(373 lines): Unit tests for output configuration and partition logicoutput_aws_s3_stream_integration_test.go(371 lines): LocalStack integration testss3_streaming_writer_test.go(405 lines): Unit tests for streaming writer with mock S3 clientFiles Modified
None. This PR only adds new files without modifying any existing code.
Configuration Example
With Compression
Testing
The PR includes 30 unit tests and 4 LocalStack-based integration tests validating:
Unit tests (18 tests for output plugin):
Unit tests (12 tests for streaming writer):
Integration tests (4 tests with LocalStack):
Manual testing with OCSF security event data (complex nested JSON):
Performance Results
At small scales, batch is faster. At production scales, batch fails completely while streaming succeeds with constant memory.
Why No Built-in Compression?
Unlike
aws_s3, this output does not provide acompressparameter. S3 multipart uploads require non-final parts to be ≥5MB. After compressing a 5MB buffer, the compressed size is unpredictable (often <5MB), causing S3 to reject the upload.Solution: Use pipeline-level compression via the
compressprocessor (see example above). This works reliably because concatenated gzip streams are valid, and each compressed message can be decompressed independently.Migration Path
Before (Broken)
After (Fixed)
Key changes:
archive: lineslogic to pipeline (add\nto each message)partition_byparameter for correct routingbatchingsection from outputaws_s3_streaminstead ofaws_s3Backwards Compatibility
No breaking changes. This PR introduces a new plugin (
aws_s3_stream) alongside the existingaws_s3output. Users can migrate at their own pace. All code is new with zero modifications to existing files.Related Work
This PR follows the same pattern as PR #661 (
aws_s3_parquet_stream), which addresses identical partition routing and memory issues for Parquet format. Both implementations use:Documentation
The plugin includes comprehensive inline documentation:
aws_s3