-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Bug Report
Describe the bug
When using the S3 output plugin with compression parquet, records larger than ~1MB cause the error:
[error] [aws][compress] Failed to parse JSON into Arrow Table for Parquet conversion
[error] [output:s3:s3.0] Failed to compress data
The root cause is in src/aws/compression/arrow/compress.c. The parse_json() function creates GArrowJSONReadOptions with default settings, which uses Apache Arrow's default block size of 1MB. When a single NDJSON line exceeds 1MB, Arrow's JSON reader fails with "straddling object straddles two block boundaries".
To Reproduce
- Configure S3 output with Parquet compression:
pipeline:
inputs:
- name: tail
path: /data/*.log
tag: logs
parser: json
outputs:
- name: s3
match: "*"
bucket: my-bucket
region: us-east-1
use_put_object: true
compression: parquet
s3_key_format: /$TAG/%Y-%m-%d/$UUID.parquet- Feed a JSON record where a single field (e.g. an API response body stored as a string) exceeds 1MB.
- Observe the error when Fluent Bit attempts to upload.
Records under 1MB work correctly. The issue is reproducible with amazon/aws-for-fluent-bit:3.2.3 (Fluent Bit v4.2.2).
Expected behavior
S3 Parquet compression should handle records of any size, or at least provide a configuration option to adjust the Arrow JSON reader block size.
Your Environment
- Fluent Bit version: 4.2.2
- Docker image:
amazon/aws-for-fluent-bit:3.2.3 - Operating System: Amazon Linux 2023 (container)
- Host: macOS (Apple Silicon)
Additional context
I'm not a C developer. The source code analysis and root cause conclusion were derived with the assistance of Claude Code. If there are any inaccuracies in the C-level analysis, please point them out.
- apache/arrow#7835 — Original Arrow JSON reader block size issue
- apache/arrow#39433 — GLib bindings block size support