Skip to content

S3 Parquet compression fails for records larger than 1MB due to Arrow JSON reader default block size #11578

@yankewei

Description

@yankewei

Bug Report

Describe the bug
When using the S3 output plugin with compression parquet, records larger than ~1MB cause the error:

[error] [aws][compress] Failed to parse JSON into Arrow Table for Parquet conversion
[error] [output:s3:s3.0] Failed to compress data

The root cause is in src/aws/compression/arrow/compress.c. The parse_json() function creates GArrowJSONReadOptions with default settings, which uses Apache Arrow's default block size of 1MB. When a single NDJSON line exceeds 1MB, Arrow's JSON reader fails with "straddling object straddles two block boundaries".

To Reproduce

  1. Configure S3 output with Parquet compression:
pipeline:
  inputs:
    - name: tail
      path: /data/*.log
      tag: logs
      parser: json

  outputs:
    - name: s3
      match: "*"
      bucket: my-bucket
      region: us-east-1
      use_put_object: true
      compression: parquet
      s3_key_format: /$TAG/%Y-%m-%d/$UUID.parquet
  1. Feed a JSON record where a single field (e.g. an API response body stored as a string) exceeds 1MB.
  2. Observe the error when Fluent Bit attempts to upload.

Records under 1MB work correctly. The issue is reproducible with amazon/aws-for-fluent-bit:3.2.3 (Fluent Bit v4.2.2).

Expected behavior
S3 Parquet compression should handle records of any size, or at least provide a configuration option to adjust the Arrow JSON reader block size.

Your Environment

  • Fluent Bit version: 4.2.2
  • Docker image: amazon/aws-for-fluent-bit:3.2.3
  • Operating System: Amazon Linux 2023 (container)
  • Host: macOS (Apple Silicon)

Additional context
I'm not a C developer. The source code analysis and root cause conclusion were derived with the assistance of Claude Code. If there are any inaccuracies in the C-level analysis, please point them out.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions