S3 Parquet compression fails for records larger than 1MB due to Arrow JSON reader default block size

## Bug Report

**Describe the bug**
When using the S3 output plugin with `compression parquet`, records larger than ~1MB cause the error:

```
[error] [aws][compress] Failed to parse JSON into Arrow Table for Parquet conversion
[error] [output:s3:s3.0] Failed to compress data
```

The root cause is in `src/aws/compression/arrow/compress.c`. The `parse_json()` function creates `GArrowJSONReadOptions` with default settings, which uses Apache Arrow's default block size of 1MB. When a single NDJSON line exceeds 1MB, Arrow's JSON reader fails with "straddling object straddles two block boundaries".

**To Reproduce**
1. Configure S3 output with Parquet compression:

```yaml
pipeline:
  inputs:
    - name: tail
      path: /data/*.log
      tag: logs
      parser: json

  outputs:
    - name: s3
      match: "*"
      bucket: my-bucket
      region: us-east-1
      use_put_object: true
      compression: parquet
      s3_key_format: /$TAG/%Y-%m-%d/$UUID.parquet
```

2. Feed a JSON record where a single field (e.g. an API response body stored as a string) exceeds 1MB.
3. Observe the error when Fluent Bit attempts to upload.

Records under 1MB work correctly. The issue is reproducible with `amazon/aws-for-fluent-bit:3.2.3` (Fluent Bit v4.2.2).

**Expected behavior**
S3 Parquet compression should handle records of any size, or at least provide a configuration option to adjust the Arrow JSON reader block size.

**Your Environment**
- Fluent Bit version: 4.2.2
- Docker image: `amazon/aws-for-fluent-bit:3.2.3`
- Operating System: Amazon Linux 2023 (container)
- Host: macOS (Apple Silicon)

**Additional context**
I'm not a C developer. The source code analysis and root cause conclusion were derived with the assistance of [Claude Code](https://claude.ai/claude-code). If there are any inaccuracies in the C-level analysis, please point them out.
- [apache/arrow#7835](https://github.com/apache/arrow/issues/7835) — Original Arrow JSON reader block size issue
- [apache/arrow#39433](https://github.com/apache/arrow/issues/39433) — GLib bindings block size support


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S3 Parquet compression fails for records larger than 1MB due to Arrow JSON reader default block size #11578

Bug Report

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

S3 Parquet compression fails for records larger than 1MB due to Arrow JSON reader default block size #11578

Description

Bug Report

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions