Skip to content

Commit bd4e1df

Browse files
committed
out_s3: Add an instruction for enabling parquet compression
Signed-off-by: Hiroshi Hatake <[email protected]>
1 parent e8f4a93 commit bd4e1df

File tree

1 file changed

+54
-1
lines changed

1 file changed

+54
-1
lines changed

pipeline/outputs/s3.md

Lines changed: 54 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,8 @@ The [Prometheus success/retry/error metrics values](../../administration/monitor
4545
| `sts_endpoint` | Custom endpoint for the STS API. | _none_ |
4646
| `profile` | Option to specify an AWS Profile for credentials. | `default` |
4747
| `canned_acl` | [Predefined Canned ACL policy](https://docs.aws.amazon.com/AmazonS3/latest/dev/acl-overview.html#canned-acl) for S3 objects. | _none_ |
48-
| `compression` | Compression type for S3 objects. `gzip` is currently the only supported value by default. If Apache Arrow support was enabled at compile time, you can use `arrow`. For gzip compression, the Content-Encoding HTTP Header will be set to `gzip`. Gzip compression can be enabled when `use_put_object` is `on` or `off` (`PutObject` and Multipart). Arrow compression can only be enabled with `use_put_object On`. | _none_ |
48+
| `compression` | Compression/format for S3 objects. Supported: `gzip` (always available) and `parquet` (requires Arrow build). For `gzip`, the `Content-Encoding` header is set to `gzip`. `parquet` is available **only when Fluent Bit is built with `-DFLB_ARROW=On`** and Arrow GLib/Parquet GLib are installed. Parquet is typically used with `use_put_object On`. | *none* |
49+
4950
| `content_type` | A standard MIME type for the S3 object, set as the Content-Type HTTP header. | _none_ |
5051
| `send_content_md5` | Send the Content-MD5 header with `PutObject` and UploadPart requests, as is required when Object Lock is enabled. | `false` |
5152
| `auto_retry_requests` | Immediately retry failed requests to AWS services once. This option doesn't affect the normal Fluent Bit retry mechanism with backoff. Instead, it enables an immediate retry with no delay for networking errors, which can help improve throughput during transient network issues. | `true` |
@@ -649,3 +650,55 @@ The following example uses `pyarrow` to analyze the uploaded data:
649650
3 2021-04-27T09:33:56.539430Z 0.0 0.0 0.0 0.0 0.0 0.0
650651
4 2021-04-27T09:33:57.539803Z 0.0 0.0 0.0 0.0 0.0 0.0
651652
```
653+
654+
## Enable Parquet support
655+
656+
### Build requirements for Parquet
657+
658+
To enable Parquet, build Fluent Bit with Apache Arrow support and install Arrow GLib/Parquet GLib:
659+
660+
```bash
661+
# Ubuntu/Debian example
662+
sudo apt-get update
663+
sudo apt-get install -y -V ca-certificates lsb-release wget
664+
wget https://packages.apache.org/artifactory/arrow/$(lsb_release --id --short | tr 'A-Z' 'a-z')/apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb
665+
sudo apt-get install -y -V ./apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb
666+
sudo apt-get update
667+
sudo apt-get install -y -V libarrow-glib-dev libparquet-glib-dev
668+
669+
# Build Fluent Bit with Arrow:
670+
cd build/
671+
cmake -DFLB_ARROW=On ..
672+
cmake --build .
673+
```
674+
675+
### Testing Parquet compression
676+
677+
678+
```md
679+
## Testing (Parquet)
680+
681+
Example configuration:
682+
683+
```yaml
684+
service:
685+
flush: 5
686+
daemon: Off
687+
log_level: debug
688+
http_server: Off
689+
690+
pipeline:
691+
inputs:
692+
- name: dummy
693+
tag: dummy.local
694+
dummy {"boolean": false, "int": 1, "long": 1, "float": 1.1, "double": 1.1, "bytes": "foo", "string": "foo"}
695+
696+
outputs:
697+
- name: s3
698+
match: dummy*
699+
region: us-east-2
700+
bucket: <your_testing_bucket>
701+
use_put_object: On
702+
compression: parquet
703+
# other parameters
704+
```

0 commit comments

Comments
 (0)