Skip to content

Commit 60e1aaa

Browse files
mfernestclaudegithub-actions[bot]
authored
docs(DOC-1853): document aws_s3 limitations and improve prose (#390)
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
1 parent 0342ffa commit 60e1aaa

File tree

5 files changed

+99
-23
lines changed

5 files changed

+99
-23
lines changed

docs-data/overrides.json

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2920,6 +2920,20 @@
29202920
{
29212921
"name": "tcp",
29222922
"$ref": "#/definitions/tcp"
2923+
},
2924+
{
2925+
"name": "batching",
2926+
"$ref": "#/definitions/batching",
2927+
"children": [
2928+
{
2929+
"name": "byte_size",
2930+
"$ref": "#/definitions/byte_size"
2931+
},
2932+
{
2933+
"name": "count",
2934+
"$ref": "#/definitions/count"
2935+
}
2936+
]
29232937
}
29242938
]
29252939
}

modules/components/attachments/connect-4.81.0.json

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -26118,7 +26118,7 @@
2611826118
"name": "batching",
2611926119
"type": "object",
2612026120
"kind": "",
26121-
"description": "\nAllows you to configure a xref:configuration:batching.adoc[batching policy].",
26121+
"description": "Configure a xref:configuration:batching.adoc[batching policy].",
2612226122
"examples": [
2612326123
{
2612426124
"byte_size": 5000,
@@ -26140,14 +26140,14 @@
2614026140
"name": "count",
2614126141
"type": "int",
2614226142
"kind": "scalar",
26143-
"description": "A number of messages at which the batch should be flushed. If `0` disables count based batching.",
26143+
"description": "The number of messages after which the batch is flushed. Set to `0` to disable count-based batching.",
2614426144
"default": 0
2614526145
},
2614626146
{
2614726147
"name": "byte_size",
2614826148
"type": "int",
2614926149
"kind": "scalar",
26150-
"description": "An amount of bytes at which the batch should be flushed. If `0` disables size based batching.",
26150+
"description": "The number of bytes at which the batch is flushed. Set to `0` to disable size-based batching.",
2615126151
"default": 0
2615226152
},
2615326153
{

modules/components/pages/inputs/aws_s3.adoc

Lines changed: 46 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212
component_type_dropdown::[]
1313

1414

15-
Downloads objects within an Amazon S3 bucket, optionally filtered by a prefix, either by walking the items in the bucket or by streaming upload notifications in realtime.
15+
Downloads objects within an Amazon S3 bucket, optionally filtered by a prefix, either by walking the items in the bucket or by streaming upload notifications in real time.
1616

1717

1818
[tabs]
@@ -39,21 +39,59 @@ include::components:example$advanced/inputs/aws_s3.yaml[]
3939

4040
== Stream objects on upload with SQS
4141

42-
A common pattern for consuming S3 objects is to emit upload notification events from the bucket either directly to an SQS queue, or to an SNS topic that is consumed by an SQS queue, and then have your consumer listen for events which prompt it to download the newly uploaded objects. More information about this pattern and how to set it up can be found at in the https://docs.aws.amazon.com/AmazonS3/latest/dev/ways-to-add-notification-config-to-bucket.html[Amazon S3 docs].
42+
A common pattern for consuming S3 objects is to emit upload notification events from the bucket either directly to an SQS queue, or to an SNS topic that is consumed by an SQS queue, and then have your consumer listen for events that prompt it to download the newly uploaded objects. More information about this pattern and how to set it up can be found in the https://docs.aws.amazon.com/AmazonS3/latest/dev/ways-to-add-notification-config-to-bucket.html[Amazon S3 docs].
4343

44-
Redpanda Connect is able to follow this pattern when you configure an `sqs.url`, where it consumes events from SQS and only downloads object keys received within those events. In order for this to work Redpanda Connect needs to know where within the event the key and bucket names can be found, specified as xref:configuration:field_paths.adoc[dot paths] with the fields `sqs.key_path` and `sqs.bucket_path`. The default values for these fields should already be correct when following the guide above.
44+
Redpanda Connect is able to follow this pattern when you configure an `sqs.url`, where it consumes events from SQS and downloads only the object keys contained in those events. For this to work, Redpanda Connect needs to know where within the event the key and bucket names can be found, specified as xref:configuration:field_paths.adoc[dot paths] with the fields `sqs.key_path` and `sqs.bucket_path`. The default values for these fields should already be correct when following the guide above.
4545

46-
If your notification events are being routed to SQS via an SNS topic then the events will be enveloped by SNS, in which case you also need to specify the field `sqs.envelope_path`, which in the case of SNS to SQS will usually be `Message`.
46+
If your notification events are being routed to SQS via an SNS topic, the events are enveloped by SNS, in which case you also need to specify the field `sqs.envelope_path`, which in the case of SNS to SQS will usually be `Message`.
4747

48-
When using SQS please make sure you have sensible values for `sqs.max_messages` and also the visibility timeout of the queue itself. When Redpanda Connect consumes an S3 object the SQS message that triggered it is not deleted until the S3 object has been sent onwards. This ensures at-least-once crash resiliency, but also means that if the S3 object takes longer to process than the visibility timeout of your queue then the same objects might be processed multiple times.
48+
When using SQS, make sure you have sensible values for `sqs.max_messages` and also the visibility timeout of the queue itself. When Redpanda Connect consumes an S3 object the SQS message that triggered it is not deleted until the S3 object has been sent onwards. This ensures at-least-once crash resiliency, but also means that if the S3 object takes longer to process than the visibility timeout of your queue, then the same objects might be processed multiple times.
4949

5050
== Download large files
5151

52-
When downloading large files it's often necessary to process it in streamed parts in order to avoid loading the entire file in memory at a given time. In order to do this a <<scanner, `scanner`>> can be specified that determines how to break the input into smaller individual messages.
52+
When downloading large files, process them in streamed parts to avoid loading the entire file into memory at once. To do this, specify a <<scanner, `scanner`>> that determines how to break the input into smaller individual messages.
53+
54+
== Bucket and prefix
55+
56+
The `bucket` field accepts a bucket name only, not an ARN. For example, use `my-bucket`, not `arn:aws:s3:::my-bucket`.
57+
58+
The `prefix` field accepts a single string. To consume from multiple prefixes in the same bucket, use multiple `aws_s3` inputs in a xref:components:inputs/broker.adoc[`broker` input]:
59+
60+
```yaml
61+
input:
62+
broker:
63+
inputs:
64+
- aws_s3:
65+
bucket: my-bucket
66+
prefix: logs/app1/
67+
- aws_s3:
68+
bucket: my-bucket
69+
prefix: logs/app2/
70+
```
5371

5472
== Credentials
5573

56-
By default Redpanda Connect will use a shared credentials file when connecting to AWS services. It's also possible to set them explicitly at the component level, allowing you to transfer data across accounts. You can find out more in xref:guides:cloud/aws.adoc[].
74+
By default, Redpanda Connect uses a shared credentials file when connecting to AWS services. You can also set credentials explicitly at the component level to transfer data across accounts. You can find out more in xref:guides:cloud/aws.adoc[AWS credentials].
75+
76+
== S3-compatible storage
77+
78+
The `endpoint` and `force_path_style_urls` fields let you connect to S3-compatible storage services such as Cloudflare R2, MinIO, or DigitalOcean Spaces.
79+
80+
For Cloudflare R2, set `endpoint` to your account endpoint URL and enable `force_path_style_urls`:
81+
82+
```yaml
83+
input:
84+
aws_s3:
85+
bucket: r2-bucket
86+
endpoint: https://<account-id>.r2.cloudflarestorage.com
87+
force_path_style_urls: true
88+
region: auto
89+
credentials:
90+
id: <r2-access-key-id>
91+
secret: <r2-secret-access-key>
92+
```
93+
94+
Find your account ID in the Cloudflare dashboard under *R2 > Overview > Account Details*. Generate API credentials under *R2 > Manage R2 API Tokens*.
5795

5896
== Metadata
5997

@@ -68,7 +106,7 @@ This input adds the following metadata fields to each message:
68106
- s3_version_id
69107
- All user defined metadata
70108

71-
You can access these metadata fields using xref:configuration:interpolation.adoc#bloblang-queries[function interpolation]. Note that user defined metadata is case insensitive within AWS, and it is likely that the keys will be received in a capitalized form, if you wish to make them consistent you can map all metadata keys to lower or uppercase using a Bloblang mapping such as `meta = meta().map_each_key(key -> key.lowercase())`.
109+
You can access these metadata fields using xref:configuration:interpolation.adoc#bloblang-queries[function interpolation]. User-defined metadata is case insensitive in AWS, so keys are often received in capitalized form. To normalize them, map all metadata keys to lowercase or uppercase using a Bloblang mapping such as `meta = meta().map_each_key(key -> key.lowercase())`.
72110

73111
include::redpanda-connect:components:partial$fields/inputs/aws_s3.adoc[]
74112

modules/components/pages/outputs/aws_s3.adoc

Lines changed: 33 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212
component_type_dropdown::[]
1313

1414

15-
Sends message parts as objects to an Amazon S3 bucket. Each object is uploaded with the path specified with the `path` field.
15+
Uploads messages to an Amazon S3 bucket as objects, using the path specified in the `path` field.
1616

1717
ifndef::env-cloud[]
1818
Introduced in version 3.36.0.
@@ -40,15 +40,15 @@ include::components:example$advanced/outputs/aws_s3.yaml[]
4040
--
4141
======
4242

43-
In order to have a different path for each object you should use function interpolations described in xref:configuration:interpolation.adoc#bloblang-queries[Bloblang queries], which are calculated per message of a batch.
43+
To use a different path for each object, use xref:configuration:interpolation.adoc#bloblang-queries[function interpolation], which is evaluated for each message in a batch.
4444

4545
== Metadata
4646

47-
Metadata fields on messages will be sent as headers, in order to mutate these values (or remove them) check out the xref:configuration:metadata.adoc[metadata docs].
47+
Redpanda Connect sends metadata fields as headers. To mutate or remove these values, see the xref:configuration:metadata.adoc[metadata docs].
4848

4949
== Tags
5050

51-
The tags field allows you to specify key/value pairs to attach to objects as tags, where the values support xref:configuration:interpolation.adoc#bloblang-queries[interpolation functions]:
51+
The `tags` field accepts key/value pairs to attach to objects as tags, and the values support xref:configuration:interpolation.adoc#bloblang-queries[interpolation functions]:
5252

5353
```yaml
5454
output:
@@ -60,15 +60,15 @@ output:
6060
Timestamp: ${!meta("Timestamp")}
6161
```
6262

63-
=== Credentials
63+
== Credentials
6464

65-
By default Redpanda Connect will use a shared credentials file when connecting to AWS services. It's also possible to set them explicitly at the component level, allowing you to transfer data across accounts. You can find out more in xref:guides:cloud/aws.adoc[].
65+
By default, Redpanda Connect uses a shared credentials file when connecting to AWS services. You can also set credentials explicitly at the component level to transfer data across accounts. You can find out more in xref:guides:cloud/aws.adoc[AWS credentials].
6666

6767
== Batching
6868

6969
It's common to want to upload messages to S3 as batched archives. The easiest way to do this is to batch your messages at the output level and join the batch of messages with an xref:components:processors/archive.adoc[`archive`] or xref:components:processors/compress.adoc[`compress`] processor.
7070

71-
For example, the following configuration uploads messages as a .tar.gz archive of documents:
71+
For example, the following configuration uploads messages as a `.tar.gz` archive of documents:
7272

7373
```yaml
7474
output:
@@ -85,7 +85,7 @@ output:
8585
algorithm: gzip
8686
```
8787

88-
Alternatively, this configuration uploads JSON documents as a single large document containing an array of objects:
88+
This configuration uploads JSON documents as a single large document containing an array of objects:
8989

9090
```yaml
9191
output:
@@ -99,6 +99,31 @@ output:
9999
format: json_array
100100
```
101101

102+
== Bucket name format
103+
104+
The `bucket` field accepts a bucket name only, not an ARN. For example, use `my-bucket`, not `arn:aws:s3:::my-bucket`.
105+
106+
== S3-compatible storage
107+
108+
The `endpoint` and `force_path_style_urls` fields let you connect to S3-compatible storage services such as Cloudflare R2, MinIO, or DigitalOcean Spaces.
109+
110+
For Cloudflare R2, set `endpoint` to your account endpoint URL and enable `force_path_style_urls`:
111+
112+
```yaml
113+
output:
114+
aws_s3:
115+
bucket: r2-bucket
116+
path: ${!uuid_v4()}.json
117+
endpoint: https://<account-id>.r2.cloudflarestorage.com
118+
force_path_style_urls: true
119+
region: auto
120+
credentials:
121+
id: <r2-access-key-id>
122+
secret: <r2-secret-access-key>
123+
```
124+
125+
Find your account ID in the Cloudflare dashboard under *R2 > Overview > Account Details*. Generate API credentials under *R2 > Manage R2 API Tokens*.
126+
102127
== Performance
103128

104129
This output benefits from sending multiple messages in flight in parallel for improved performance. You can tune the max number of in flight messages (or message batches) with the field `max_in_flight`.

modules/components/partials/fields/outputs/aws_s3.adoc

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,7 @@
44

55
=== `batching`
66

7-
8-
Allows you to configure a xref:configuration:batching.adoc[batching policy].
7+
Configure a xref:configuration:batching.adoc[batching policy].
98

109
*Type*: `object`
1110

@@ -33,7 +32,7 @@ batching:
3332

3433
=== `batching.byte_size`
3534

36-
An amount of bytes at which the batch should be flushed. If `0` disables size based batching.
35+
The number of bytes at which the batch is flushed. Set to `0` to disable size-based batching.
3736

3837
*Type*: `int`
3938

@@ -55,7 +54,7 @@ check: this.type == "end_of_transaction"
5554

5655
=== `batching.count`
5756

58-
A number of messages at which the batch should be flushed. If `0` disables count based batching.
57+
The number of messages after which the batch is flushed. Set to `0` to disable count-based batching.
5958

6059
*Type*: `int`
6160

0 commit comments

Comments
 (0)