Skip to content

Commit f237dea

Browse files
committed
Improved output settings
1 parent be960d7 commit f237dea

File tree

1 file changed

+31
-29
lines changed

1 file changed

+31
-29
lines changed

src/content/docs/pipelines/build-with-pipelines/output-settings.mdx

Lines changed: 31 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -13,43 +13,31 @@ import { Render, PackageManagers } from "~/components";
1313
Pipelines convert a stream of records into output files, and deliver the files to an R2 bucket in your account. This guide details how you can change the output destination, and how to customize batch settings to generate query ready files.
1414

1515
## Configure an R2 bucket as a destination
16-
To create or update a Pipeline using Wrangler, run the following command in a terminal:
16+
To create or update a pipeline using Wrangler, run the following command in a terminal:
1717

1818
```sh
1919
npx wrangler pipelines create [PIPELINE-NAME] --r2-bucket [R2-BUCKET-NAME]
2020
```
2121

22-
After running this command, you'll be prompted to authorize Cloudflare Workers Pipelines to create R2 API tokens on your behalf. These tokens are required by your Pipeline. Your Pipeline uses the tokens when loading data into your bucket. You can approve the request through the browser link which will open automatically.
22+
After running this command, you'll be prompted to authorize Cloudflare Workers Pipelines to create an R2 API token on your behalf. Your pipeline uses the R2 API token to load data into your bucket. You can approve the request through the browser link which will open automatically.
2323

24-
If you prefer not to authenticate this way, you may pass your [R2 API Tokens](/r2/api/s3/tokens/) to Wrangler:
24+
If you prefer not to authenticate this way, you may pass your [R2 API Token](/r2/api/s3/tokens/) to Wrangler:
2525
```sh
2626
npx wrangler pipelines create [PIPELINE-NAME] --r2 [R2-BUCKET-NAME] --r2-access-key-id [ACCESS-KEY-ID] --r2-secret-access-key [SECRET-ACCESS-KEY]
2727
```
2828

29-
## Deliver data to a prefix
30-
You can specify an optional prefix for all the output files stored in your specified R2 bucket.
31-
32-
To modify the prefix for an existing Pipeline using Wrangler:
33-
```sh
34-
npx wrangler pipelines update <pipeline-name> --r2-prefix "test"
35-
```
36-
37-
All the output records generated by your pipeline will be stored under the prefix "test", and will look like this:
38-
```sh
39-
- test/event_date=2024-09-06/hr=15/37db9289-15ba-4e8b-9231-538dc7c72c1e-15.json.gz
40-
- test/event_date=2024-09-06/hr=15/37db9289-15ba-4e8b-9231-538dc7c72c1e-15.json.gz
41-
```
42-
4329
## File format and compression
4430
Output files are generated as Newline Delimited JSON files (`ndjson`). Each line in an output file maps to a single record.
4531

4632
By default, output files are compressed in the `gzip` format. Compression can be turned off using the `--compression` flag:
4733
```sh
48-
npx wrangler pipelines update <pipeline-name> --compression none
34+
npx wrangler pipelines update [PIPELINE-NAME] --compression none
4935
```
5036

37+
Output files are named using a [UILD](/https://github.com/ulid/spec) slug, followed by an extension.
38+
5139
## Customize batch behavior
52-
When configuring your Pipeline, you can define how records are batched before they are delivered to R2. Batches of records are written out to a single output file.
40+
When configuring your pipeline, you can define how records are batched before they are delivered to R2. Batches of records are written out to a single output file.
5341

5442
Batching can:
5543
1. Reduce the number of output files written to R2, and thus reduce the [cost of writing data to R2](/r2/pricing/#class-a-operations)
@@ -61,22 +49,24 @@ There are three ways to define how ingested data is batched:
6149
2. `batch-max-rows`: The maximum number of rows or events in a batch before data is written. Default, and maximum, is 10,000 rows.
6250
3. `batch-max-seconds`: The maximum duration of a batch before data is written, in seconds. Default is 15 seconds, maximum is 300 seconds.
6351

64-
Pipelines batch definitions are hints. A pipeline will follow these hints closely, but batches will not be exact.
52+
Batch definitions are hints. A pipeline will follow these hints closely, but batches might not be exact.
6553

6654
All three batch definitions work together. Whichever limit is reached first triggers the delivery of a batch.
6755

68-
For example, a `batch-max-mb` = 100 MB and a `batch-max-seconds` = 100 means that if 100 MB of events are posted to the Pipeline, the batch will be delivered. However, if it takes longer than 100 seconds for 100 MB of events to be posted, a batch of all the messages that were posted during those 100 seconds will be created.
56+
For example, a `batch-max-mb` = 100 MB and a `batch-max-seconds` = 100 means that if 100 MB of events are posted to the pipeline, the batch will be delivered. However, if it takes longer than 100 seconds for 100 MB of events to be posted, a batch of all the messages that were posted during those 100 seconds will be created.
6957

7058
### Defining batch settings using Wrangler
71-
To update the batch settings for an existing Pipeline using Wrangler, run the following command in a terminal
59+
You can use the following batch settings flags while creating or updating a pipeline:
60+
* `--batch-max-mb`
61+
* `--batch-max-rows`
62+
* `--batch-max-seconds`
7263

64+
For example:
7365
```sh
7466
npx wrangler pipelines update [PIPELINE-NAME] --batch-max-mb 100 --batch-max-rows 10000 --batch-max-seconds 300
7567
```
7668

77-
### Batch settings
78-
79-
You can configure the following batch-level settings to adjust how Pipelines create a batch:
69+
#### Batch size limits
8070

8171
| Setting | Default | Minimum | Maximum |
8272
| ----------------------------------------- | ----------- | --------- | ----------- |
@@ -86,16 +76,28 @@ You can configure the following batch-level settings to adjust how Pipelines cre
8676

8777

8878
## Deliver partitioned data
89-
Partitioning organizes data into directories based on specific fields to improve query performance. It helps by reducing the amount of data scanned for queries, enabling faster reads.
79+
Partitioning organizes data into directories based on specific fields to improve query performance. Partitions reduce the amount of data scanned for queries, enabling faster reads.
9080

9181
:::note
92-
By default, all Pipelines partition data by event date and time. This will be customizable in the future.
82+
By default, Pipelines partition data by event date and time. This will be customizable in the future.
9383
:::
9484

9585
Output files are prefixed with event date and hour. For example, the output from a Pipeline in your R2 bucket might look like this:
9686
```sh
97-
- event_date=2024-09-06/hr=15/37db9289-15ba-4e8b-9231-538dc7c72c1e-15.json.gz
98-
- event_date=2024-09-06/hr=15/37db9289-15ba-4e8b-9231-538dc7c72c1e-15.json.gz
87+
- event_date=2025-04-01/hr=15/01JQWBZCZBAQZ7RJNZHN38JQ7V.json.gz
88+
- event_date=2025-04-01/hr=15/01JQWC16FXGP845EFHMG1C0XNW.json.gz
9989
```
10090

91+
## Deliver data to a prefix
92+
You can specify an optional prefix for all the output files stored in your specified R2 bucket, using the flag `--r2-prefix`.
93+
94+
For example:
95+
```sh
96+
npx wrangler pipelines update [PIPELINE-NAME] --r2-prefix test
97+
```
10198

99+
After running the above command, the output files generated by your pipeline will be stored under the prefix "test". Files will remain partitioned. Your output will look like this:
100+
```sh
101+
- test/event_date=2025-04-01/hr=15/01JQWBZCZBAQZ7RJNZHN38JQ7V.json.gz
102+
- test/event_date=2025-04-01/hr=15/01JQWC16FXGP845EFHMG1C0XNW.json.gz
103+
```

0 commit comments

Comments
 (0)