Skip to content

Commit 924796d

Browse files
committed
Improved output settings
1 parent cff26ee commit 924796d

File tree

6 files changed

+113
-108
lines changed

6 files changed

+113
-108
lines changed

src/content/docs/pipelines/build-with-pipelines/batching.mdx

Lines changed: 0 additions & 39 deletions
This file was deleted.

src/content/docs/pipelines/build-with-pipelines/http.mdx

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,16 @@
11
---
2-
title: Source - HTTP
2+
title: Configure HTTP Endpoint
33
pcx_content_type: concept
44
sidebar:
55
order: 1
66
head:
77
- tag: title
8-
content: Cloudflare Pipelines Source - HTTP
8+
content: Configure HTTP Endpoint
99
---
1010

1111
import { Render, PackageManagers } from "~/components";
1212

13-
Pipelines support ingesting data via HTTP. When you create a new Pipeline, you'll receive an HTTP endpoint that you can make post requests to.
13+
Pipelines support data ingestion over HTTP. When you create a new Pipeline, you'll receive a globally scalable ingestion endpoint. To ingest data, make HTTP POST requests to the endpoint.
1414

1515

1616
```sh
Lines changed: 101 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,101 @@
1+
---
2+
title: Customize output settings
3+
pcx_content_type: concept
4+
sidebar:
5+
order: 3
6+
head:
7+
- tag: title
8+
content: Customize output settings
9+
---
10+
11+
import { Render, PackageManagers } from "~/components";
12+
13+
Pipelines convert a stream of records into output files, and deliver the files to an R2 bucket in your account. This guide details how you can change the output destination, and how to customize batch settings to generate query ready files.
14+
15+
## Configure an R2 bucket as a destination
16+
To create or update a Pipeline using Wrangler, run the following command in a terminal:
17+
18+
```sh
19+
npx wrangler pipelines create [PIPELINE-NAME] --r2-bucket [R2-BUCKET-NAME]
20+
```
21+
22+
After running this command, you'll be prompted to authorize Cloudflare Workers Pipelines to create R2 API tokens on your behalf. These tokens are required by your Pipeline. Your Pipeline uses the tokens when loading data into your bucket. You can approve the request through the browser link which will open automatically.
23+
24+
If you prefer not to authenticate this way, you may pass your [R2 API Tokens](/r2/api/s3/tokens/) to Wrangler:
25+
```sh
26+
npx wrangler pipelines create [PIPELINE-NAME] --r2 [R2-BUCKET-NAME] --r2-access-key-id [ACCESS-KEY-ID] --r2-secret-access-key [SECRET-ACCESS-KEY]
27+
```
28+
29+
## Deliver data to a prefix
30+
You can specify an optional prefix for all the output files stored in your specified R2 bucket.
31+
32+
To modify the prefix for an existing Pipeline using Wrangler:
33+
```sh
34+
npx wrangler pipelines update <pipeline-name> --r2-prefix "test"
35+
```
36+
37+
All the output records generated by your pipeline will be stored under the prefix "test", and will look like this:
38+
```sh
39+
- test/event_date=2024-09-06/hr=15/37db9289-15ba-4e8b-9231-538dc7c72c1e-15.json.gz
40+
- test/event_date=2024-09-06/hr=15/37db9289-15ba-4e8b-9231-538dc7c72c1e-15.json.gz
41+
```
42+
43+
## File format and compression
44+
Output files are generated as Newline Delimited JSON files (`ndjson`). Each line in an output file maps to a single record.
45+
46+
By default, output files are compressed in the `gzip` format. Compression can be turned off using the `--compression` flag:
47+
```sh
48+
npx wrangler pipelines update <pipeline-name> --compression none
49+
```
50+
51+
## Customize batch behavior
52+
When configuring your Pipeline, you can define how records are batched before they are delivered to R2. Batches of records are written out to a single output file.
53+
54+
Batching can:
55+
1. Reduce the number of output files written to R2, and thus reduce the [cost of writing data to R2](r2/pricing/#class-a-operations)
56+
2. Increase the size of output files, making them more efficient to query
57+
58+
There are three ways to define how ingested data is batched:
59+
60+
1. `batch-max-mb`: The maximum amount of data that will be batched, in megabytes. Default is 10 MB, maximum is 100 MB.
61+
2. `batch-max-rows`: The maximum number of rows or events in a batch before data is written. Default, and maximum, is 10,000 rows.
62+
3. `batch-max-seconds`: The maximum duration of a batch before data is written, in seconds. Default is 15 seconds, maximum is 300 seconds.
63+
64+
Pipelines batch definitions are hints. A pipeline will follow these hints closely, but batches will not be exact.
65+
66+
All three batch definitions work together. Whichever limit is reached first triggers the delivery of a batch.
67+
68+
For example, a `batch-max-mb` = 100 MB and a `batch-max-seconds` = 100 means that if 100 MB of events are posted to the Pipeline, the batch will be delivered. However, if it takes longer than 100 seconds for 100 MB of events to be posted, a batch of all the messages that were posted during those 100 seconds will be created.
69+
70+
### Defining batch settings using Wrangler
71+
To update the batch settings for an existing Pipeline using Wrangler, run the following command in a terminal
72+
73+
```sh
74+
npx wrangler pipelines update [PIPELINE-NAME] --batch-max-mb 100 --batch-max-rows 10000 --batch-max-seconds 300
75+
```
76+
77+
### Batch settings
78+
79+
You can configure the following batch-level settings to adjust how Pipelines create a batch:
80+
81+
| Setting | Default | Minimum | Maximum |
82+
| ----------------------------------------- | ----------- | --------- | ----------- |
83+
| Maximum Batch Size `batch-max-mb` | 10 MB | 0.001 MB | 100 MB |
84+
| Maximum Batch Timeout `batch-max-seconds` | 15 seconds | 0 seconds | 300 seconds |
85+
| Maximum Batch Rows `batch-max-rows` | 10,000 rows | 1 row | 10,000 rows |
86+
87+
88+
## Deliver partitioned data
89+
Partitioning organizes data into directories based on specific fields to improve query performance. It helps by reducing the amount of data scanned for queries, enabling faster reads.
90+
91+
:::note
92+
By default, all Pipelines partition data by event date and time. This will be customizable in the future.
93+
:::
94+
95+
Output files are prefixed with event date and hour. For example, the output from a Pipeline in your R2 bucket might look like this:
96+
```sh
97+
- event_date=2024-09-06/hr=15/37db9289-15ba-4e8b-9231-538dc7c72c1e-15.json.gz
98+
- event_date=2024-09-06/hr=15/37db9289-15ba-4e8b-9231-538dc7c72c1e-15.json.gz
99+
```
100+
101+

src/content/docs/pipelines/build-with-pipelines/r2.mdx

Lines changed: 0 additions & 57 deletions
This file was deleted.

src/content/docs/pipelines/build-with-pipelines/workers-apis.mdx

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
11
---
2-
title: Source - Workers
2+
title: Workers API
33
pcx_content_type: concept
44
sidebar:
55
order: 2
66
head:
77
- tag: title
8-
content: Cloudflare Pipelines - Workers APIs
8+
content: Workers API
99

1010
---
1111

@@ -28,15 +28,11 @@ compatibility_date = "2025-04-01"
2828

2929
[[pipelines]]
3030
pipeline = "<MY-PIPELINE-NAME>" # The name of your Pipeline
31-
binding = "MY_PIPELINE" # The binding name, accessed using env.MY_PIPELINT
31+
binding = "MY_PIPELINE" # The binding name, accessed using env.MY_PIPELINE
3232
```
3333

3434
</WranglerConfig>
3535

36-
:::note
37-
When running your Worker locally, Pipelines are partially simulated. Worker code which sends data to a Pipeline will execute successfully. However, the full Pipeline, including batching & writing to R2, will not be executed locally.
38-
:::
39-
4036

4137
## `Pipeline`
4238
A binding which allows a Worker to send messages to a Pipeline.
@@ -51,3 +47,7 @@ interface Pipeline<PipelineRecord> {
5147

5248
* Sends a message to the Pipeline. The body must be an array of objects supported by the [structured clone algorithm](https://developer.mozilla.org/en-US/docs/Web/API/Web_Workers_API/Structured_clone_algorithm#supported_types).
5349
* When the promise resolves, the message is confirmed to be stored by the Pipeline.
50+
51+
:::note
52+
When running your Worker locally, Pipelines are partially simulated. Worker code which sends data to a Pipeline will execute successfully. However, the full Pipeline, including batching & writing to R2, will not be executed locally.
53+
:::

src/content/docs/pipelines/index.mdx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -46,8 +46,8 @@ Refer to the [getting started guide](/pipelines/getting-started) to start buildi
4646
Each Pipeline generates a globally scalable HTTP endpoint, which supports authentication and CORS settings.
4747
</Feature>
4848

49-
<Feature header="Batch settings" href="/pipelines/build-with-pipelines/batching">
50-
Leverage batch sizes to generate large, and query-efficient, output files.
49+
<Feature header="Customize batch behavior, deliver to a prefix, and receive partitioned data" href="/pipelines/build-with-pipelines/output-settings">
50+
Customize Pipeline settings to generate output files that are efficient to query.
5151
</Feature>
5252

5353
***

0 commit comments

Comments
 (0)