cloudflare · maheshwarip · Apr 9, 2025 · Dec 5, 2024 · Dec 5, 2024 · Dec 6, 2024
diff --git a/src/content/docs/pipelines/build-with-pipelines/http.mdx b/src/content/docs/pipelines/build-with-pipelines/http.mdx
@@ -0,0 +1,88 @@
+---
+title: Configure HTTP Endpoint
+pcx_content_type: concept
+sidebar:
+  order: 1
+head:
+  - tag: title
+    content: Configure HTTP Endpoint
+---
+
+import { Render, PackageManagers } from "~/components";
+
+Pipelines support data ingestion over HTTP. When you create a new pipeline, you'll receive a globally scalable ingestion endpoint. To ingest data, make HTTP POST requests to the endpoint.
+
+
+```sh
+$ npx wrangler pipelines create [PIPELINE-NAME] --r2-bucket [R2-BUCKET-NAME]
+
+🌀 Creating pipeline named "[PIPELINE-NAME]"
+✅ Successfully created pipeline [PIPELINE-NAME] with ID [PIPELINE-ID]
+
+You can now send data to your pipeline with:
+  curl "https://<PIPELINE-ID>.pipelines.cloudflare.com/" -d '[{ "foo":"bar" }]'
+```
+
+## Accepted data formats
+Pipelines accept arrays of valid JSON objects. You can send multiple objects in a single request, provided the total data volume is within the [documented limits](/pipelines/platform/limits). Sending data in a different format will result in an error.
+
+For example, you can send data to your pipeline using a curl command like this:
+```sh
+curl -X POST https://<PIPELINE-ID>.pipelines.cloudflare.com \
+  -H "Content-Type: application/json" \
+  -d '[{"foo":"bar"}, {"foo":"bar"}, {"foo":"bar"}]'
+
+{"success":true,"result":{"committed":3}}
+```
+
+## Turning HTTP ingestion off
+By default, ingestion via HTTP is turned on. You can turn it off by setting `--enable-http false` when creating or updating a pipeline.
+
+```sh
+$ npx wrangler pipelines create [PIPELINE-NAME] --r2-bucket [R2-BUCKET-NAME] --enable-http false
+```
+
+Ingestion URLs are tied to your pipeline ID. Turning HTTP off, and then turning it back on, will not change the URL.
+
+## Authentication
+You can secure your HTTP ingestion endpoint using Cloudflare API tokens. By default, authentication is turned off. To enable authentication, use `--require-http-auth true` while creating or updating a pipeline.
+
+```sh
+$ npx wrangler pipelines create [PIPELINE-NAME] --r2-bucket [R2-BUCKET-NAME] --require-http-auth true
+```
+
+Once authentication is turned on, you will need to include a Cloudflare API token in your request headers.
+
+### Get API token
+1. Log in to the [Cloudflare dashboard](https://dash.cloudflare.com) and select your account.
+2. Navigate to your [API Keys](https://dash.cloudflare.com/profile/api-tokens)
+3. Select *Create Token*
+4. Choose the template for Workers Pipelines. Click on *continue to summary*, and finally on *create token*. Make sure to copy the API token, and save it securely.
+
+### Making authenticated requests
+Include the API token you created in the previous step in the headers for your request:
+
+```sh
+curl https://<PIPELINE-ID>.pipelines.cloudflare.com
+	-H "Content-Type: application/json" \
+	-H "Authorization: Bearer ${API_TOKEN}" \
+	-d '[{"foo":"bar"}, {"foo":"bar"}, {"foo":"bar"}]'
+```
+
+## Specifying CORS Settings
+If you want to use your pipeline to ingest client side data, such as website clicks, you'll need to configure your [Cross-Origin Resource Sharing (CORS) settings](https://developer.mozilla.org/en-US/docs/Web/HTTP/CORS).
+
+Without setting your CORS settings, browsers will restrict requests made to your pipeline endpoint. For example, if your website domain is `https://my-website.com`, and you want to post client side data to your pipeline at `https://<PIPELINE-ID>.pipelines.cloudflare.com`, without CORS settings, the request will fail.
+
+To fix this, you need to configure your pipeline to accept requests from `https://my-website.com`. You can do so while creating or updating a pipeline, using the flag `--cors-origins`. You can specify multiple domains separated by a space.
+
+```sh
+$ npx wrangler pipelines update [PIPELINE-NAME] --cors-origins https://mydomain.com http://localhost:8787
+```
+
+You can specify that all cross origin requests are accepted. We recommend only using this option in development, and not for production use cases.
+```sh
+$ npx wrangler pipelines update [PIPELINE-NAME] --cors-origins "*"
+```
+
+After your the `--cors-origins` have been set on your pipeline, your pipeline will respond to preflight requests and POST requests with the appropriate `Access-Control-Allow-Origin` headers set.
@@ -0,0 +1,8 @@
+---
+title: Build with Pipelines
+pcx_content_type: navigation
+sidebar:
+  order: 3
+  group:
+    hideIndex: true
+---
@@ -0,0 +1,103 @@
+---
+title: Customize output settings
+pcx_content_type: concept
+sidebar:
+  order: 3
+head:
+  - tag: title
+    content: Customize output settings
+---
+
+import { Render, PackageManagers } from "~/components";
+
+Pipelines convert a stream of records into output files, and deliver the files to an R2 bucket in your account. This guide details how you can change the output destination, and how to customize batch settings to generate query ready files.
+
+## Configure an R2 bucket as a destination
+To create or update a pipeline using Wrangler, run the following command in a terminal:
+
+```sh
+npx wrangler pipelines create [PIPELINE-NAME] --r2-bucket [R2-BUCKET-NAME]
+```
+
+After running this command, you'll be prompted to authorize Cloudflare Workers Pipelines to create an R2 API token on your behalf. Your pipeline uses the R2 API token to load data into your bucket. You can approve the request through the browser link which will open automatically.
+
+If you prefer not to authenticate this way, you may pass your [R2 API Token](/r2/api/s3/tokens/) to Wrangler:
+```sh
+npx wrangler pipelines create [PIPELINE-NAME] --r2 [R2-BUCKET-NAME] --r2-access-key-id [ACCESS-KEY-ID] --r2-secret-access-key [SECRET-ACCESS-KEY]
+```
+
+## File format and compression
+Output files are generated as Newline Delimited JSON files (`ndjson`). Each line in an output file maps to a single record.
+
+By default, output files are compressed in the `gzip` format. Compression can be turned off using the `--compression`  flag:
+```sh
+npx wrangler pipelines update [PIPELINE-NAME] --compression none
+```
+
+Output files are named using a [UILD](https://github.com/ulid/spec) slug, followed by an extension.
+
+## Customize batch behavior
+When configuring your pipeline, you can define how records are batched before they are delivered to R2. Batches of records are written out to a single output file.
+
+Batching can:
+1. Reduce the number of output files written to R2, and thus reduce the [cost of writing data to R2](/r2/pricing/#class-a-operations)
+2. Increase the size of output files, making them more efficient to query
+
+There are three ways to define how ingested data is batched:
+
+1. `batch-max-mb`: The maximum amount of data that will be batched, in megabytes. Default is 10 MB, maximum is 100 MB.
+2. `batch-max-rows`: The maximum number of rows or events in a batch before data is written. Default, and maximum, is 10,000 rows.
-2. `batch-max-rows`: The maximum number of rows or events in a batch before data is written. Default, and maximum, is 10,000 rows.
+2. `batch-max-rows`: The maximum number of rows or events in a batch before data is written. Default, and maximum, is `10,000` rows.
-2. `batch-max-rows`: The maximum number of rows or events in a batch before data is written. Default, and maximum, is 10,000 rows.
+2. `batch-max-rows`: The maximum number of rows or events in a batch before data is written. Default, and maximum, is `10,000` rows.
+3. `batch-max-seconds`: The maximum duration of a batch before data is written, in seconds. Default is 15 seconds, maximum is 300 seconds.
-3. `batch-max-seconds`: The maximum duration of a batch before data is written, in seconds. Default is 15 seconds, maximum is 300 seconds.
+3. `batch-max-seconds`: The maximum duration of a batch before data is written in seconds. Default is `15 seconds`, maximum is `300 seconds`.
-3. `batch-max-seconds`: The maximum duration of a batch before data is written, in seconds. Default is 15 seconds, maximum is 300 seconds.
+3. `batch-max-seconds`: The maximum duration of a batch before data is written in seconds. Default is `15 seconds`, maximum is `300 seconds`.
+
+Batch definitions are hints. A pipeline will follow these hints closely, but batches might not be exact.
+
+All three batch definitions work together. Whichever limit is reached first triggers the delivery of a batch.
-All three batch definitions work together. Whichever limit is reached first triggers the delivery of a batch.
+All three batch definitions work together. Whichever limit is reached first triggers the delivery of a batch.
-All three batch definitions work together. Whichever limit is reached first triggers the delivery of a batch.
+All three batch definitions work together and whichever limit is reached first triggers the delivery of a batch.
-All three batch definitions work together. Whichever limit is reached first triggers the delivery of a batch.
+All three batch definitions work together. Whichever limit is reached first triggers the delivery of a batch.
-All three batch definitions work together. Whichever limit is reached first triggers the delivery of a batch.
+All three batch definitions work together and whichever limit is reached first triggers the delivery of a batch.
+
+For example, a `batch-max-mb` = 100 MB and a `batch-max-seconds` = 100 means that if 100 MB of events are posted to the pipeline, the batch will be delivered. However, if it takes longer than 100 seconds for 100 MB of events to be posted, a batch of all the messages that were posted during those 100 seconds will be created.
+
+### Defining batch settings using Wrangler
+You can use the following batch settings flags while creating or updating a pipeline:
+* `--batch-max-mb`
+* `--batch-max-rows`
+* `--batch-max-seconds`
+
+For example:
+```sh
+npx wrangler pipelines update [PIPELINE-NAME] --batch-max-mb 100 --batch-max-rows 10000 --batch-max-seconds 300
+```
+
+#### Batch size limits
-#### Batch size limits
+### Batch size limits
-#### Batch size limits
+### Batch size limits
+
+| Setting                                   | Default     | Minimum   | Maximum     |
+| ----------------------------------------- | ----------- | --------- | ----------- |
+| Maximum Batch Size `batch-max-mb`         | 10 MB       | 0.001 MB  | 100 MB      |
+| Maximum Batch Timeout `batch-max-seconds` | 15 seconds  | 0 seconds | 300 seconds |
+| Maximum Batch Rows `batch-max-rows`       | 10,000 rows | 1 row     | 10,000 rows |
+
+
+## Deliver partitioned data
+Partitioning organizes data into directories based on specific fields to improve query performance. Partitions reduce the amount of data scanned for queries, enabling faster reads.
+
+:::note
+By default, Pipelines partition data by event date and time. This will be customizable in the future.
+:::
+
+Output files are prefixed with event date and hour. For example, the output from a Pipeline in your R2 bucket might look like this:
+```sh
+- event_date=2025-04-01/hr=15/01JQWBZCZBAQZ7RJNZHN38JQ7V.json.gz
+- event_date=2025-04-01/hr=15/01JQWC16FXGP845EFHMG1C0XNW.json.gz
+```
+
+## Deliver data to a prefix
+You can specify an optional prefix for all the output files stored in your specified R2 bucket, using the flag `--r2-prefix`.
+
+For example:
+```sh
+npx wrangler pipelines update [PIPELINE-NAME] --r2-prefix test
+```
+
+After running the above command, the output files generated by your pipeline will be stored under the prefix "test". Files will remain partitioned. Your output will look like this:
-After running the above command, the output files generated by your pipeline will be stored under the prefix "test". Files will remain partitioned. Your output will look like this:
+After running the above command, the output files generated by your pipeline will be stored under the prefix `test`. Files will remain partitioned. Your output will look like this:
-After running the above command, the output files generated by your pipeline will be stored under the prefix "test". Files will remain partitioned. Your output will look like this:
+After running the above command, the output files generated by your pipeline will be stored under the prefix `test`. Files will remain partitioned. Your output will look like this:
+```sh
+- test/event_date=2025-04-01/hr=15/01JQWBZCZBAQZ7RJNZHN38JQ7V.json.gz
+- test/event_date=2025-04-01/hr=15/01JQWC16FXGP845EFHMG1C0XNW.json.gz
+```
@@ -0,0 +1,8 @@
+---
+pcx_content_type: concept
+title: Shards
+sidebar:
+  order: 11
+---
+
+TODO
diff --git a/src/content/docs/pipelines/build-with-pipelines/workers-apis.mdx b/src/content/docs/pipelines/build-with-pipelines/workers-apis.mdx
@@ -0,0 +1,53 @@
+---
+title: Workers API
+pcx_content_type: concept
+sidebar:
+  order: 2
+head:
+  - tag: title
+    content: Workers API
+
+---
+
+This guide details the Pipelines API within Cloudflare Workers.
+
+## Send data to a Pipeline from a Worker
+Pipelines exposes an API directly to your Workers scripts via the [bindings](/workers/runtime-apis/bindings/#what-is-a-binding) concept. Bindings allow you to securely send data to a Pipeline without having to manage API keys or clients.
+
+You can bind to a Pipeline by defining a `[[pipelines]]` binding within your Wrangler configuration. For example:
+
+import { WranglerConfig } from "~/components";
+
+<WranglerConfig>
+
+```toml title="wrangler.toml"
+#:schema node_modules/wrangler/config-schema.json
+name = "pipeline-starter"
+main = "src/index.ts"
+compatibility_date = "2025-04-01"
+
+[[pipelines]]
+pipeline = "<MY-PIPELINE-NAME>" # The name of your Pipeline
+binding = "MY_PIPELINE" # The binding name, accessed using env.MY_PIPELINE
+```
+
+</WranglerConfig>
+
+
+## `Pipeline`
+A binding which allows a Worker to send messages to a Pipeline.
+
+```ts
+interface Pipeline<PipelineRecord> {
+  send(records: PipelineRecord[]): Promise<void>;
+}
+```
+
+* `send(records)`: `Promise<void>`
+
+  * Sends a message to the Pipeline. The body must be an array of objects supported by the [structured clone algorithm](https://developer.mozilla.org/en-US/docs/Web/API/Web_Workers_API/Structured_clone_algorithm#supported_types).
+  * When the promise resolves, the message is confirmed to be stored by the Pipeline.
+
+:::note
+When running your Worker locally, Pipelines are partially simulated. Worker code which sends data to a Pipeline will execute successfully. However, the full Pipeline, including batching & writing to R2, will not be executed locally.
+:::
@@ -0,0 +1,50 @@
+---
+pcx_content_type: concept
+title: How Pipelines Work
+sidebar:
+  order: 1
+---
+
+Cloudflare Pipelines let you ingest data from a source, and deliver to a sink. It's built for high volume, real time data streams. Each pipeline can ingest up to 100 MB/s of data, via HTTP or a Worker, and load the data as files in an R2 bucket.
-Cloudflare Pipelines let you ingest data from a source, and deliver to a sink. It's built for high volume, real time data streams. Each pipeline can ingest up to 100 MB/s of data, via HTTP or a Worker, and load the data as files in an R2 bucket.
+Cloudflare Pipelines let you ingest data from a source and deliver to a sink. It is built for high volume, real time data streams. Each pipeline can ingest up to 100 MB/s of data, via HTTP or a Worker, and load the data as files in an R2 bucket.
-Cloudflare Pipelines let you ingest data from a source, and deliver to a sink. It's built for high volume, real time data streams. Each pipeline can ingest up to 100 MB/s of data, via HTTP or a Worker, and load the data as files in an R2 bucket.
+Cloudflare Pipelines let you ingest data from a source and deliver to a sink. It is built for high volume, real time data streams. Each pipeline can ingest up to 100 MB/s of data, via HTTP or a Worker, and load the data as files in an R2 bucket.
+
+This guide explains how a pipeline works.
+
+![Pipelines Architecture](~/assets/images/pipelines/architecture.png)
+
+## Supported sources, data formats, and sinks
+
+### Sources
+Pipelines supports the following sources:
+* [HTTP Clients](/pipelines/build-with-pipelines/http), with optional authentication and CORS settings
+* [Cloudflare Worker](/workers/), using the [Pipelines Workers API](/pipelines/build-with-pipelines/workers-apis)
+
+Multiple sources can be active on a single pipeline simultaneously. For example, you can create a pipeline which accepts data from both a Worker, and via HTTP. Multiple workers can be configured to send data to the same pipeline. There is no limit to the number of source clients.
+
+### Data format
+Pipelines can ingest JSON serializable records.
+
+### Sinks
+Pipelines supports delivering data into [R2 Object Storage](/r2/). Ingested data is delivered as newline delimited JSON files (`ndjson`), with optional compression. Multiple pipelines can be configured to deliver data to the same R2 bucket.
-Pipelines supports delivering data into [R2 Object Storage](/r2/). Ingested data is delivered as newline delimited JSON files (`ndjson`), with optional compression. Multiple pipelines can be configured to deliver data to the same R2 bucket.
+Pipelines supports delivering data into [R2 Object Storage](/r2/). Ingested data is delivered as newline delimited JSON files (`ndjson`) with optional compression. Multiple pipelines can be configured to deliver data to the same R2 bucket.
-Pipelines supports delivering data into [R2 Object Storage](/r2/). Ingested data is delivered as newline delimited JSON files (`ndjson`), with optional compression. Multiple pipelines can be configured to deliver data to the same R2 bucket.
+Pipelines supports delivering data into [R2 Object Storage](/r2/). Ingested data is delivered as newline delimited JSON files (`ndjson`) with optional compression. Multiple pipelines can be configured to deliver data to the same R2 bucket.
+
+## Data durability
+Pipelines are designed to be reliable. Data sent to a pipeline should be delivered successfully to the configured R2 bucket, provided that the [R2 API credentials associated with a pipeline](/r2/api/s3/tokens/) remain valid.
+
+Each pipeline maintains a storage buffer. Requests to send data to a pipeline receive a successful response only after the data is committed to this storage buffer.
+
+Ingested data accumulates, until a sufficiently [large batch of data](/pipelines/build-with-pipelines/output-settings/#customize-batch-behavior) has been filled. Once the batch reaches its target size, the entire batch of data is converted to a file and delivered to R2.
+
+Transient failures, such as network connectivity issues, are automatically retried.
+
+However, if the [R2 API credentials associated with a pipeline](/r2/api/s3/tokens/) expire or are revoked, data delivery will fail. In this scenario, some data might continue to accumulate in the buffers, but the pipeline will eventually start rejecting requests.
+
+## Updating a pipeline
+Pipelines update without dropping records. Updating an existing pipeline effectively creates a new instance of the pipeline. Requests are gracefully re-routed to the new instance. The old instance continues to write data into your configured sink. Once the old instance is fully drained, it is spun down.
+
+This means that updates might take a few minutes to go into effect. For example, if you update a pipeline's sink, previously ingested data might continue to be delivered into the old sink.
+
+## Backpressure behavior
+If you send too much data, the pipeline will communicate backpressure by returning a 429 response to HTTP requests, or throwing an error if using the Workers API. Refer to the [limits](/pipelines/platform/limits) to learn how much volume a single pipeline can support. You might see 429 responses if you are sending too many requests, or sending too much data.
-If you send too much data, the pipeline will communicate backpressure by returning a 429 response to HTTP requests, or throwing an error if using the Workers API. Refer to the [limits](/pipelines/platform/limits) to learn how much volume a single pipeline can support. You might see 429 responses if you are sending too many requests, or sending too much data.
+If you send too much data, the pipeline will communicate backpressure by returning a 429 response to HTTP requests, or throwing an error if using the Workers API. Refer to the [limits](/pipelines/platform/limits) to learn how much volume a single pipeline can support. You might see 429 responses if you are sending too many requests or sending too much data.
-If you send too much data, the pipeline will communicate backpressure by returning a 429 response to HTTP requests, or throwing an error if using the Workers API. Refer to the [limits](/pipelines/platform/limits) to learn how much volume a single pipeline can support. You might see 429 responses if you are sending too many requests, or sending too much data.
+If you send too much data, the pipeline will communicate backpressure by returning a 429 response to HTTP requests, or throwing an error if using the Workers API. Refer to the [limits](/pipelines/platform/limits) to learn how much volume a single pipeline can support. You might see 429 responses if you are sending too many requests or sending too much data.
+
+If you are consistently seeing backpressure from your pipeline, consider the following strategies:
+* Increase the [shard count](/pipelines/build-with-pipelines/shards), to increase the maxiumum throughput of your pipeline.
-* Increase the [shard count](/pipelines/build-with-pipelines/shards), to increase the maxiumum throughput of your pipeline.
+* Increase the [shard count](/pipelines/build-with-pipelines/shards) to increase the maximum throughput of your pipeline.
-* Increase the [shard count](/pipelines/build-with-pipelines/shards), to increase the maxiumum throughput of your pipeline.
+* Increase the [shard count](/pipelines/build-with-pipelines/shards) to increase the maximum throughput of your pipeline.
+* Send data to a second pipeline if you receive an error. You can setup multiple pipelines to write to the same R2 bucket.
-* Send data to a second pipeline if you receive an error. You can setup multiple pipelines to write to the same R2 bucket.
+* Send data to a second pipeline if you receive an error. You can set up multiple pipelines to write to the same R2 bucket.
-* Send data to a second pipeline if you receive an error. You can setup multiple pipelines to write to the same R2 bucket.
+* Send data to a second pipeline if you receive an error. You can set up multiple pipelines to write to the same R2 bucket.
@@ -0,0 +1,8 @@
+---
+title: Concepts
+pcx_content_type: navigation
+sidebar:
+  order: 3
+  group:
+    hideIndex: true
+---
diff --git a/src/content/docs/pipelines/examples/index.mdx b/src/content/docs/pipelines/examples/index.mdx
@@ -0,0 +1,12 @@
+---
+title: Examples
+pcx_content_type: navigation
+sidebar:
+  order: 6
+  group:
+    hideIndex: false
+---
+
+import { DirectoryListing } from "~/components"
+
+<DirectoryListing />