Added documentation for how pipelines work

maheshwarip · maheshwarip · commit fe1aefad3173 · 2025-04-03T12:25:51.000-04:00
diff --git a/src/content/docs/pipelines/concepts/how-pipelines-work.mdx b/src/content/docs/pipelines/concepts/how-pipelines-work.mdx
@@ -5,8 +5,39 @@ sidebar:
   order: 1
 ---
 
-Cloudflare Pipelines lets you ingest high volumes of real time data, and load the data into R2. It's a useful tool to build a data lake of clickstream data, or load logs from a service to query later. Pipelines let you do this without managing any of the underlying infrastructure; you can just create a Pipeline and send data.
+Cloudflare Pipelines let you ingest data from a source, and deliver to a destination. It's built for high volume, real time data streams. Each Pipeline can ingest up to 100 MB/s of data, via HTTP or a Worker, and load the data as files in an R2 bucket.
+
+This guide explains how a Pipeline works.
 
 ![Pipelines Architecture](~/assets/images/pipelines/architecture.png)
 
-More coming soon!
+## Supported sources, data formats, and sinks
+Pipelines supports ingestion via [HTTP](/pipelines/build-with-pipelines/http), or from a [Cloudflare Worker](/workers/), using the [Pipelines Workers API](/pipelines/build-with-pipelines/workers-apis).
+
+A pipeline can ingest JSON-serializable records.
+
+Finally, Pipelines supports R2 as a sink. Ingested data is written to output files, compressed, and delivered to an R2 bucket. Output files are generated as newline delimited JSON files (`ndjson`). The filename of each output file is prefixed by the event date and time, to make querying the data more efficient. For example, an output fle might be named like this: `event_date=2025-04-03/hr=15/01JQY361X75TMYSQZGWC6ZDMR2.json.gz`. Each line in an output file maps to a single record ingested by a pipeline.
+
+We plan to support more sources, data formats, and sinks, in the future.
+
+## Data durability, and the lifecycle of a request
+If you make a request to send data to a pipeline, and receive a successful response, we guarantee that the data will be delivered.
+
+Any data sent to a pipeline is durably committed to storage. Pipelines use [SQLite backed Durable Objects](durable-objects/best-practices/access-durable-objects-storage/#sqlite-storage-backend) as a buffer for ingested records. A pipeline will only return a response after data has been successfully stored.
+
+Ingested data is buffered until a sufficiently large batch of data has accumulated. Batching is useful to reduce the number of output files written out to R2. [Batch sizes are customizable](/pipelines/build-with-pipelines/output-settings/#customize-batch-behavior), in terms of data volume, rows, or time.
+
+Finally, the the batch of data is converted into output files, which are compressed, and delivered to the configured R2 bucket. Any transient failures, such as network failures, are automatically retried.
+
+Once files have been successfully written to R2, the buffers are flushed.
+
+## How a Pipeline handles updates
+Data delivery is guaranteed even while updating an existing pipeline. Updating an existing pipeline effectively creates a new deployment, including all your previously configured options. Requests are gracefully re-routed to the new pipeline. The old pipeline continues to write data into your destination. We spin down the old pipeline only after all the data has been written out. This way, you won’t lose data even while updating a pipeline.
+
+## How Pipelines scale
+Pipelines are organized into shards. You can [customize the number of shards](/pipelines/build-with-pipelines/shards) to increase maximum throughput, or to reduce the number of output files generated.
+
+Each shard consists of layers of durable objects. Shards are stateless, so your pipeline can scale by horizontally scaling the number of shards.
+
+## What if I send too much data? Do Pipelines communicate backpressure?
+If you send too much data, the pipeline will communicate backpressure by returning a 429 response to HTTP requests, or throwing an error if using the Workers API. Refer to the [limits](/pipelines/platform/limits) to learn how much volume a single pipeline can support.