diff --git a/src/content/changelog/pipelines/2025-09-25-pipelines-sql.mdx b/src/content/changelog/pipelines/2025-09-25-pipelines-sql.mdx new file mode 100644 index 000000000000000..6582e979305b539 --- /dev/null +++ b/src/content/changelog/pipelines/2025-09-25-pipelines-sql.mdx @@ -0,0 +1,39 @@ +--- +title: Pipelines now supports SQL transformations and Apache Iceberg +description: Transform streaming data with SQL and write to Apache Iceberg tables in R2 +date: 2025-09-25T13:00:00 +products: + - pipelines +hidden: true +--- + +import { LinkCard } from "~/components"; + +Today, we're launching the new [Cloudflare Pipelines](/pipelines/): a streaming data platform that ingests events, transforms them with [SQL](/pipelines/sql-reference/select-statements/), and writes to [R2](/r2/) as [Apache Iceberg](https://iceberg.apache.org/) tables or Parquet files. + +Pipelines can receive events via [HTTP endpoints](/pipelines/streams/writing-to-streams/#send-via-http) or [Worker bindings](/pipelines/streams/writing-to-streams/#send-via-workers), transform them with SQL, and deliver to R2 with exactly-once guarantees. This makes it easy to build analytics-ready warehouses for server logs, mobile application events, IoT telemetry, or clickstream data without managing streaming infrastructure. + +For example, here's a pipeline that ingests clickstream events and filters out bot traffic while extracting domain information: + +```sql +INSERT into events_table +SELECT + user_id, + lower(event) AS event_type, + to_timestamp_micros(ts_us) AS event_time, + regexp_match(url, '^https?://([^/]+)')[1] AS domain, + url, + referrer, + user_agent +FROM events_json +WHERE event = 'page_view' + AND NOT regexp_like(user_agent, '(?i)bot|spider'); +``` + +Get started by creating a pipeline in the dashboard or running a single command in [Wrangler](/workers/wrangler/): + +```bash +npx wrangler pipelines setup +``` + +Check out our [getting started guide](/pipelines/getting-started/) to learn how to create a pipeline that delivers events to an [Iceberg table](/r2/data-catalog/) you can query with R2 SQL. Read more about today's announcement in our [blog post](https://blog.cloudflare.com/cloudflare-data-platform). diff --git a/src/content/dash-routes/index.json b/src/content/dash-routes/index.json index c9037feda8e9040..7cba51353839fc6 100644 --- a/src/content/dash-routes/index.json +++ b/src/content/dash-routes/index.json @@ -261,7 +261,7 @@ }, { "name": "Pipelines", - "deeplink": "/?to=/:account/workers/pipelines", + "deeplink": "/?to=/:account/pipelines", "parent": ["Storage & Databases"] }, { diff --git a/src/content/docs/pipelines/build-with-pipelines/index.mdx b/src/content/docs/pipelines/build-with-pipelines/index.mdx deleted file mode 100644 index 947b417cf1628ca..000000000000000 --- a/src/content/docs/pipelines/build-with-pipelines/index.mdx +++ /dev/null @@ -1,8 +0,0 @@ ---- -title: Build with Pipelines -pcx_content_type: navigation -sidebar: - order: 3 - group: - hideIndex: true ---- \ No newline at end of file diff --git a/src/content/docs/pipelines/build-with-pipelines/output-settings.mdx b/src/content/docs/pipelines/build-with-pipelines/output-settings.mdx deleted file mode 100644 index 3e899f5d1f545bf..000000000000000 --- a/src/content/docs/pipelines/build-with-pipelines/output-settings.mdx +++ /dev/null @@ -1,103 +0,0 @@ ---- -title: Configure output settings -pcx_content_type: how-to -sidebar: - order: 3 -head: - - tag: title - content: Configure output settings ---- - -import { Render, PackageManagers } from "~/components"; - -Pipelines convert a stream of records into output files and deliver the files to an R2 bucket in your account. This guide details how you can change the output destination and customize batch settings to generate query ready files. - -## Configure an R2 bucket as a destination -To create or update a pipeline using Wrangler, run the following command in a terminal: - -```sh -npx wrangler pipelines create [PIPELINE-NAME] --r2-bucket [R2-BUCKET-NAME] -``` - -After running this command, you will be prompted to authorize Cloudflare Workers Pipelines to create an R2 API token on your behalf. Your pipeline uses the R2 API token to load data into your bucket. You can approve the request through the browser link which will open automatically. - -If you prefer not to authenticate this way, you can pass your [R2 API Token](/r2/api/tokens/) to Wrangler: -```sh -npx wrangler pipelines create [PIPELINE-NAME] --r2 [R2-BUCKET-NAME] --r2-access-key-id [ACCESS-KEY-ID] --r2-secret-access-key [SECRET-ACCESS-KEY] -``` - -## File format and compression -Output files are generated as Newline Delimited JSON files (`ndjson`). Each line in an output file maps to a single record. - -By default, output files are compressed in the `gzip` format. Compression can be turned off using the `--compression` flag: -```sh -npx wrangler pipelines update [PIPELINE-NAME] --compression none -``` - -Output files are named using a [ULID](https://github.com/ulid/spec) slug, followed by an extension. - -## Customize batch behavior -When configuring your pipeline, you can define how records are batched before they are delivered to R2. Batches of records are written out to a single output file. - -Batching can: -- Reduce the number of output files written to R2 and thus reduce the [cost of writing data to R2](/r2/pricing/#class-a-operations). -- Increase the size of output files making them more efficient to query. - -There are three ways to define how ingested data is batched: - -1. `batch-max-mb`: The maximum amount of data that will be batched in megabytes. Default, and maximum, is `100 MB`. -2. `batch-max-rows`: The maximum number of rows or events in a batch before data is written. Default, and maximum, is `10,000,000` rows. -3. `batch-max-seconds`: The maximum duration of a batch before data is written in seconds. Default, and maximum, is `300 seconds`. - -Batch definitions are hints. A pipeline will follow these hints closely, but batches might not be exact. - -All three batch definitions work together and whichever limit is reached first triggers the delivery of a batch. - -For example, a `batch-max-mb` = 100 MB and a `batch-max-seconds` = 100 means that if 100 MB of events are posted to the pipeline, the batch will be delivered. However, if it takes longer than 100 seconds for 100 MB of events to be posted, a batch of all the messages that were posted during those 100 seconds will be created. - -### Defining batch settings using Wrangler -You can use the following batch settings flags while creating or updating a pipeline: -* `--batch-max-mb` -* `--batch-max-rows` -* `--batch-max-seconds` - -For example: -```sh -npx wrangler pipelines update [PIPELINE-NAME] --batch-max-mb 100 --batch-max-rows 10000 --batch-max-seconds 300 -``` - -### Batch size limits - -| Setting | Default | Minimum | Maximum | -| ----------------------------------------- | ----------------| --------- | ----------- | -| Maximum Batch Size `batch-max-mb` | 100 MB | 1 MB | 100 MB | -| Maximum Batch Timeout `batch-max-seconds` | 300 seconds | 1 second | 300 seconds | -| Maximum Batch Rows `batch-max-rows` | 10,000,000 rows | 1 row | 10,000,000 rows | - - -## Deliver partitioned data -Partitioning organizes data into directories based on specific fields to improve query performance. Partitions reduce the amount of data scanned for queries, enabling faster reads. - -:::note -By default, Pipelines partition data by event date and time. This will be customizable in the future. -::: - -Output files are prefixed with event date and hour. For example, the output from a Pipeline in your R2 bucket might look like this: -```sh -- event_date=2025-04-01/hr=15/01JQWBZCZBAQZ7RJNZHN38JQ7V.json.gz -- event_date=2025-04-01/hr=15/01JQWC16FXGP845EFHMG1C0XNW.json.gz -``` - -## Deliver data to a prefix -You can specify an optional prefix for all the output files stored in your specified R2 bucket, using the flag `--r2-prefix`. - -For example: -```sh -npx wrangler pipelines update [PIPELINE-NAME] --r2-prefix test -``` - -After running the above command, the output files generated by your pipeline will be stored under the prefix `test`. Files will remain partitioned. Your output will look like this: -```sh -- test/event_date=2025-04-01/hr=15/01JQWBZCZBAQZ7RJNZHN38JQ7V.json.gz -- test/event_date=2025-04-01/hr=15/01JQWC16FXGP845EFHMG1C0XNW.json.gz -``` diff --git a/src/content/docs/pipelines/build-with-pipelines/shards.mdx b/src/content/docs/pipelines/build-with-pipelines/shards.mdx deleted file mode 100644 index 09833dd3cb15c58..000000000000000 --- a/src/content/docs/pipelines/build-with-pipelines/shards.mdx +++ /dev/null @@ -1,56 +0,0 @@ ---- -pcx_content_type: concept -title: Increase pipeline throughput -sidebar: - order: 11 ---- - -import { Render, PackageManagers } from "~/components"; - -A pipeline's maximum throughput can be increased by increasing the shard count. A single shard can handle approximately 7,000 requests per second, or can ingest 7 MB/s of data. - -By default, each pipeline is configured with two shards. To set the shard count, use the `--shard-count` flag while creating or updating a pipeline: -```sh -$ npx wrangler pipelines update [PIPELINE-NAME] --shard-count 10 -``` - -:::note -The default shard count will be set to `auto` in the future, with support for automatic horizontal scaling. -::: - -## How shards work -![Pipeline shards](~/assets/images/pipelines/shards.png) - -Each pipeline is composed of stateless, independent shards. These shards are spun up when a pipeline is created. Each shard is composed of layers of [Durable Objects](/durable-objects). The Durable Objects buffer data, replicate for durability, handle compression, and delivery to R2. - -When a record is sent to a pipeline: -1. The Pipelines [Worker](/workers) receives the record. -2. The record is routed to to one of the shards. -3. The record is handled by a set of Durable Objects, which commit the record to storage and replicate for durability. -4. Records accumulate until the [batch definitions](/pipelines/build-with-pipelines/output-settings/#customize-batch-behavior) are met. -5. The batch is written to an output file and optionally compressed. -6. The output file is delivered to the configured R2 bucket. - -Increasing the number of shards will increase the maximum throughput of a pipeline, as well as the number of output files created. - -### Example -Your workload might require making 5,000 requests per second to a pipeline. If you create a pipeline with a single shard, all 5,000 requests will be routed to the same shard. If your pipeline has been configured with a maximum batch duration of 1 second, every second, all 5,000 requests will be batched, and a single file will be delivered. - -Increasing the shard count to 2 will double the number of output files. The 5,000 requests will be split into 2,500 requests to each shard. Every second, each shard will create a batch of data, and deliver to R2. - -## Considerations while increasing the shard count -Increasing the shard count also increases the number of output files that your pipeline generates. This in turn increases the [cost of writing data to R2](/r2/pricing/#class-a-operations), as each file written to R2 counts as a single class A operation. Additionally, smaller files are slower, and more expensive, to query. Rather than setting the maximum, choose a shard count based on your workload needs. - -## Determine the right number of shards -Choose a shard count based on these factors: -* The number of requests per second you will make to your pipeline -* The amount of data per second you will send to your pipeline - -Each shard is capable of handling approximately 7,000 requests per second, or ingesting 7 MB/s of data. Either factor might act as the bottleneck, so choose the shard count based on the higher number. - -For example, if you estimate that you will ingest 70 MB/s, making 70,000 requests per second, setup a pipeline with 10 shards. However, if you estimate that you will ingest 70 MB/s while making 100,000 requests per second, setup a pipeline with 15 shards. - -## Limits -| Setting | Default | Minimum | Maximum | -| ----------------------------------------- | ----------- | --------- | ----------- | -| Shards per pipeline `shard-count` | 2 | 1 | 15 | diff --git a/src/content/docs/pipelines/build-with-pipelines/sources/http.mdx b/src/content/docs/pipelines/build-with-pipelines/sources/http.mdx deleted file mode 100644 index 3a44825656122ab..000000000000000 --- a/src/content/docs/pipelines/build-with-pipelines/sources/http.mdx +++ /dev/null @@ -1,90 +0,0 @@ ---- -title: Configure HTTP endpoint -pcx_content_type: concept -sidebar: - order: 1 -head: - - tag: title - content: Configure HTTP endpoint ---- - -import { Render, PackageManagers, DashButton } from "~/components"; - -Pipelines support data ingestion over HTTP. When you create a new pipeline using the default settings you will receive a globally scalable ingestion endpoint. To ingest data, make HTTP POST requests to the endpoint. - -```sh -$ npx wrangler@latest pipelines create my-clickstream-pipeline --r2-bucket my-bucket - -πŸŒ€ Authorizing R2 bucket "my-bucket" -πŸŒ€ Creating pipeline named "my-clickstream-pipeline" -βœ… Successfully created pipeline my-clickstream-pipeline - -Id: 0e00c5ff09b34d018152af98d06f5a1xvc -Name: my-clickstream-pipeline -Sources: - HTTP: - Endpoint: https://0e00c5ff09b34d018152af98d06f5a1xvc.pipelines.cloudflare.com/ - Authentication: off - Format: JSON - Worker: - Format: JSON -Destination: - Type: R2 - Bucket: my-bucket - Format: newline-delimited JSON - Compression: GZIP -Batch hints: - Max bytes: 100 MB - Max duration: 300 seconds - Max records: 100,000 - -πŸŽ‰ You can now send data to your pipeline! - -Send data to your pipeline's HTTP endpoint: -curl "https://0e00c5ff09b34d018152af98d06f5a1xvc.pipelines.cloudflare.com/" -d '[{ ...JSON_DATA... }]' -``` - -## Authentication -You can secure your HTTP ingestion endpoint using Cloudflare API tokens. By default, authentication is turned off. To configure authentication, use the `--require-http-auth` flag while creating or updating a pipeline. - -```sh -$ npx wrangler pipelines create [PIPELINE-NAME] --r2-bucket [R2-BUCKET-NAME] --require-http-auth true -``` - -Once authentication is turned on, you will need to include a Cloudflare API token in your request headers. - -### Get API token - -1. Log in to the [Cloudflare dashboard](https://dash.cloudflare.com). -2. Go to **My Profile** > [API Tokens](https://dash.cloudflare.com/profile/api-tokens). -3. Select **Create Token**. -4. Choose the template for Workers Pipelines. Select **Continue to summary** > **Create token**. Make sure to copy the API token and save it securely. - -### Making authenticated requests - -Include the API token you created in the previous step in the headers for your request: - -```sh -curl https://.pipelines.cloudflare.com - -H "Content-Type: application/json" \ - -H "Authorization: Bearer ${API_TOKEN}" \ - -d '[{"foo":"bar"}, {"foo":"bar"}, {"foo":"bar"}]' -``` - -## Specifying CORS Settings -If you want to use your pipeline to ingest client side data, such as website clicks, you will need to configure your [Cross-Origin Resource Sharing (CORS) settings](https://developer.mozilla.org/en-US/docs/Web/HTTP/CORS). - -Without setting your CORS settings, browsers will restrict requests made to your pipeline endpoint. For example, if your website domain is `https://my-website.com`, and you want to post client side data to your pipeline at `https://.pipelines.cloudflare.com`, without CORS settings, the request will fail. - -To fix this, you need to configure your pipeline to accept requests from `https://my-website.com`. You can do so while creating or updating a pipeline, using the flag `--cors-origins`. You can specify multiple domains separated by a space. - -```sh -$ npx wrangler pipelines update [PIPELINE-NAME] --cors-origins https://mydomain.com http://localhost:8787 -``` - -You can specify that all cross origin requests are accepted. We recommend only using this option in development, and not for production use cases. -```sh -$ npx wrangler pipelines update [PIPELINE-NAME] --cors-origins "*" -``` - -After the `--cors-origins` have been set on your pipeline, your pipeline will respond to preflight requests and `POST` requests with the appropriate `Access-Control-Allow-Origin` headers set. diff --git a/src/content/docs/pipelines/build-with-pipelines/sources/index.mdx b/src/content/docs/pipelines/build-with-pipelines/sources/index.mdx deleted file mode 100644 index d2e0f6db11a9f70..000000000000000 --- a/src/content/docs/pipelines/build-with-pipelines/sources/index.mdx +++ /dev/null @@ -1,27 +0,0 @@ ---- -title: Sources -pcx_content_type: concept -sidebar: - order: 1 - group: - hideIndex: false ---- - -Pipelines let you ingest data from the following sources: -* [HTTP Clients](/pipelines/build-with-pipelines/sources/http), with optional authentication and CORS settings -* [Cloudflare Workers](/workers/), using the [Pipelines Workers API](/pipelines/build-with-pipelines/sources/workers-apis) - -Multiple sources can be active on a single pipeline simultaneously. For example, you can create a pipeline which accepts data from Workers and via HTTP. There is no limit to the number of source clients. Multiple Workers can be configured to send data to the same pipeline. - -Each pipeline can ingest up to 100β€―MB/s of data or accept up to 100,000 requests per second, aggregated across all sources. - -## Configuring allowed sources -By default, ingestion via HTTP and from Workers is turned on. You can configure the allowed sources by using the `--source` flag while creating or updating a pipeline. - -For example, to create a pipeline which only accepts data via a Worker, you can run this command: -```sh -$ npx wrangler pipelines create [PIPELINE-NAME] --r2-bucket [R2-BUCKET-NAME] --source worker -``` - -## Accepted data formats -Pipelines accept arrays of valid JSON objects. You can send multiple objects in a single request, provided the total data volume is within the [documented limits](/pipelines/platform/limits). Sending data in a different format will result in an error. diff --git a/src/content/docs/pipelines/build-with-pipelines/sources/workers-apis.mdx b/src/content/docs/pipelines/build-with-pipelines/sources/workers-apis.mdx deleted file mode 100644 index ba53067fc394c41..000000000000000 --- a/src/content/docs/pipelines/build-with-pipelines/sources/workers-apis.mdx +++ /dev/null @@ -1,71 +0,0 @@ ---- -title: Workers API -pcx_content_type: concept -sidebar: - order: 2 -head: - - tag: title - content: Workers API - ---- - -import { Render, PackageManagers, WranglerConfig } from "~/components"; - -Pipelines exposes an API directly to your [Workers](/workers) scripts via the [bindings](/workers/runtime-apis/bindings/#what-is-a-binding) concept. Bindings allow you to securely send data to a pipeline without having to manage API keys or clients. Sending data via a Worker is enabled by default. - -## Send data from a Worker -### Setup a binding -Bind to a pipeline by defining a `pipelines` binding within your Wrangler configuration. For example: - - - -```toml title="wrangler.toml" -#:schema node_modules/wrangler/config-schema.json -name = "pipeline-starter" -main = "src/index.ts" -compatibility_date = "2025-04-01" - -[[pipelines]] -pipeline = "" # The name of your Pipeline -binding = "PIPELINE" # The binding name, accessed using env.MY_PIPELINE -``` - - - -You can bind multiple pipelines to a Worker. - -### Send data -The Pipelines binding exposes a `send()` method. For example, to log inbound HTTP requests to your Worker: - -```ts -export default { - async fetch(request, env, ctx): Promise { - let log = { - url: request.url, - method: request.method, - headers: Object.fromEntries(request.headers), - }; - await env.PIPELINE.send([log]); - return new Response('Hello World!'); - }, -} satisfies ExportedHandler; -``` - -## Workers API -### `Pipeline` -A binding which allows a Worker to send messages to a pipeline. - -```ts -interface Pipeline { - send(records: PipelineRecord[]): Promise; -} -``` - -* `send(records)`: `Promise` - - * Sends records to the pipeline. The body must be an array of objects which are JSON serializable. - * When the promise resolves, the records are confirmed to be ingested. - -:::note -When running your Worker locally, pipelines are partially simulated. Worker code which sends data to a pipeline will execute successfully. However, the full pipeline, including batching & writing to R2, will not be executed locally. -::: \ No newline at end of file diff --git a/src/content/docs/pipelines/concepts/how-pipelines-work.mdx b/src/content/docs/pipelines/concepts/how-pipelines-work.mdx deleted file mode 100644 index 93e1a03015a48bb..000000000000000 --- a/src/content/docs/pipelines/concepts/how-pipelines-work.mdx +++ /dev/null @@ -1,48 +0,0 @@ ---- -pcx_content_type: concept -title: How Pipelines work -sidebar: - order: 1 ---- - -Cloudflare Pipelines let you ingest data from a source and deliver to a sink. It is built for high volume, real time data streams. Each pipeline can ingest up to 100 MB/s of data, via HTTP or a Worker, and load the data as files in an R2 bucket. - -![Pipelines Architecture](~/assets/images/pipelines/architecture.png) - -## Supported sources, data formats, and sinks - -### Sources -Pipelines supports the following sources: -* [HTTP Clients](/pipelines/build-with-pipelines/sources/http), with optional authentication and CORS settings -* [Cloudflare Workers](/workers/), using the [Pipelines Workers API](/pipelines/build-with-pipelines/sources/workers-apis) - -Multiple sources can be active on a single pipeline simultaneously. For example, you can create a pipeline which accepts data from Workers and via HTTP. Multiple workers can be configured to send data to the same pipeline. There is no limit to the number of source clients. - -### Data format -Pipelines can ingest JSON serializable records. - -### Sinks -Pipelines supports delivering data into [R2 Object Storage](/r2/). Ingested data is delivered as newline delimited JSON files (`ndjson`) with optional compression. Multiple pipelines can be configured to deliver data to the same R2 bucket. - -## Data durability -Pipelines are designed to be reliable. Any data which is successfully ingested will be delivered, at least once, to the configured R2 bucket, provided that the [R2 API credentials associated with a pipeline](/r2/api/tokens/) remain valid. Ordering of records is best effort. - -Each pipeline maintains a storage buffer. Requests to send data to a pipeline receive a successful response only after the data is committed to this storage buffer. - -Ingested data accumulates, until a sufficiently [large batch of data](/pipelines/build-with-pipelines/output-settings/#customize-batch-behavior) has been filled. Once the batch reaches its target size, the entire batch of data is converted to a file and delivered to R2. - -Transient failures, such as network connectivity issues, are automatically retried. - -However, if the [R2 API credentials associated with a pipeline](/r2/api/tokens/) expire or are revoked, data delivery will fail. In this scenario, some data might continue to accumulate in the buffers, but the pipeline will eventually start rejecting requests once the buffers are full. - -## Updating a pipeline -Pipelines update without dropping records. Updating an existing pipeline creates a new instance of the pipeline. Requests are gracefully re-routed to the new instance. The old instance continues to write data into the configured sink. Once the old instance is fully drained, it is spun down. - -This means that updates might take a few minutes to go into effect. For example, if you update a pipeline's sink, previously ingested data might continue to be delivered into the old sink. - -## Backpressure behavior -If you send too much data, the pipeline will communicate backpressure by returning a 429 response to HTTP requests, or throwing an error if using the Workers API. Refer to the [limits](/pipelines/platform/limits) to learn how much volume a single pipeline can support. You might see 429 responses if you are sending too many requests or sending too much data. - -If you are consistently seeing backpressure from your pipeline, consider the following strategies: -* Increase the [shard count](/pipelines/build-with-pipelines/shards) to increase the maximum throughput of your pipeline. -* Send data to a second pipeline if you receive an error. You can set up multiple pipelines to write to the same R2 bucket. diff --git a/src/content/docs/pipelines/concepts/index.mdx b/src/content/docs/pipelines/concepts/index.mdx deleted file mode 100644 index 350ff855e500264..000000000000000 --- a/src/content/docs/pipelines/concepts/index.mdx +++ /dev/null @@ -1,8 +0,0 @@ ---- -title: Concepts -pcx_content_type: navigation -sidebar: - order: 3 - group: - hideIndex: true ---- \ No newline at end of file diff --git a/src/content/docs/pipelines/getting-started.mdx b/src/content/docs/pipelines/getting-started.mdx index 0ca35f36abebfbb..63e5039645978a1 100644 --- a/src/content/docs/pipelines/getting-started.mdx +++ b/src/content/docs/pipelines/getting-started.mdx @@ -1,137 +1,357 @@ --- -title: Getting started pcx_content_type: get-started +title: Getting started +head: [] sidebar: - order: 1 -head: - - tag: title - content: Getting started wih Pipelines + order: 2 +description: Create your first pipeline to ingest streaming data and write to R2 Data Catalog as an Apache Iceberg table. --- -import { Render, PackageManagers, Details } from "~/components"; +import { + Render, + Steps, + Tabs, + TabItem, + DashButton, + LinkCard, +} from "~/components"; -Cloudflare Pipelines allows you to ingest load high volumes of real time streaming data, and load into [R2 Object Storage](/r2/), without managing any infrastructure. +This guide will instruct you through: -By following this guide, you will: -1. Setup an R2 bucket. -2. Create a pipeline, with HTTP as a source, and an R2 bucket as a sink. -3. Send data to your pipeline's HTTP ingestion endpoint. -4. Verify the output delivered to R2. +- Creating your first [R2 bucket](/r2/buckets/) and enabling its [data catalog](/r2/data-catalog/). +- Creating an [API token](/r2/api/tokens/) needed for pipelines to authenticate with your data catalog. +- Creating your first pipeline with a simple ecommerce schema that writes to an [Apache Iceberg](https://iceberg.apache.org/) table managed by R2 Data Catalog. +- Sending sample ecommerce data via HTTP endpoint. +- Validating data in your bucket and querying it with R2 SQL. -:::note +## Prerequisites -Pipelines is in **public beta**, and any developer with a [paid Workers plan](/workers/platform/pricing/#workers) can start using Pipelines immediately. + -::: +## 1. Create an R2 bucket -*** + + -## Prerequisites + +1. If not already logged in, run: -To use Pipelines, you will need: + ``` + npx wrangler login + ``` - +2. Create an R2 bucket: -## 1. Set up an R2 bucket + ``` + npx wrangler r2 bucket create pipelines-tutorial + ``` -Create a bucket by following the [get started guide for R2](/r2/get-started/), or by running the command below: + -```sh -npx wrangler r2 bucket create my-bucket -``` + + + + +1. In the Cloudflare dashboard, go to the **R2 object storage** page. + + +2. Select **Create bucket**. + +3. Enter the bucket name: pipelines-tutorial -Save the bucket name for the next step. +4. Select **Create bucket**. + + + -## 2. Create a Pipeline +## 2. Enable R2 Data Catalog -To create a pipeline using Wrangler, run the following command in a terminal, and specify: + + -- The name of your pipeline -- The name of the R2 bucket you created in step 1 +Enable the catalog on your R2 bucket: -```sh -npx wrangler pipelines create my-clickstream-pipeline --r2-bucket my-bucket --batch-max-seconds 5 --compression none ``` +npx wrangler r2 bucket catalog enable pipelines-tutorial +``` + +When you run this command, take note of the "Warehouse" and "Catalog URI". You will need these later. + + + -After running this command, you will be prompted to authorize Cloudflare Workers Pipelines to create an R2 API token on your behalf. These tokens used by your pipeline when loading data into your bucket. You can approve the request through the browser link which will open automatically. - -
- Choosing a pipeline name -When choosing a name for your pipeline: - -- Ensure it is descriptive and relevant to the type of events you intend to ingest. You cannot change the name of the pipeline after creating it. -- The pipeline name must be between 1 and 63 characters long. -- The name cannot contain special characters outside dashes (`-`). -- The name must start and end with a letter or a number. -
- -You will notice two optional flags are set while creating the pipeline: `--batch-max-seconds` and `--compression`. These flags are added to make it faster for you to see the output of your first pipeline. For production use cases, we recommend keeping the default settings. - -Once you create your pipeline, you will receive a summary of your pipeline's configuration, as well as an HTTP endpoint which you can post data to: - -```sh -πŸŒ€ Authorizing R2 bucket "my-bucket" -πŸŒ€ Creating pipeline named "my-clickstream-pipeline" -βœ… Successfully created pipeline my-clickstream-pipeline - -Id: [PIPELINE-ID] -Name: my-clickstream-pipeline -Sources: - HTTP: - Endpoint: https://[PIPELINE-ID].pipelines.cloudflare.com/ - Authentication: off - Format: JSON - Worker: - Format: JSON -Destination: - Type: R2 - Bucket: my-bucket - Format: newline-delimited JSON - Compression: GZIP -Batch hints: - Max bytes: 100 MB - Max duration: 300 seconds - Max records: 100,000 - -πŸŽ‰ You can now send data to your Pipeline! - -Send data to your Pipeline's HTTP endpoint: -curl "https://[PIPELINE-ID].pipelines.cloudflare.com/" -d '[{ ...JSON_DATA... }]' - -To send data to your Pipeline from a Worker, add the following configuration to your config file: + +1. In the Cloudflare dashboard, go to the **R2 object storage** page. + + +2. Select the bucket: pipelines-tutorial. + +3. Switch to the **Settings** tab, scroll down to **R2 Data Catalog**, and select **Enable**. + +4. Once enabled, note the **Catalog URI** and **Warehouse name**. + +
+
+ +## 3. Create an API token + +Pipelines must authenticate to R2 Data Catalog with an [R2 API token](/r2/api/tokens/) that has catalog and R2 permissions. + + +1. In the Cloudflare dashboard, go to the **R2 object storage** page. + + +2. Select **Manage API tokens**. + +3. Select **Create Account API token**. + +4. Give your API token a name. + +5. Under **Permissions**, choose the **Admin Read & Write** permission. + +6. Select **Create Account API Token**. + +7. Note the **Token value**. + + + +:::note +This token also includes the R2 SQL Read permission, which allows you to query your data with R2 SQL. +::: + +## 4. Create your first pipeline + + + + +First, create a schema file that defines your ecommerce data structure: + +**Create `schema.json`:** +```json { - "pipelines": [ + "fields": [ + { + "name": "user_id", + "type": "string", + "required": true + }, + { + "name": "event_type", + "type": "string", + "required": true + }, + { + "name": "product_id", + "type": "string", + "required": false + }, { - "pipeline": "my-clickstream-pipeline", - "binding": "PIPELINE" + "name": "amount", + "type": "float64", + "required": false } ] } ``` -## 3. Post data to your pipeline +Use the interactive setup to create a pipeline that writes to R2 Data Catalog: -Use a curl command in your terminal to post an array of JSON objects to the endpoint you received in step 1. +```bash +npx wrangler pipelines setup +``` -```sh -curl -H "Content-Type:application/json" \ - -d '[{"event":"viewedCart", "timestamp": "2025-04-03T15:42:30Z"},{"event":"cartAbandoned", "timestamp": "2025-04-03T15:42:37Z"}]' \ - +Follow the prompts: + +1. **Pipeline name**: Enter `ecommerce` + +2. **Stream configuration**: + - Enable HTTP endpoint: `yes` + - Require authentication: `no` (for simplicity) + - Configure custom CORS origins: `no` + - Schema definition: `Load from file` + - Schema file path: `schema.json` (or your file path) + +3. **Sink configuration**: + - Destination type: `Data Catalog Table` + - R2 bucket name: `pipelines-tutorial` + - Namespace: `default` + - Table name: `ecommerce` + - Catalog API token: Enter your token from step 3 + - Compression: `zstd` + - Roll file when size reaches (MB): `100` + - Roll file when time reaches (seconds): `10` (for faster data visibility in this tutorial) + +4. **SQL transformation**: Choose `Use simple ingestion query` to use: + ```sql + INSERT INTO ecommerce_sink SELECT * FROM ecommerce_stream + ``` + +After setup completes, note the HTTP endpoint URL displayed in the final output. + + + + + +1. In the Cloudflare dashboard, go to **Pipelines** > **Pipelines**. + + +2. Select **Create Pipeline**. + +3. **Connect to a Stream**: + - Pipeline name: `ecommerce` + - Enable HTTP endpoint for sending data: Enabled + - HTTP authentication: Disabled (default) + - Select **Next** + +4. **Define Input Schema**: + - Select **JSON editor** + - Copy in the schema: + ```json + { + "fields": [ + { + "name": "user_id", + "type": "string", + "required": true + }, + { + "name": "event_type", + "type": "string", + "required": true + }, + { + "name": "product_id", + "type": "string", + "required": false + }, + { + "name": "amount", + "type": "f64", + "required": false + } + ] + } + ``` + - Select **Next** + +5. **Define Sink**: + - Select your R2 bucket: `pipelines-tutorial` + - Storage type: **R2 Data Catalog** + - Namespace: `default` + - Table name: `ecommerce` + - **Advanced Settings**: Change **Maximum Time Interval** to `10 seconds` + - Select **Next** + +6. **Credentials**: + - Disable **Automatically create an Account API token for your sink** + - Enter **Catalog Token** from step 3 + - Select **Next** + +7. **Pipeline Definition**: + - Leave the default SQL query: + ```sql + INSERT INTO ecommerce_sink SELECT * FROM ecommerce_stream; + ``` + - Select **Create Pipeline** + +8. After pipeline creation, note the **Stream ID** for the next step. + + + + + +## 5. Send sample data + +Send ecommerce events to your pipeline's HTTP endpoint: + +```bash +curl -X POST https://{stream-id}.ingest.cloudflare.com \ + -H "Content-Type: application/json" \ + -d '[ + { + "user_id": "user_12345", + "event_type": "purchase", + "product_id": "widget-001", + "amount": 29.99 + }, + { + "user_id": "user_67890", + "event_type": "view_product", + "product_id": "widget-002" + }, + { + "user_id": "user_12345", + "event_type": "add_to_cart", + "product_id": "widget-003", + "amount": 15.50 + } + ]' ``` -Once the pipeline successfully accepts the data, you will receive a success message. +Replace `{stream-id}` with your actual stream endpoint from the pipeline setup. + +## 6. Validate data in your bucket + + +1. In the Cloudflare dashboard, go to the **R2 object storage** page. -You can continue posting data to the pipeline. The pipeline will automatically buffer ingested data. Based on the batch settings (`--batch-max-seconds`) specified in step 2, a batch will be generated every 5 seconds, turned into a file, and written out to your R2 bucket. +2. Select your bucket: `pipelines-tutorial`. + +3. You should see Iceberg metadata files and data files created by your pipeline. Note: If you aren't seeing any files in your bucket, try waiting a couple of minutes and trying again. + +4. The data is organized in the Apache Iceberg format with metadata tracking table versions. + + +## 7. Query your data using R2 SQL + +Set up your environment to use R2 SQL: + +```bash +export WRANGLER_R2_SQL_AUTH_TOKEN=YOUR_API_TOKEN +``` + +Or create a `.env` file with: + +``` +WRANGLER_R2_SQL_AUTH_TOKEN=YOUR_API_TOKEN +``` + +Where `YOUR_API_TOKEN` is the token you created in step 3. For more information on setting environment variables, refer to [Wrangler system environment variables](/workers/wrangler/system-environment-variables/). + +Query your data: + +```bash +npx wrangler r2 sql query "YOUR_WAREHOUSE_NAME" " +SELECT + user_id, + event_type, + product_id, + amount +FROM default.ecommerce +WHERE event_type = 'purchase' +LIMIT 10" +``` -## 4. Verify in R2 +Replace `YOUR_WAREHOUSE_NAME` with the warehouse name from step 2. -Open the [R2 dashboard](https://dash.cloudflare.com/?to=/:account/r2/overview), and navigate to the R2 bucket you created in step 1. You will see a directory, labeled with today's date (such as `event_date=2025-04-05`). Click on the directory, and you'll see a sub-directory with the current hour (such as `hr=04`). You should see a newline delimited JSON file, containing the data you posted in step 3. Download the file, and open it in a text editor of your choice, to verify that the data posted in step 2 is present. +You can also query this table with any engine that supports Apache Iceberg. To learn more about connecting other engines to R2 Data Catalog, refer to [Connect to Iceberg engines](/r2/data-catalog/config-examples/). -*** +## Learn more -## Next steps + -* Learn about how to [setup authentication, or CORS settings](/pipelines/build-with-pipelines/sources/http), on your HTTP endpoint. -* Send data to your Pipeline from a Cloudflare Worker using the [Workers API documentation](/pipelines/build-with-pipelines/sources/workers-apis). + -If you have any feature requests or notice any bugs, share your feedback directly with the Cloudflare team by joining the [Cloudflare Developers community on Discord](https://discord.cloudflare.com). + diff --git a/src/content/docs/pipelines/index.mdx b/src/content/docs/pipelines/index.mdx index eab41df38cac3dd..c43c253950b5332 100644 --- a/src/content/docs/pipelines/index.mdx +++ b/src/content/docs/pipelines/index.mdx @@ -1,5 +1,5 @@ --- -title: Overview +title: Cloudflare Pipelines pcx_content_type: overview sidebar: @@ -11,85 +11,66 @@ head: content: Pipelines --- -import { CardGrid, Description, Feature, LinkTitleCard, Plan, RelatedProduct } from "~/components"; +import { + CardGrid, + Description, + Feature, + LinkTitleCard, + Plan, + RelatedProduct, +} from "~/components"; + +:::note +Pipelines is in **open beta**, and any developer with a [Workers Paid plan](/workers/platform/pricing/) can start using it. Currently, outside of standard R2 storage and operations, you will not be billed for your use of Pipelines. +::: -Ingest real time data streams and load into R2, using Cloudflare Pipelines. +Ingest, transform, and load streaming data into Apache Iceberg or Parquet in R2. -Cloudflare Pipelines lets you ingest high volumes of real time data, without managing any infrastructure. Ingested data is automatically batched, written to output files, and delivered to an [R2 bucket](/r2/) in your account. You can use Pipelines to build a data lake of clickstream data, or to store events from a Worker. +Cloudflare Pipelines ingests events, transforms them with SQL, and delivers them to R2 as [Iceberg tables](/r2/data-catalog/) or as Parquet and JSON files. -## Create your first pipeline -You can setup a pipeline to ingest data via HTTP, and deliver output to R2, with a single command: +Whether you're processing server logs, mobile application events, IoT telemetry, or clickstream data, Pipelines provides durable ingestion via HTTP endpoints or Worker bindings, SQL-based transformations, and exactly-once delivery to R2. This makes it easy to build analytics-ready data warehouses and lakehouses without managing streaming infrastructure. + +Create your first pipeline by following the [getting started guide](/pipelines/getting-started) or running this [Wrangler](/workers/wrangler/) command: ```sh -$ npx wrangler@latest pipelines create my-clickstream-pipeline --r2-bucket my-bucket - -πŸŒ€ Authorizing R2 bucket "my-bucket" -πŸŒ€ Creating pipeline named "my-clickstream-pipeline" -βœ… Successfully created pipeline my-clickstream-pipeline - -Id: 0e00c5ff09b34d018152af98d06f5a1xvc -Name: my-clickstream-pipeline -Sources: - HTTP: - Endpoint: https://0e00c5ff09b34d018152af98d06f5a1xvc.pipelines.cloudflare.com/ - Authentication: off - Format: JSON - Worker: - Format: JSON -Destination: - Type: R2 - Bucket: my-bucket - Format: newline-delimited JSON - Compression: GZIP -Batch hints: - Max bytes: 100 MB - Max duration: 300 seconds - Max records: 100,000 - -πŸŽ‰ You can now send data to your pipeline! - -Send data to your pipeline's HTTP endpoint: -curl "https://0e00c5ff09b34d018152af98d06f5a1xvc.pipelines.cloudflare.com/" -d '[{ ...JSON_DATA... }]' - -To send data to your pipeline from a Worker, add the following configuration to your config file: -{ - "pipelines": [ - { - "pipeline": "my-clickstream-pipeline", - "binding": "PIPELINE" - } - ] -} +npx wrangler pipelines setup ``` -Refer to the [getting started guide](/pipelines/getting-started) to start building with pipelines. - -:::note -While in beta, you will not be billed for Pipelines usage. You will be billed only for [R2 usage](/r2/pricing/). -::: +--- -*** ## Features - -Each pipeline generates a globally scalable HTTP endpoint, which supports authentication and CORS settings. + + +Build your first pipeline to ingest data via HTTP or Workers, apply SQL transformations, and deliver to R2 as Iceberg tables or Parquet files. + - -Send data to a pipeline directly from a Cloudflare Worker. + + +Durable, buffered queues that receive events via HTTP endpoints or Worker bindings. + + + + + +Connect streams to sinks with SQL transformations that validate, filter, transform, and enrich your data at ingestion time. + - -Define batch sizes and enable compression to generate output files that are efficient to query. + + +Configure destinations for your data. Write Apache Iceberg tables to R2 Data Catalog or export as Parquet and JSON files. + -*** +--- ## Related products @@ -105,22 +86,36 @@ Cloudflare Workers allows developers to build serverless applications and deploy -*** +--- ## More resources - -Learn about pipelines limits. + + Learn about pipelines limits. - -Follow @CloudflareDev on Twitter to learn about product announcements, and what is new in Cloudflare Workers. + + Follow @CloudflareDev on Twitter to learn about product announcements, and + what is new in Cloudflare Workers. - -Connect with the Workers community on Discord to ask questions, show what you are building, and discuss the platform with other developers. + + Connect with the Workers community on Discord to ask questions, show what you + are building, and discuss the platform with other developers. diff --git a/src/content/docs/pipelines/observability/index.mdx b/src/content/docs/pipelines/observability/index.mdx index c1576788609ddc5..a19d833118ed90a 100644 --- a/src/content/docs/pipelines/observability/index.mdx +++ b/src/content/docs/pipelines/observability/index.mdx @@ -2,11 +2,11 @@ title: Observability pcx_content_type: navigation sidebar: - order: 5 + order: 6 group: hideIndex: true --- -import { DirectoryListing } from "~/components" +import { DirectoryListing } from "~/components"; - \ No newline at end of file + diff --git a/src/content/docs/pipelines/observability/metrics.mdx b/src/content/docs/pipelines/observability/metrics.mdx index 630bdc3b3a8b201..8e9f13324456efb 100644 --- a/src/content/docs/pipelines/observability/metrics.mdx +++ b/src/content/docs/pipelines/observability/metrics.mdx @@ -5,46 +5,60 @@ sidebar: order: 10 --- -Pipelines expose metrics which allow you to measure data ingested, requests made, and data delivered. +Pipelines expose metrics which allow you to measure data ingested, processed, and delivered to sinks. -The metrics displayed in the [Cloudflare dashboard](https://dash.cloudflare.com/) are queried from Cloudflare’s [GraphQL Analytics API](/analytics/graphql-api/). You can access the metrics [programmatically](#query-via-the-graphql-api) via GraphQL or HTTP client. +The metrics displayed in the [Cloudflare dashboard](https://dash.cloudflare.com/) are queried from Cloudflare's [GraphQL Analytics API](/analytics/graphql-api/). You can access the metrics [programmatically](#query-via-the-graphql-api) via GraphQL or HTTP client. ## Metrics -### Ingestion +### Operator metrics -Pipelines export the below metrics within the `pipelinesIngestionAdaptiveGroups` dataset. +Pipelines export the below metrics within the `AccountPipelinesOperatorAdaptiveGroups` dataset. These metrics track data read and processed by pipeline operators. -| Metric | GraphQL Field Name | Description | -| ---------------- | ------------------ | ------------------------------------------------------------ | -| Ingestion Events | `count` | Number of ingestion events, or requests made, to a pipeline. | -| Ingested Bytes | `ingestedBytes` | Total number of bytes ingested | -| Ingested Records | `ingestedRecords` | Total number of records ingested | +| Metric | GraphQL Field Name | Description | +| ------------- | ------------------ | --------------------------------------------------------------------------------------------------------- | +| Bytes In | `bytesIn` | Total number of bytes read by the pipeline (filter by `streamId_neq: ""` to get data read from streams) | +| Records In | `recordsIn` | Total number of records read by the pipeline (filter by `streamId_neq: ""` to get data read from streams) | +| Decode Errors | `decodeErrors` | Number of messages that could not be deserialized in the stream schema | -The `pipelinesIngestionAdaptiveGroups` dataset provides the following dimensions for filtering and grouping queries: +The `AccountPipelinesOperatorAdaptiveGroups` dataset provides the following dimensions for filtering and grouping queries: - `pipelineId` - ID of the pipeline -- `datetime` - Timestamp of the ingestion event -- `date` - Timestamp of the ingestion event, truncated to the start of a day -- `datetimeHour` - Timestamp of the ingestion event, truncated to the start of an hour -- `datetimeMinute` - Timestamp of the ingestion event, truncated to the start of a minute +- `streamId` - ID of the source stream +- `datetime` - Timestamp of the operation +- `date` - Timestamp of the operation, truncated to the start of a day +- `datetimeHour` - Timestamp of the operation, truncated to the start of an hour -### Delivery +### Sink metrics -Pipelines export the below metrics within the `pipelinesDeliveryAdaptiveGroups` dataset. +Pipelines export the below metrics within the `AccountPipelinesSinkAdaptiveGroups` dataset. These metrics track data delivery to sinks. -| Metric | GraphQL Field Name | Description | -| ---------------- | ------------------ | ----------------------------------------- | -| Ingestion Events | `count` | Number of delivery events to an R2 bucket | -| Delivered Bytes | `deliveredBytes` | Total number of bytes ingested | +| Metric | GraphQL Field Name | Description | +| -------------------------- | -------------------------- | ------------------------------------------------------------ | +| Bytes Written | `bytesWritten` | Total number of bytes written to the sink, after compression | +| Records Written | `recordsWritten` | Total number of records written to the sink | +| Files Written | `filesWritten` | Number of files written to the sink | +| Row Groups Written | `rowGroupsWritten` | Number of row groups written (for Parquet files) | +| Uncompressed Bytes Written | `uncompressedBytesWritten` | Total number of bytes written before compression | -The `pipelinesDeliverynAdaptiveGroups` dataset provides the following dimensions for filtering and grouping queries: +The `AccountPipelinesSinkAdaptiveGroups` dataset provides the following dimensions for filtering and grouping queries: - `pipelineId` - ID of the pipeline -- `datetime` - Timestamp of the delivery event -- `date` - Timestamp of the delivery event, truncated to the start of a day -- `datetimeHour` - Timestamp of the delivery event, truncated to the start of an hour -- `datetimeMinute` - Timestamp of the delivery event, truncated to the start of a minute +- `sinkId` - ID of the destination sink +- `datetime` - Timestamp of the operation +- `date` - Timestamp of the operation, truncated to the start of a day +- `datetimeHour` - Timestamp of the operation, truncated to the start of an hour + +## View metrics in the dashboard + +Per-pipeline analytics are available in the Cloudflare dashboard. To view current and historical metrics for a pipeline: + +1. Log in to the [Cloudflare dashboard](https://dash.cloudflare.com) and select your account. +2. Go to **Pipelines** > **Pipelines**. +3. Select a pipeline. +4. Go to the **Metrics** tab to view its metrics. + +You can optionally select a time window to query. This defaults to the last 24 hours. ## Query via the GraphQL API @@ -52,10 +66,12 @@ You can programmatically query analytics for your pipelines via the [GraphQL Ana Pipelines GraphQL datasets require an `accountTag` filter with your Cloudflare account ID. -### Measure total bytes & records ingested over time period +### Measure operator metrics over time period + +This query returns the total bytes and records read by a pipeline from streams, along with any decode errors. ```graphql graphql-api-explorer -query PipelineIngestion( +query PipelineOperatorMetrics( $accountTag: string! $pipelineId: string! $datetimeStart: Time! @@ -63,17 +79,19 @@ query PipelineIngestion( ) { viewer { accounts(filter: { accountTag: $accountTag }) { - pipelinesIngestionAdaptiveGroups( + accountPipelinesOperatorAdaptiveGroups( limit: 10000 filter: { pipelineId: $pipelineId + streamId_neq: "" datetime_geq: $datetimeStart datetime_leq: $datetimeEnd } ) { sum { - ingestedBytes - ingestedRecords + bytesIn + recordsIn + decodeErrors } } } @@ -81,27 +99,35 @@ query PipelineIngestion( } ``` -### Measure volume of data delivered +### Measure sink delivery metrics + +This query returns detailed metrics about data written to a specific sink, including file and compression statistics. ```graphql graphql-api-explorer -query PipelineDelivery( +query PipelineSinkMetrics( $accountTag: string! $pipelineId: string! + $sinkId: string! $datetimeStart: Time! $datetimeEnd: Time! ) { viewer { accounts(filter: { accountTag: $accountTag }) { - pipelinesDeliveryAdaptiveGroups( + accountPipelinesSinkAdaptiveGroups( limit: 10000 filter: { pipelineId: $pipelineId + sinkId: $sinkId datetime_geq: $datetimeStart datetime_leq: $datetimeEnd } ) { sum { - deliveredBytes + bytesWritten + recordsWritten + filesWritten + rowGroupsWritten + uncompressedBytesWritten } } } diff --git a/src/content/docs/pipelines/pipelines-api.mdx b/src/content/docs/pipelines/pipelines-api.mdx index e0ee588dcfba75b..9a1ffe1f418abd8 100644 --- a/src/content/docs/pipelines/pipelines-api.mdx +++ b/src/content/docs/pipelines/pipelines-api.mdx @@ -4,5 +4,4 @@ title: Pipelines REST API external_link: /api/resources/pipelines/ sidebar: order: 10 - --- diff --git a/src/content/docs/pipelines/pipelines/index.mdx b/src/content/docs/pipelines/pipelines/index.mdx new file mode 100644 index 000000000000000..7ef3d5198b77bba --- /dev/null +++ b/src/content/docs/pipelines/pipelines/index.mdx @@ -0,0 +1,20 @@ +--- +title: Pipelines +pcx_content_type: navigation +sidebar: + order: 4 +--- + +import { LinkCard } from "~/components"; + +Pipelines connect [streams](/pipelines/streams/) and [sinks](/pipelines/sinks/) via SQL transformations, which can modify events before writing them to storage. This enables you to shift left, pushing validation, schematization, and processing to your ingestion layer to make your queries easy, fast, and correct. + +Pipelines enable you to filter, transform, enrich, and restructure events in real-time as data flows from streams to sinks. + +## Learn more + + diff --git a/src/content/docs/pipelines/pipelines/manage-pipelines.mdx b/src/content/docs/pipelines/pipelines/manage-pipelines.mdx new file mode 100644 index 000000000000000..50d42ac3d1cad90 --- /dev/null +++ b/src/content/docs/pipelines/pipelines/manage-pipelines.mdx @@ -0,0 +1,154 @@ +--- +pcx_content_type: configuration +title: Manage pipelines +description: Create, configure, and manage SQL transformations between streams and sinks +sidebar: + order: 1 +--- + +import { Steps, DashButton } from "~/components"; + +Learn how to: + +- Create pipelines with SQL transformations +- View pipeline configuration and SQL +- Delete pipelines when no longer needed + +## Create a pipeline + +Pipelines execute SQL statements that define how data flows from streams to sinks. + +### Dashboard + + +1. In the Cloudflare dashboard, go to the **Pipelines** page. + + +2. Select **Create Pipeline** to launch the pipeline creation wizard. +3. Follow the wizard to configure your stream, sink, and SQL transformation. + + +### Wrangler CLI + +To create a pipeline, run the [`pipelines create`](/workers/wrangler/commands/#pipelines-create) command: + +```bash +npx wrangler pipelines create my-pipeline \ + --sql "INSERT INTO my_sink SELECT * FROM my_stream" +``` + +You can also provide SQL from a file: + +```bash +npx wrangler pipelines create my-pipeline \ + --sql-file pipeline.sql +``` + +Alternatively, to use the interactive setup wizard that helps you configure a stream, sink, and pipeline, run the [`pipelines setup`](/workers/wrangler/commands/#pipelines-setup) command: + +```bash +npx wrangler pipelines setup +``` + +### SQL transformations + +Pipelines support SQL statements for data transformation. For complete syntax, supported functions, and data types, see the [SQL reference](/pipelines/sql-reference/). + +Common patterns include: + +#### Basic data flow + +Transfer all data from stream to sink: + +```sql +INSERT INTO my_sink SELECT * FROM my_stream +``` + +#### Filtering events + +Filter events based on conditions: + +```sql +INSERT INTO my_sink +SELECT * FROM my_stream +WHERE event_type = 'purchase' AND amount > 100 +``` + +#### Selecting specific fields + +Choose only the fields you need: + +```sql +INSERT INTO my_sink +SELECT user_id, event_type, timestamp, amount +FROM my_stream +``` + +#### Transforming data + +Apply transformations to fields: + +```sql +INSERT INTO my_sink +SELECT + user_id, + UPPER(event_type) as event_type, + timestamp, + amount * 1.1 as amount_with_tax +FROM my_stream +``` + +## View pipeline configuration + +### Dashboard + + + 1. In the Cloudflare dashboard, go to the **Pipelines** page. + + 2. Select a pipeline to view its SQL transformation, connected streams/sinks, and + associated metrics. + + + +### Wrangler CLI + +To view a specific pipeline, run the [`pipelines get`](/workers/wrangler/commands/#pipelines-get) command: + +```bash +npx wrangler pipelines get +``` + +To list all pipelines in your account, run the [`pipelines list`](/workers/wrangler/commands/#pipelines-list) command: + +```bash +npx wrangler pipelines list +``` + +## Delete a pipeline + +Deleting a pipeline stops data flow from the connected stream to sink. + +### Dashboard + + + 1. In the Cloudflare dashboard, go to the **Pipelines** page. + + 2. Select the pipeline you want to delete. 3. In the **Settings** tab, and select **Delete**. + + + +### Wrangler CLI + +To delete a pipeline, run the [`pipelines delete`](/workers/wrangler/commands/#pipelines-delete) command: + +```bash +npx wrangler pipelines delete +``` + +:::caution +Deleting a pipeline immediately stops data flow between the stream and sink. +::: + +## Limitations + +Pipeline SQL cannot be modified after creation. To change the SQL transformation, you must delete and recreate the pipeline. diff --git a/src/content/docs/pipelines/platform/index.mdx b/src/content/docs/pipelines/platform/index.mdx index a6f575945f80a9d..fe7a16eca104db5 100644 --- a/src/content/docs/pipelines/platform/index.mdx +++ b/src/content/docs/pipelines/platform/index.mdx @@ -7,6 +7,6 @@ sidebar: hideIndex: true --- -import { DirectoryListing } from "~/components" +import { DirectoryListing } from "~/components"; - \ No newline at end of file + diff --git a/src/content/docs/pipelines/platform/limits.mdx b/src/content/docs/pipelines/platform/limits.mdx index f6a38174f5ec7b8..c0c45c8b07f4cc6 100644 --- a/src/content/docs/pipelines/platform/limits.mdx +++ b/src/content/docs/pipelines/platform/limits.mdx @@ -5,23 +5,16 @@ sidebar: order: 2 --- -import { Render } from "~/components" +import { Render } from "~/components"; +While in open beta, the following limits are currently in effect: -| Feature | Limit | -| --------------------------------------------- | ------------------------------------------------------------- | -| Maximum requests per second, per pipeline | 14,000 default (configurable up to 100,000) | -| Maximum payload per request | 1 MB | -| Maximum data throughput per pipeline | 14 MB/s default (configurable up to 100 MB/s) | -| Shards per pipeline | 2 default (configurable up to 15) | -| Maximum batch size | 100 MB | -| Maximum batch records | 10,000,000 | -| Maximum batch duration | 300s | +| Feature | Limit | +| ------------------------------------------ | ------ | +| Maximum streams per account | 20 | +| Maximum payload size per ingestion request | 1 MB | +| Maximum ingest rate per stream | 5 MB/s | +| Maximum sinks per account | 20 | +| Maximum pipelines per account | 20 | - -## Exceeding requests per second or throughput limits -If you consistently exceed the requests per second or throughput limits, your pipeline might not be able to keep up with the load. The pipeline will communicate backpressure by returning a 429 response to HTTP requests or throwing an error if using the Workers API. - -If you are consistently seeing backpressure from your pipeline, consider the following strategies: -* Increase the [shard count](/pipelines/build-with-pipelines/shards) to increase the maximum throughput of your pipeline. -* Send data to a second pipeline if you receive an error. You can setup multiple pipelines to write to the same R2 bucket. \ No newline at end of file + diff --git a/src/content/docs/pipelines/platform/pricing.mdx b/src/content/docs/pipelines/platform/pricing.mdx index a02c869b7c53de7..8a7d86c0dead130 100644 --- a/src/content/docs/pipelines/platform/pricing.mdx +++ b/src/content/docs/pipelines/platform/pricing.mdx @@ -6,18 +6,10 @@ sidebar: head: - tag: title content: Cloudflare Pipelines - Pricing - --- -:::note -Pipelines requires a [Workers paid](/workers/platform/pricing/#workers) plan to use. -::: - -During the first phase of the Pipelines open beta, you will not be billed for Pipelines usage. You will be billed only for [R2 usage](/r2/pricing). +Cloudflare Pipelines is in open beta and available to any developer with a [Workers Paid plan](/workers/platform/pricing/). -We plan to price based on the volume of data ingested into and delivered from Pipelines. We expect to begin charging by September 15, 2025, and will provide at least 30 days' notice beforehand. +We are not currently billing for Pipelines during open beta. However, you will be billed for standard [R2 storage and operations](/r2/pricing/) for data written by sinks to R2 buckets. -| | Workers Paid Users -| ---------------------------------- | ------------------------ -| Ingestion | 50 GB / month included + $0.02 / additional GB -| Delivery to R2 | 50 GB / month included + $0.02 / additional GB +We plan to bill based on the volume of data processed by pipelines, transformed by pipelines, and delivered to sinks. We'll provide at least 30 days notice before we make any changes or start charging for Pipelines usage. diff --git a/src/content/docs/pipelines/platform/wrangler-commands.mdx b/src/content/docs/pipelines/platform/wrangler-commands.mdx index 6b26ac89a49f319..345b0c5a10dac42 100644 --- a/src/content/docs/pipelines/platform/wrangler-commands.mdx +++ b/src/content/docs/pipelines/platform/wrangler-commands.mdx @@ -3,12 +3,12 @@ pcx_content_type: concept title: Wrangler commands sidebar: order: 80 - --- -import { Render, Type, MetaInfo } from "~/components" +import { Render, Type, MetaInfo } from "~/components"; ## Global commands - \ No newline at end of file + + diff --git a/src/content/docs/pipelines/reference/index.mdx b/src/content/docs/pipelines/reference/index.mdx new file mode 100644 index 000000000000000..5dcd0961f3e7dbc --- /dev/null +++ b/src/content/docs/pipelines/reference/index.mdx @@ -0,0 +1,16 @@ +--- +pcx_content_type: navigation +title: Reference +head: [] +sidebar: + order: 9 + group: + hideIndex: true +description: Reference documentation for Cloudflare Pipelines. +--- + +import { DirectoryListing } from "~/components"; + +[Pipelines](/pipelines/) reference documentation: + + diff --git a/src/content/docs/pipelines/reference/legacy-pipelines.mdx b/src/content/docs/pipelines/reference/legacy-pipelines.mdx new file mode 100644 index 000000000000000..67630e1de2df297 --- /dev/null +++ b/src/content/docs/pipelines/reference/legacy-pipelines.mdx @@ -0,0 +1,38 @@ +--- +title: Legacy pipelines +pcx_content_type: concept +sidebar: + order: 1 +--- + +Legacy pipelines, those created before September 25, 2025 via the legacy API, are on a deprecation path. + +To check if your pipelines are legacy pipelines, view them in the dashboard under **Pipelines** > **Pipelines** or run the [`pipelines list`](/workers/wrangler/commands/#pipelines-list) command in [Wrangler](/workers/wrangler/). Legacy pipelines are labeled "legacy" in both locations. + +New pipelines offer SQL transformations, multiple output formats, and improved architecture. + +## Notable changes + +- New pipelines support SQL transformations for data processing. +- New pipelines write to JSON, Parquet, and Apache Iceberg formats instead of JSON only. +- New pipelines separate streams, pipelines, and sinks into distinct resources. +- New pipelines support optional structured schemas with validation. +- New pipelines offer configurable rolling policies and customizable partitioning. + +## Moving to new pipelines + +Legacy pipelines will continue to work until Pipelines is Generally Available, but new features and improvements are only available in the new pipeline architecture. To migrate: + +1. Create a new pipeline using the interactive setup: + + ```bash + npx wrangler pipelines setup + ``` + +2. Configure your new pipeline with the desired streams, SQL transformations, and sinks. + +3. Update your applications to send data to the new stream endpoints. + +4. Once verified, delete your legacy pipeline. + +For detailed guidance, refer to the [getting started guide](/pipelines/getting-started/). diff --git a/src/content/docs/pipelines/sinks/available-sinks/index.mdx b/src/content/docs/pipelines/sinks/available-sinks/index.mdx new file mode 100644 index 000000000000000..2865764b3808541 --- /dev/null +++ b/src/content/docs/pipelines/sinks/available-sinks/index.mdx @@ -0,0 +1,16 @@ +--- +pcx_content_type: navigation +title: Available sinks +head: [] +sidebar: + order: 2 + group: + hideIndex: true +description: Find detailed configuration options for each supported sink type. +--- + +import { DirectoryListing } from "~/components"; + +[Pipelines](/pipelines/) supports the following sink types: + + diff --git a/src/content/docs/pipelines/sinks/available-sinks/r2-data-catalog.mdx b/src/content/docs/pipelines/sinks/available-sinks/r2-data-catalog.mdx new file mode 100644 index 000000000000000..0ce4d384fc217a6 --- /dev/null +++ b/src/content/docs/pipelines/sinks/available-sinks/r2-data-catalog.mdx @@ -0,0 +1,81 @@ +--- +pcx_content_type: configuration +title: R2 Data Catalog +description: Write data as Apache Iceberg tables to R2 Data Catalog +sidebar: + order: 2 +--- + +R2 Data Catalog sinks write processed data from pipelines as [Apache Iceberg](https://iceberg.apache.org/) tables to [R2 Data Catalog](/r2/data-catalog/). Iceberg tables provide ACID transactions, schema evolution, and time travel capabilities for analytics workloads. + +To create an R2 Data Catalog sink, run the [`pipelines sinks create`](/workers/wrangler/commands/#pipelines-sinks-create) command and specify the sink type, target bucket, namespace, and table name: + +```bash +npx wrangler pipelines sinks create my-sink \ + --type r2-data-catalog \ + --bucket my-bucket \ + --namespace my_namespace \ + --table my_table \ + --catalog-token YOUR_CATALOG_TOKEN +``` + +The sink will create the specified namespace and table if they do not exist. Sinks cannot be created for existing Iceberg tables. + +## Format + +R2 Data Catalog sinks only support Parquet format. JSON format is not supported for Iceberg tables. + +### Compression options + +Configure Parquet compression for optimal storage and query performance: + +```bash +--compression zstd +``` + +**Available compression options:** + +- `zstd` (default) - Best compression ratio +- `snappy` - Fastest compression +- `gzip` - Good compression, widely supported +- `lz4` - Fast compression with reasonable ratio +- `uncompressed` - No compression + +### Row group size + +[Row groups](https://parquet.apache.org/docs/file-format/configurations/) are sets of rows in a Parquet file that are stored together, affecting memory usage and query performance. Configure the target row group size in MB: + +```bash +--target-row-group-size 256 +``` + +## Batching and rolling policy + +Control when data is written to Iceberg tables. Configure based on your needs: + +- **Lower values**: More frequent writes, smaller files, lower latency +- **Higher values**: Less frequent writes, larger files, better query performance + +### Roll interval + +Set how often files are written (default: 300 seconds): + +```bash +--roll-interval 60 # Write files every 60 seconds +``` + +### Roll size + +Set maximum file size in MB before creating a new file: + +```bash +--roll-size 100 # Create new file after 100MB +``` + +## Authentication + +R2 Data Catalog sinks require an API token with [R2 Admin Read & Write permissions](/r2/data-catalog/manage-catalogs/#create-api-token-in-the-dashboard). This permission grants the sink access to both R2 Data Catalog and R2 storage. + +```bash +--catalog-token YOUR_CATALOG_TOKEN +``` diff --git a/src/content/docs/pipelines/sinks/available-sinks/r2.mdx b/src/content/docs/pipelines/sinks/available-sinks/r2.mdx new file mode 100644 index 000000000000000..7f1bef3a91e9447 --- /dev/null +++ b/src/content/docs/pipelines/sinks/available-sinks/r2.mdx @@ -0,0 +1,113 @@ +--- +pcx_content_type: configuration +title: R2 +description: Write data as JSON or Parquet files to R2 object storage +sidebar: + order: 1 +--- + +R2 sinks write processed data from pipelines as raw files to [R2 object storage](/r2/). They currently support writing to JSON and Parquet formats. + +To create an R2 sink, run the [`pipelines sinks create`](/workers/wrangler/commands/#pipelines-sinks-create) command and specify the sink type and target [bucket](/r2/buckets/): + +```bash +npx wrangler pipelines sinks create my-sink \ + --type r2 \ + --bucket my-bucket +``` + +## Format options + +R2 sinks support two output formats: + +### JSON format + +Write data as newline-delimited JSON files: + +```bash +--format json +``` + +### Parquet format + +Write data as Parquet files for better query performance and compression: + +```bash +--format parquet --compression zstd +``` + +**Compression options for Parquet:** + +- `zstd` (default) - Best compression ratio +- `snappy` - Fastest compression +- `gzip` - Good compression, widely supported +- `lz4` - Fast compression with reasonable ratio +- `uncompressed` - No compression + +**Row group size:** +[Row groups](https://parquet.apache.org/docs/file-format/configurations/) are sets of rows in a Parquet file that are stored together, affecting memory usage and query performance. Configure the target row group size in MB: + +```bash +--target-row-group-size 256 +``` + +## File organization + +Files are written with UUID names within the partitioned directory structure. For example, with path `analytics` and default partitioning: + +``` +analytics/year=2025/month=09/day=18/002507a5-d449-48e8-a484-b1bea916102f.parquet +``` + +### Path + +Set a base directory in your bucket where files will be written: + +```bash +--path analytics/events +``` + +### Partitioning + +R2 sinks automatically partition files by time using a configurable pattern. The default pattern is `year=%Y/month=%m/day=%d` (Hive-style partitioning). + +```bash +--partitioning "year=%Y/month=%m/day=%d/hour=%H" +``` + +For available format specifiers, refer to [strftime documentation](https://docs.rs/chrono/latest/chrono/format/strftime/index.html). + +## Batching and rolling policy + +Control when files are written to R2. Configure based on your needs: + +- **Lower values**: More frequent writes, smaller files, lower latency +- **Higher values**: Less frequent writes, larger files, better query performance + +### Roll interval + +Set how often files are written (default: 300 seconds): + +```bash +--roll-interval 60 # Write files every 60 seconds +``` + +### Roll size + +Set maximum file size in MB before creating a new file: + +```bash +--roll-size 100 # Create new file after 100MB +``` + +## Authentication + +R2 sinks require an API credentials (Access Key ID and Secret Access Key) with [Object Read & Write permissions](/r2/api/tokens/#permissions) to write data to your bucket. + +```bash +npx wrangler pipelines sinks create my-sink \ + --type r2 \ + --bucket my-bucket \ + --access-key-id YOUR_ACCESS_KEY_ID \ + --secret-access-key YOUR_SECRET_ACCESS_KEY +``` diff --git a/src/content/docs/pipelines/sinks/index.mdx b/src/content/docs/pipelines/sinks/index.mdx new file mode 100644 index 000000000000000..6c458a661f48bf7 --- /dev/null +++ b/src/content/docs/pipelines/sinks/index.mdx @@ -0,0 +1,26 @@ +--- +title: Sinks +pcx_content_type: navigation +sidebar: + order: 3 +--- + +import { LinkCard } from "~/components"; + +Sinks define destinations for your data in Cloudflare Pipelines. They support writing to [R2 Data Catalog](/r2/data-catalog/) as Apache Iceberg tables or to [R2](/r2/) as raw JSON or Parquet files. + +Sinks provide exactly-once delivery guarantees, ensuring events are never duplicated or dropped. They can be configured to write files frequently for low-latency ingestion or to write larger, less frequent files for better query performance. + +## Learn more + + + + diff --git a/src/content/docs/pipelines/sinks/manage-sinks.mdx b/src/content/docs/pipelines/sinks/manage-sinks.mdx new file mode 100644 index 000000000000000..ae503d38df91a36 --- /dev/null +++ b/src/content/docs/pipelines/sinks/manage-sinks.mdx @@ -0,0 +1,101 @@ +--- +pcx_content_type: configuration +title: Manage sinks +description: Create, configure, and manage sinks for data storage +sidebar: + order: 1 +--- + +import { Steps, DashButton } from "~/components"; + +Learn how to: + +- Create and configure sinks for data storage +- View sink configuration +- Delete sinks when no longer needed + +## Create a sink + +Sinks are made available to pipelines as SQL tables using the sink name (e.g., `INSERT INTO my_sink SELECT * FROM my_stream`). + +### Dashboard + + +1. In the Cloudflare dashboard, go to the **Pipelines** page. + + +2. Select **Create Pipeline** to launch the pipeline creation wizard. +3. Complete the wizard to create your sink along with the associated stream and pipeline. + + +### Wrangler CLI + +To create a sink, run the [`pipelines sinks create`](/workers/wrangler/commands/#pipelines-sinks-create) command: + +```bash +npx wrangler pipelines sinks create \ + --type r2 \ + --bucket my-bucket \ +``` + +For sink-specific configuration options, refer to [Available sinks](/pipelines/sinks/available-sinks/). + +Alternatively, to use the interactive setup wizard that helps you configure a stream, sink, and pipeline, run the [`pipelines setup`](/workers/wrangler/commands/#pipelines-setup) command: + +```bash +npx wrangler pipelines setup +``` + +## View sink configuration + +### Dashboard + + + 1. In the Cloudflare dashboard, go to **Pipelines** > **Sinks**. + + 2. Select a sink to view its configuration. + + + +### Wrangler CLI + +To view a specific sink, run the [`pipelines sinks get`](/workers/wrangler/commands/#pipelines-sinks-get) command: + +```bash +npx wrangler pipelines sinks get +``` + +To list all sinks in your account, run the [`pipelines sinks list`](/workers/wrangler/commands/#pipelines-sinks-list) command: + +```bash +npx wrangler pipelines sinks list +``` + +## Delete a sink + +### Dashboard + + + 1. In the Cloudflare dashboard, go to **Pipelines** > **Sinks**. + + 2. Select the sink you want to delete. + + 3. In the **Settings** tab, navigate to **General**, and select **Delete**. + + + +### Wrangler CLI + +To delete a sink, run the [`pipelines sinks delete`](/workers/wrangler/commands/#pipelines-sinks-delete) command: + +```bash +npx wrangler pipelines sinks delete +``` + +:::caution +Deleting a sink stops all data writes to that destination. +::: + +## Limitations + +Sinks cannot be modified after creation. To change sink configuration, you must delete and recreate the sink. diff --git a/src/content/docs/pipelines/sql-reference/index.mdx b/src/content/docs/pipelines/sql-reference/index.mdx new file mode 100644 index 000000000000000..bb2be220a40049e --- /dev/null +++ b/src/content/docs/pipelines/sql-reference/index.mdx @@ -0,0 +1,16 @@ +--- +pcx_content_type: navigation +title: SQL reference +head: [] +sidebar: + order: 5 + group: + hideIndex: true +description: Comprehensive reference for SQL syntax, data types, and functions supported in Pipelines. +--- + +import { DirectoryListing } from "~/components"; + +[Pipelines](/pipelines/) SQL reference documentation: + + diff --git a/src/content/docs/pipelines/sql-reference/scalar-functions/array.mdx b/src/content/docs/pipelines/sql-reference/scalar-functions/array.mdx new file mode 100644 index 000000000000000..41508861ac674bb --- /dev/null +++ b/src/content/docs/pipelines/sql-reference/scalar-functions/array.mdx @@ -0,0 +1,1314 @@ +--- +title: Array functions +description: Scalar functions for manipulating arrays +sidebar: + order: 8 +--- + +_Cloudflare Pipelines scalar function implementations are based on +[Apache DataFusion](https://arrow.apache.org/datafusion/) (via [Arroyo](https://www.arroyo.dev/)) and these docs are derived from +the DataFusion function reference._ + +## `array_append` + +Appends an element to the end of an array. + +``` +array_append(array, element) +``` + +**Arguments** + +- **array**: Array expression. + Can be a constant, column, or function, and any combination of array operators. +- **element**: Element to append to the array. + +**Example** + +``` +> select array_append([1, 2, 3], 4); ++--------------------------------------+ +| array_append(List([1,2,3]),Int64(4)) | ++--------------------------------------+ +| [1, 2, 3, 4] | ++--------------------------------------+ +``` + +**Aliases** + +- array_push_back +- list_append +- list_push_back + +## `array_sort` + +Sort array. + +``` +array_sort(array, desc, nulls_first) +``` + +**Arguments** + +- **array**: Array expression. + Can be a constant, column, or function, and any combination of array operators. +- **desc**: Whether to sort in descending order(`ASC` or `DESC`). +- **nulls_first**: Whether to sort nulls first(`NULLS FIRST` or `NULLS LAST`). + +**Example** + +``` +> select array_sort([3, 1, 2]); ++-----------------------------+ +| array_sort(List([3,1,2])) | ++-----------------------------+ +| [1, 2, 3] | ++-----------------------------+ +``` + +**Aliases** + +- list_sort + +## `array_resize` + +Resizes the list to contain size elements. Initializes new elements with value or empty if value is not set. + +``` +array_resize(array, size, value) +``` + +**Arguments** + +- **array**: Array expression. + Can be a constant, column, or function, and any combination of array operators. +- **size**: New size of given array. +- **value**: Defines new elements' value or empty if value is not set. + +**Example** + +``` +> select array_resize([1, 2, 3], 5, 0); ++-------------------------------------+ +| array_resize(List([1,2,3],5,0)) | ++-------------------------------------+ +| [1, 2, 3, 0, 0] | ++-------------------------------------+ +``` + +**Aliases** + +- list_resize + +## `array_cat` + +_Alias of [array_concat](#array_concat)._ + +## `array_concat` + +Concatenates arrays. + +``` +array_concat(array[, ..., array_n]) +``` + +**Arguments** + +- **array**: Array expression to concatenate. + Can be a constant, column, or function, and any combination of array operators. +- **array_n**: Subsequent array column or literal array to concatenate. + +**Example** + +``` +> select array_concat([1, 2], [3, 4], [5, 6]); ++---------------------------------------------------+ +| array_concat(List([1,2]),List([3,4]),List([5,6])) | ++---------------------------------------------------+ +| [1, 2, 3, 4, 5, 6] | ++---------------------------------------------------+ +``` + +**Aliases** + +- array_cat +- list_cat +- list_concat + +## `array_contains` + +_Alias of [array_has](#array_has)._ + +## `array_has` + +Returns true if the array contains the element + +``` +array_has(array, element) +``` + +**Arguments** + +- **array**: Array expression. + Can be a constant, column, or function, and any combination of array operators. +- **element**: Scalar or Array expression. + Can be a constant, column, or function, and any combination of array operators. + +**Aliases** + +- list_has + +## `array_has_all` + +Returns true if all elements of sub-array exist in array + +``` +array_has_all(array, sub-array) +``` + +**Arguments** + +- **array**: Array expression. + Can be a constant, column, or function, and any combination of array operators. +- **sub-array**: Array expression. + Can be a constant, column, or function, and any combination of array operators. + +**Aliases** + +- list_has_all + +## `array_has_any` + +Returns true if any elements exist in both arrays + +``` +array_has_any(array, sub-array) +``` + +**Arguments** + +- **array**: Array expression. + Can be a constant, column, or function, and any combination of array operators. +- **sub-array**: Array expression. + Can be a constant, column, or function, and any combination of array operators. + +**Aliases** + +- list_has_any + +## `array_dims` + +Returns an array of the array's dimensions. + +``` +array_dims(array) +``` + +**Arguments** + +- **array**: Array expression. + Can be a constant, column, or function, and any combination of array operators. + +**Example** + +``` +> select array_dims([[1, 2, 3], [4, 5, 6]]); ++---------------------------------+ +| array_dims(List([1,2,3,4,5,6])) | ++---------------------------------+ +| [2, 3] | ++---------------------------------+ +``` + +**Aliases** + +- list_dims + +## `array_distinct` + +Returns distinct values from the array after removing duplicates. + +``` +array_distinct(array) +``` + +**Arguments** + +- **array**: Array expression. + Can be a constant, column, or function, and any combination of array operators. + +**Example** + +``` +> select array_distinct([1, 3, 2, 3, 1, 2, 4]); ++---------------------------------+ +| array_distinct(List([1,2,3,4])) | ++---------------------------------+ +| [1, 2, 3, 4] | ++---------------------------------+ +``` + +**Aliases** + +- list_distinct + +## `array_element` + +Extracts the element with the index n from the array. + +``` +array_element(array, index) +``` + +**Arguments** + +- **array**: Array expression. + Can be a constant, column, or function, and any combination of array operators. +- **index**: Index to extract the element from the array. + +**Example** + +``` +> select array_element([1, 2, 3, 4], 3); ++-----------------------------------------+ +| array_element(List([1,2,3,4]),Int64(3)) | ++-----------------------------------------+ +| 3 | ++-----------------------------------------+ +``` + +**Aliases** + +- array_extract +- list_element +- list_extract + +## `array_extract` + +_Alias of [array_element](#array_element)._ + +## `array_fill` + +Returns an array filled with copies of the given value. + +DEPRECATED: use `array_repeat` instead! + +``` +array_fill(element, array) +``` + +**Arguments** + +- **array**: Array expression. + Can be a constant, column, or function, and any combination of array operators. +- **element**: Element to copy to the array. + +## `flatten` + +Converts an array of arrays to a flat array + +- Applies to any depth of nested arrays +- Does not change arrays that are already flat + +The flattened array contains all the elements from all source arrays. + +**Arguments** + +- **array**: Array expression + Can be a constant, column, or function, and any combination of array operators. + +``` +flatten(array) +``` + +## `array_indexof` + +_Alias of [array_position](#array_position)._ + +## `array_intersect` + +Returns an array of elements in the intersection of array1 and array2. + +``` +array_intersect(array1, array2) +``` + +**Arguments** + +- **array1**: Array expression. + Can be a constant, column, or function, and any combination of array operators. +- **array2**: Array expression. + Can be a constant, column, or function, and any combination of array operators. + +**Example** + +``` +> select array_intersect([1, 2, 3, 4], [5, 6, 3, 4]); ++----------------------------------------------------+ +| array_intersect([1, 2, 3, 4], [5, 6, 3, 4]); | ++----------------------------------------------------+ +| [3, 4] | ++----------------------------------------------------+ +> select array_intersect([1, 2, 3, 4], [5, 6, 7, 8]); ++----------------------------------------------------+ +| array_intersect([1, 2, 3, 4], [5, 6, 7, 8]); | ++----------------------------------------------------+ +| [] | ++----------------------------------------------------+ +``` + +--- + +**Aliases** + +- list_intersect + +## `array_join` + +_Alias of [array_to_string](#array_to_string)._ + +## `array_length` + +Returns the length of the array dimension. + +``` +array_length(array, dimension) +``` + +**Arguments** + +- **array**: Array expression. + Can be a constant, column, or function, and any combination of array operators. +- **dimension**: Array dimension. + +**Example** + +``` +> select array_length([1, 2, 3, 4, 5]); ++---------------------------------+ +| array_length(List([1,2,3,4,5])) | ++---------------------------------+ +| 5 | ++---------------------------------+ +``` + +**Aliases** + +- list_length + +## `array_ndims` + +Returns the number of dimensions of the array. + +``` +array_ndims(array, element) +``` + +**Arguments** + +- **array**: Array expression. + Can be a constant, column, or function, and any combination of array operators. + +**Example** + +``` +> select array_ndims([[1, 2, 3], [4, 5, 6]]); ++----------------------------------+ +| array_ndims(List([1,2,3,4,5,6])) | ++----------------------------------+ +| 2 | ++----------------------------------+ +``` + +**Aliases** + +- list_ndims + +## `array_prepend` + +Prepends an element to the beginning of an array. + +``` +array_prepend(element, array) +``` + +**Arguments** + +- **element**: Element to prepend to the array. +- **array**: Array expression. + Can be a constant, column, or function, and any combination of array operators. + +**Example** + +``` +> select array_prepend(1, [2, 3, 4]); ++---------------------------------------+ +| array_prepend(Int64(1),List([2,3,4])) | ++---------------------------------------+ +| [1, 2, 3, 4] | ++---------------------------------------+ +``` + +**Aliases** + +- array_push_front +- list_prepend +- list_push_front + +## `array_pop_front` + +Returns the array without the first element. + +``` +array_pop_front(array) +``` + +**Arguments** + +- **array**: Array expression. + Can be a constant, column, or function, and any combination of array operators. + +**Example** + +``` +> select array_pop_front([1, 2, 3]); ++-------------------------------+ +| array_pop_front(List([1,2,3])) | ++-------------------------------+ +| [2, 3] | ++-------------------------------+ +``` + +**Aliases** + +- list_pop_front + +## `array_pop_back` + +Returns the array without the last element. + +``` +array_pop_back(array) +``` + +**Arguments** + +- **array**: Array expression. + Can be a constant, column, or function, and any combination of array operators. + +**Example** + +``` +> select array_pop_back([1, 2, 3]); ++-------------------------------+ +| array_pop_back(List([1,2,3])) | ++-------------------------------+ +| [1, 2] | ++-------------------------------+ +``` + +**Aliases** + +- list_pop_back + +## `array_position` + +Returns the position of the first occurrence of the specified element in the array. + +``` +array_position(array, element) +array_position(array, element, index) +``` + +**Arguments** + +- **array**: Array expression. + Can be a constant, column, or function, and any combination of array operators. +- **element**: Element to search for position in the array. +- **index**: Index at which to start searching. + +**Example** + +``` +> select array_position([1, 2, 2, 3, 1, 4], 2); ++----------------------------------------------+ +| array_position(List([1,2,2,3,1,4]),Int64(2)) | ++----------------------------------------------+ +| 2 | ++----------------------------------------------+ +``` + +**Aliases** + +- array_indexof +- list_indexof +- list_position + +## `array_positions` + +Searches for an element in the array, returns all occurrences. + +``` +array_positions(array, element) +``` + +**Arguments** + +- **array**: Array expression. + Can be a constant, column, or function, and any combination of array operators. +- **element**: Element to search for positions in the array. + +**Example** + +``` +> select array_positions([1, 2, 2, 3, 1, 4], 2); ++-----------------------------------------------+ +| array_positions(List([1,2,2,3,1,4]),Int64(2)) | ++-----------------------------------------------+ +| [2, 3] | ++-----------------------------------------------+ +``` + +**Aliases** + +- list_positions + +## `array_push_back` + +_Alias of [array_append](#array_append)._ + +## `array_push_front` + +_Alias of [array_prepend](#array_prepend)._ + +## `array_repeat` + +Returns an array containing element `count` times. + +``` +array_repeat(element, count) +``` + +**Arguments** + +- **element**: Element expression. + Can be a constant, column, or function, and any combination of array operators. +- **count**: Value of how many times to repeat the element. + +**Example** + +``` +> select array_repeat(1, 3); ++---------------------------------+ +| array_repeat(Int64(1),Int64(3)) | ++---------------------------------+ +| [1, 1, 1] | ++---------------------------------+ +``` + +``` +> select array_repeat([1, 2], 2); ++------------------------------------+ +| array_repeat(List([1,2]),Int64(2)) | ++------------------------------------+ +| [[1, 2], [1, 2]] | ++------------------------------------+ +``` + +**Aliases** + +- list_repeat + +## `array_remove` + +Removes the first element from the array equal to the given value. + +``` +array_remove(array, element) +``` + +**Arguments** + +- **array**: Array expression. + Can be a constant, column, or function, and any combination of array operators. +- **element**: Element to be removed from the array. + +**Example** + +``` +> select array_remove([1, 2, 2, 3, 2, 1, 4], 2); ++----------------------------------------------+ +| array_remove(List([1,2,2,3,2,1,4]),Int64(2)) | ++----------------------------------------------+ +| [1, 2, 3, 2, 1, 4] | ++----------------------------------------------+ +``` + +**Aliases** + +- list_remove + +## `array_remove_n` + +Removes the first `max` elements from the array equal to the given value. + +``` +array_remove_n(array, element, max) +``` + +**Arguments** + +- **array**: Array expression. + Can be a constant, column, or function, and any combination of array operators. +- **element**: Element to be removed from the array. +- **max**: Number of first occurrences to remove. + +**Example** + +``` +> select array_remove_n([1, 2, 2, 3, 2, 1, 4], 2, 2); ++---------------------------------------------------------+ +| array_remove_n(List([1,2,2,3,2,1,4]),Int64(2),Int64(2)) | ++---------------------------------------------------------+ +| [1, 3, 2, 1, 4] | ++---------------------------------------------------------+ +``` + +**Aliases** + +- list_remove_n + +## `array_remove_all` + +Removes all elements from the array equal to the given value. + +``` +array_remove_all(array, element) +``` + +**Arguments** + +- **array**: Array expression. + Can be a constant, column, or function, and any combination of array operators. +- **element**: Element to be removed from the array. + +**Example** + +``` +> select array_remove_all([1, 2, 2, 3, 2, 1, 4], 2); ++--------------------------------------------------+ +| array_remove_all(List([1,2,2,3,2,1,4]),Int64(2)) | ++--------------------------------------------------+ +| [1, 3, 1, 4] | ++--------------------------------------------------+ +``` + +**Aliases** + +- list_remove_all + +## `array_replace` + +Replaces the first occurrence of the specified element with another specified element. + +``` +array_replace(array, from, to) +``` + +**Arguments** + +- **array**: Array expression. + Can be a constant, column, or function, and any combination of array operators. +- **from**: Initial element. +- **to**: Final element. + +**Example** + +``` +> select array_replace([1, 2, 2, 3, 2, 1, 4], 2, 5); ++--------------------------------------------------------+ +| array_replace(List([1,2,2,3,2,1,4]),Int64(2),Int64(5)) | ++--------------------------------------------------------+ +| [1, 5, 2, 3, 2, 1, 4] | ++--------------------------------------------------------+ +``` + +**Aliases** + +- list_replace + +## `array_replace_n` + +Replaces the first `max` occurrences of the specified element with another specified element. + +``` +array_replace_n(array, from, to, max) +``` + +**Arguments** + +- **array**: Array expression. + Can be a constant, column, or function, and any combination of array operators. +- **from**: Initial element. +- **to**: Final element. +- **max**: Number of first occurrences to replace. + +**Example** + +``` +> select array_replace_n([1, 2, 2, 3, 2, 1, 4], 2, 5, 2); ++-------------------------------------------------------------------+ +| array_replace_n(List([1,2,2,3,2,1,4]),Int64(2),Int64(5),Int64(2)) | ++-------------------------------------------------------------------+ +| [1, 5, 5, 3, 2, 1, 4] | ++-------------------------------------------------------------------+ +``` + +**Aliases** + +- list_replace_n + +## `array_replace_all` + +Replaces all occurrences of the specified element with another specified element. + +``` +array_replace_all(array, from, to) +``` + +**Arguments** + +- **array**: Array expression. + Can be a constant, column, or function, and any combination of array operators. +- **from**: Initial element. +- **to**: Final element. + +**Example** + +``` +> select array_replace_all([1, 2, 2, 3, 2, 1, 4], 2, 5); ++------------------------------------------------------------+ +| array_replace_all(List([1,2,2,3,2,1,4]),Int64(2),Int64(5)) | ++------------------------------------------------------------+ +| [1, 5, 5, 3, 5, 1, 4] | ++------------------------------------------------------------+ +``` + +**Aliases** + +- list_replace_all + +## `array_reverse` + +Returns the array with the order of the elements reversed. + +``` +array_reverse(array) +``` + +**Arguments** + +- **array**: Array expression. + Can be a constant, column, or function, and any combination of array operators. + +**Example** + +``` +> select array_reverse([1, 2, 3, 4]); ++------------------------------------------------------------+ +| array_reverse(List([1, 2, 3, 4])) | ++------------------------------------------------------------+ +| [4, 3, 2, 1] | ++------------------------------------------------------------+ +``` + +**Aliases** + +- list_reverse + +## `array_slice` + +Returns a slice of the array based on 1-indexed start and end positions. + +``` +array_slice(array, begin, end) +``` + +**Arguments** + +- **array**: Array expression. + Can be a constant, column, or function, and any combination of array operators. +- **begin**: Index of the first element. + If negative, it counts backward from the end of the array. +- **end**: Index of the last element. + If negative, it counts backward from the end of the array. +- **stride**: Stride of the array slice. The default is 1. + +**Example** + +``` +> select array_slice([1, 2, 3, 4, 5, 6, 7, 8], 3, 6); ++--------------------------------------------------------+ +| array_slice(List([1,2,3,4,5,6,7,8]),Int64(3),Int64(6)) | ++--------------------------------------------------------+ +| [3, 4, 5, 6] | ++--------------------------------------------------------+ +``` + +**Aliases** + +- list_slice + +## `array_to_string` + +Converts each element to its text representation. + +``` +array_to_string(array, delimiter) +``` + +**Arguments** + +- **array**: Array expression. + Can be a constant, column, or function, and any combination of array operators. +- **delimiter**: Array element separator. + +**Example** + +``` +> select array_to_string([[1, 2, 3, 4], [5, 6, 7, 8]], ','); ++----------------------------------------------------+ +| array_to_string(List([1,2,3,4,5,6,7,8]),Utf8(",")) | ++----------------------------------------------------+ +| 1,2,3,4,5,6,7,8 | ++----------------------------------------------------+ +``` + +**Aliases** + +- array_join +- list_join +- list_to_string + +## `array_union` + +Returns an array of elements that are present in both arrays (all elements from both arrays) with out duplicates. + +``` +array_union(array1, array2) +``` + +**Arguments** + +- **array1**: Array expression. + Can be a constant, column, or function, and any combination of array operators. +- **array2**: Array expression. + Can be a constant, column, or function, and any combination of array operators. + +**Example** + +``` +> select array_union([1, 2, 3, 4], [5, 6, 3, 4]); ++----------------------------------------------------+ +| array_union([1, 2, 3, 4], [5, 6, 3, 4]); | ++----------------------------------------------------+ +| [1, 2, 3, 4, 5, 6] | ++----------------------------------------------------+ +> select array_union([1, 2, 3, 4], [5, 6, 7, 8]); ++----------------------------------------------------+ +| array_union([1, 2, 3, 4], [5, 6, 7, 8]); | ++----------------------------------------------------+ +| [1, 2, 3, 4, 5, 6, 7, 8] | ++----------------------------------------------------+ +``` + +--- + +**Aliases** + +- list_union + +## `array_except` + +Returns an array of the elements that appear in the first array but not in the second. + +``` +array_except(array1, array2) +``` + +**Arguments** + +- **array1**: Array expression. + Can be a constant, column, or function, and any combination of array operators. +- **array2**: Array expression. + Can be a constant, column, or function, and any combination of array operators. + +**Example** + +``` +> select array_except([1, 2, 3, 4], [5, 6, 3, 4]); ++----------------------------------------------------+ +| array_except([1, 2, 3, 4], [5, 6, 3, 4]); | ++----------------------------------------------------+ +| [1, 2] | ++----------------------------------------------------+ +> select array_except([1, 2, 3, 4], [3, 4, 5, 6]); ++----------------------------------------------------+ +| array_except([1, 2, 3, 4], [3, 4, 5, 6]); | ++----------------------------------------------------+ +| [1, 2] | ++----------------------------------------------------+ +``` + +--- + +**Aliases** + +- list_except + +## `cardinality` + +Returns the total number of elements in the array. + +``` +cardinality(array) +``` + +**Arguments** + +- **array**: Array expression. + Can be a constant, column, or function, and any combination of array operators. + +**Example** + +``` +> select cardinality([[1, 2, 3, 4], [5, 6, 7, 8]]); ++--------------------------------------+ +| cardinality(List([1,2,3,4,5,6,7,8])) | ++--------------------------------------+ +| 8 | ++--------------------------------------+ +``` + +## `empty` + +Returns 1 for an empty array or 0 for a non-empty array. + +``` +empty(array) +``` + +**Arguments** + +- **array**: Array expression. + Can be a constant, column, or function, and any combination of array operators. + +**Example** + +``` +> select empty([1]); ++------------------+ +| empty(List([1])) | ++------------------+ +| 0 | ++------------------+ +``` + +**Aliases** + +- array_empty, +- list_empty + +## `generate_series` + +Similar to the range function, but it includes the upper bound. + +``` +generate_series(start, stop, step) +``` + +**Arguments** + +- **start**: start of the range +- **end**: end of the range (included) +- **step**: increase by step (can not be 0) + +**Example** + +``` +> select generate_series(1,3); ++------------------------------------+ +| generate_series(Int64(1),Int64(3)) | ++------------------------------------+ +| [1, 2, 3] | ++------------------------------------+ +``` + +## `list_append` + +_Alias of [array_append](#array_append)._ + +## `list_cat` + +_Alias of [array_concat](#array_concat)._ + +## `list_concat` + +_Alias of [array_concat](#array_concat)._ + +## `list_dims` + +_Alias of [array_dims](#array_dims)._ + +## `list_distinct` + +_Alias of [array_dims](#array_distinct)._ + +## `list_element` + +_Alias of [array_element](#array_element)._ + +## `list_empty` + +_Alias of [empty](#empty)._ + +## `list_except` + +_Alias of [array_element](#array_except)._ + +## `list_extract` + +_Alias of [array_element](#array_element)._ + +## `list_has` + +_Alias of [array_has](#array_has)._ + +## `list_has_all` + +_Alias of [array_has_all](#array_has_all)._ + +## `list_has_any` + +_Alias of [array_has_any](#array_has_any)._ + +## `list_indexof` + +_Alias of [array_position](#array_position)._ + +## `list_intersect` + +_Alias of [array_position](#array_intersect)._ + +## `list_join` + +_Alias of [array_to_string](#array_to_string)._ + +## `list_length` + +_Alias of [array_length](#array_length)._ + +## `list_ndims` + +_Alias of [array_ndims](#array_ndims)._ + +## `list_prepend` + +_Alias of [array_prepend](#array_prepend)._ + +## `list_pop_back` + +_Alias of [array_pop_back](#array_pop_back)._ + +## `list_pop_front` + +_Alias of [array_pop_front](#array_pop_front)._ + +## `list_position` + +_Alias of [array_position](#array_position)._ + +## `list_positions` + +_Alias of [array_positions](#array_positions)._ + +## `list_push_back` + +_Alias of [array_append](#array_append)._ + +## `list_push_front` + +_Alias of [array_prepend](#array_prepend)._ + +## `list_repeat` + +_Alias of [array_repeat](#array_repeat)._ + +## `list_resize` + +_Alias of [array_resize](#array_resize)._ + +## `list_remove` + +_Alias of [array_remove](#array_remove)._ + +## `list_remove_n` + +_Alias of [array_remove_n](#array_remove_n)._ + +## `list_remove_all` + +_Alias of [array_remove_all](#array_remove_all)._ + +## `list_replace` + +_Alias of [array_replace](#array_replace)._ + +## `list_replace_n` + +_Alias of [array_replace_n](#array_replace_n)._ + +## `list_replace_all` + +_Alias of [array_replace_all](#array_replace_all)._ + +## `list_reverse` + +_Alias of [array_reverse](#array_reverse)._ + +## `list_slice` + +_Alias of [array_slice](#array_slice)._ + +## `list_sort` + +_Alias of [array_sort](#array_sort)._ + +## `list_to_string` + +_Alias of [array_to_string](#array_to_string)._ + +## `list_union` + +_Alias of [array_union](#array_union)._ + +## `make_array` + +Returns an Arrow array using the specified input expressions. + +``` +make_array(expression1[, ..., expression_n]) +``` + +## `array_empty` + +_Alias of [empty](#empty)._ + +**Arguments** + +- **expression_n**: Expression to include in the output array. + Can be a constant, column, or function, and any combination of arithmetic or + string operators. + +**Example** + +``` +> select make_array(1, 2, 3, 4, 5); ++----------------------------------------------------------+ +| make_array(Int64(1),Int64(2),Int64(3),Int64(4),Int64(5)) | ++----------------------------------------------------------+ +| [1, 2, 3, 4, 5] | ++----------------------------------------------------------+ +``` + +**Aliases** + +- make_list + +## `make_list` + +_Alias of [make_array](#make_array)._ + +## `string_to_array` + +Splits a string in to an array of substrings based on a delimiter. Any substrings matching the optional `null_str` argument are replaced with NULL. +`SELECT string_to_array('abc##def', '##')` or `SELECT string_to_array('abc def', ' ', 'def')` + +``` +starts_with(str, delimiter[, null_str]) +``` + +**Arguments** + +- **str**: String expression to split. +- **delimiter**: Delimiter string to split on. +- **null_str**: Substring values to be replaced with `NULL` + +**Aliases** + +- string_to_list + +## `string_to_list` + +_Alias of [string_to_array](#string_to_array)._ + +## `trim_array` + +Removes the last n elements from the array. + +DEPRECATED: use `array_slice` instead! + +``` +trim_array(array, n) +``` + +**Arguments** + +- **array**: Array expression. + Can be a constant, column, or function, and any combination of array operators. +- **n**: Element to trim the array. + +## `range` + +Returns an Arrow array between start and stop with step. `SELECT range(2, 10, 3) -> [2, 5, 8]` or `SELECT range(DATE '1992-09-01', DATE '1993-03-01', INTERVAL '1' MONTH);` + +The range start..end contains all values with start <= x < end. It is empty if start >= end. + +Step can not be 0 (then the range will be nonsense.). + +Note that when the required range is a number, it accepts (stop), (start, stop), and (start, stop, step) as parameters, but when the required range is a date, it must be 3 non-NULL parameters. +For example, + +``` +SELECT range(3); +SELECT range(1,5); +SELECT range(1,5,1); +``` + +are allowed in number ranges + +but in date ranges, only + +``` +SELECT range(DATE '1992-09-01', DATE '1993-03-01', INTERVAL '1' MONTH); +``` + +is allowed, and + +``` +SELECT range(DATE '1992-09-01', DATE '1993-03-01', NULL); +SELECT range(NULL, DATE '1993-03-01', INTERVAL '1' MONTH); +SELECT range(DATE '1992-09-01', NULL, INTERVAL '1' MONTH); +``` + +are not allowed + +**Arguments** + +- **start**: start of the range +- **end**: end of the range (not included) +- **step**: increase by step (can not be 0) + +**Aliases** + +- generate_series diff --git a/src/content/docs/pipelines/sql-reference/scalar-functions/binary-string.mdx b/src/content/docs/pipelines/sql-reference/scalar-functions/binary-string.mdx new file mode 100644 index 000000000000000..a6f69b175cf980a --- /dev/null +++ b/src/content/docs/pipelines/sql-reference/scalar-functions/binary-string.mdx @@ -0,0 +1,44 @@ +--- +title: Binary string functions +description: Scalar functions for manipulating binary strings +sidebar: + order: 4 +--- + +_Cloudflare Pipelines scalar function implementations are based on +[Apache DataFusion](https://arrow.apache.org/datafusion/) (via [Arroyo](https://www.arroyo.dev/)) and these docs are derived from +the DataFusion function reference._ + +## `encode` + +Encode binary data into a textual representation. + +``` +encode(expression, format) +``` + +**Arguments** + +- **expression**: Expression containing string or binary data + +- **format**: Supported formats are: `base64`, `hex` + +**Related functions**: +[decode](#decode) + +## `decode` + +Decode binary data from textual representation in string. + +``` +decode(expression, format) +``` + +**Arguments** + +- **expression**: Expression containing encoded string data + +- **format**: Same arguments as [encode](#encode) + +**Related functions**: +[encode](#encode) diff --git a/src/content/docs/pipelines/sql-reference/scalar-functions/conditional.mdx b/src/content/docs/pipelines/sql-reference/scalar-functions/conditional.mdx new file mode 100644 index 000000000000000..51514019748dce6 --- /dev/null +++ b/src/content/docs/pipelines/sql-reference/scalar-functions/conditional.mdx @@ -0,0 +1,79 @@ +--- +title: Conditional functions +description: Scalar functions to implement conditional logic +sidebar: + order: 2 +--- + +_Cloudflare Pipelines scalar function implementations are based on +[Apache DataFusion](https://arrow.apache.org/datafusion/) (via [Arroyo](https://www.arroyo.dev/)) and these docs are derived from +the DataFusion function reference._ + +## `coalesce` + +Returns the first of its arguments that is not _null_. +Returns _null_ if all arguments are _null_. +This function is often used to substitute a default value for _null_ values. + +``` +coalesce(expression1[, ..., expression_n]) +``` + +**Arguments** + +- **expression1, expression_n**: + Expression to use if previous expressions are _null_. + Can be a constant, column, or function, and any combination of arithmetic operators. + Pass as many expression arguments as necessary. + +## `nullif` + +Returns _null_ if _expression1_ equals _expression2_; otherwise it returns _expression1_. +This can be used to perform the inverse operation of [`coalesce`](#coalesce). + +``` +nullif(expression1, expression2) +``` + +**Arguments** + +- **expression1**: Expression to compare and return if equal to expression2. + Can be a constant, column, or function, and any combination of arithmetic operators. +- **expression2**: Expression to compare to expression1. + Can be a constant, column, or function, and any combination of arithmetic operators. + +## `nvl` + +Returns _expression2_ if _expression1_ is NULL; otherwise it returns _expression1_. + +``` +nvl(expression1, expression2) +``` + +**Arguments** + +- **expression1**: return if expression1 not is NULL. + Can be a constant, column, or function, and any combination of arithmetic operators. +- **expression2**: return if expression1 is NULL. + Can be a constant, column, or function, and any combination of arithmetic operators. + +## `nvl2` + +Returns _expression2_ if _expression1_ is not NULL; otherwise it returns _expression3_. + +``` +nvl2(expression1, expression2, expression3) +``` + +**Arguments** + +- **expression1**: conditional expression. + Can be a constant, column, or function, and any combination of arithmetic operators. +- **expression2**: return if expression1 is not NULL. + Can be a constant, column, or function, and any combination of arithmetic operators. +- **expression3**: return if expression1 is NULL. + Can be a constant, column, or function, and any combination of arithmetic operators. + +## `ifnull` + +_Alias of [nvl](#nvl)._ diff --git a/src/content/docs/pipelines/sql-reference/scalar-functions/hashing.mdx b/src/content/docs/pipelines/sql-reference/scalar-functions/hashing.mdx new file mode 100644 index 000000000000000..6f35cb2269f10ee --- /dev/null +++ b/src/content/docs/pipelines/sql-reference/scalar-functions/hashing.mdx @@ -0,0 +1,98 @@ +--- +title: Hashing functions +description: Scalar functions for hashing values +sidebar: + order: 10 +--- + +_Cloudflare Pipelines scalar function implementations are based on +[Apache DataFusion](https://arrow.apache.org/datafusion/) (via [Arroyo](https://www.arroyo.dev/)) and these docs are derived from +the DataFusion function reference._ + +## `digest` + +Computes the binary hash of an expression using the specified algorithm. + +``` +digest(expression, algorithm) +``` + +**Arguments** + +- **expression**: String expression to operate on. + Can be a constant, column, or function, and any combination of string operators. +- **algorithm**: String expression specifying algorithm to use. + Must be one of: + - md5 + - sha224 + - sha256 + - sha384 + - sha512 + - blake2s + - blake2b + - blake3 + +## `md5` + +Computes an MD5 128-bit checksum for a string expression. + +``` +md5(expression) +``` + +**Arguments** + +- **expression**: String expression to operate on. + Can be a constant, column, or function, and any combination of string operators. + +## `sha224` + +Computes the SHA-224 hash of a binary string. + +``` +sha224(expression) +``` + +**Arguments** + +- **expression**: String expression to operate on. + Can be a constant, column, or function, and any combination of string operators. + +## `sha256` + +Computes the SHA-256 hash of a binary string. + +``` +sha256(expression) +``` + +**Arguments** + +- **expression**: String expression to operate on. + Can be a constant, column, or function, and any combination of string operators. + +## `sha384` + +Computes the SHA-384 hash of a binary string. + +``` +sha384(expression) +``` + +**Arguments** + +- **expression**: String expression to operate on. + Can be a constant, column, or function, and any combination of string operators. + +## `sha512` + +Computes the SHA-512 hash of a binary string. + +``` +sha512(expression) +``` + +**Arguments** + +- **expression**: String expression to operate on. + Can be a constant, column, or function, and any combination of string operators. diff --git a/src/content/docs/pipelines/sql-reference/scalar-functions/index.mdx b/src/content/docs/pipelines/sql-reference/scalar-functions/index.mdx new file mode 100644 index 000000000000000..dfd9f208224f361 --- /dev/null +++ b/src/content/docs/pipelines/sql-reference/scalar-functions/index.mdx @@ -0,0 +1,16 @@ +--- +pcx_content_type: navigation +title: Scalar functions +head: [] +sidebar: + order: 3 + group: + hideIndex: true +description: Scalar functions available in Cloudflare Pipelines SQL. +--- + +import { DirectoryListing } from "~/components"; + +[Pipelines](/pipelines/) scalar functions: + + diff --git a/src/content/docs/pipelines/sql-reference/scalar-functions/json.mdx b/src/content/docs/pipelines/sql-reference/scalar-functions/json.mdx new file mode 100644 index 000000000000000..220734a4746a331 --- /dev/null +++ b/src/content/docs/pipelines/sql-reference/scalar-functions/json.mdx @@ -0,0 +1,158 @@ +--- +title: JSON functions +description: Scalar functions for manipulating JSON +sidebar: + order: 6 +--- + +Cloudflare Pipelines provides two set of JSON functions, the first based on PostgreSQL's SQL +functions and syntax, and the second based on the +[JSONPath](https://jsonpath.com/) standard. + +## SQL functions + +The SQL functions provide basic JSON parsing functions similar to those found in +PostgreSQL. + +### json_contains + +Returns `true` if the JSON string contains the specified key(s). + +```sql +SELECT json_contains('{"a": 1, "b": 2, "c": 3}', 'a') FROM source; +true +``` + +Also available via the `?` operator: + +```sql +SELECT '{"a": 1, "b": 2, "c": 3}' ? 'a' FROM source; +true +``` + +### json_get + +Retrieves the value from a JSON string by the specified path (keys). Returns the +value as its native type (string, int, etc.). + +```sql +SELECT json_get('{"a": {"b": 2}}', 'a', 'b') FROM source; +2 +``` + +Also available via the `->` operator: + +```sql +SELECT '{"a": {"b": 2}}'->'a'->'b' FROM source; +2 +``` + +Various permutations of `json_get` functions are available for retrieving values as +a specific type, or you can use SQL type annotations: + +```sql +SELECT json_get('{"a": {"b": 2}}', 'a', 'b')::int FROM source; +2 +``` + +### json_get_str + +Retrieves a string value from a JSON string by the specified path. Returns an +empty string if the value does not exist or is not a string. + +```sql +SELECT json_get_str('{"a": {"b": "hello"}}', 'a', 'b') FROM source; +"hello" +``` + +### json_get_int + +Retrieves an integer value from a JSON string by the specified path. Returns `0` +if the value does not exist or is not an integer. + +```sql +SELECT json_get_int('{"a": {"b": 42}}', 'a', 'b') FROM source; +42 +``` + +### json_get_float + +Retrieves a float value from a JSON string by the specified path. Returns `0.0` +if the value does not exist or is not a float. + +```sql +SELECT json_get_float('{"a": {"b": 3.14}}', 'a', 'b') FROM source; +3.14 +``` + +### json_get_bool + +Retrieves a boolean value from a JSON string by the specified path. Returns +`false` if the value does not exist or is not a boolean. + +```sql +SELECT json_get_bool('{"a": {"b": true}}', 'a', 'b') FROM source; +true +``` + +### json_get_json + +Retrieves a nested JSON string from a JSON string by the specified path. The +value is returned as raw JSON. + +```sql +SELECT json_get_json('{"a": {"b": {"c": 1}}}', 'a', 'b') FROM source; +'{"c": 1}' +``` + +### json_as_text + +Retrieves any value from a JSON string by the specified path and returns it as a +string, regardless of the original type. + +```sql +SELECT json_as_text('{"a": {"b": 42}}', 'a', 'b') FROM source; +"42" +``` + +Also available via the `->>` operator: + +```sql +SELECT '{"a": {"b": 42}}'->>'a'->>'b' FROM source; +"42" +``` + +### json_length + +Returns the length of a JSON object or array at the specified path. Returns `0` +if the path does not exist or is not an object/array. + +```sql +SELECT json_length('{"a": [1, 2, 3]}', 'a') FROM source; +3 +``` + +## Json path functions + +JSON functions provide basic json parsing functions using +[JsonPath](https://goessner.net/articles/JsonPath/), an evolving standard for +querying JSON objects. + +### extract_json + +Returns the JSON elements in the first argument that match the JsonPath in the second argument. +The returned value is an array of json strings. + +```sql +SELECT extract_json('{"a": 1, "b": 2, "c": 3}', '$.a') FROM source; +['1'] +``` + +### extract_json_string + +Returns an unescaped String for the first item matching the JsonPath, if it is a string. + +```sql +SELECT extract_json_string('{"a": "a", "b": 2, "c": 3}', '$.a') FROM source; +'a' +``` diff --git a/src/content/docs/pipelines/sql-reference/scalar-functions/math.mdx b/src/content/docs/pipelines/sql-reference/scalar-functions/math.mdx new file mode 100644 index 000000000000000..947226692dbd6df --- /dev/null +++ b/src/content/docs/pipelines/sql-reference/scalar-functions/math.mdx @@ -0,0 +1,515 @@ +--- +title: Math functions +description: Scalar functions for mathematical operations +sidebar: + order: 1 +--- + +_Cloudflare Pipelines scalar function implementations are based on +[Apache DataFusion](https://arrow.apache.org/datafusion/) (via [Arroyo](https://www.arroyo.dev/)) and these docs are derived from +the DataFusion function reference._ + +## `abs` + +Returns the absolute value of a number. + +``` +abs(numeric_expression) +``` + +**Arguments** + +- **numeric_expression**: Numeric expression to operate on. + Can be a constant, column, or function, and any combination of arithmetic operators. + +## `acos` + +Returns the arc cosine or inverse cosine of a number. + +``` +acos(numeric_expression) +``` + +**Arguments** + +- **numeric_expression**: Numeric expression to operate on. + Can be a constant, column, or function, and any combination of arithmetic operators. + +## `acosh` + +Returns the area hyperbolic cosine or inverse hyperbolic cosine of a number. + +``` +acosh(numeric_expression) +``` + +**Arguments** + +- **numeric_expression**: Numeric expression to operate on. + Can be a constant, column, or function, and any combination of arithmetic operators. + +## `asin` + +Returns the arc sine or inverse sine of a number. + +``` +asin(numeric_expression) +``` + +**Arguments** + +- **numeric_expression**: Numeric expression to operate on. + Can be a constant, column, or function, and any combination of arithmetic operators. + +## `asinh` + +Returns the area hyperbolic sine or inverse hyperbolic sine of a number. + +``` +asinh(numeric_expression) +``` + +**Arguments** + +- **numeric_expression**: Numeric expression to operate on. + Can be a constant, column, or function, and any combination of arithmetic operators. + +## `atan` + +Returns the arc tangent or inverse tangent of a number. + +``` +atan(numeric_expression) +``` + +**Arguments** + +- **numeric_expression**: Numeric expression to operate on. + Can be a constant, column, or function, and any combination of arithmetic operators. + +## `atanh` + +Returns the area hyperbolic tangent or inverse hyperbolic tangent of a number. + +``` +atanh(numeric_expression) +``` + +**Arguments** + +- **numeric_expression**: Numeric expression to operate on. + Can be a constant, column, or function, and any combination of arithmetic operators. + +## `atan2` + +Returns the arc tangent or inverse tangent of `expression_y / expression_x`. + +``` +atan2(expression_y, expression_x) +``` + +**Arguments** + +- **expression_y**: First numeric expression to operate on. + Can be a constant, column, or function, and any combination of arithmetic operators. +- **expression_x**: Second numeric expression to operate on. + Can be a constant, column, or function, and any combination of arithmetic operators. + +## `cbrt` + +Returns the cube root of a number. + +``` +cbrt(numeric_expression) +``` + +**Arguments** + +- **numeric_expression**: Numeric expression to operate on. + Can be a constant, column, or function, and any combination of arithmetic operators. + +## `ceil` + +Returns the nearest integer greater than or equal to a number. + +``` +ceil(numeric_expression) +``` + +**Arguments** + +- **numeric_expression**: Numeric expression to operate on. + Can be a constant, column, or function, and any combination of arithmetic operators. + +## `cos` + +Returns the cosine of a number. + +``` +cos(numeric_expression) +``` + +**Arguments** + +- **numeric_expression**: Numeric expression to operate on. + Can be a constant, column, or function, and any combination of arithmetic operators. + +## `cosh` + +Returns the hyperbolic cosine of a number. + +``` +cosh(numeric_expression) +``` + +**Arguments** + +- **numeric_expression**: Numeric expression to operate on. + Can be a constant, column, or function, and any combination of arithmetic operators. + +## `degrees` + +Converts radians to degrees. + +``` +degrees(numeric_expression) +``` + +**Arguments** + +- **numeric_expression**: Numeric expression to operate on. + Can be a constant, column, or function, and any combination of arithmetic operators. + +## `exp` + +Returns the base-e exponential of a number. + +``` +exp(numeric_expression) +``` + +**Arguments** + +- **numeric_expression**: Numeric expression to use as the exponent. + Can be a constant, column, or function, and any combination of arithmetic operators. + +## `factorial` + +Factorial. Returns 1 if value is less than 2. + +``` +factorial(numeric_expression) +``` + +**Arguments** + +- **numeric_expression**: Numeric expression to operate on. + Can be a constant, column, or function, and any combination of arithmetic operators. + +## `floor` + +Returns the nearest integer less than or equal to a number. + +``` +floor(numeric_expression) +``` + +**Arguments** + +- **numeric_expression**: Numeric expression to operate on. + Can be a constant, column, or function, and any combination of arithmetic operators. + +## `gcd` + +Returns the greatest common divisor of `expression_x` and `expression_y`. Returns 0 if both inputs are zero. + +``` +gcd(expression_x, expression_y) +``` + +**Arguments** + +- **expression_x**: First numeric expression to operate on. + Can be a constant, column, or function, and any combination of arithmetic operators. +- **expression_y**: Second numeric expression to operate on. + Can be a constant, column, or function, and any combination of arithmetic operators. + +## `isnan` + +Returns true if a given number is +NaN or -NaN otherwise returns false. + +``` +isnan(numeric_expression) +``` + +**Arguments** + +- **numeric_expression**: Numeric expression to operate on. + Can be a constant, column, or function, and any combination of arithmetic operators. + +## `iszero` + +Returns true if a given number is +0.0 or -0.0 otherwise returns false. + +``` +iszero(numeric_expression) +``` + +**Arguments** + +- **numeric_expression**: Numeric expression to operate on. + Can be a constant, column, or function, and any combination of arithmetic operators. + +## `lcm` + +Returns the least common multiple of `expression_x` and `expression_y`. Returns 0 if either input is zero. + +``` +lcm(expression_x, expression_y) +``` + +**Arguments** + +- **expression_x**: First numeric expression to operate on. + Can be a constant, column, or function, and any combination of arithmetic operators. +- **expression_y**: Second numeric expression to operate on. + Can be a constant, column, or function, and any combination of arithmetic operators. + +## `ln` + +Returns the natural logarithm of a number. + +``` +ln(numeric_expression) +``` + +**Arguments** + +- **numeric_expression**: Numeric expression to operate on. + Can be a constant, column, or function, and any combination of arithmetic operators. + +## `log` + +Returns the base-x logarithm of a number. +Can either provide a specified base, or if omitted then takes the base-10 of a number. + +``` +log(base, numeric_expression) +log(numeric_expression) +``` + +**Arguments** + +- **base**: Base numeric expression to operate on. + Can be a constant, column, or function, and any combination of arithmetic operators. +- **numeric_expression**: Numeric expression to operate on. + Can be a constant, column, or function, and any combination of arithmetic operators. + +## `log10` + +Returns the base-10 logarithm of a number. + +``` +log10(numeric_expression) +``` + +**Arguments** + +- **numeric_expression**: Numeric expression to operate on. + Can be a constant, column, or function, and any combination of arithmetic operators. + +## `log2` + +Returns the base-2 logarithm of a number. + +``` +log2(numeric_expression) +``` + +**Arguments** + +- **numeric_expression**: Numeric expression to operate on. + Can be a constant, column, or function, and any combination of arithmetic operators. + +## `nanvl` + +Returns the first argument if it's not _NaN_. +Returns the second argument otherwise. + +``` +nanvl(expression_x, expression_y) +``` + +**Arguments** + +- **expression_x**: Numeric expression to return if it's not _NaN_. + Can be a constant, column, or function, and any combination of arithmetic operators. +- **expression_y**: Numeric expression to return if the first expression is _NaN_. + Can be a constant, column, or function, and any combination of arithmetic operators. + +## `pi` + +Returns an approximate value of Ο€. + +``` +pi() +``` + +## `power` + +Returns a base expression raised to the power of an exponent. + +``` +power(base, exponent) +``` + +**Arguments** + +- **base**: Numeric expression to operate on. + Can be a constant, column, or function, and any combination of arithmetic operators. +- **exponent**: Exponent numeric expression to operate on. + Can be a constant, column, or function, and any combination of arithmetic operators. + +**Aliases** + +- pow + +## `pow` + +_Alias of [power](#power)._ + +## `radians` + +Converts degrees to radians. + +``` +radians(numeric_expression) +``` + +**Arguments** + +- **numeric_expression**: Numeric expression to operate on. + Can be a constant, column, or function, and any combination of arithmetic operators. + +## `random` + +Returns a random float value in the range [0, 1). +The random seed is unique to each row. + +``` +random() +``` + +## `round` + +Rounds a number to the nearest integer. + +``` +round(numeric_expression[, decimal_places]) +``` + +**Arguments** + +- **numeric_expression**: Numeric expression to operate on. + Can be a constant, column, or function, and any combination of arithmetic operators. +- **decimal_places**: Optional. The number of decimal places to round to. + Defaults to 0. + +## `signum` + +Returns the sign of a number. +Negative numbers return `-1`. +Zero and positive numbers return `1`. + +``` +signum(numeric_expression) +``` + +**Arguments** + +- **numeric_expression**: Numeric expression to operate on. + Can be a constant, column, or function, and any combination of arithmetic operators. + +## `sin` + +Returns the sine of a number. + +``` +sin(numeric_expression) +``` + +**Arguments** + +- **numeric_expression**: Numeric expression to operate on. + Can be a constant, column, or function, and any combination of arithmetic operators. + +## `sinh` + +Returns the hyperbolic sine of a number. + +``` +sinh(numeric_expression) +``` + +**Arguments** + +- **numeric_expression**: Numeric expression to operate on. + Can be a constant, column, or function, and any combination of arithmetic operators. + +## `sqrt` + +Returns the square root of a number. + +``` +sqrt(numeric_expression) +``` + +**Arguments** + +- **numeric_expression**: Numeric expression to operate on. + Can be a constant, column, or function, and any combination of arithmetic operators. + +## `tan` + +Returns the tangent of a number. + +``` +tan(numeric_expression) +``` + +**Arguments** + +- **numeric_expression**: Numeric expression to operate on. + Can be a constant, column, or function, and any combination of arithmetic operators. + +## `tanh` + +Returns the hyperbolic tangent of a number. + +``` +tanh(numeric_expression) +``` + +**Arguments** + +- **numeric_expression**: Numeric expression to operate on. + Can be a constant, column, or function, and any combination of arithmetic operators. + +## `trunc` + +Truncates a number to a whole number or truncated to the specified decimal places. + +``` +trunc(numeric_expression[, decimal_places]) +``` + +**Arguments** + +- **numeric_expression**: Numeric expression to operate on. + Can be a constant, column, or function, and any combination of arithmetic operators. + +- **decimal_places**: Optional. The number of decimal places to + truncate to. Defaults to 0 (truncate to a whole number). If + `decimal_places` is a positive integer, truncates digits to the + right of the decimal point. If `decimal_places` is a negative + integer, replaces digits to the left of the decimal point with `0`. diff --git a/src/content/docs/pipelines/sql-reference/scalar-functions/other.mdx b/src/content/docs/pipelines/sql-reference/scalar-functions/other.mdx new file mode 100644 index 000000000000000..16de65bfb29912f --- /dev/null +++ b/src/content/docs/pipelines/sql-reference/scalar-functions/other.mdx @@ -0,0 +1,68 @@ +--- +title: Other functions +description: Miscellaneous scalar functions +sidebar: + order: 11 +--- + +_Cloudflare Pipelines scalar function implementations are based on +[Apache DataFusion](https://arrow.apache.org/datafusion/) (via [Arroyo](https://www.arroyo.dev/)) and these docs are derived from +the DataFusion function reference._ + +## `arrow_cast` + +Casts a value to a specific Arrow data type: + +``` +arrow_cast(expression, datatype) +``` + +**Arguments** + +- **expression**: Expression to cast. + Can be a constant, column, or function, and any combination of arithmetic or + string operators. +- **datatype**: [Arrow data type](https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html) name + to cast to, as a string. The format is the same as that returned by [`arrow_typeof`] + +**Example** + +``` +> select arrow_cast(-5, 'Int8') as a, + arrow_cast('foo', 'Dictionary(Int32, Utf8)') as b, + arrow_cast('bar', 'LargeUtf8') as c, + arrow_cast('2023-01-02T12:53:02', 'Timestamp(Microsecond, Some("+08:00"))') as d + ; ++----+-----+-----+---------------------------+ +| a | b | c | d | ++----+-----+-----+---------------------------+ +| -5 | foo | bar | 2023-01-02T12:53:02+08:00 | ++----+-----+-----+---------------------------+ +1 row in set. Query took 0.001 seconds. +``` + +## `arrow_typeof` + +Returns the name of the underlying [Arrow data type](https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html) of the expression: + +``` +arrow_typeof(expression) +``` + +**Arguments** + +- **expression**: Expression to evaluate. + Can be a constant, column, or function, and any combination of arithmetic or + string operators. + +**Example** + +``` +> select arrow_typeof('foo'), arrow_typeof(1); ++---------------------------+------------------------+ +| arrow_typeof(Utf8("foo")) | arrow_typeof(Int64(1)) | ++---------------------------+------------------------+ +| Utf8 | Int64 | ++---------------------------+------------------------+ +1 row in set. Query took 0.001 seconds. +``` diff --git a/src/content/docs/pipelines/sql-reference/scalar-functions/regex.mdx b/src/content/docs/pipelines/sql-reference/scalar-functions/regex.mdx new file mode 100644 index 000000000000000..07eda3e24cf4fca --- /dev/null +++ b/src/content/docs/pipelines/sql-reference/scalar-functions/regex.mdx @@ -0,0 +1,158 @@ +--- +title: Regex functions +description: Scalar functions for regular expressions +sidebar: + order: 5 +--- + +_Cloudflare Pipelines scalar function implementations are based on +[Apache DataFusion](https://arrow.apache.org/datafusion/) (via [Arroyo](https://www.arroyo.dev/)) and these docs are derived from +the DataFusion function reference._ + +Cloudflare Pipelines uses a +[PCRE-like](https://en.wikibooks.org/wiki/Regular_Expressions/Perl-Compatible_Regular_Expressions) +regular expression [syntax](https://docs.rs/regex/latest/regex/#syntax) (minus support for several features including +look-around and backreferences). + +## `regexp_like` + +Returns true if a [regular expression] has at least one match in a string, +false otherwise. + +[regular expression]: https://docs.rs/regex/latest/regex/#syntax + +``` +regexp_like(str, regexp[, flags]) +``` + +**Arguments** + +- **str**: String expression to operate on. + Can be a constant, column, or function, and any combination of string operators. +- **regexp**: Regular expression to test against the string expression. + Can be a constant, column, or function. +- **flags**: Optional regular expression flags that control the behavior of the + regular expression. The following flags are supported: + - **i**: case-insensitive: letters match both upper and lower case + - **m**: multi-line mode: ^ and $ match begin/end of line + - **s**: allow . to match \n + - **R**: enables CRLF mode: when multi-line mode is enabled, \r\n is used + - **U**: swap the meaning of x* and x*? + +**Example** + +```sql +select regexp_like('KΓΆln', '[a-zA-Z]ΓΆ[a-zA-Z]{2}'); ++--------------------------------------------------------+ +| regexp_like(Utf8("KΓΆln"),Utf8("[a-zA-Z]ΓΆ[a-zA-Z]{2}")) | ++--------------------------------------------------------+ +| true | ++--------------------------------------------------------+ +SELECT regexp_like('aBc', '(b|d)', 'i'); ++--------------------------------------------------+ +| regexp_like(Utf8("aBc"),Utf8("(b|d)"),Utf8("i")) | ++--------------------------------------------------+ +| true | ++--------------------------------------------------+ +``` + +Additional examples can be found [here](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/regexp.rs) + +## `regexp_match` + +Returns a list of [regular expression](https://docs.rs/regex/latest/regex/#syntax) matches in a string. + +``` +regexp_match(str, regexp[, flags]) +``` + +**Arguments** + +- **str**: String expression to operate on. + Can be a constant, column, or function, and any combination of string operators. +- **regexp**: Regular expression to match against. + Can be a constant, column, or function. +- **flags**: Optional regular expression flags that control the behavior of the + regular expression. The following flags are supported: + - **i**: case-insensitive: letters match both upper and lower case + - **m**: multi-line mode: ^ and $ match begin/end of line + - **s**: allow . to match \n + - **R**: enables CRLF mode: when multi-line mode is enabled, \r\n is used + - **U**: swap the meaning of x* and x*? + +**Example** + +```sql +select regexp_match('KΓΆln', '[a-zA-Z]ΓΆ[a-zA-Z]{2}'); ++---------------------------------------------------------+ +| regexp_match(Utf8("KΓΆln"),Utf8("[a-zA-Z]ΓΆ[a-zA-Z]{2}")) | ++---------------------------------------------------------+ +| [KΓΆln] | ++---------------------------------------------------------+ +SELECT regexp_match('aBc', '(b|d)', 'i'); ++---------------------------------------------------+ +| regexp_match(Utf8("aBc"),Utf8("(b|d)"),Utf8("i")) | ++---------------------------------------------------+ +| [B] | ++---------------------------------------------------+ +``` + +Additional examples can be found [here](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/regexp.rs) + +## `regexp_replace` + +Replaces substrings in a string that match a [regular expression](https://docs.rs/regex/latest/regex/#syntax). + +``` +regexp_replace(str, regexp, replacement[, flags]) +``` + +**Arguments** + +- **str**: String expression to operate on. + Can be a constant, column, or function, and any combination of string operators. +- **regexp**: Regular expression to match against. + Can be a constant, column, or function. +- **replacement**: Replacement string expression. + Can be a constant, column, or function, and any combination of string operators. +- **flags**: Optional regular expression flags that control the behavior of the + regular expression. The following flags are supported: + - **g**: (global) Search globally and don't return after the first match + - **i**: case-insensitive: letters match both upper and lower case + - **m**: multi-line mode: ^ and $ match begin/end of line + - **s**: allow . to match \n + - **R**: enables CRLF mode: when multi-line mode is enabled, \r\n is used + - **U**: swap the meaning of x* and x*? + +**Example** + +```sql +SELECT regexp_replace('foobarbaz', 'b(..)', 'X\\1Y', 'g'); ++------------------------------------------------------------------------+ +| regexp_replace(Utf8("foobarbaz"),Utf8("b(..)"),Utf8("X\1Y"),Utf8("g")) | ++------------------------------------------------------------------------+ +| fooXarYXazY | ++------------------------------------------------------------------------+ +SELECT regexp_replace('aBc', '(b|d)', 'Ab\\1a', 'i'); ++-------------------------------------------------------------------+ +| regexp_replace(Utf8("aBc"),Utf8("(b|d)"),Utf8("Ab\1a"),Utf8("i")) | ++-------------------------------------------------------------------+ +| aAbBac | ++-------------------------------------------------------------------+ +``` + +Additional examples can be found [here](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/regexp.rs) + +## `position` + +Returns the position of `substr` in `origstr` (counting from 1). If `substr` does +not appear in `origstr`, return 0. + +``` +position(substr in origstr) +``` + +**Arguments** + +- **substr**: The pattern string. +- **origstr**: The model string. diff --git a/src/content/docs/pipelines/sql-reference/scalar-functions/string.mdx b/src/content/docs/pipelines/sql-reference/scalar-functions/string.mdx new file mode 100644 index 000000000000000..b31b6311f6a915e --- /dev/null +++ b/src/content/docs/pipelines/sql-reference/scalar-functions/string.mdx @@ -0,0 +1,576 @@ +--- +title: String functions +description: Scalar functions for manipulating strings +sidebar: + order: 3 +--- + +_Cloudflare Pipelines scalar function implementations are based on +[Apache DataFusion](https://arrow.apache.org/datafusion/) (via [Arroyo](https://www.arroyo.dev/)) and these docs are derived from +the DataFusion function reference._ + +## `ascii` + +Returns the ASCII value of the first character in a string. + +``` +ascii(str) +``` + +**Arguments** + +- **str**: String expression to operate on. + Can be a constant, column, or function, and any combination of string operators. + +**Related functions**: +[chr](#chr) + +## `bit_length` + +Returns the bit length of a string. + +``` +bit_length(str) +``` + +**Arguments** + +- **str**: String expression to operate on. + Can be a constant, column, or function, and any combination of string operators. + +**Related functions**: +[length](#length), +[octet_length](#octet_length) + +## `btrim` + +Trims the specified trim string from the start and end of a string. +If no trim string is provided, all whitespace is removed from the start and end +of the input string. + +``` +btrim(str[, trim_str]) +``` + +**Arguments** + +- **str**: String expression to operate on. + Can be a constant, column, or function, and any combination of string operators. +- **trim_str**: String expression to trim from the beginning and end of the input string. + Can be a constant, column, or function, and any combination of arithmetic operators. + _Default is whitespace characters._ + +**Related functions**: +[ltrim](#ltrim), +[rtrim](#rtrim) + +**Aliases** + +- trim + +## `char_length` + +_Alias of [length](#length)._ + +## `character_length` + +_Alias of [length](#length)._ + +## `concat` + +Concatenates multiple strings together. + +``` +concat(str[, ..., str_n]) +``` + +**Arguments** + +- **str**: String expression to concatenate. + Can be a constant, column, or function, and any combination of string operators. +- **str_n**: Subsequent string column or literal string to concatenate. + +**Related functions**: +[concat_ws](#concat_ws) + +## `concat_ws` + +Concatenates multiple strings together with a specified separator. + +``` +concat(separator, str[, ..., str_n]) +``` + +**Arguments** + +- **separator**: Separator to insert between concatenated strings. +- **str**: String expression to concatenate. + Can be a constant, column, or function, and any combination of string operators. +- **str_n**: Subsequent string column or literal string to concatenate. + +**Related functions**: +[concat](#concat) + +## `chr` + +Returns the character with the specified ASCII or Unicode code value. + +``` +chr(expression) +``` + +**Arguments** + +- **expression**: Expression containing the ASCII or Unicode code value to operate on. + Can be a constant, column, or function, and any combination of arithmetic or + string operators. + +**Related functions**: +[ascii](#ascii) + +## `ends_with` + +Tests if a string ends with a substring. + +``` +ends_with(str, substr) +``` + +**Arguments** + +- **str**: String expression to test. + Can be a constant, column, or function, and any combination of string operators. +- **substr**: Substring to test for. + +## `initcap` + +Capitalizes the first character in each word in the input string. +Words are delimited by non-alphanumeric characters. + +``` +initcap(str) +``` + +**Arguments** + +- **str**: String expression to operate on. + Can be a constant, column, or function, and any combination of string operators. + +**Related functions**: +[lower](#lower), +[upper](#upper) + +## `instr` + +_Alias of [strpos](#strpos)._ + +**Arguments** + +- **str**: String expression to operate on. + Can be a constant, column, or function, and any combination of string operators. +- **substr**: Substring expression to search for. + Can be a constant, column, or function, and any combination of string operators. + +## `left` + +Returns a specified number of characters from the left side of a string. + +``` +left(str, n) +``` + +**Arguments** + +- **str**: String expression to operate on. + Can be a constant, column, or function, and any combination of string operators. +- **n**: Number of characters to return. + +**Related functions**: +[right](#right) + +## `length` + +Returns the number of characters in a string. + +``` +length(str) +``` + +**Arguments** + +- **str**: String expression to operate on. + Can be a constant, column, or function, and any combination of string operators. + +**Aliases** + +- char_length +- character_length + +**Related functions**: +[bit_length](#bit_length), +[octet_length](#octet_length) + +## `lower` + +Converts a string to lower-case. + +``` +lower(str) +``` + +**Arguments** + +- **str**: String expression to operate on. + Can be a constant, column, or function, and any combination of string operators. + +**Related functions**: +[initcap](#initcap), +[upper](#upper) + +## `lpad` + +Pads the left side of a string with another string to a specified string length. + +``` +lpad(str, n[, padding_str]) +``` + +**Arguments** + +- **str**: String expression to operate on. + Can be a constant, column, or function, and any combination of string operators. +- **n**: String length to pad to. +- **padding_str**: String expression to pad with. + Can be a constant, column, or function, and any combination of string operators. + _Default is a space._ + +**Related functions**: +[rpad](#rpad) + +## `ltrim` + +Trims the specified trim string from the beginning of a string. +If no trim string is provided, all whitespace is removed from the start +of the input string. + +``` +ltrim(str[, trim_str]) +``` + +**Arguments** + +- **str**: String expression to operate on. + Can be a constant, column, or function, and any combination of string operators. +- **trim_str**: String expression to trim from the beginning of the input string. + Can be a constant, column, or function, and any combination of arithmetic operators. + _Default is whitespace characters._ + +**Related functions**: +[btrim](#btrim), +[rtrim](#rtrim) + +## `octet_length` + +Returns the length of a string in bytes. + +``` +octet_length(str) +``` + +**Arguments** + +- **str**: String expression to operate on. + Can be a constant, column, or function, and any combination of string operators. + +**Related functions**: +[bit_length](#bit_length), +[length](#length) + +## `repeat` + +Returns a string with an input string repeated a specified number. + +``` +repeat(str, n) +``` + +**Arguments** + +- **str**: String expression to repeat. + Can be a constant, column, or function, and any combination of string operators. +- **n**: Number of times to repeat the input string. + +## `replace` + +Replaces all occurrences of a specified substring in a string with a new substring. + +``` +replace(str, substr, replacement) +``` + +**Arguments** + +- **str**: String expression to repeat. + Can be a constant, column, or function, and any combination of string operators. +- **substr**: Substring expression to replace in the input string. + Can be a constant, column, or function, and any combination of string operators. +- **replacement**: Replacement substring expression. + Can be a constant, column, or function, and any combination of string operators. + +## `reverse` + +Reverses the character order of a string. + +``` +reverse(str) +``` + +**Arguments** + +- **str**: String expression to repeat. + Can be a constant, column, or function, and any combination of string operators. + +## `right` + +Returns a specified number of characters from the right side of a string. + +``` +right(str, n) +``` + +**Arguments** + +- **str**: String expression to operate on. + Can be a constant, column, or function, and any combination of string operators. +- **n**: Number of characters to return. + +**Related functions**: +[left](#left) + +## `rpad` + +Pads the right side of a string with another string to a specified string length. + +``` +rpad(str, n[, padding_str]) +``` + +**Arguments** + +- **str**: String expression to operate on. + Can be a constant, column, or function, and any combination of string operators. +- **n**: String length to pad to. +- **padding_str**: String expression to pad with. + Can be a constant, column, or function, and any combination of string operators. + _Default is a space._ + +**Related functions**: +[lpad](#lpad) + +## `rtrim` + +Trims the specified trim string from the end of a string. +If no trim string is provided, all whitespace is removed from the end +of the input string. + +``` +rtrim(str[, trim_str]) +``` + +**Arguments** + +- **str**: String expression to operate on. + Can be a constant, column, or function, and any combination of string operators. +- **trim_str**: String expression to trim from the end of the input string. + Can be a constant, column, or function, and any combination of arithmetic operators. + _Default is whitespace characters._ + +**Related functions**: +[btrim](#btrim), +[ltrim](#ltrim) + +## `split_part` + +Splits a string based on a specified delimiter and returns the substring in the +specified position. + +``` +split_part(str, delimiter, pos) +``` + +**Arguments** + +- **str**: String expression to spit. + Can be a constant, column, or function, and any combination of string operators. +- **delimiter**: String or character to split on. +- **pos**: Position of the part to return. + +## `starts_with` + +Tests if a string starts with a substring. + +``` +starts_with(str, substr) +``` + +**Arguments** + +- **str**: String expression to test. + Can be a constant, column, or function, and any combination of string operators. +- **substr**: Substring to test for. + +## `strpos` + +Returns the starting position of a specified substring in a string. +Positions begin at 1. +If the substring does not exist in the string, the function returns 0. + +``` +strpos(str, substr) +``` + +**Arguments** + +- **str**: String expression to operate on. + Can be a constant, column, or function, and any combination of string operators. +- **substr**: Substring expression to search for. + Can be a constant, column, or function, and any combination of string operators. + +**Aliases** + +- instr + +## `substr` + +Extracts a substring of a specified number of characters from a specific +starting position in a string. + +``` +substr(str, start_pos[, length]) +``` + +**Arguments** + +- **str**: String expression to operate on. + Can be a constant, column, or function, and any combination of string operators. +- **start_pos**: Character position to start the substring at. + The first character in the string has a position of 1. +- **length**: Number of characters to extract. + If not specified, returns the rest of the string after the start position. + +## `translate` + +Translates characters in a string to specified translation characters. + +``` +translate(str, chars, translation) +``` + +- **str**: String expression to operate on. + Can be a constant, column, or function, and any combination of string operators. +- **chars**: Characters to translate. +- **translation**: Translation characters. Translation characters replace only + characters at the same position in the **chars** string. + +## `to_hex` + +Converts an integer to a hexadecimal string. + +``` +to_hex(int) +``` + +**Arguments** + +- **int**: Integer expression to convert. + Can be a constant, column, or function, and any combination of arithmetic operators. + +## `trim` + +_Alias of [btrim](#btrim)._ + +## `upper` + +Converts a string to upper-case. + +``` +upper(str) +``` + +**Arguments** + +- **str**: String expression to operate on. + Can be a constant, column, or function, and any combination of string operators. + +**Related functions**: +[initcap](#initcap), +[lower](#lower) + +## `uuid` + +Returns UUID v4 string value which is unique per row. + +``` +uuid() +``` + +## `overlay` + +Returns the string which is replaced by another string from the specified position and specified count length. +For example, `overlay('Txxxxas' placing 'hom' from 2 for 4) β†’ Thomas` + +``` +overlay(str PLACING substr FROM pos [FOR count]) +``` + +**Arguments** + +- **str**: String expression to operate on. +- **substr**: the string to replace part of str. +- **pos**: the start position to replace of str. +- **count**: the count of characters to be replaced from start position of str. If not specified, will use substr length instead. + +## `levenshtein` + +Returns the Levenshtein distance between the two given strings. +For example, `levenshtein('kitten', 'sitting') = 3` + +``` +levenshtein(str1, str2) +``` + +**Arguments** + +- **str1**: String expression to compute Levenshtein distance with str2. +- **str2**: String expression to compute Levenshtein distance with str1. + +## `substr_index` + +Returns the substring from str before count occurrences of the delimiter delim. +If count is positive, everything to the left of the final delimiter (counting from the left) is returned. +If count is negative, everything to the right of the final delimiter (counting from the right) is returned. +For example, `substr_index('www.apache.org', '.', 1) = www`, `substr_index('www.apache.org', '.', -1) = org` + +``` +substr_index(str, delim, count) +``` + +**Arguments** + +- **str**: String expression to operate on. +- **delim**: the string to find in str to split str. +- **count**: The number of times to search for the delimiter. Can be both a positive or negative number. + +## `find_in_set` + +Returns a value in the range of 1 to N if the string str is in the string list strlist consisting of N substrings. +For example, `find_in_set('b', 'a,b,c,d') = 2` + +``` +find_in_set(str, strlist) +``` + +**Arguments** + +- **str**: String expression to find in strlist. +- **strlist**: A string list is a string composed of substrings separated by , characters. diff --git a/src/content/docs/pipelines/sql-reference/scalar-functions/struct.mdx b/src/content/docs/pipelines/sql-reference/scalar-functions/struct.mdx new file mode 100644 index 000000000000000..3cf310ef0e75030 --- /dev/null +++ b/src/content/docs/pipelines/sql-reference/scalar-functions/struct.mdx @@ -0,0 +1,47 @@ +--- +title: Struct functions +description: Scalar functions for manipulating structs +sidebar: + order: 9 +--- + +_Cloudflare Pipelines scalar function implementations are based on +[Apache DataFusion](https://arrow.apache.org/datafusion/) (via [Arroyo](https://www.arroyo.dev/)) and these docs are derived from +the DataFusion function reference._ + +## `struct` + +Returns an Arrow struct using the specified input expressions. +Fields in the returned struct use the `cN` naming convention. +For example: `c0`, `c1`, `c2`, etc. + +``` +struct(expression1[, ..., expression_n]) +``` + +For example, this query converts two columns `a` and `b` to a single column with +a struct type of fields `c0` and `c1`: + +``` +select * from t; ++---+---+ +| a | b | ++---+---+ +| 1 | 2 | +| 3 | 4 | ++---+---+ + +select struct(a, b) from t; ++-----------------+ +| struct(t.a,t.b) | ++-----------------+ +| {c0: 1, c1: 2} | +| {c0: 3, c1: 4} | ++-----------------+ +``` + +#### Arguments + +- **expression_n**: Expression to include in the output struct. + Can be a constant, column, or function, and any combination of arithmetic or + string operators. diff --git a/src/content/docs/pipelines/sql-reference/scalar-functions/time-and-date.mdx b/src/content/docs/pipelines/sql-reference/scalar-functions/time-and-date.mdx new file mode 100644 index 000000000000000..bece2eeeba3e527 --- /dev/null +++ b/src/content/docs/pipelines/sql-reference/scalar-functions/time-and-date.mdx @@ -0,0 +1,412 @@ +--- +title: Time and date functions +description: Scalar functions for handling times and dates +sidebar: + order: 7 +--- + +_Cloudflare Pipelines scalar function implementations are based on +[Apache DataFusion](https://arrow.apache.org/datafusion/) (via [Arroyo](https://www.arroyo.dev/)) and these docs are derived from +the DataFusion function reference._ + +## `date_bin` + +Calculates time intervals and returns the start of the interval nearest to the specified timestamp. +Use `date_bin` to downsample time series data by grouping rows into time-based "bins" or "windows" +and applying an aggregate or selector function to each window. + +For example, if you "bin" or "window" data into 15 minute intervals, an input +timestamp of `2023-01-01T18:18:18Z` will be updated to the start time of the 15 +minute bin it is in: `2023-01-01T18:15:00Z`. + +``` +date_bin(interval, expression, origin-timestamp) +``` + +**Arguments** + +- **interval**: Bin interval. +- **expression**: Time expression to operate on. + Can be a constant, column, or function. +- **origin-timestamp**: Optional. Starting point used to determine bin boundaries. If not specified + defaults `1970-01-01T00:00:00Z` (the UNIX epoch in UTC). + +The following intervals are supported: + +- nanoseconds +- microseconds +- milliseconds +- seconds +- minutes +- hours +- days +- weeks +- months +- years +- century + +## `date_trunc` + +Truncates a timestamp value to a specified precision. + +``` +date_trunc(precision, expression) +``` + +**Arguments** + +- **precision**: Time precision to truncate to. + The following precisions are supported: + - year / YEAR + - quarter / QUARTER + - month / MONTH + - week / WEEK + - day / DAY + - hour / HOUR + - minute / MINUTE + - second / SECOND + +- **expression**: Time expression to operate on. + Can be a constant, column, or function. + +**Aliases** + +- datetrunc + +## `datetrunc` + +_Alias of [date_trunc](#date_trunc)._ + +## `date_part` + +Returns the specified part of the date as an integer. + +``` +date_part(part, expression) +``` + +**Arguments** + +- **part**: Part of the date to return. + The following date parts are supported: + - year + - quarter _(emits value in inclusive range [1, 4] based on which quartile of the year the date is in)_ + - month + - week _(week of the year)_ + - day _(day of the month)_ + - hour + - minute + - second + - millisecond + - microsecond + - nanosecond + - dow _(day of the week)_ + - doy _(day of the year)_ + - epoch _(seconds since Unix epoch)_ + +- **expression**: Time expression to operate on. + Can be a constant, column, or function. + +**Aliases** + +- datepart + +## `datepart` + +_Alias of [date_part](#date_part)._ + +## `extract` + +Returns a sub-field from a time value as an integer. + +``` +extract(field FROM source) +``` + +Equivalent to calling `date_part('field', source)`. For example, these are equivalent: + +```sql +extract(day FROM '2024-04-13'::date) +date_part('day', '2024-04-13'::date) +``` + +See [date_part](#date_part). + +## `make_date` + +Make a date from year/month/day component parts. + +``` +make_date(year, month, day) +``` + +**Arguments** + +- **year**: Year to use when making the date. + Can be a constant, column or function, and any combination of arithmetic operators. +- **month**: Month to use when making the date. + Can be a constant, column or function, and any combination of arithmetic operators. +- **day**: Day to use when making the date. + Can be a constant, column or function, and any combination of arithmetic operators. + +**Example** + +``` +> select make_date(2023, 1, 31); ++-------------------------------------------+ +| make_date(Int64(2023),Int64(1),Int64(31)) | ++-------------------------------------------+ +| 2023-01-31 | ++-------------------------------------------+ +> select make_date('2023', '01', '31'); ++-----------------------------------------------+ +| make_date(Utf8("2023"),Utf8("01"),Utf8("31")) | ++-----------------------------------------------+ +| 2023-01-31 | ++-----------------------------------------------+ +``` + +## `to_char` + +Returns a string representation of a date, time, timestamp or duration based +on a [Chrono format]. Unlike the PostgreSQL equivalent of this function +numerical formatting is not supported. + +``` +to_char(expression, format) +``` + +**Arguments** + +- **expression**: Expression to operate on. + Can be a constant, column, or function that results in a + date, time, timestamp or duration. +- **format**: A [Chrono format] string to use to convert the expression. + +**Example** + +``` +> > select to_char('2023-03-01'::date, '%d-%m-%Y'); ++----------------------------------------------+ +| to_char(Utf8("2023-03-01"),Utf8("%d-%m-%Y")) | ++----------------------------------------------+ +| 01-03-2023 | ++----------------------------------------------+ +``` + +**Aliases** + +- date_format + +## `to_timestamp` + +Converts a value to a timestamp (`YYYY-MM-DDT00:00:00Z`). +Supports strings, integer, unsigned integer, and double types as input. +Strings are parsed as RFC3339 (e.g. '2023-07-20T05:44:00') if no [Chrono formats] are provided. +Integers, unsigned integers, and doubles are interpreted as seconds since the unix epoch (`1970-01-01T00:00:00Z`). +Returns the corresponding timestamp. + +Note: `to_timestamp` returns `Timestamp(Nanosecond)`. The supported range for integer input is between `-9223372037` and `9223372036`. +Supported range for string input is between `1677-09-21T00:12:44.0` and `2262-04-11T23:47:16.0`. Please use `to_timestamp_seconds` +for the input outside of supported bounds. + +``` +to_timestamp(expression[, ..., format_n]) +``` + +**Arguments** + +- **expression**: Expression to operate on. + Can be a constant, column, or function, and any combination of arithmetic operators. +- **format_n**: Optional [Chrono format] strings to use to parse the expression. Formats will be tried in the order + they appear with the first successful one being returned. If none of the formats successfully parse the expression + an error will be returned. + +[chrono format]: https://docs.rs/chrono/latest/chrono/format/strftime/index.html + +**Example** + +``` +> select to_timestamp('2023-01-31T09:26:56.123456789-05:00'); ++-----------------------------------------------------------+ +| to_timestamp(Utf8("2023-01-31T09:26:56.123456789-05:00")) | ++-----------------------------------------------------------+ +| 2023-01-31T14:26:56.123456789 | ++-----------------------------------------------------------+ +> select to_timestamp('03:59:00.123456789 05-17-2023', '%c', '%+', '%H:%M:%S%.f %m-%d-%Y'); ++--------------------------------------------------------------------------------------------------------+ +| to_timestamp(Utf8("03:59:00.123456789 05-17-2023"),Utf8("%c"),Utf8("%+"),Utf8("%H:%M:%S%.f %m-%d-%Y")) | ++--------------------------------------------------------------------------------------------------------+ +| 2023-05-17T03:59:00.123456789 | ++--------------------------------------------------------------------------------------------------------+ +``` + +## `to_timestamp_millis` + +Converts a value to a timestamp (`YYYY-MM-DDT00:00:00.000Z`). +Supports strings, integer, and unsigned integer types as input. +Strings are parsed as RFC3339 (e.g. '2023-07-20T05:44:00') if no [Chrono format]s are provided. +Integers and unsigned integers are interpreted as milliseconds since the unix epoch (`1970-01-01T00:00:00Z`). +Returns the corresponding timestamp. + +``` +to_timestamp_millis(expression[, ..., format_n]) +``` + +**Arguments** + +- **expression**: Expression to operate on. + Can be a constant, column, or function, and any combination of arithmetic operators. +- **format_n**: Optional [Chrono format] strings to use to parse the expression. Formats will be tried in the order + they appear with the first successful one being returned. If none of the formats successfully parse the expression + an error will be returned. + +**Example** + +``` +> select to_timestamp_millis('2023-01-31T09:26:56.123456789-05:00'); ++------------------------------------------------------------------+ +| to_timestamp_millis(Utf8("2023-01-31T09:26:56.123456789-05:00")) | ++------------------------------------------------------------------+ +| 2023-01-31T14:26:56.123 | ++------------------------------------------------------------------+ +> select to_timestamp_millis('03:59:00.123456789 05-17-2023', '%c', '%+', '%H:%M:%S%.f %m-%d-%Y'); ++---------------------------------------------------------------------------------------------------------------+ +| to_timestamp_millis(Utf8("03:59:00.123456789 05-17-2023"),Utf8("%c"),Utf8("%+"),Utf8("%H:%M:%S%.f %m-%d-%Y")) | ++---------------------------------------------------------------------------------------------------------------+ +| 2023-05-17T03:59:00.123 | ++---------------------------------------------------------------------------------------------------------------+ +``` + +## `to_timestamp_micros` + +Converts a value to a timestamp (`YYYY-MM-DDT00:00:00.000000Z`). +Supports strings, integer, and unsigned integer types as input. +Strings are parsed as RFC3339 (e.g. '2023-07-20T05:44:00') if no [Chrono format]s are provided. +Integers and unsigned integers are interpreted as microseconds since the unix epoch (`1970-01-01T00:00:00Z`) +Returns the corresponding timestamp. + +``` +to_timestamp_micros(expression[, ..., format_n]) +``` + +**Arguments** + +- **expression**: Expression to operate on. + Can be a constant, column, or function, and any combination of arithmetic operators. +- **format_n**: Optional [Chrono format] strings to use to parse the expression. Formats will be tried in the order + they appear with the first successful one being returned. If none of the formats successfully parse the expression + an error will be returned. + +**Example** + +``` +> select to_timestamp_micros('2023-01-31T09:26:56.123456789-05:00'); ++------------------------------------------------------------------+ +| to_timestamp_micros(Utf8("2023-01-31T09:26:56.123456789-05:00")) | ++------------------------------------------------------------------+ +| 2023-01-31T14:26:56.123456 | ++------------------------------------------------------------------+ +> select to_timestamp_micros('03:59:00.123456789 05-17-2023', '%c', '%+', '%H:%M:%S%.f %m-%d-%Y'); ++---------------------------------------------------------------------------------------------------------------+ +| to_timestamp_micros(Utf8("03:59:00.123456789 05-17-2023"),Utf8("%c"),Utf8("%+"),Utf8("%H:%M:%S%.f %m-%d-%Y")) | ++---------------------------------------------------------------------------------------------------------------+ +| 2023-05-17T03:59:00.123456 | ++---------------------------------------------------------------------------------------------------------------+ +``` + +## `to_timestamp_nanos` + +Converts a value to a timestamp (`YYYY-MM-DDT00:00:00.000000000Z`). +Supports strings, integer, and unsigned integer types as input. +Strings are parsed as RFC3339 (e.g. '2023-07-20T05:44:00') if no [Chrono formats] are provided. +Integers and unsigned integers are interpreted as nanoseconds since the unix epoch (`1970-01-01T00:00:00Z`). +Returns the corresponding timestamp. + +``` +to_timestamp_nanos(expression[, ..., format_n]) +``` + +**Arguments** + +- **expression**: Expression to operate on. + Can be a constant, column, or function, and any combination of arithmetic operators. +- **format_n**: Optional [Chrono format] strings to use to parse the expression. Formats will be tried in the order + they appear with the first successful one being returned. If none of the formats successfully parse the expression + an error will be returned. + +**Example** + +``` +> select to_timestamp_nanos('2023-01-31T09:26:56.123456789-05:00'); ++-----------------------------------------------------------------+ +| to_timestamp_nanos(Utf8("2023-01-31T09:26:56.123456789-05:00")) | ++-----------------------------------------------------------------+ +| 2023-01-31T14:26:56.123456789 | ++-----------------------------------------------------------------+ +> select to_timestamp_nanos('03:59:00.123456789 05-17-2023', '%c', '%+', '%H:%M:%S%.f %m-%d-%Y'); ++--------------------------------------------------------------------------------------------------------------+ +| to_timestamp_nanos(Utf8("03:59:00.123456789 05-17-2023"),Utf8("%c"),Utf8("%+"),Utf8("%H:%M:%S%.f %m-%d-%Y")) | ++--------------------------------------------------------------------------------------------------------------+ +| 2023-05-17T03:59:00.123456789 | ++---------------------------------------------------------------------------------------------------------------+ +``` + +## `to_timestamp_seconds` + +Converts a value to a timestamp (`YYYY-MM-DDT00:00:00.000Z`). +Supports strings, integer, and unsigned integer types as input. +Strings are parsed as RFC3339 (e.g. '2023-07-20T05:44:00') if no [Chrono format]s are provided. +Integers and unsigned integers are interpreted as seconds since the unix epoch (`1970-01-01T00:00:00Z`). +Returns the corresponding timestamp. + +``` +to_timestamp_seconds(expression[, ..., format_n]) +``` + +**Arguments** + +- **expression**: Expression to operate on. + Can be a constant, column, or function, and any combination of arithmetic operators. +- **format_n**: Optional [Chrono format] strings to use to parse the expression. Formats will be tried in the order + they appear with the first successful one being returned. If none of the formats successfully parse the expression + an error will be returned. + +**Example** + +``` +> select to_timestamp_seconds('2023-01-31T09:26:56.123456789-05:00'); ++-------------------------------------------------------------------+ +| to_timestamp_seconds(Utf8("2023-01-31T09:26:56.123456789-05:00")) | ++-------------------------------------------------------------------+ +| 2023-01-31T14:26:56 | ++-------------------------------------------------------------------+ +> select to_timestamp_seconds('03:59:00.123456789 05-17-2023', '%c', '%+', '%H:%M:%S%.f %m-%d-%Y'); ++----------------------------------------------------------------------------------------------------------------+ +| to_timestamp_seconds(Utf8("03:59:00.123456789 05-17-2023"),Utf8("%c"),Utf8("%+"),Utf8("%H:%M:%S%.f %m-%d-%Y")) | ++----------------------------------------------------------------------------------------------------------------+ +| 2023-05-17T03:59:00 | ++----------------------------------------------------------------------------------------------------------------+ +``` + +## `from_unixtime` + +Converts an integer to RFC3339 timestamp format (`YYYY-MM-DDT00:00:00.000000000Z`). +Integers and unsigned integers are interpreted as nanoseconds since the unix epoch (`1970-01-01T00:00:00Z`) +return the corresponding timestamp. + +``` +from_unixtime(expression) +``` + +**Arguments** + +- **expression**: Expression to operate on. + Can be a constant, column, or function, and any combination of arithmetic operators. + +## `now` + +Returns the UTC timestamp at pipeline start. + +The now() return value is determined at query compilation time, and will be constant across the execution +of the pipeline. diff --git a/src/content/docs/pipelines/sql-reference/select-statements.mdx b/src/content/docs/pipelines/sql-reference/select-statements.mdx new file mode 100644 index 000000000000000..dd0119d48c6bdad --- /dev/null +++ b/src/content/docs/pipelines/sql-reference/select-statements.mdx @@ -0,0 +1,133 @@ +--- +title: SELECT statements +description: Query syntax for data transformation in Cloudflare Pipelines SQL +sidebar: + order: 2 +--- + +SELECT statements are used to transform data in Cloudflare Pipelines. The general form is: + +```sql +[WITH with_query [, ...]] +SELECT select_expr [, ...] +FROM from_item +[WHERE condition] +``` + +## WITH clause + +The WITH clause allows you to define named subqueries that can be referenced in the main query. This can improve query readability by breaking down complex transformations. + +Syntax: + +```sql +WITH query_name AS (subquery) [, ...] +``` + +Simple example: + +```sql +WITH filtered_events AS + (SELECT user_id, event_type, amount + FROM user_events WHERE amount > 50) +SELECT user_id, amount * 1.1 as amount_with_tax +FROM filtered_events +WHERE event_type = 'purchase'; +``` + +## SELECT clause + +The SELECT clause is a comma-separated list of expressions, with optional aliases. Column names must be unique. + +```sql +SELECT select_expr [, ...] +``` + +Examples: + +```sql +-- Select specific columns +SELECT user_id, event_type, amount FROM events + +-- Use expressions and aliases +SELECT + user_id, + amount * 1.1 as amount_with_tax, + UPPER(event_type) as event_type_upper +FROM events + +-- Select all columns +SELECT * FROM events +``` + +## FROM clause + +The FROM clause specifies the data source for the query. It will be either a table name or subquery. The table name can be either a stream name or a table created in the WITH clause. + +```sql +FROM from_item +``` + +Tables can be given aliases: + +```sql +SELECT e.user_id, e.amount +FROM user_events e +WHERE e.event_type = 'purchase' +``` + +## WHERE clause + +The WHERE clause filters data using boolean conditions. Predicates are applied to input rows. + +```sql +WHERE condition +``` + +Examples: + +```sql +-- Filter by field value +SELECT * FROM events WHERE event_type = 'purchase' + +-- Multiple conditions +SELECT * FROM events +WHERE event_type = 'purchase' AND amount > 50 + +-- String operations +SELECT * FROM events +WHERE user_id LIKE 'user_%' + +-- Null checks +SELECT * FROM events +WHERE description IS NOT NULL +``` + +## UNNEST operator + +The UNNEST operator converts arrays into multiple rows. This is useful for processing list data types. + +UNNEST restrictions: + +- May only appear in the SELECT clause +- Only one array may be unnested per SELECT statement + +Example: + +```sql +SELECT + UNNEST([1, 2, 3]) as numbers +FROM events; +``` + +This will produce: + +``` ++---------+ +| numbers | ++---------+ +| 1 | +| 2 | +| 3 | ++---------+ +``` diff --git a/src/content/docs/pipelines/sql-reference/sql-data-types.mdx b/src/content/docs/pipelines/sql-reference/sql-data-types.mdx new file mode 100644 index 000000000000000..2be142cfd1cfa2e --- /dev/null +++ b/src/content/docs/pipelines/sql-reference/sql-data-types.mdx @@ -0,0 +1,54 @@ +--- +title: SQL data types +description: Supported data types in Cloudflare Pipelines SQL +sidebar: + order: 1 +--- + +Cloudflare Pipelines supports a set of primitive and composite data types for SQL transformations. These types can be used in stream schemas and SQL literals with automatic type inference. + +## Primitive types + +| Pipelines | SQL Types | Example Literals | +| ----------- | ----------------------------------- | ---------------------------------------------------- | +| `bool` | `BOOLEAN` | `TRUE`, `FALSE` | +| `int32` | `INT`, `INTEGER` | `0`, `1`, `-2` | +| `int64` | `BIGINT` | `0`, `1`, `-2` | +| `float32` | `FLOAT`, `REAL` | `0.0`, `-2.4`, `1E-3` | +| `float64` | `DOUBLE` | `0.0`, `-2.4`, `1E-35` | +| `string` | `VARCHAR`, `CHAR`, `TEXT`, `STRING` | `"hello"`, `"world"` | +| `timestamp` | `TIMESTAMP` | `'2020-01-01'`, `'2023-05-17T22:16:00.648662+00:00'` | +| `binary` | `BYTEA` | `X'A123'` (hex) | +| `json` | `JSON` | `'{"event": "purchase", "amount": 29.99}'` | + +## Composite types + +In addition to primitive types, Pipelines SQL supports composite types for more complex data structures. + +### List types + +Lists group together zero or more elements of the same type. In stream schemas, lists are declared using the `list` type with an `items` field specifying the element type. In SQL, lists correspond to arrays and are declared by suffixing another type with `[]`, for example `INT[]`. + +List values can be indexed using 1-indexed subscript notation (`v[1]` is the first element of `v`). + +Lists can be constructed via `[]` literals: + +```sql +SELECT [1, 2, 3] as numbers +``` + +Pipelines provides array functions for manipulating list values, and lists may be unnested using the `UNNEST` operator. + +### Struct types + +Structs combine related fields into a single value. In stream schemas, structs are declared using the `struct` type with a `fields` array. In SQL, structs can be created using the `struct` function. + +Example creating a struct in SQL: + +```sql +SELECT struct('user123', 'purchase', 29.99) as event_data FROM events +``` + +This creates a struct with fields `c0`, `c1`, `c2` containing the user ID, event type, and amount. + +Struct fields can be accessed via `.` notation, for example `event_data.c0` for the user ID. diff --git a/src/content/docs/pipelines/streams/index.mdx b/src/content/docs/pipelines/streams/index.mdx new file mode 100644 index 000000000000000..5660ad4353c509f --- /dev/null +++ b/src/content/docs/pipelines/streams/index.mdx @@ -0,0 +1,28 @@ +--- +title: Streams +pcx_content_type: navigation +sidebar: + order: 2 +--- + +import { LinkCard } from "~/components"; + +Streams are durable, buffered queues that receive and store events for processing in [Cloudflare Pipelines](/pipelines/). They provide reliable data ingestion via HTTP endpoints and Worker bindings, ensuring no data loss even during downstream processing delays or failures. + +A single stream can be read by multiple pipelines, allowing you to route the same data to different destinations or apply different transformations. For example, you might send user events to both a real-time analytics pipeline and a data warehouse pipeline. + +Streams currently accept events in JSON format and support both structured events with defined schemas and unstructured JSON. When a schema is provided, streams will validate and enforce it for incoming events. + +## Learn more + + + + diff --git a/src/content/docs/pipelines/streams/manage-streams.mdx b/src/content/docs/pipelines/streams/manage-streams.mdx new file mode 100644 index 000000000000000..87968fddd0365d3 --- /dev/null +++ b/src/content/docs/pipelines/streams/manage-streams.mdx @@ -0,0 +1,191 @@ +--- +pcx_content_type: configuration +title: Manage streams +description: Create, configure, and manage streams for data ingestion +sidebar: + order: 1 +--- + +import { Steps, Tabs, TabItem, DashButton } from "~/components"; + +Learn how to: + +- Create and configure streams for data ingestion +- View and update stream settings +- Delete streams when no longer needed + +## Create a stream + +Streams are made available to pipelines as SQL tables using the stream name (e.g., `SELECT * FROM my_stream`). + +### Dashboard + + +1. In the Cloudflare dashboard, go to the **Pipelines** page. + + +2. Select **Create Pipeline** to launch the pipeline creation wizard. +3. Complete the wizard to create your stream along with the associated sink and pipeline. + + +### Wrangler CLI + +To create a stream, run the [`pipelines streams create`](/workers/wrangler/commands/#pipelines-streams-create) command: + +```bash +npx wrangler pipelines streams create +``` + +Alternatively, to use the interactive setup wizard that helps you configure a stream, sink, and pipeline, run the [`pipelines setup`](/workers/wrangler/commands/#pipelines-setup) command: + +```bash +npx wrangler pipelines setup +``` + +### Schema configuration + +Streams support two approaches for handling data: + +- **Structured streams**: Define a schema with specific fields and data types. Events are validated against the schema. +- **Unstructured streams**: Accept any valid JSON without validation. These streams have a single `value` column containing the JSON data. + +To create a structured stream, provide a schema file: + +```bash +npx wrangler pipelines streams create my-stream --schema-file schema.json +``` + +Example schema file: + +```json +{ + "fields": [ + { + "name": "user_id", + "type": "string", + "required": true + }, + { + "name": "amount", + "type": "float64", + "required": false + }, + { + "name": "tags", + "type": "list", + "required": false, + "items": { + "type": "string" + } + }, + { + "name": "metadata", + "type": "struct", + "required": false, + "fields": [ + { + "name": "source", + "type": "string", + "required": false + }, + { + "name": "priority", + "type": "int32", + "required": false + } + ] + } + ] +} +``` + +**Supported data types:** + +- `string` - Text values +- `int32`, `int64` - Integer numbers +- `float32`, `float64` - Floating-point numbers +- `bool` - Boolean true/false +- `timestamp` - RFC 3339 timestamps, or numeric values parsed as Unix seconds, milliseconds, or microseconds (depending on unit) +- `json` - JSON objects +- `binary` - Binary data (base64-encoded) +- `list` - Arrays of values +- `struct` - Nested objects with defined fields + +:::note +Events that do not match the defined schema are accepted during ingestion but will be dropped during processing. Schema modifications are not supported after stream creation. +::: + +## View stream configuration + +### Dashboard + + + 1. In the Cloudflare dashboard, go to **Pipelines** > **Streams**. + + 2. Select a stream to view its associated configuration. + + + +### Wrangler CLI + +To view a specific stream, run the [`pipelines streams get`](/workers/wrangler/commands/#pipelines-streams-get) command: + +```bash +npx wrangler pipelines streams get +``` + +To list all streams in your account, run the [`pipelines streams list`](/workers/wrangler/commands/#pipelines-streams-list) command: + +```bash +npx wrangler pipelines streams list +``` + +## Update HTTP ingest settings + +You can update certain HTTP ingest settings after stream creation. Schema modifications are not supported once a stream is created. + +### Dashboard + + + 1. In the Cloudflare dashboard, go to **Pipelines** > **Streams**. + + 2. Select the stream you want to update. + + 3. In the **Settings** tab, navigate to **HTTP Ingest**. + + 4. To enable or disable HTTP ingestion, select **Enable** or **Disable**. + + 5. To update authentication and CORS settings, select **Edit** and modify. + + 6. Save your changes. + + + +:::note +For details on configuring authentication tokens and making authenticated requests, see [Writing to streams](/pipelines/streams/writing-to-streams/). +::: + +## Delete a stream + +### Dashboard + + + 1. In the Cloudflare dashboard, go to **Pipelines** > **Streams**. + + 2. Select the stream you want to delete. + + 3. In the **Settings** tab, navigate to **General**, and select **Delete**. + + + +### Wrangler CLI + +To delete a stream, run the [`pipelines streams delete`](/workers/wrangler/commands/#pipelines-streams-delete) command: + +```bash +npx wrangler pipelines streams delete +``` + +:::caution +Deleting a stream will permanently remove all buffered events that have not been processed and will delete any dependent pipelines. Ensure all data has been delivered to your sink before deletion. +::: diff --git a/src/content/docs/pipelines/streams/writing-to-streams.mdx b/src/content/docs/pipelines/streams/writing-to-streams.mdx new file mode 100644 index 000000000000000..f5ee84d248a0c0c --- /dev/null +++ b/src/content/docs/pipelines/streams/writing-to-streams.mdx @@ -0,0 +1,114 @@ +--- +pcx_content_type: configuration +title: Writing to streams +description: Send data to streams via Worker bindings or HTTP endpoints +sidebar: + order: 2 +--- + +import { Tabs, TabItem, WranglerConfig, TypeScriptExample } from "~/components"; + +Send events to streams using [Worker bindings](/workers/runtime-apis/bindings/) or HTTP endpoints for client-side applications and external systems. + +## Send via Workers + +Worker bindings provide a secure way to send data to streams from [Workers](/workers/) without managing API tokens or credentials. + +### Configure pipeline binding + +Add a pipeline binding to your Wrangler file that points to your stream: + + +```toml +[[pipelines]] +pipeline = "" +binding = "STREAM" +``` + + +### Workers API + +The pipeline binding exposes a method for sending data to your stream: + +#### `send(records)` + +Sends an array of JSON-serializable records to the stream. Returns a Promise that resolves when records are confirmed as ingested. + + +```typescript +export default { + async fetch(request, env, ctx): Promise { + const event = { + user_id: "12345", + event_type: "purchase", + product_id: "widget-001", + amount: 29.99 + }; + + await env.STREAM.send([event]); + + return new Response('Event sent'); + }, + +} satisfies ExportedHandler; + +``` + + +## Send via HTTP + +Each stream provides an optional HTTP endpoint for ingesting data from external applications, browsers, or any system that can make HTTP requests. + +### Endpoint format + +HTTP endpoints follow this format: +``` + +https://{stream-id}.ingest.cloudflare.com + +```` + +Find your stream's endpoint URL in the Cloudflare dashboard under **Pipelines** > **Streams** or using the Wrangler CLI: + +```bash +npx wrangler pipelines streams get +```` + +### Making requests + +Send events as JSON arrays via POST requests: + +```bash +curl -X POST https://{stream-id}.ingest.cloudflare.com \ + -H "Content-Type: application/json" \ + -d '[ + { + "user_id": "12345", + "event_type": "purchase", + "product_id": "widget-001", + "amount": 29.99 + } + ]' +``` + +### Authentication + +When authentication is enabled for your stream, include the API token in the `Authorization` header: + +```bash +curl -X POST https://{stream-id}.ingest.cloudflare.com \ + -H "Content-Type: application/json" \ + -H "Authorization: Bearer YOUR_API_TOKEN" \ + -d '[{"event": "test"}]' +``` + +The API token must have **Workers Pipeline Send** permission. To learn more, refer to the [Create API token](/fundamentals/api/get-started/create-token/) documentation. + +## Schema validation + +Streams handle validation differently based on their configuration: + +- **Structured streams**: Events must match the defined schema fields and types. +- **Unstructured streams**: Accept any valid JSON structure. Data is stored in a single `value` column. + +For structured streams, ensure your events match the schema definition. Invalid events will be accepted but dropped, so validate your data before sending to avoid dropped events. diff --git a/src/content/docs/pipelines/tutorials/index.mdx b/src/content/docs/pipelines/tutorials/index.mdx deleted file mode 100644 index 6b97f82a78b8477..000000000000000 --- a/src/content/docs/pipelines/tutorials/index.mdx +++ /dev/null @@ -1,13 +0,0 @@ ---- -hideChildren: true -pcx_content_type: navigation -title: Tutorials -sidebar: - order: 8 ---- - -import { GlossaryTooltip, ListTutorials } from "~/components"; - -View tutorials to help you get started with Pipelines. - - diff --git a/src/content/docs/pipelines/tutorials/query-data-with-motherduck/index.mdx b/src/content/docs/pipelines/tutorials/query-data-with-motherduck/index.mdx deleted file mode 100644 index dca2b2a9c30a4e5..000000000000000 --- a/src/content/docs/pipelines/tutorials/query-data-with-motherduck/index.mdx +++ /dev/null @@ -1,492 +0,0 @@ ---- -reviewed: 2025-04-09 -difficulty: Intermediate -content_type: πŸ“ Tutorial -pcx_content_type: tutorial -title: Ingest data from a Worker, and analyze using MotherDuck -products: - - R2 - - Workers -tags: - - MotherDuck - - SQL ---- - -import { Render, PackageManagers, Details, WranglerConfig } from "~/components"; - -In this tutorial, you will learn how to ingest clickstream data to a [R2 bucket](/r2) using Pipelines. You will use the Pipeline binding to send the clickstream data to the R2 bucket from your Worker. You will also learn how to connect the bucket to MotherDuck. You will then query the data using MotherDuck. - -For this tutorial, you will build a landing page of an e-commerce website. A user can click on the view button to view the product details or click on the add to cart button to add the product to their cart. - -## Prerequisites - -1. A [MotherDuck](https://motherduck.com/) account. -2. Install [`Node.js`](https://docs.npmjs.com/downloading-and-installing-node-js-and-npm). - -
- Use a Node version manager like [Volta](https://volta.sh/) or - [nvm](https://github.com/nvm-sh/nvm) to avoid permission issues and change - Node.js versions. [Wrangler](/workers/wrangler/install-and-update/), discussed - later in this guide, requires a Node version of `16.17.0` or later. -
- -## 1. Create a new project - -You will create a new Worker project that will use [Static Assets](/workers/static-assets/) to serve the HTML file. - -Create a new Worker project by running the following commands: - - - - - -Navigate to the `e-commerce-pipelines` directory: - -```sh frame="none" -cd e-commerce-pipelines -``` - -## 2. Update the frontend - -Using Static Assets, you can serve the frontend of your application from your Worker. The above step creates a new Worker project with a default `public/index.html` file. Update the `public/index.html` file with the following HTML code: - -
- Select to view the HTML code -```html - - - - - E-commerce Store - - - - -
-

Our Products

-
- - - -
-
- - - - - - -``` -
- -The above code does the following: - -- Uses Tailwind CSS to style the page. -- Renders a list of products. -- Adds a button to view the details of a product. -- Adds a button to add a product to the cart. -- Contains a `handleClick` function to handle the click events. This function logs the action and the product ID. In the next steps, you will add the logic to send the click events to your pipeline. - -## 3. Generate clickstream data - -You need to send clickstream data like the `timestamp`, `user_id`, `session_id`, and `device_info` to your pipeline. You can generate this data on the client side. Add the following function in the ` - - - -
-

Our Products

-
- - - -
-
- - - - - -``` - - -The above code does the following: - -- Uses Tailwind CSS to style the page. -- Renders a list of products. -- Adds a button to view the details of a product. -- Adds a button to add a product to the cart. -- Contains a `handleClick` function to handle the click events. This function logs the action and the product ID. In the next steps, you will create a pipeline and add the logic to send the click events to this pipeline. - -## 3. Create an R2 Bucket -We'll create a new R2 bucket to use as the sink for our pipeline. Create a new r2 bucket `clickstream-bucket` using the [Wrangler CLI](/workers/wrangler/). Open a terminal window, and run the following command: - -```sh -npx wrangler r2 bucket create clickstream-bucket -``` - -## 4. Create a pipeline -You need to create a new pipeline and connect it to your R2 bucket. - -Create a new pipeline `clickstream-pipeline-client` using the [Wrangler CLI](/workers/wrangler/). Open a terminal window, and run the following command: - -```sh -npx wrangler pipelines create clickstream-pipeline-client --r2-bucket clickstream-bucket --compression none --batch-max-seconds 5 -``` - -When you run the command, you will be prompted to authorize Cloudflare Workers Pipelines to create R2 API tokens on your behalf. These tokens are required by your Pipeline. Your Pipeline uses these tokens when loading data into your bucket. You can approve the request through the browser link which will open automatically. - -:::note -The above command creates a pipeline using two optional flags: `--compression none`, and `--batch-max-seconds 5`. - -With these flags, your pipeline will deliver an uncompressed file of data to your R2 bucket every 5 seconds. - -These flags are useful for testing, but we recommend keeping the default settings in a production environment. -::: - -```txt output -βœ… Successfully created Pipeline "clickstream-pipeline-client" with ID - -Id: -Name: clickstream-pipeline-client -Sources: - HTTP: - Endpoint: https://.pipelines.cloudflare.com - Authentication: off - Format: JSON - Worker: - Format: JSON -Destination: - Type: R2 - Bucket: clickstream-bucket - Format: newline-delimited JSON - Compression: NONE - Batch hints: - Max bytes: 100 MB - Max duration: 300 seconds - Max records: 10,000,000 - -πŸŽ‰ You can now send data to your Pipeline! - -Send data to your Pipeline's HTTP endpoint: - -curl "https://.pipelines.cloudflare.com" -d '[{"foo": "bar"}]' -``` - -Make a note of the URL of the pipeline. You will use this URL to send the clickstream data from the client-side. - -## 5. Generate clickstream data - -You need to send clickstream data like the `timestamp`, `user_id`, `session_id`, and `device_info` to your pipeline. You can generate this data on the client side. Add the following function in the `