From cb03a30ede53a3ef3d93eaea413c5587f56756d4 Mon Sep 17 00:00:00 2001 From: Marc Selwan Date: Thu, 18 Sep 2025 15:20:16 -0700 Subject: [PATCH 01/30] initial docs --- .../docs/r2/sql/end-to-end-pipeline.mdx | 364 ++++++++++++++++++ src/content/docs/r2/sql/index.mdx | 21 + .../platform/limitations-best-practices.mdx | 212 ++++++++++ src/content/docs/r2/sql/platform/pricing.mdx | 0 .../docs/r2/sql/platform/sql-reference.mdx | 251 ++++++++++++ src/content/docs/r2/sql/troubleshooting.mdx | 308 +++++++++++++++ 6 files changed, 1156 insertions(+) create mode 100644 src/content/docs/r2/sql/end-to-end-pipeline.mdx create mode 100644 src/content/docs/r2/sql/index.mdx create mode 100644 src/content/docs/r2/sql/platform/limitations-best-practices.mdx create mode 100644 src/content/docs/r2/sql/platform/pricing.mdx create mode 100644 src/content/docs/r2/sql/platform/sql-reference.mdx create mode 100644 src/content/docs/r2/sql/troubleshooting.mdx diff --git a/src/content/docs/r2/sql/end-to-end-pipeline.mdx b/src/content/docs/r2/sql/end-to-end-pipeline.mdx new file mode 100644 index 00000000000000..455d5ed5fd0a36 --- /dev/null +++ b/src/content/docs/r2/sql/end-to-end-pipeline.mdx @@ -0,0 +1,364 @@ +--- +title: Build a fraud detection pipeline with Cloudflare Pipelines and R2 SQL +summary: Learn how to create an end-to-end data pipeline using Cloudflare Pipelines, R2 Data Catalog, and R2 SQL for real-time transaction analysis. +pcx_content_type: tutorial +products: + - R2 + - R2 Data Catalog + - R2 SQL +--- + + +# Build a fraud detection pipeline with the Cloudflare Data Platform + +In this guide, you will learn how to build a complete data pipeline using Cloudflare Pipelines, R2 Data Catalog, and R2 SQL. This also includes a sample Python script that creates and sends financial transaction data to your Pipeline that can be queried by R2 SQL or any Apache Iceberg-compatible query engine. + +This tutorial demonstrates how to: +- Set up R2 Data Catalog to store our transaction events in an Apache Iceberg table +- Set up a Cloudflare Pipeline +- Create transaction data with fraud patterns to send to your Pipeline +- Query your data using R2 SQL for fraud analysis + + +## Prerequisites + +1. Sign up for a [Cloudflare account](https://dash.cloudflare.com/sign-up). +2. Install [Node.js](https://nodejs.org/en/). +3. Install [Python 3.8+](https://python.org) for the data generation script. + +:::note[Node.js version manager] +Use a Node version manager like [Volta](https://volta.sh/) or [nvm](https://github.com/nvm-sh/nvm) to avoid permission issues and change Node.js versions. Wrangler requires a Node version of 16.17.0 or later. +::: + +## 1. Set up authentication + +You'll need API tokens to interact with Cloudflare services. + +### Custom API Token +1. Go to **My Profile** → **API Tokens** in the Cloudflare dashboard +2. Select **Create Token** → **Custom token** +3. Add the following permissions: + - **Workers R2 Storage** - Edit, Read + - **Workers R2 Data Catalog** - Edit, Read + - **Workers R2 SQL** - Read + - **Workers R2 SQL** - Read, Send, Edit + +Export your new token as an environment variable: + +```bash +export WRANGLER_R2_SQL_AUTH_TOKEN=your_token_here +``` + +If this is your first time using Wrangler, make sure to login. +```bash +npx wrangler login +``` + +## 2. Create an R2 bucket + +Create a new R2 bucket to store your fraud detection data: + +```bash +npx wrangler r2 bucket create fraud-detection-data +``` + +## 3. Enable R2 Data Catalog + +Enable the Data Catalog feature on your bucket to use Apache Iceberg tables: + +```bash +npx wrangler r2 bucket catalog enable fraud-detection-data +``` +:::note +Make sure to save the Warehouse for use later in this guide +::: + +### Optional - Enable compaction on your R2 Data Catalog +R2 Data Catalog can automatically compact tables for you. In production event streaming use cases, it's common to end up with many small files so it's recommended to enable compaction. Since this is a sample use case, this is optional. +```bash +npx wrangler r2 bucket catalog compaction enable fraud-detection-data --token $WRANGLER_R2_SQL_AUTH_TOKEN +``` + +## 4. Set up the pipeline infrastructure + +### Create the Pipeline stream + +Create a stream to receive incoming fraud detection events: + +```bash +npx wrangler pipelines streams create fraud-transactions \ + --schema '{ + "fields": [ + {"name": "transaction_id", "type": "string", "required": true}, + {"name": "user_id", "type": "int64", "required": true}, + {"name": "amount", "type": "f64", "required": false}, + {"name": "transaction_timestamp", "type": "string", "required": false}, + {"name": "location", "type": "string", "required": false}, + {"name": "merchant_category", "type": "string", "required": false}, + {"name": "is_fraud", "type": "string", "required": false}, + {"name": "ingestion_timestamp", "type": "string", "required": false} + ] + }' \ + --http-enabled true \ + --http-auth true +``` +:::note +After running the `stream create` command, note the **Stream Endpoint URL** from the output. This is the endpoint you'll use to send data to your pipeline. +::: + +### Create the data sink + +Create a sink that writes data to your R2 bucket as Apache Iceberg tables: + +```bash +npx wrangler pipelines sinks create fraud-data-sink \ + --type "r2-data-catalog" \ + --bucket "fraud-detection-data" \ + --roll-interval 30 \ + --namespace "fraud_detection" \ + --table "transactions" \ + --catalog-token $WRANGLER_R2_SQL_AUTH_TOKEN +``` + +:::note +This creates a `sink` configuration that will write to the Iceberg table fraud_detection.transactions every 30 seconds. Pipelines automatically appends an `__ingest_ts` column that is used to partion the table by `DAY` +::: + +### Create the pipeline + +Connect your stream to your sink with SQL: + +```bash +npx wrangler pipelines create fraud-pipeline \ + --sql "INSERT INTO fraud-data-sink SELECT * FROM fraud-transactions" +``` + +## 5. Generate fraud detection data + +Create a Python script to generate realistic transaction data with fraud patterns: + +```python title="fraud_data_generator.py" +import requests +import json +import uuid +import random +import time +from datetime import datetime, timezone, timedelta + +# Configuration +STREAM_ENDPOINT = "https://YOUR_STREAM_ID.ingest.cloudflare.com" # From the stream you created +API_TOKEN = "WRANGLER_R2_SQL_AUTH_TOKEN" #the same one created earlier +EVENTS_TO_SEND = 1000 # Feel free to adjust this + +def generate_transaction(): + """Generate some transactions with occassional fraud patterns""" + + # User IDs + high_risk_users = [1001, 1002, 1003, 1004, 1005] + normal_users = list(range(1006, 2000)) + + user_id = random.choice(high_risk_users + normal_users) + is_high_risk_user = user_id in high_risk_users + + # Generate amount + if random.random() < 0.05: + amount = round(random.uniform(5000, 50000), 2) + elif random.random() < 0.03: + amount = round(random.uniform(0.01, 1.00), 2) + else: + amount = round(random.uniform(10, 500), 2) + + # Locations + normal_locations = ["NEW_YORK", "LOS_ANGELES", "CHICAGO", "MIAMI", "SEATTLE"] + high_risk_locations = ["UNKNOWN_LOCATION", "VPN_EXIT", "BELARUS", "NIGERIA"] + + if is_high_risk_user and random.random() < 0.3: + location = random.choice(high_risk_locations) + else: + location = random.choice(normal_locations) + + # Merchant categories + normal_merchants = ["GROCERY", "GAS_STATION", "RESTAURANT", "RETAIL"] + high_risk_merchants = ["GAMBLING", "CRYPTO", "MONEY_TRANSFER", "GIFT_CARDS"] + + if random.random() < 0.1: # 10% high-risk merchants + merchant_category = random.choice(high_risk_merchants) + else: + merchant_category = random.choice(normal_merchants) + + # Determine if transaction is fraudulent based on basic risk factors + fraud_score = 0 + if amount > 2000: fraud_score += 0.4 + if amount < 1: fraud_score += 0.3 + if location in high_risk_locations: fraud_score += 0.5 + if merchant_category in high_risk_merchants: fraud_score += 0.3 + if is_high_risk_user: fraud_score += 0.2 + + # Compare the fraud score + is_fraud = random.random() < min(fraud_score * 0.3, 0.8) + + # Generate timestamps (some fraud happens at unusual hours) + base_time = datetime.now(timezone.utc) + if is_fraud and random.random() < 0.4: # 40% of fraud at night + hour = random.randint(0, 5) # Late night/early morning + transaction_time = base_time.replace(hour=hour) + else: + transaction_time = base_time - timedelta( + hours=random.randint(0, 168) # Last week + ) + + return { + "transaction_id": str(uuid.uuid4()), + "user_id": user_id, + "amount": amount, + "transaction_timestamp": transaction_time.isoformat(), + "location": location, + "merchant_category": merchant_category, + "is_fraud": "TRUE" if is_fraud else "FALSE", + "ingestion_timestamp": datetime.now(timezone.utc).isoformat() + } + +def send_batch_to_stream(events, batch_size=100): + """Send events to Cloudflare Stream in batches""" + + headers = { + "Authorization": f"Bearer {API_TOKEN}", + "Content-Type": "application/json" + } + + total_sent = 0 + fraud_count = 0 + + for i in range(0, len(events), batch_size): + batch = events[i:i + batch_size] + fraud_in_batch = sum(1 for event in batch if event["is_fraud"] == "TRUE") + + try: + response = requests.post(STREAM_ENDPOINT, headers=headers, json=batch) + + if response.status_code in [200, 201]: + total_sent += len(batch) + fraud_count += fraud_in_batch + print(f"✅ Sent batch of {len(batch)} events (Total: {total_sent})") + else: + print(f"❌ Failed to send batch: {response.status_code} - {response.text}") + + except Exception as e: + print(f"❌ Error sending batch: {e}") + + # Small delay between batches + time.sleep(0.1) + + return total_sent, fraud_count + +def main(): + print("Generating fraud detection data...") + + # Generate events + events = [] + for i in range(EVENTS_TO_SEND): + events.append(generate_transaction()) + if (i + 1) % 100 == 0: + print(f"Generated {i + 1} events...") + + fraud_events = sum(1 for event in events if event["is_fraud"] == "TRUE") + print(f"📊 Generated {len(events)} total events ({fraud_events} fraud, {fraud_events/len(events)*100:.1f}%)") + + # Send to stream + print("📤 Sending data to Cloudflare Stream...") + sent, fraud_sent = send_batch_to_stream(events) + + print(f"\n🎉 Complete!") + print(f" Events sent: {sent:,}") + print(f" Fraud events: {fraud_sent:,} ({fraud_sent/sent*100:.1f}%)") + print(f" Data is now flowing through your pipeline!") + +if __name__ == "__main__": + main() +``` + +Update the configuration variables in the script: +- Replace `YOUR_STREAM_ID` with your actual stream endpoint from step 4 +- Replace `YOUR_API_TOKEN` with your Cloudflare API token + +Install the required Python dependency and run the script: + +```bash +pip install requests +python fraud_data_generator.py +``` + +## 6. Query your fraud data with R2 SQL + +Now you can analyze your fraud detection data using R2 SQL. Here are some example queries: + +### View recent transactions + +```bash +npx wrangler r2 sql query "YOUR_WAREHOUSE" " +SELECT + transaction_id, + user_id, + amount, + location, + merchant_category, + is_fraud, + transaction_timestamp +FROM fraud_detection.transactions +WHERE __ingest_ts > '2025-09-12T01:00:00Z' +AND is_fruad = 'TRUE' +LIMIT 10" +``` +:::note +Replace `YOUR_WAREHOUSE` with your R2 Data Catalog warehouse. This in the form of `{YOUR_ACCOUNT_ID}_{BUCKET_NAME}`. This can be found in the dash under the settings in your bucket. Adjust the `__ingest_ts` date in the query as needed. +::: + +### Let's filter the raw transactions into a new table to highlight high-value transactions + +Create a new sink that will write the filtered data to a new Apache Iceberg table in R2 Data Catalog: + +```bash +npx wrangler pipelines sink create filtered-fraud-sink \ + --type "r2-data-catalog" \ + --bucket "fraud-detection-data" \ + --roll-interval 30 \ + --namespace "fraud_detection" \ + --table "fraud_transactions" \ + --catalog-token $WRANGLER_R2_SQL_AUTH_TOKEN +``` + +Now you'll create a new SQL query to process data from the original `fraud-transactions` stream and only write flagged transactions that are over the `amount` of 1000. + +```bash +npx wrangler pipelines create fraud-pipeline \ + --sql "INSERT INTO filtered-fraud-sink SELECT * FROM fraud-transactions WHERE is_fraud='TRUE' and amount > 1000" +``` + +:::note +It may take a few minutes for the new Pipeline to fully Initialize and start processing the data. Also keep in mind the 30 second `roll-interval` +::: + +Let's query our table and check the results: +```bash +npx wrangler r2 sql query " +SELECT + transaction_id, + user_id, + amount, + location, + merchant_category, + is_fraud, + transaction_timestamp +FROM fraud_detection.fraud_transactions +WHERE __ingest_ts > '2025-09-12T01:00:00Z' +LIMIT 10" +``` + +## Conclusion + +You have successfully built an end to end data pipeline using Cloudflare's data platform. Through this tutorial, you've learned to: + +1. **Use R2 Data Catalog** - Leveraged Apache Iceberg tables for efficient data storage +2. **Set up Cloudflare Pipelines** - Created streams, sinks, and pipelines for data ingestion +3. **Generated sample data** - Created transaction data with basic fraud patterns +4. **Query with R2 SQL** - Performed complex fraud analysis using SQL queries diff --git a/src/content/docs/r2/sql/index.mdx b/src/content/docs/r2/sql/index.mdx new file mode 100644 index 00000000000000..3aeaa93c32873d --- /dev/null +++ b/src/content/docs/r2/sql/index.mdx @@ -0,0 +1,21 @@ +--- +pcx_content_type: navigation +title: R2 SQL +sidebar: + order: 7 + group: + badge: Beta +head: [] +description: Query your R2 Data Catalog tables with R2 SQL. +--- + +## Efficiently Query Apache Iceberg tables in R2 Data Catalog Using R2 SQL. + + +:::note +R2 SQL is in public beta, and any developer with an R2 subscription can start using it. Currently, outside of standard R2 storage and operations, you will not be billed for your use of R2 SQL +::: + +R2 SQL is Cloudflare's serverless, distributed, analytics query engine for querying Apache Iceberg tables stored in [R2 data catalog](https://developers.cloudflare.com/r2/data-catalog/). R2 SQL is designed to efficiently query large amounts of data by automatically utilizing file pruning, Cloudflare's distributed compute, and R2 object storage. + +Query your first table in R2 SQL by following the Get Started guide, learn how to create a data pipeline that takes a stream of events and automatically creates an Apache Iceberg table, making them accessible with R2 SQL. \ No newline at end of file diff --git a/src/content/docs/r2/sql/platform/limitations-best-practices.mdx b/src/content/docs/r2/sql/platform/limitations-best-practices.mdx new file mode 100644 index 00000000000000..da626501f8ff63 --- /dev/null +++ b/src/content/docs/r2/sql/platform/limitations-best-practices.mdx @@ -0,0 +1,212 @@ +--- +title: Limitations and Best Practices +pcx_content_type: concept +tags: + - SQL +sidebar: + order: 5 + +--- + +# R2 SQL Limitations and Best Practices + +## Overview + +R2 SQL is in public beta, limitations and best practices will change over time. + +R2 SQL is designed for querying **partitioned** Apache Iceberg tables in your R2 data catalog. This document outlines the supported features, limitations, and best practices of R2 SQL. + + +## Quick Reference + +| Feature | Supported | Notes | +| :---- | :---- | :---- | +| Basic SELECT | Yes | Columns, \*, aliases | +| SQL Functions | No | No COUNT, AVG, etc. | +| Single table FROM | Yes | With aliasing | +| JOINs | No | No table joins | +| WHERE with time | Yes | Required | +| Array filtering | No | No array type support | +| JSON filtering | No | No nested object queries | +| Simple LIMIT | Yes | 1-10,000 range | +| ORDER BY | Yes | Only on partition key | +| GROUP BY | No | Not supported | + +## Supported SQL Clauses + +R2 SQL supports a limited set of SQL clauses: `SELECT`, `FROM`, `WHERE`, and `LIMIT`. All other SQL clauses are not supported at the moment. New features will release often, keep an eye on this page and the changelog\[LINK TO CHANGE LOG\] for the latest. + +--- + +## SELECT Clause + +### Supported Features + +- **Individual columns**: `SELECT column1, column2` +- **All columns**: `SELECT *` + +### Limitations + +- **No JSON field querying**: Cannot query individual fields from JSON objects +- **No SQL functions**: Functions like `AVG()`, `COUNT()`, `MAX()`, `MIN()`, quantiles are not supported +- **No synthetic data**: Cannot create synthetic columns like `SELECT 1 AS what, "hello" AS greeting` +- **Field aliasing**: `SELECT field AS another_name` + + +### Examples + +```sql +-- Valid +SELECT timestamp, user_id, status +SELECT * + +-- Invalid +SELECT user_id AS uid, timestamp AS ts +SELECT COUNT(*) FROM events +SELECT json_field.property FROM table +SELECT 1 AS synthetic_column +``` + +--- + +## FROM Clause + +### Supported Features + +- **Single table queries**: `SELECT * FROM table_name` + +### Limitations + +- **No multiple tables**: Cannot specify multiple tables in FROM clause +- **No subqueries**: `SELECT ... FROM (SELECT ...)` is not supported +- **No JOINs**: No INNER, LEFT, RIGHT, or FULL JOINs +- **No SQL functions**: Cannot use functions like `read_parquet()` +- **No synthetic tables**: Cannot create tables from values +- **No schema evolution**: Schema cannot be altered (no ALTER TABLE, migrations) +- **Immutable datasets**: No UPDATE or DELETE operations allowed +- **Fully defined schema**: Dynamic or union-type fields are not supported +- **Table aliasing**: `SELECT * FROM table_name AS alias` + +### Examples + +```sql +--Valid +SELECT * FROM http_requests + +--Invalid +SELECT * FROM table1, table2 +SELECT * FROM table1 JOIN table2 ON table1.id = table2.id +SELECT * FROM (SELECT * FROM events WHERE status = 200) +``` + +--- + +## WHERE Clause + +### Supported Features + +- **Time filtering**: Queries should include a time filter +- **Simple type filtering**: Supports `string`, `boolean`, and `number` types +- **Boolean logic**: Supports `AND`, `OR`, `NOT` operators +- **Comparison operators**: `>`, `>=`, `=`, `<`, `<=`, `!=` +- **Grouped conditions**: `WHERE col_a="hello" AND (col_b>5 OR col_c != 3)` +- **Pattern mating:** `WHERE col_a LIKE ‘%hello w%’` +- **NULL Handling:** `WHERE col_a IS NOT NULL` + +### Limitations + +- **No column-to-column comparisons**: Cannot use `WHERE col_a = col_b` +- **No array filtering**: Cannot filter on array types (array\[number\], array\[string\], array\[boolean\]) +- **No JSON/object filtering**: Cannot filter on fields inside nested objects or JSON +- **No SQL functions**: No function calls in WHERE clause +- **No arithmetic operators**: Cannot use `+`, `-`, `*`, `/` in conditions + +### Examples + +```sql +--Valid +SELECT * FROM events WHERE timestamp BETWEEN '2024-01-01' AND '2024-01-02' +SELECT * FROM logs WHERE status = 200 AND user_type = 'premium' +SELECT * FROM requests WHERE (method = 'GET' OR method = 'POST') AND response_time < 1000 + +--Invalid +SELECT * FROM events -- Missing time filter +SELECT * FROM logs WHERE tags[0] = 'error' -- Array filtering +SELECT * FROM requests WHERE metadata.user_id = '123' -- JSON field filtering +SELECT * FROM events WHERE col_a = col_b -- Column comparison +SELECT * FROM logs WHERE response_time + latency > 5000 -- Arithmetic +``` + +--- + +## ORDER BY Clause + +### Supported Features + +- **ASC**: Ascending order (Default) +- **DESC**: Descending order + +### Limitations + +- **Non-partition keys not supported**: `ORDER BY` on columns other than the partition key is not supported + +### Examples + +```sql +SELECT * FROM table_name WHERE ... ORDER BY partitionKey +SELECT * FROM table_name WHERE ... ORDER BY partitionKey DESC +SELECT * FROM table_name WHERE ... ORDER BY partitionKey DESC +``` + +--- + +## LIMIT Clause + +### Supported Features + +- **Simple limits**: `LIMIT number` +- **Range**: Minimum 1, maximum 10,000 + +### Limitations + +- **No pagination**: `LIMIT offset, count` syntax not supported +- **No SQL functions**: Cannot use functions to determine limit +- **No arithmetic**: Cannot use expressions like `LIMIT 10 * 50` + +### Examples + +```sql +-- Valid +SELECT * FROM events WHERE ... LIMIT 100 +SELECT * FROM logs WHERE ... LIMIT 10000 + +-- Invalid +SELECT * FROM events LIMIT 100, 50 -- Pagination +SELECT * FROM logs LIMIT COUNT(*) / 2 -- Functions +SELECT * FROM events LIMIT 10 * 10 -- Arithmetic +``` + +--- + +## Unsupported SQL Clauses + +The following SQL clauses are **not supported**: + +- `GROUP BY` +- `HAVING` +- `UNION`/`INTERSECT`/`EXCEPT` +- `WITH` (Common Table Expressions) +- `WINDOW` functions +- `INSERT`/`UPDATE`/`DELETE` +- `CREATE`/`ALTER`/`DROP` + +--- + +## Best Practices + +1. **Always include time filters** in your WHERE clause to ensure efficient queries +2. **Use specific column selection** instead of `SELECT *` when possible for better performance +3. **Structure your data** to avoid nested JSON objects if you need to filter on those fields + +--- + diff --git a/src/content/docs/r2/sql/platform/pricing.mdx b/src/content/docs/r2/sql/platform/pricing.mdx new file mode 100644 index 00000000000000..e69de29bb2d1d6 diff --git a/src/content/docs/r2/sql/platform/sql-reference.mdx b/src/content/docs/r2/sql/platform/sql-reference.mdx new file mode 100644 index 00000000000000..d020aac3b378d4 --- /dev/null +++ b/src/content/docs/r2/sql/platform/sql-reference.mdx @@ -0,0 +1,251 @@ +--- +title: SQL Reference +pcx_content_type: concept +tags: + - SQL +sidebar: + order: 5 +--- + +# R2 SQL Language Reference + +## Overview + +R2 SQL is in public beta, supported SQL grammar will change over time. + +This reference documents the R2 SQL syntax based on the currently supported grammar in public beta. + +--- + +## Complete Query Syntax + +```sql +SELECT column_list +FROM table_name +WHERE conditions +[ORDER BY column_name [DESC, ASC]] +[LIMIT number] +``` + +--- + +## SELECT Clause + +### Syntax + +```sql +SELECT column_specification [, column_specification, ...] +``` + +### Column Specification + +- **Column name**: `column_name` +- **All columns**: `*` + +### Examples + +```sql +SELECT * FROM table_name +SELECT user_id FROM table_name +SELECT user_id, timestamp, status FROM table_name +SELECT timestamp, user_id, response_code FROM table_name +``` + +--- + +## FROM Clause + +### Syntax + +```sql +SELECT * FROM table_name +``` + +### Examples + +```sql +SELECT column_name FROM table_name +``` + +--- + +## WHERE Clause + +### Syntax + +```sql +SELECT * WHERE condition [AND|OR condition ...] +``` + +### Conditions + +#### Null Checks + +- `column_name IS NULL` +- `column_name IS NOT NULL` + +#### Value Comparisons + +- `column_name BETWEEN value AND value` +- `column_name = value` +- `column_name >= value` +- `column_name > value` +- `column_name <= value` +- `column_name < value` +- `column_name != value` + +#### Logical Operators + +- `AND` \- Logical AND +- `OR` \- Logical OR + +### Data Types + +- **integer** \- Whole numbers +- **float** \- Decimal numbers +- **string** \- Text values (quoted) + +### Examples + +```sql +SELECT * FROM table_name WHERE timestamp BETWEEN '2025-01-01' AND '2025-01-02' +SELECT * FROM table_name WHERE status = 200 +SELECT * FROM table_name WHERE response_time > 1000 +SELECT * FROM table_name WHERE user_id IS NOT NULL +SELECT * FROM table_name WHERE method = 'GET' AND status >= 200 AND status < 300 +SELECT * FROM table_name WHERE (status = 404 OR status = 500) AND timestamp > '2024-01-01' +``` + +--- + +## ORDER BY Clause + +### Syntax + +```sql +--Note: ORDERY BY only supports ordering by the partition key +ORDER BY partition_key [DESC] +``` + +- **Default**: Ascending order (ASC) +- **DESC**: Descending order + +### Examples + +```sql +SELECT * FROM table_name WHERE ... ORDER BY partitionKey +SELECT * FROM table_name WHERE ... ORDER BY partitionKey DESC +SELECT * FROM table_name WHERE ... ORDER BY partitionKey DESC + +``` + +--- + +## LIMIT Clause + +### Syntax + +```sql +LIMIT number +``` + +- **Range**: 1 to 10,000 +- **Type**: Integer only + +### Examples + +```sql +SELECT * FROM table_name WHERE ... LIMIT 100 +``` + +--- + +## Complete Query Examples + +### Basic Query + +```sql +SELECT * +FROM http_requests +WHERE timestamp BETWEEN '2024-01-01' AND '2024-01-02' +LIMIT 100 +``` + +### Filtered Query with Sorting + +```sql +SELECT user_id, timestamp, status, response_time +FROM access_logs +WHERE status >= 400 AND response_time > 5000 +ORDER BY response_time DESC +LIMIT 50 +``` + +### Complex Conditions + +```sql +SELECT timestamp, method, status, user_agent +FROM http_requests +WHERE (method = 'POST' OR method = 'PUT') + AND status BETWEEN 200 AND 299 + AND user_agent IS NOT NULL +ORDER BY timestamp DESC +LIMIT 1000 +``` + +### Null Handling + +```sql +SELECT user_id, session_id, timestamp +FROM user_events +WHERE session_id IS NOT NULL + AND timestamp >= '2024-01-01' +ORDER BY timestamp +LIMIT 500 +``` + +--- + +## Data Type Reference + +### Supported Types + +| Type | Description | Example Values | +| :---- | :---- | :---- | +| `integer` | Whole numbers | `1`, `42`, `-10`, `0` | +| `float` | Decimal numbers | `1.5`, `3.14`, `-2.7`, `0.0` | +| `string` | Text values | `'hello'`, `'GET'`, `'2024-01-01'` | + +### Type Usage in Conditions + +```sql +-- Integer comparisons +SELECT * FROM table_name WHERE status = 200 +SELECT * FROM table_name WHERE response_time > 1000 + +-- Float comparisons +SELECT * FROM table_name WHERE cpu_usage >= 85.5 +SELECT * FROM table_name WHERE memory_ratio < 0.8 + +-- String comparisons +SELECT * FROM table_name WHERE method = 'POST' +SELECT * FROM table_name WHERE user_agent != 'bot' +SELECT * FROM table_name WHERE country_code = 'US' +``` + +--- + +## Operator Precedence + +1. **Comparison operators**: `=`, `!=`, `<`, `<=`, `>`, `>=`, `BETWEEN`, `IS NULL`, `IS NOT NULL` +2. **AND** (higher precedence) +3. **OR** (lower precedence) + +Use parentheses to override default precedence: + +```sql +SELECT * FROM table_name WHERE (status = 404 OR status = 500) AND method = 'GET' +``` + +--- + diff --git a/src/content/docs/r2/sql/troubleshooting.mdx b/src/content/docs/r2/sql/troubleshooting.mdx new file mode 100644 index 00000000000000..5f1a6542cc1819 --- /dev/null +++ b/src/content/docs/r2/sql/troubleshooting.mdx @@ -0,0 +1,308 @@ +--- +title: "R2 SQL Troubleshooting Guide" +pcx_content_type: concept +tags: + - SQL +sidebar: + order: 5 +--- + +# R2 SQL Troubleshooting Guide + +This guide covers potential errors and limitations you may encounter when using R2 SQL. R2 SQL is in open beta and supported functionality will evolve and change over time. + +## Query Structure Errors + +### Missing Required Clauses + +
+**Error**: `expected exactly 1 table in `FROM` clause` +
+ +**Problem**: R2 SQL requires specific clauses in your query. + +```sql +-- Invalid - Missing FROM clause +SELECT user_id WHERE status = 200 + +-- Valid +SELECT user_id +FROM http_requests +WHERE status = 200 AND timestamp BETWEEN '2024-01-01' AND '2024-01-02' +``` + +**Solution**: Always include `FROM` in your queries. + +--- + +## SELECT Clause Issues + +### Unsupported SQL Functions + +
+**Error**: `Function not supported` +
+ +**Problem**: Trying to use aggregate or SQL functions in SELECT. + +```sql +-- Invalid - Aggregate functions not supported +SELECT COUNT(*) FROM events WHERE timestamp > '2024-01-01' +SELECT AVG(response_time) FROM http_requests WHERE status = 200 +SELECT MAX(timestamp) FROM logs WHERE user_id = '123' +``` + +**Solution**: Use basic column selection and handle aggregation in your application code. + +### JSON Field Access + +
+**Error**: `Cannot access nested fields` +
+ +**Problem**: Attempting to query individual fields from JSON objects. + +```sql +-- Invalid - JSON field access not supported +SELECT metadata.user_id FROM events +SELECT json_field->>'property' FROM logs + +-- Valid - Select entire JSON field +SELECT metadata FROM events +SELECT json_field FROM logs +``` + +**Solution**: Select the entire JSON column and parse it in your application. + +### Synthetic Data + +
+**Error**: `aliases (`AS`) are not supported` +
+ +**Problem**: Creating synthetic columns with literal values. + +```sql +-- Invalid - Synthetic data not supported +SELECT user_id, 'active' as status, 1 as priority FROM users + +-- Valid +SELECT user_id, status, priority FROM users WHERE status = 'active' +``` + +**Solution**: Add the required data to your table schema or handle it in post-processing. + +--- + +## FROM Clause Issues + +### Multiple Tables + +
+**Error**: `Multiple tables not supported` or `JOIN operations not allowed` +
+ +**Problem**: Attempting to query multiple tables or use JOINs. + +```sql +-- Invalid - Multiple tables not supported +SELECT a.*, b.* FROM table1 a, table2 b WHERE a.id = b.id +SELECT * FROM events JOIN users ON events.user_id = users.id + +-- Valid - Separate queries +SELECT * FROM table1 WHERE id IN ('id1', 'id2', 'id3') +-- Then in application code, query table2 separately +SELECT * FROM table2 WHERE id IN ('id1', 'id2', 'id3') +``` + +**Solution**: +- Denormalize your data by including necessary fields in a single table +- Perform multiple queries and join data in your application +- Restructure your data model to avoid cross-table queries + +### Subqueries + +
+**Error**: `only table name is supported in `FROM` clause` +
+ +**Problem**: Using subqueries in FROM clause. + +```sql +-- Invalid - Subqueries not supported +SELECT * FROM (SELECT user_id FROM events WHERE status = 200) as active_users + +-- Valid - Use direct query with appropriate filters +SELECT user_id FROM events WHERE status = 200 +``` + +**Solution**: Flatten your query logic or use multiple sequential queries. + +--- + +## WHERE Clause Issues + +### Array Filtering + +
+**Error**: `This feature is not implemented: GetFieldAccess` +
+ +**Problem**: Attempting to filter on array fields. + +```sql +-- Invalid - Array filtering not supported +SELECT * FROM logs WHERE tags[0] = 'error' +SELECT * FROM events WHERE 'admin' = ANY(roles) + +-- Valid alternatives - denormalize or use string contains +SELECT * FROM logs WHERE tags_string LIKE '%error%' +-- Or restructure data to avoid arrays +``` + +**Solution**: +- Denormalize array data into separate columns +- Use string concatenation of array values for pattern matching +- Restructure your schema to avoid array types + +### JSON Object Filtering + +
+**Error**: `unsupported binary operator` or `Error during planning: could not parse compound` +
+ +**Problem**: Filtering on fields inside JSON objects. + +```sql +-- Invalid - JSON field filtering not supported +SELECT * FROM requests WHERE metadata.country = 'US' +SELECT * FROM logs WHERE json_data->>'level' = 'error' + +-- Valid alternatives +SELECT * FROM requests WHERE country = 'US' -- If denormalized +-- Or filter entire JSON field and parse in application +SELECT * FROM logs WHERE json_data IS NOT NULL +``` + +**Solution**: +- Denormalize frequently queried JSON fields into separate columns +- Filter on the entire JSON field and handle parsing in your application + +### Column Comparisons + +
+**Error**: `right argument to a binary expression must be a literal` +
+ +**Problem**: Comparing one column to another in WHERE clause. + +```sql +-- Invalid - Column comparisons not supported +SELECT * FROM events WHERE start_time < end_time +SELECT * FROM logs WHERE request_size > response_size + +-- Valid - Use computed columns or application logic +-- Add a computed column 'duration' to your schema +SELECT * FROM events WHERE duration > 0 +``` + +**Solution**: +- Pre-compute comparisons and store as separate columns +- Handle comparisons in your application layer +- Restructure your data model + +--- + +## LIMIT Clause Issues + +### Invalid Limit Values + +
+**Error**: `maximum LIMIT is 10000` +
+ +**Problem**: Using invalid LIMIT values. + +```sql +-- Invalid - Out of range limits +SELECT * FROM events LIMIT 50000 -- Maximum is 10,000 + +-- Valid +SELECT * FROM events LIMIT 1 +SELECT * FROM events LIMIT 10000 +``` + +**Solution**: Use LIMIT values between 1 and 10,000. + +### Pagination Attempts + +
+**Error**: `OFFSET not supported` +
+ +**Problem**: Attempting to use pagination syntax. + +```sql +-- Invalid - Pagination not supported +SELECT * FROM events LIMIT 100 OFFSET 200 +SELECT * FROM events LIMIT 100, 100 + +-- Valid alternatives - Use ORDER BY with conditional filters +-- Page 1 +SELECT * FROM events WHERE timestamp >= '2024-01-01' ORDER BY timestamp LIMIT 100 + +-- Page 2 - Use last timestamp from previous page +SELECT * FROM events WHERE timestamp > '2024-01-01T10:30:00Z' ORDER BY timestamp LIMIT 100 +``` + +**Solution**: Implement cursor-based pagination using ORDER BY and WHERE conditions. + +--- + +## Schema Issues + +### Dynamic Schema Changes + +
+**Error**: `Sinvalid SQL: only top-level SELECT clause is supported` +
+ +**Problem**: Attempting to modify table schema or reference non-existent columns. + +```sql +-- Invalid - Schema changes not supported +ALTER TABLE events ADD COLUMN new_field STRING +UPDATE events SET status = 200 WHERE user_id = '123' +``` + +**Solution**: +- Plan your schema carefully before data ingestion +- Contact your data engineering team for schema changes +- Ensure all column names exist in your current schema + +--- + +## Performance Optimization + +### Query Performance Issues + +If your queries are running slowly: + +1. **Always include partition (timestamp) filters**: This is the most important optimization + ```sql + -- Good + WHERE timestamp BETWEEN '2024-01-01' AND '2024-01-02' + ``` + +2. **Use selective filtering**: Include specific conditions to reduce result sets + ```sql + -- Good + WHERE status = 200 AND country = 'US' AND timestamp > '2024-01-01' + ``` + +3. **Limit result size**: Use appropriate LIMIT values + ```sql + -- Good for exploration + SELECT * FROM events WHERE timestamp > '2024-01-01' LIMIT 100 + ``` + From 1cbebacf464c99431a9d3dcb98fabd49c1e005ba Mon Sep 17 00:00:00 2001 From: Marc Selwan Date: Thu, 18 Sep 2025 15:28:16 -0700 Subject: [PATCH 02/30] fixed link in index --- src/content/docs/r2/sql/index.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/content/docs/r2/sql/index.mdx b/src/content/docs/r2/sql/index.mdx index 3aeaa93c32873d..d37f584fdf50e8 100644 --- a/src/content/docs/r2/sql/index.mdx +++ b/src/content/docs/r2/sql/index.mdx @@ -18,4 +18,4 @@ R2 SQL is in public beta, and any developer with an R2 subscription can start us R2 SQL is Cloudflare's serverless, distributed, analytics query engine for querying Apache Iceberg tables stored in [R2 data catalog](https://developers.cloudflare.com/r2/data-catalog/). R2 SQL is designed to efficiently query large amounts of data by automatically utilizing file pruning, Cloudflare's distributed compute, and R2 object storage. -Query your first table in R2 SQL by following the Get Started guide, learn how to create a data pipeline that takes a stream of events and automatically creates an Apache Iceberg table, making them accessible with R2 SQL. \ No newline at end of file +Create an end to end data pipeline and query your first table in R2 SQL by following [this step by step guide](/r2/sql/end-to-end-pipeline/), learn how to create a data pipeline that takes a stream of events and automatically creates an Apache Iceberg table, making them accessible with R2 SQL. \ No newline at end of file From dd0e8d5237a1dc543ea36042b3305187f753ed11 Mon Sep 17 00:00:00 2001 From: Marc Selwan Date: Fri, 19 Sep 2025 09:12:49 -0700 Subject: [PATCH 03/30] fix indents in index, add query-data --- src/content/docs/r2/sql/index.mdx | 8 ++++---- src/content/docs/r2/sql/query-data.mdx | 0 2 files changed, 4 insertions(+), 4 deletions(-) create mode 100644 src/content/docs/r2/sql/query-data.mdx diff --git a/src/content/docs/r2/sql/index.mdx b/src/content/docs/r2/sql/index.mdx index d37f584fdf50e8..a97dd2bfbd417b 100644 --- a/src/content/docs/r2/sql/index.mdx +++ b/src/content/docs/r2/sql/index.mdx @@ -2,11 +2,11 @@ pcx_content_type: navigation title: R2 SQL sidebar: - order: 7 - group: - badge: Beta + order: 7 + group: + badge: Beta head: [] -description: Query your R2 Data Catalog tables with R2 SQL. +description: A distributed SQL engine for R2 Data Catalog --- ## Efficiently Query Apache Iceberg tables in R2 Data Catalog Using R2 SQL. diff --git a/src/content/docs/r2/sql/query-data.mdx b/src/content/docs/r2/sql/query-data.mdx new file mode 100644 index 00000000000000..e69de29bb2d1d6 From 87e5a32f31faa0b0e5738c92cefaeb065a77f063 Mon Sep 17 00:00:00 2001 From: Marc Selwan Date: Fri, 19 Sep 2025 10:49:34 -0700 Subject: [PATCH 04/30] Improved all docs, added index.mdx in platform also tested examples e2e --- .../docs/r2/sql/end-to-end-pipeline.mdx | 142 +++++++++++++----- src/content/docs/r2/sql/platform/index.mdx | 7 + src/content/docs/r2/sql/platform/pricing.mdx | 17 +++ src/content/docs/r2/sql/query-data.mdx | 77 ++++++++++ src/content/docs/r2/sql/troubleshooting.mdx | 7 +- 5 files changed, 206 insertions(+), 44 deletions(-) create mode 100644 src/content/docs/r2/sql/platform/index.mdx diff --git a/src/content/docs/r2/sql/end-to-end-pipeline.mdx b/src/content/docs/r2/sql/end-to-end-pipeline.mdx index 455d5ed5fd0a36..b9a7b65d8a0c9e 100644 --- a/src/content/docs/r2/sql/end-to-end-pipeline.mdx +++ b/src/content/docs/r2/sql/end-to-end-pipeline.mdx @@ -1,10 +1,10 @@ --- -title: Build a fraud detection pipeline with Cloudflare Pipelines and R2 SQL +title: Build an end to end data pipeline summary: Learn how to create an end-to-end data pipeline using Cloudflare Pipelines, R2 Data Catalog, and R2 SQL for real-time transaction analysis. pcx_content_type: tutorial products: - R2 - - R2 Data Catalog + - R2 Data Catalog - R2 SQL --- @@ -83,11 +83,9 @@ npx wrangler r2 bucket catalog compaction enable fraud-detection-data --token $W ### Create the Pipeline stream -Create a stream to receive incoming fraud detection events: - -```bash -npx wrangler pipelines streams create fraud-transactions \ - --schema '{ +First, create a schema file called `raw_transactions_schema.json` with the following `json` schema: +```json +{ "fields": [ {"name": "transaction_id", "type": "string", "required": true}, {"name": "user_id", "type": "int64", "required": true}, @@ -98,20 +96,70 @@ npx wrangler pipelines streams create fraud-transactions \ {"name": "is_fraud", "type": "string", "required": false}, {"name": "ingestion_timestamp", "type": "string", "required": false} ] - }' \ +} +``` + +Create a stream to receive incoming fraud detection events: + +```bash +npx wrangler pipelines streams create rawtransactionstream \ + --schema-file raw_transactions_schema.json \ --http-enabled true \ --http-auth true ``` :::note -After running the `stream create` command, note the **Stream Endpoint URL** from the output. This is the endpoint you'll use to send data to your pipeline. +Note the **HTTP Ingest Endpoint URL** from the output. This is the endpoint you'll use to send data to your pipeline. ::: +```bash +# The http ingest endpoint from the output (see example below) +export STREAM_ENDPOINT= #the http ingest endpoint from the output (see example below) +``` + +The output should look like this: +```sh +🌀 Creating stream 'rawtransactionstream'... +✨ Successfully created stream 'rawtransactionstream' with id 'stream_id'. + +Creation Summary: +General: + Name: rawtransactionstream + +HTTP Ingest: + Enabled: Yes + Authentication: Yes + Endpoint: https://stream_id.ingest.cloudflare.com + CORS Origins: None + +Input Schema: +┌───────────────────────┬────────┬────────────┬──────────┐ +│ Field Name │ Type │ Unit/Items │ Required │ +├───────────────────────┼────────┼────────────┼──────────┤ +│ transaction_id │ string │ │ Yes │ +├───────────────────────┼────────┼────────────┼──────────┤ +│ user_id │ int64 │ │ Yes │ +├───────────────────────┼────────┼────────────┼──────────┤ +│ amount │ f64 │ │ No │ +├───────────────────────┼────────┼────────────┼──────────┤ +│ transaction_timestamp │ string │ │ No │ +├───────────────────────┼────────┼────────────┼──────────┤ +│ location │ string │ │ No │ +├───────────────────────┼────────┼────────────┼──────────┤ +│ merchant_category │ string │ │ No │ +├───────────────────────┼────────┼────────────┼──────────┤ +│ is_fraud │ string │ │ No │ +├───────────────────────┼────────┼────────────┼──────────┤ +│ ingestion_timestamp │ string │ │ No │ +└───────────────────────┴────────┴────────────┴──────────┘ +``` + + ### Create the data sink Create a sink that writes data to your R2 bucket as Apache Iceberg tables: ```bash -npx wrangler pipelines sinks create fraud-data-sink \ +npx wrangler pipelines sinks create rawtransactionsink \ --type "r2-data-catalog" \ --bucket "fraud-detection-data" \ --roll-interval 30 \ @@ -129,8 +177,8 @@ This creates a `sink` configuration that will write to the Iceberg table fraud_d Connect your stream to your sink with SQL: ```bash -npx wrangler pipelines create fraud-pipeline \ - --sql "INSERT INTO fraud-data-sink SELECT * FROM fraud-transactions" +npx wrangler pipelines create transactionspipeline \ + --sql "INSERT INTO rawtransactionsink SELECT * FROM rawtransactionstream" ``` ## 5. Generate fraud detection data @@ -143,15 +191,16 @@ import json import uuid import random import time +import os from datetime import datetime, timezone, timedelta -# Configuration -STREAM_ENDPOINT = "https://YOUR_STREAM_ID.ingest.cloudflare.com" # From the stream you created -API_TOKEN = "WRANGLER_R2_SQL_AUTH_TOKEN" #the same one created earlier +# Configuration - exported from the prior steps +STREAM_ENDPOINT = os.environ["STREAM_ENDPOINT"]# From the stream you created +API_TOKEN = os.environ["WRANGLER_R2_SQL_AUTH_TOKEN"] #the same one created earlier EVENTS_TO_SEND = 1000 # Feel free to adjust this def generate_transaction(): - """Generate some transactions with occassional fraud patterns""" + """Generate some random transactions with occassional fraud""" # User IDs high_risk_users = [1001, 1002, 1003, 1004, 1005] @@ -160,7 +209,7 @@ def generate_transaction(): user_id = random.choice(high_risk_users + normal_users) is_high_risk_user = user_id in high_risk_users - # Generate amount + # Generate amounts if random.random() < 0.05: amount = round(random.uniform(5000, 50000), 2) elif random.random() < 0.03: @@ -169,8 +218,8 @@ def generate_transaction(): amount = round(random.uniform(10, 500), 2) # Locations - normal_locations = ["NEW_YORK", "LOS_ANGELES", "CHICAGO", "MIAMI", "SEATTLE"] - high_risk_locations = ["UNKNOWN_LOCATION", "VPN_EXIT", "BELARUS", "NIGERIA"] + normal_locations = ["NEW_YORK", "LOS_ANGELES", "CHICAGO", "MIAMI", "SEATTLE", "SAN FRANCISCO"] + high_risk_locations = ["UNKNOWN_LOCATION", "VPN_EXIT", "MARS", "BAT_CAVE"] if is_high_risk_user and random.random() < 0.3: location = random.choice(high_risk_locations) @@ -186,7 +235,7 @@ def generate_transaction(): else: merchant_category = random.choice(normal_merchants) - # Determine if transaction is fraudulent based on basic risk factors + # Series of checks to either increase fraud score by a certain margin fraud_score = 0 if amount > 2000: fraud_score += 0.4 if amount < 1: fraud_score += 0.3 @@ -194,7 +243,7 @@ def generate_transaction(): if merchant_category in high_risk_merchants: fraud_score += 0.3 if is_high_risk_user: fraud_score += 0.2 - # Compare the fraud score + # Compare the fraud scores is_fraud = random.random() < min(fraud_score * 0.3, 0.8) # Generate timestamps (some fraud happens at unusual hours) @@ -239,14 +288,13 @@ def send_batch_to_stream(events, batch_size=100): if response.status_code in [200, 201]: total_sent += len(batch) fraud_count += fraud_in_batch - print(f"✅ Sent batch of {len(batch)} events (Total: {total_sent})") + print(f"Sent batch of {len(batch)} events (Total: {total_sent})") else: - print(f"❌ Failed to send batch: {response.status_code} - {response.text}") + print(f"Failed to send batch: {response.status_code} - {response.text}") except Exception as e: - print(f"❌ Error sending batch: {e}") + print(f"Error sending batch: {e}") - # Small delay between batches time.sleep(0.1) return total_sent, fraud_count @@ -265,10 +313,10 @@ def main(): print(f"📊 Generated {len(events)} total events ({fraud_events} fraud, {fraud_events/len(events)*100:.1f}%)") # Send to stream - print("📤 Sending data to Cloudflare Stream...") + print("Sending data to Pipeline stream...") sent, fraud_sent = send_batch_to_stream(events) - print(f"\n🎉 Complete!") + print(f"\nComplete!") print(f" Events sent: {sent:,}") print(f" Fraud events: {fraud_sent:,} ({fraud_sent/sent*100:.1f}%)") print(f" Data is now flowing through your pipeline!") @@ -305,8 +353,8 @@ SELECT is_fraud, transaction_timestamp FROM fraud_detection.transactions -WHERE __ingest_ts > '2025-09-12T01:00:00Z' -AND is_fruad = 'TRUE' +WHERE __ingest_ts > '2025-09-24T01:00:00Z' +AND is_fraud = 'TRUE' LIMIT 10" ``` :::note @@ -318,7 +366,7 @@ Replace `YOUR_WAREHOUSE` with your R2 Data Catalog warehouse. This in the form o Create a new sink that will write the filtered data to a new Apache Iceberg table in R2 Data Catalog: ```bash -npx wrangler pipelines sink create filtered-fraud-sink \ +npx wrangler pipelines sinks create filteredfraudsink \ --type "r2-data-catalog" \ --bucket "fraud-detection-data" \ --roll-interval 30 \ @@ -327,20 +375,20 @@ npx wrangler pipelines sink create filtered-fraud-sink \ --catalog-token $WRANGLER_R2_SQL_AUTH_TOKEN ``` -Now you'll create a new SQL query to process data from the original `fraud-transactions` stream and only write flagged transactions that are over the `amount` of 1000. +Now you'll create a new SQL query to process data from the original `rawtransactionstream` stream and only write flagged transactions that are over the `amount` of 1000. ```bash -npx wrangler pipelines create fraud-pipeline \ - --sql "INSERT INTO filtered-fraud-sink SELECT * FROM fraud-transactions WHERE is_fraud='TRUE' and amount > 1000" +npx wrangler pipelines create fraudpipeline \ + --sql "INSERT INTO filteredfraudsink SELECT * FROM rawtransactionstream WHERE is_fraud='TRUE' and amount > 1000" ``` :::note It may take a few minutes for the new Pipeline to fully Initialize and start processing the data. Also keep in mind the 30 second `roll-interval` ::: -Let's query our table and check the results: +Let's query the table and check the results: ```bash -npx wrangler r2 sql query " +npx wrangler r2 sql query "YOUR_WAREHOUSE" " SELECT transaction_id, user_id, @@ -350,9 +398,27 @@ SELECT is_fraud, transaction_timestamp FROM fraud_detection.fraud_transactions -WHERE __ingest_ts > '2025-09-12T01:00:00Z' LIMIT 10" ``` +Let's also verify that the non-fraudulent events are being filtered out: +```bash +npx wrangler r2 sql query "YOUR_WAREHOUSE" " +SELECT + transaction_id, + user_id, + amount, + location, + merchant_category, + is_fraud, + transaction_timestamp +FROM fraud_detection.fraud_transactions +WHERE is_fraud = 'FALSE' +LIMIT 10" +``` +You should see the following output: +```text +Query executed successfully with no results +``` ## Conclusion @@ -360,5 +426,5 @@ You have successfully built an end to end data pipeline using Cloudflare's data 1. **Use R2 Data Catalog** - Leveraged Apache Iceberg tables for efficient data storage 2. **Set up Cloudflare Pipelines** - Created streams, sinks, and pipelines for data ingestion -3. **Generated sample data** - Created transaction data with basic fraud patterns -4. **Query with R2 SQL** - Performed complex fraud analysis using SQL queries +3. **Generated sample data** - Created transaction data with some basic fraud patterns +4. **Query your tables with R2 SQL** - Access raw and processed data tables stored in R2 Data Catalog diff --git a/src/content/docs/r2/sql/platform/index.mdx b/src/content/docs/r2/sql/platform/index.mdx new file mode 100644 index 00000000000000..ef43ff93fe3c19 --- /dev/null +++ b/src/content/docs/r2/sql/platform/index.mdx @@ -0,0 +1,7 @@ +--- +title: Platform +pcx_content_type: navigation +sidebar: + group: + hideIndex: true +--- diff --git a/src/content/docs/r2/sql/platform/pricing.mdx b/src/content/docs/r2/sql/platform/pricing.mdx index e69de29bb2d1d6..2b41cd9c6df209 100644 --- a/src/content/docs/r2/sql/platform/pricing.mdx +++ b/src/content/docs/r2/sql/platform/pricing.mdx @@ -0,0 +1,17 @@ +--- +pcx_content_type: concept +title: Pricing +sidebar: + order: 1 +head: + - tag: title + content: R2 SQL - Pricing + +--- + + +R2 SQL is currently not billed during open beta but will eventually be billed on the amount of data queried. + +During the first phase of the R2 SQL open beta, you will not be billed for R2 SQL usage. You will be billed only for R2 usage. + +We plan to price based on the volume of data queried by R2 SQL. We will provide at least 30 days' notice and exact pricing before charging. \ No newline at end of file diff --git a/src/content/docs/r2/sql/query-data.mdx b/src/content/docs/r2/sql/query-data.mdx index e69de29bb2d1d6..3905eb47b20380 100644 --- a/src/content/docs/r2/sql/query-data.mdx +++ b/src/content/docs/r2/sql/query-data.mdx @@ -0,0 +1,77 @@ +--- +title: Query data in R2 Data Catalog +pcx_content_type: example +--- + +:::note +R2 SQL is currently in open beta +::: + +## Prerequisites + +- Sign up for a [Cloudflare account](https://dash.cloudflare.com/sign-up/workers-and-pages). +- [Create an R2 bucket](/r2/buckets/create-buckets/) and [enable the data catalog](/r2/data-catalog/manage-catalogs/#enable-r2-data-catalog-on-a-bucket). +- [Create an R2 API token](/r2/api/tokens/) with [R2, R2 SQL, and data catalog permissions](/r2/api/tokens/#permissions). +- Tables must have a time-based partition key in order be queried by R2 SQL. Read about the current [limitations](/r2/sql/platform/limitations-best-practices) to learn more. + +R2 SQL can currently be accessed via Wrangler commands or a REST API. + +## Wrangler + + +Export your R2 API token as an environment variable: + +```bash +export WRANGLER_R2_SQL_AUTH_TOKEN=your_token_here +``` + +If this is your first time using Wrangler, make sure to login. +```bash +npx wrangler login +``` + +You'll also want to grab the **warehouse** of the your R2 Data Catalog: + +```sh +❯ npx wrangler r2 bucket catalog get [BUCKET_NAME] + + ⛅️ wrangler 4.38.0 +──────────────────────────────────────────────────────────────────────────── +▲ [WARNING] 🚧 `wrangler r2 bucket catalog get` is an open-beta command. Please report any issues to https://github.com/cloudflare/workers-sdk/issues/new/choose + + +Catalog URI: https://catalog.cloudflarestorage.com/[ACCOUNT_ID]/[BUCKET_NAME] +Warehouse: [ACCOUNT_ID]_[BUCKET_NAME] +Status: active +``` + +To query R2 SQL with Wrangler, simply run: + +```sh +npx wrangler r2 sql query "YOUR_WAREHOUSE" "SELECT * FROM namespace.table_name limit 10;" +``` +For a full list of supported sql commands, check out the [R2 SQL reference page](/r2/sql/platform/sql-reference). + + +## REST API + +Set your environment variable + +```bash +export ACCOUNT_ID="your-cloudflare-account-id" +export BUCKET_NAME="your-r2-bucket-name" +export WRANGLER_R2_SQL_AUTH_TOKEN="your_token_here" +``` + +Now you're ready to use the REST endpoint + +```bash +curl -X POST \ + "https://api.sql.cloudflarestorage.com/api/v1/accounts/${ACCOUNT_ID}/r2-sql/query/${BUCKET_NAME}" \ + -H "Authorization: Bearer ${WRANGLER_R2_SQL_AUTH_TOKEN}" \ + -H "Content-Type: application/json" \ + -d '{ + "warehouse": "your-warehouse-name", + "query": "SELECT * FROM namespace.table_name limit 10;" + }' | jq . +``` \ No newline at end of file diff --git a/src/content/docs/r2/sql/troubleshooting.mdx b/src/content/docs/r2/sql/troubleshooting.mdx index 5f1a6542cc1819..d233ccdaeffa0c 100644 --- a/src/content/docs/r2/sql/troubleshooting.mdx +++ b/src/content/docs/r2/sql/troubleshooting.mdx @@ -118,7 +118,6 @@ SELECT * FROM table2 WHERE id IN ('id1', 'id2', 'id3') **Solution**: - Denormalize your data by including necessary fields in a single table - Perform multiple queries and join data in your application -- Restructure your data model to avoid cross-table queries ### Subqueries @@ -206,10 +205,7 @@ SELECT * FROM logs WHERE request_size > response_size SELECT * FROM events WHERE duration > 0 ``` -**Solution**: -- Pre-compute comparisons and store as separate columns -- Handle comparisons in your application layer -- Restructure your data model +**Solution**: Handle comparisons in your application layer --- @@ -277,7 +273,6 @@ UPDATE events SET status = 200 WHERE user_id = '123' **Solution**: - Plan your schema carefully before data ingestion -- Contact your data engineering team for schema changes - Ensure all column names exist in your current schema --- From b8abf911b2e24ddf2a2edeb85d07a01b522c319a Mon Sep 17 00:00:00 2001 From: Marc Selwan Date: Fri, 19 Sep 2025 10:50:54 -0700 Subject: [PATCH 05/30] removed redundant command --- src/content/docs/r2/sql/end-to-end-pipeline.mdx | 4 ---- 1 file changed, 4 deletions(-) diff --git a/src/content/docs/r2/sql/end-to-end-pipeline.mdx b/src/content/docs/r2/sql/end-to-end-pipeline.mdx index b9a7b65d8a0c9e..3e236a1e31f28d 100644 --- a/src/content/docs/r2/sql/end-to-end-pipeline.mdx +++ b/src/content/docs/r2/sql/end-to-end-pipeline.mdx @@ -325,10 +325,6 @@ if __name__ == "__main__": main() ``` -Update the configuration variables in the script: -- Replace `YOUR_STREAM_ID` with your actual stream endpoint from step 4 -- Replace `YOUR_API_TOKEN` with your Cloudflare API token - Install the required Python dependency and run the script: ```bash From 1f9632f0822aedbcbcf6a5f89186a9b7d0e6664b Mon Sep 17 00:00:00 2001 From: Marc Selwan Date: Sat, 20 Sep 2025 15:27:52 -0700 Subject: [PATCH 06/30] A ton of changes and improvements implemented Jerome's feedback in virtually all docs. Docs are properly organized now. --- .gitignore | 3 +- src/content/docs/r2/sql/get-started.mdx | 210 ++++++++++++++++++ .../platform/limitations-best-practices.mdx | 26 +-- src/content/docs/r2/sql/platform/pricing.mdx | 2 +- .../docs/r2/sql/platform/sql-reference.mdx | 23 +- src/content/docs/r2/sql/query-data.mdx | 7 +- src/content/docs/r2/sql/troubleshooting.mdx | 10 +- .../{ => tutorials}/end-to-end-pipeline.mdx | 76 +++---- src/content/docs/r2/sql/tutorials/index.mdx | 7 + 9 files changed, 292 insertions(+), 72 deletions(-) create mode 100644 src/content/docs/r2/sql/get-started.mdx rename src/content/docs/r2/sql/{ => tutorials}/end-to-end-pipeline.mdx (84%) create mode 100644 src/content/docs/r2/sql/tutorials/index.mdx diff --git a/.gitignore b/.gitignore index 501cf6fc5247c5..673fe364a8a3ec 100644 --- a/.gitignore +++ b/.gitignore @@ -29,4 +29,5 @@ pnpm-debug.log* /assets/secrets /worker/functions/ -.idea \ No newline at end of file +.idea +package-lock.json diff --git a/src/content/docs/r2/sql/get-started.mdx b/src/content/docs/r2/sql/get-started.mdx new file mode 100644 index 00000000000000..cc6773163e5977 --- /dev/null +++ b/src/content/docs/r2/sql/get-started.mdx @@ -0,0 +1,210 @@ +--- +pcx_content_type: get-started +title: Getting started +head: [] +sidebar: + order: 2 +description: Learn how to get up and running with R2 SQL using R2 Data Catalog and Pipelines +--- +import { + Render, + LinkCard, +} from "~/components"; + +## Overview + +This guide will instruct you through: + +- Creating an [R2 bucket](/r2/buckets/) and enabling its [data catalog](/r2/data-catalog/). +- Using Wrangler to create a Pipeline Stream, Sink, and the SQL that reads from the stream and writes it to the sink +- Sending some data to the stream via the HTTP Streams endpoint +- Querying the data using R2 SQL + +## Prerequisites + +1. Sign up for a [Cloudflare account](https://dash.cloudflare.com/sign-up). +2. Install [Node.js](https://nodejs.org/en/). +3. Install [Wrangler](/workers/wranger/install-and-update) + +:::note[Node.js version manager] +Use a Node version manager like [Volta](https://volta.sh/) or [nvm](https://github.com/nvm-sh/nvm) to avoid permission issues and change Node.js versions. Wrangler requires a Node version of 16.17.0 or later. +::: + +## 1. Set up authentication + +You'll need API tokens to interact with Cloudflare services. + +### Custom API Token +1. Go to **My Profile** → **API Tokens** in the Cloudflare dashboard +2. Select **Create Token** → **Custom token** +3. Add the following permissions: + - **Workers Pipelines** - Read, Send, Edit + - **Workers R2 Storage** - Edit, Read + - **Workers R2 Data Catalog** - Edit, Read + - **Workers R2 SQL** - Read + +Export your new token as an environment variable: + +```bash +export WRANGLER_R2_SQL_AUTH_TOKEN=your_token_here +``` + +If this is your first time using Wrangler, make sure to login. +```bash +npx wrangler login +``` + +## 2. Create an R2 bucket + +Create a new R2 bucket: + +```bash +npx wrangler r2 bucket create r2-sql-demo +``` + +## 3. Enable R2 Data Catalog + +Enable [R2 Data Catalog](/r2/data-catalog/) feature on your bucket to use Apache Iceberg tables: + +```bash +npx wrangler r2 bucket catalog enable r2-sql-demo +``` +## 4. Create the data Pipeline + +### 1. Create the Pipeline Stream + +First, create a schema file called `demo_schema.json` with the following `json` schema: +```json +{ + "fields": [ + {"name": "user_id", "type": "int64", "required": true}, + {"name": "payload", "type": "string", "required": false}, + {"name": "numbers", "type": "int32", "required": false} + ] +} +``` +Next, crete the stream we'll use to ingest events to: + +```bash +npx wrangler pipelines streams create demo_stream \ + --schema-file demo_schema.json \ + --http-enabled true \ + --http-auth false +``` +:::note +Note the **HTTP Ingest Endpoint URL** from the output. This is the endpoint you'll use to send data to your pipeline. +::: + +```bash +# The http ingest endpoint from the output (see example below) +export STREAM_ENDPOINT= #the http ingest endpoint from the output (see example below) +``` +The output should look like this: +```sh +🌀 Creating stream 'demo_stream'... +✨ Successfully created stream 'demo_stream' with id 'stream_id'. + +Creation Summary: +General: + Name: demo_stream + +HTTP Ingest: + Enabled: Yes + Authentication: No + Endpoint: https://stream_id.ingest.cloudflare.com + CORS Origins: None + +Input Schema: +┌────────────┬────────┬────────────┬──────────┐ +│ Field Name │ Type │ Unit/Items │ Required │ +├────────────┼────────┼────────────┼──────────┤ +│ user_id │ int64 │ │ Yes │ +├────────────┼────────┼────────────┼──────────┤ +│ payload │ string │ │ No │ +├────────────┼────────┼────────────┼──────────┤ +│ numbers │ int32 │ │ No │ +└────────────┴────────┴────────────┴──────────┘ +``` + + +### 2. Create the Pipeline Sink + +Create a sink that writes data to your R2 bucket as Apache Iceberg tables: + +```bash +npx wrangler pipelines sinks create demo_sink \ + --type "r2-data-catalog" \ + --bucket "r2-sql-demo" \ + --roll-interval 30 \ + --namespace "demo" \ + --table "first_table" \ + --catalog-token $WRANGLER_R2_SQL_AUTH_TOKEN +``` + +:::note +This creates a `sink` configuration that will write to the Iceberg table demo.first_table in your R2 Data Catalog every 30 seconds. Pipelines automatically appends an `__ingest_ts` column that is used to partition the table by `DAY` +::: + +### 3. Create the Pipeline + +Pipelines are SQL statements read data from the stream, does some work, and writes it to the sink + +```bash +npx wrangler pipelines create demo_pipeline \ + --sql "INSERT INTO demo_sink SELECT * FROM demo_stream WHERE numbers > 5;" +``` +:::note +Note that there is a filter on this statement that will only send events where `numbers` is greater than 5 +::: + +## 5. Send some data + +Next, let's send some events to our stream: + +```curl +curl -X POST "$STREAM_ENDPOINT" \ + -H "Authorization: Bearer YOUR_API_TOKEN" \ + -H "Content-Type: application/json" \ + -d '[ + { + "user_id": 1, + "payload": "you should see this", + "numbers": 42 + }, + { + "user_id": 2, + "payload": "you should also see this", + "numbers": 100 + }, + { + "user_id": 3, + "payload": null, + "numbers": 1 + }, + { + "user_id": 4, + "numbers": null + } + ]' +``` +This will send 4 events in one `POST`. Since our Pipeline is filtering out records with `numbers` less than 5, `user_id` `3` and `4` should not appear in the table. Feel free to change values and send more events. + +## 6. Query the table with R2 SQL + +After you've sent your events to the stream, it will take about 30 seconds for the data to show in the table since that's what we configured our `roll interval` to be in the Sink. + +```bash +npx wrangler r2 sql query "SELECT * FROM demo.first_table LIMIT 10" +``` + + + + diff --git a/src/content/docs/r2/sql/platform/limitations-best-practices.mdx b/src/content/docs/r2/sql/platform/limitations-best-practices.mdx index da626501f8ff63..adb53dfab59804 100644 --- a/src/content/docs/r2/sql/platform/limitations-best-practices.mdx +++ b/src/content/docs/r2/sql/platform/limitations-best-practices.mdx @@ -21,20 +21,20 @@ R2 SQL is designed for querying **partitioned** Apache Iceberg tables in your R2 | Feature | Supported | Notes | | :---- | :---- | :---- | -| Basic SELECT | Yes | Columns, \*, aliases | -| SQL Functions | No | No COUNT, AVG, etc. | -| Single table FROM | Yes | With aliasing | +| Basic SELECT | Yes | Columns, \* | +| Aggregation functions | No | No COUNT, AVG, etc. | +| Single table FROM | Yes | Note, aliasing not supported| +| WHERE clause | Yes | Filters, comparisons, equality, etc | | JOINs | No | No table joins | -| WHERE with time | Yes | Required | | Array filtering | No | No array type support | | JSON filtering | No | No nested object queries | | Simple LIMIT | Yes | 1-10,000 range | -| ORDER BY | Yes | Only on partition key | +| ORDER BY | Yes | Any columns of the partition key only| | GROUP BY | No | Not supported | ## Supported SQL Clauses -R2 SQL supports a limited set of SQL clauses: `SELECT`, `FROM`, `WHERE`, and `LIMIT`. All other SQL clauses are not supported at the moment. New features will release often, keep an eye on this page and the changelog\[LINK TO CHANGE LOG\] for the latest. +R2 SQL supports a limited set of SQL clauses: `SELECT`, `FROM`, `WHERE`, and `LIMIT`. All other SQL clauses are not supported at the moment. New features will be released in the future, keep an eye on this page and the changelog\[LINK TO CHANGE LOG\] for the latest. --- @@ -50,7 +50,7 @@ R2 SQL supports a limited set of SQL clauses: `SELECT`, `FROM`, `WHERE`, and `LI - **No JSON field querying**: Cannot query individual fields from JSON objects - **No SQL functions**: Functions like `AVG()`, `COUNT()`, `MAX()`, `MIN()`, quantiles are not supported - **No synthetic data**: Cannot create synthetic columns like `SELECT 1 AS what, "hello" AS greeting` -- **Field aliasing**: `SELECT field AS another_name` +- **No field aliasing**: `SELECT field AS another_name` ### Examples @@ -85,7 +85,7 @@ SELECT 1 AS synthetic_column - **No schema evolution**: Schema cannot be altered (no ALTER TABLE, migrations) - **Immutable datasets**: No UPDATE or DELETE operations allowed - **Fully defined schema**: Dynamic or union-type fields are not supported -- **Table aliasing**: `SELECT * FROM table_name AS alias` +- **No table aliasing**: `SELECT * FROM table_name AS alias` ### Examples @@ -105,13 +105,12 @@ SELECT * FROM (SELECT * FROM events WHERE status = 200) ### Supported Features -- **Time filtering**: Queries should include a time filter -- **Simple type filtering**: Supports `string`, `boolean`, and `number` types +- **Simple type filtering**: Supports `string`, `boolean`, `number` types, and timestamps expressed as RFC3339 - **Boolean logic**: Supports `AND`, `OR`, `NOT` operators - **Comparison operators**: `>`, `>=`, `=`, `<`, `<=`, `!=` - **Grouped conditions**: `WHERE col_a="hello" AND (col_b>5 OR col_c != 3)` -- **Pattern mating:** `WHERE col_a LIKE ‘%hello w%’` -- **NULL Handling:** `WHERE col_a IS NOT NULL` +- **Pattern matching:** `WHERE col_a LIKE ‘hello w%’` (prefix matching only) +- **NULL Handling :** `WHERE col_a IS NOT NULL` (`IS`/`IS NOT`) ### Limitations @@ -208,5 +207,4 @@ The following SQL clauses are **not supported**: 2. **Use specific column selection** instead of `SELECT *` when possible for better performance 3. **Structure your data** to avoid nested JSON objects if you need to filter on those fields ---- - +--- \ No newline at end of file diff --git a/src/content/docs/r2/sql/platform/pricing.mdx b/src/content/docs/r2/sql/platform/pricing.mdx index 2b41cd9c6df209..b408b2f4192f63 100644 --- a/src/content/docs/r2/sql/platform/pricing.mdx +++ b/src/content/docs/r2/sql/platform/pricing.mdx @@ -14,4 +14,4 @@ R2 SQL is currently not billed during open beta but will eventually be billed on During the first phase of the R2 SQL open beta, you will not be billed for R2 SQL usage. You will be billed only for R2 usage. -We plan to price based on the volume of data queried by R2 SQL. We will provide at least 30 days' notice and exact pricing before charging. \ No newline at end of file +We plan to price based on the volume of data queried by R2 SQL. We will provide at least 30 days notice and exact pricing before charging. \ No newline at end of file diff --git a/src/content/docs/r2/sql/platform/sql-reference.mdx b/src/content/docs/r2/sql/platform/sql-reference.mdx index d020aac3b378d4..be24d1660642c8 100644 --- a/src/content/docs/r2/sql/platform/sql-reference.mdx +++ b/src/content/docs/r2/sql/platform/sql-reference.mdx @@ -93,6 +93,7 @@ SELECT * WHERE condition [AND|OR condition ...] - `column_name <= value` - `column_name < value` - `column_name != value` +- `column_name LIKE value%` #### Logical Operators @@ -104,11 +105,12 @@ SELECT * WHERE condition [AND|OR condition ...] - **integer** \- Whole numbers - **float** \- Decimal numbers - **string** \- Text values (quoted) +- **timestamp** - RFC3339 format (`'YYYY-DD-MMT-HH:MM:SSZ'`) ### Examples ```sql -SELECT * FROM table_name WHERE timestamp BETWEEN '2025-01-01' AND '2025-01-02' +SELECT * FROM table_name WHERE timestamp BETWEEN '2025-09-24T01:00:00Z' AND '2025-09-25T01:00:00Z' SELECT * FROM table_name WHERE status = 200 SELECT * FROM table_name WHERE response_time > 1000 SELECT * FROM table_name WHERE user_id IS NOT NULL @@ -123,19 +125,21 @@ SELECT * FROM table_name WHERE (status = 404 OR status = 500) AND timestamp > '2 ### Syntax ```sql ---Note: ORDERY BY only supports ordering by the partition key +--Note: ORDER BY only supports ordering by the partition key ORDER BY partition_key [DESC] ``` -- **Default**: Ascending order (ASC) +- **ASC**: Ascending order - **DESC**: Descending order +- **Default**: partition_key DESC +- Can contain any columns from the partition key ### Examples ```sql -SELECT * FROM table_name WHERE ... ORDER BY partitionKey -SELECT * FROM table_name WHERE ... ORDER BY partitionKey DESC -SELECT * FROM table_name WHERE ... ORDER BY partitionKey DESC +SELECT * FROM table_name WHERE ... ORDER BY paetition_key_A +SELECT * FROM table_name WHERE ... ORDER BY partition_key_B DESC +SELECT * FROM table_name WHERE ... ORDER BY partitionKey_A ASC ``` @@ -151,6 +155,7 @@ LIMIT number - **Range**: 1 to 10,000 - **Type**: Integer only +- **Default**: 500 ### Examples @@ -167,7 +172,7 @@ SELECT * FROM table_name WHERE ... LIMIT 100 ```sql SELECT * FROM http_requests -WHERE timestamp BETWEEN '2024-01-01' AND '2024-01-02' +WHERE timestamp BETWEEN '2025-09-24T01:00:00Z' AND '2025-09-25T01:00:00Z' LIMIT 100 ``` @@ -215,6 +220,8 @@ LIMIT 500 | `integer` | Whole numbers | `1`, `42`, `-10`, `0` | | `float` | Decimal numbers | `1.5`, `3.14`, `-2.7`, `0.0` | | `string` | Text values | `'hello'`, `'GET'`, `'2024-01-01'` | +| `boolean` | boolean values | `true`, `false` | +| `timestamp` | RFC3339 | `'2025-09-24T01:00:00Z'` | ### Type Usage in Conditions @@ -237,7 +244,7 @@ SELECT * FROM table_name WHERE country_code = 'US' ## Operator Precedence -1. **Comparison operators**: `=`, `!=`, `<`, `<=`, `>`, `>=`, `BETWEEN`, `IS NULL`, `IS NOT NULL` +1. **Comparison operators**: `=`, `!=`, `<`, `<=`, `>`, `>=`, `LIK#`, `BETWEEN`, `IS NULL`, `IS NOT NULL` 2. **AND** (higher precedence) 3. **OR** (lower precedence) diff --git a/src/content/docs/r2/sql/query-data.mdx b/src/content/docs/r2/sql/query-data.mdx index 3905eb47b20380..c30e8d6c9ee1d9 100644 --- a/src/content/docs/r2/sql/query-data.mdx +++ b/src/content/docs/r2/sql/query-data.mdx @@ -1,6 +1,8 @@ --- title: Query data in R2 Data Catalog pcx_content_type: example +sidebar: + order: 3 --- :::note @@ -12,7 +14,7 @@ R2 SQL is currently in open beta - Sign up for a [Cloudflare account](https://dash.cloudflare.com/sign-up/workers-and-pages). - [Create an R2 bucket](/r2/buckets/create-buckets/) and [enable the data catalog](/r2/data-catalog/manage-catalogs/#enable-r2-data-catalog-on-a-bucket). - [Create an R2 API token](/r2/api/tokens/) with [R2, R2 SQL, and data catalog permissions](/r2/api/tokens/#permissions). -- Tables must have a time-based partition key in order be queried by R2 SQL. Read about the current [limitations](/r2/sql/platform/limitations-best-practices) to learn more. +- Tables must have a time-based partition key in order to be queried by R2 SQL. Read about the current [limitations](/r2/sql/platform/limitations-best-practices) to learn more. R2 SQL can currently be accessed via Wrangler commands or a REST API. @@ -30,7 +32,7 @@ If this is your first time using Wrangler, make sure to login. npx wrangler login ``` -You'll also want to grab the **warehouse** of the your R2 Data Catalog: +You'll also want to grab the **warehouse** of the R2 Data Catalog: ```sh ❯ npx wrangler r2 bucket catalog get [BUCKET_NAME] @@ -71,7 +73,6 @@ curl -X POST \ -H "Authorization: Bearer ${WRANGLER_R2_SQL_AUTH_TOKEN}" \ -H "Content-Type: application/json" \ -d '{ - "warehouse": "your-warehouse-name", "query": "SELECT * FROM namespace.table_name limit 10;" }' | jq . ``` \ No newline at end of file diff --git a/src/content/docs/r2/sql/troubleshooting.mdx b/src/content/docs/r2/sql/troubleshooting.mdx index d233ccdaeffa0c..9a1a18349aa1c7 100644 --- a/src/content/docs/r2/sql/troubleshooting.mdx +++ b/src/content/docs/r2/sql/troubleshooting.mdx @@ -4,7 +4,7 @@ pcx_content_type: concept tags: - SQL sidebar: - order: 5 + order: 7 --- # R2 SQL Troubleshooting Guide @@ -23,12 +23,12 @@ This guide covers potential errors and limitations you may encounter when using ```sql -- Invalid - Missing FROM clause -SELECT user_id WHERE status = 200 +SELECT user_id WHERE status = 200; -- Valid SELECT user_id FROM http_requests -WHERE status = 200 AND timestamp BETWEEN '2024-01-01' AND '2024-01-02' +WHERE status = 200 AND timestamp BETWEEN '2025-09-24T01:00:00Z' AND '2025-09-25T01:00:00Z'; ``` **Solution**: Always include `FROM` in your queries. @@ -47,7 +47,7 @@ WHERE status = 200 AND timestamp BETWEEN '2024-01-01' AND '2024-01-02' ```sql -- Invalid - Aggregate functions not supported -SELECT COUNT(*) FROM events WHERE timestamp > '2024-01-01' +SELECT COUNT(*) FROM events WHERE timestamp > '2025-09-24T01:00:00Z' SELECT AVG(response_time) FROM http_requests WHERE status = 200 SELECT MAX(timestamp) FROM logs WHERE user_id = '123' ``` @@ -260,7 +260,7 @@ SELECT * FROM events WHERE timestamp > '2024-01-01T10:30:00Z' ORDER BY timestamp ### Dynamic Schema Changes
-**Error**: `Sinvalid SQL: only top-level SELECT clause is supported` +**Error**: `invalid SQL: only top-level SELECT clause is supported`
**Problem**: Attempting to modify table schema or reference non-existent columns. diff --git a/src/content/docs/r2/sql/end-to-end-pipeline.mdx b/src/content/docs/r2/sql/tutorials/end-to-end-pipeline.mdx similarity index 84% rename from src/content/docs/r2/sql/end-to-end-pipeline.mdx rename to src/content/docs/r2/sql/tutorials/end-to-end-pipeline.mdx index 3e236a1e31f28d..cca15b3489b507 100644 --- a/src/content/docs/r2/sql/end-to-end-pipeline.mdx +++ b/src/content/docs/r2/sql/tutorials/end-to-end-pipeline.mdx @@ -59,7 +59,7 @@ npx wrangler login Create a new R2 bucket to store your fraud detection data: ```bash -npx wrangler r2 bucket create fraud-detection-data +npx wrangler r2 bucket create fraud-pipeline ``` ## 3. Enable R2 Data Catalog @@ -67,16 +67,21 @@ npx wrangler r2 bucket create fraud-detection-data Enable the Data Catalog feature on your bucket to use Apache Iceberg tables: ```bash -npx wrangler r2 bucket catalog enable fraud-detection-data +npx wrangler r2 bucket catalog enable fraud-pipeline ``` + :::note -Make sure to save the Warehouse for use later in this guide +Copy the warehouse (ACCOUNTID_BUCKETNAME) and paste it in the `export` below. We'll use it later in the tutorial. ::: +```bash +export $WAREHOUSE= #Paste your warehouse here +``` + ### Optional - Enable compaction on your R2 Data Catalog -R2 Data Catalog can automatically compact tables for you. In production event streaming use cases, it's common to end up with many small files so it's recommended to enable compaction. Since this is a sample use case, this is optional. +R2 Data Catalog can automatically compact tables for you. In production event streaming use cases, it's common to end up with many small files, so it's recommended to enable compaction. Since this is a sample use case, this is optional. ```bash -npx wrangler r2 bucket catalog compaction enable fraud-detection-data --token $WRANGLER_R2_SQL_AUTH_TOKEN +npx wrangler r2 bucket catalog compaction enable fraud-pipeline --token $WRANGLER_R2_SQL_AUTH_TOKEN ``` ## 4. Set up the pipeline infrastructure @@ -93,8 +98,7 @@ First, create a schema file called `raw_transactions_schema.json` with the follo {"name": "transaction_timestamp", "type": "string", "required": false}, {"name": "location", "type": "string", "required": false}, {"name": "merchant_category", "type": "string", "required": false}, - {"name": "is_fraud", "type": "string", "required": false}, - {"name": "ingestion_timestamp", "type": "string", "required": false} + {"name": "is_fraud", "type": "bool", "required": false} ] } ``` @@ -102,10 +106,10 @@ First, create a schema file called `raw_transactions_schema.json` with the follo Create a stream to receive incoming fraud detection events: ```bash -npx wrangler pipelines streams create rawtransactionstream \ +npx wrangler pipelines streams create raw_stream \ --schema-file raw_transactions_schema.json \ --http-enabled true \ - --http-auth true + --http-auth false ``` :::note Note the **HTTP Ingest Endpoint URL** from the output. This is the endpoint you'll use to send data to your pipeline. @@ -117,12 +121,12 @@ export STREAM_ENDPOINT= #the http ingest endpoint from the output (see example b The output should look like this: ```sh -🌀 Creating stream 'rawtransactionstream'... -✨ Successfully created stream 'rawtransactionstream' with id 'stream_id'. +🌀 Creating stream 'raw_stream'... +✨ Successfully created stream 'raw_stream' with id 'stream_id'. Creation Summary: General: - Name: rawtransactionstream + Name: raw_stream HTTP Ingest: Enabled: Yes @@ -146,22 +150,18 @@ Input Schema: ├───────────────────────┼────────┼────────────┼──────────┤ │ merchant_category │ string │ │ No │ ├───────────────────────┼────────┼────────────┼──────────┤ -│ is_fraud │ string │ │ No │ -├───────────────────────┼────────┼────────────┼──────────┤ -│ ingestion_timestamp │ string │ │ No │ +│ is_fraud │ bool │ │ No │ └───────────────────────┴────────┴────────────┴──────────┘ ``` - - ### Create the data sink Create a sink that writes data to your R2 bucket as Apache Iceberg tables: ```bash -npx wrangler pipelines sinks create rawtransactionsink \ +npx wrangler pipelines sinks create raw_sink \ --type "r2-data-catalog" \ - --bucket "fraud-detection-data" \ + --bucket "fraud-pipeline" \ --roll-interval 30 \ --namespace "fraud_detection" \ --table "transactions" \ @@ -169,7 +169,7 @@ npx wrangler pipelines sinks create rawtransactionsink \ ``` :::note -This creates a `sink` configuration that will write to the Iceberg table fraud_detection.transactions every 30 seconds. Pipelines automatically appends an `__ingest_ts` column that is used to partion the table by `DAY` +This creates a `sink` configuration that will write to the Iceberg table fraud_detection.transactions in your R2 Data Catalog every 30 seconds. Pipelines automatically appends an `__ingest_ts` column that is used to partition the table by `DAY` ::: ### Create the pipeline @@ -177,8 +177,8 @@ This creates a `sink` configuration that will write to the Iceberg table fraud_d Connect your stream to your sink with SQL: ```bash -npx wrangler pipelines create transactionspipeline \ - --sql "INSERT INTO rawtransactionsink SELECT * FROM rawtransactionstream" +npx wrangler pipelines create raw_events_pipeline \ + --sql "INSERT INTO raw_sink SELECT * FROM raw_stream" ``` ## 5. Generate fraud detection data @@ -200,7 +200,7 @@ API_TOKEN = os.environ["WRANGLER_R2_SQL_AUTH_TOKEN"] #the same one created earli EVENTS_TO_SEND = 1000 # Feel free to adjust this def generate_transaction(): - """Generate some random transactions with occassional fraud""" + """Generate some random transactions with occasional fraud""" # User IDs high_risk_users = [1001, 1002, 1003, 1004, 1005] @@ -263,8 +263,7 @@ def generate_transaction(): "transaction_timestamp": transaction_time.isoformat(), "location": location, "merchant_category": merchant_category, - "is_fraud": "TRUE" if is_fraud else "FALSE", - "ingestion_timestamp": datetime.now(timezone.utc).isoformat() + "is_fraud": True if is_fraud else False } def send_batch_to_stream(events, batch_size=100): @@ -280,7 +279,7 @@ def send_batch_to_stream(events, batch_size=100): for i in range(0, len(events), batch_size): batch = events[i:i + batch_size] - fraud_in_batch = sum(1 for event in batch if event["is_fraud"] == "TRUE") + fraud_in_batch = sum(1 for event in batch if event["is_fraud"] == True) try: response = requests.post(STREAM_ENDPOINT, headers=headers, json=batch) @@ -309,7 +308,7 @@ def main(): if (i + 1) % 100 == 0: print(f"Generated {i + 1} events...") - fraud_events = sum(1 for event in events if event["is_fraud"] == "TRUE") + fraud_events = sum(1 for event in events if event["is_fraud"] == True) print(f"📊 Generated {len(events)} total events ({fraud_events} fraud, {fraud_events/len(events)*100:.1f}%)") # Send to stream @@ -339,7 +338,7 @@ Now you can analyze your fraud detection data using R2 SQL. Here are some exampl ### View recent transactions ```bash -npx wrangler r2 sql query "YOUR_WAREHOUSE" " +npx wrangler r2 sql query "$WAREHOUSE" " SELECT transaction_id, user_id, @@ -350,32 +349,29 @@ SELECT transaction_timestamp FROM fraud_detection.transactions WHERE __ingest_ts > '2025-09-24T01:00:00Z' -AND is_fraud = 'TRUE' +AND is_fraud = true LIMIT 10" ``` -:::note -Replace `YOUR_WAREHOUSE` with your R2 Data Catalog warehouse. This in the form of `{YOUR_ACCOUNT_ID}_{BUCKET_NAME}`. This can be found in the dash under the settings in your bucket. Adjust the `__ingest_ts` date in the query as needed. -::: ### Let's filter the raw transactions into a new table to highlight high-value transactions Create a new sink that will write the filtered data to a new Apache Iceberg table in R2 Data Catalog: ```bash -npx wrangler pipelines sinks create filteredfraudsink \ +npx wrangler pipelines sinks create fraud_filter_sink \ --type "r2-data-catalog" \ - --bucket "fraud-detection-data" \ + --bucket "fraud-pipeline" \ --roll-interval 30 \ --namespace "fraud_detection" \ --table "fraud_transactions" \ --catalog-token $WRANGLER_R2_SQL_AUTH_TOKEN ``` -Now you'll create a new SQL query to process data from the original `rawtransactionstream` stream and only write flagged transactions that are over the `amount` of 1000. +Now you'll create a new SQL query to process data from the original `raw_stream` stream and only write flagged transactions that are over the `amount` of 1000. ```bash -npx wrangler pipelines create fraudpipeline \ - --sql "INSERT INTO filteredfraudsink SELECT * FROM rawtransactionstream WHERE is_fraud='TRUE' and amount > 1000" +npx wrangler pipelines create fraud_events_pipeline \ + --sql "INSERT INTO fraud_filter_sink SELECT * FROM raw_stream WHERE is_fraud=true and amount > 1000" ``` :::note @@ -384,7 +380,7 @@ It may take a few minutes for the new Pipeline to fully Initialize and start pro Let's query the table and check the results: ```bash -npx wrangler r2 sql query "YOUR_WAREHOUSE" " +npx wrangler r2 sql query "$WAREHOUSE" " SELECT transaction_id, user_id, @@ -398,7 +394,7 @@ LIMIT 10" ``` Let's also verify that the non-fraudulent events are being filtered out: ```bash -npx wrangler r2 sql query "YOUR_WAREHOUSE" " +npx wrangler r2 sql query "$WAREHOUSE" " SELECT transaction_id, user_id, @@ -408,7 +404,7 @@ SELECT is_fraud, transaction_timestamp FROM fraud_detection.fraud_transactions -WHERE is_fraud = 'FALSE' +WHERE is_fraud = false LIMIT 10" ``` You should see the following output: diff --git a/src/content/docs/r2/sql/tutorials/index.mdx b/src/content/docs/r2/sql/tutorials/index.mdx new file mode 100644 index 00000000000000..78c229f8741630 --- /dev/null +++ b/src/content/docs/r2/sql/tutorials/index.mdx @@ -0,0 +1,7 @@ +--- +title: Tutorials +pcx_content_type: navigation +sidebar: + group: + hideIndex: true +--- From 9512bde1bea1c150737ce93a76555ff1aaaafbe4 Mon Sep 17 00:00:00 2001 From: Marc Selwan Date: Sat, 20 Sep 2025 15:38:40 -0700 Subject: [PATCH 07/30] Update get-started.mdx --- src/content/docs/r2/sql/get-started.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/content/docs/r2/sql/get-started.mdx b/src/content/docs/r2/sql/get-started.mdx index cc6773163e5977..3b4a518140d710 100644 --- a/src/content/docs/r2/sql/get-started.mdx +++ b/src/content/docs/r2/sql/get-started.mdx @@ -39,7 +39,7 @@ You'll need API tokens to interact with Cloudflare services. 2. Select **Create Token** → **Custom token** 3. Add the following permissions: - **Workers Pipelines** - Read, Send, Edit - - **Workers R2 Storage** - Edit, Read + - **Workers R2 Storage** - Edit, Read - **Workers R2 Data Catalog** - Edit, Read - **Workers R2 SQL** - Read From 579cbf2b6a849938f49b014e21b3c9d85b7ea654 Mon Sep 17 00:00:00 2001 From: Marc Selwan Date: Sat, 20 Sep 2025 15:39:30 -0700 Subject: [PATCH 08/30] Update end-to-end-pipeline.mdx --- src/content/docs/r2/sql/tutorials/end-to-end-pipeline.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/content/docs/r2/sql/tutorials/end-to-end-pipeline.mdx b/src/content/docs/r2/sql/tutorials/end-to-end-pipeline.mdx index cca15b3489b507..6759f1cb0d8273 100644 --- a/src/content/docs/r2/sql/tutorials/end-to-end-pipeline.mdx +++ b/src/content/docs/r2/sql/tutorials/end-to-end-pipeline.mdx @@ -41,7 +41,7 @@ You'll need API tokens to interact with Cloudflare services. - **Workers R2 Storage** - Edit, Read - **Workers R2 Data Catalog** - Edit, Read - **Workers R2 SQL** - Read - - **Workers R2 SQL** - Read, Send, Edit + - **Workers R2 SQL** - Read, Send, Edit Export your new token as an environment variable: From 3b1acc70e06b30d47761ca7cf8732d3555df81d1 Mon Sep 17 00:00:00 2001 From: Marc Selwan Date: Mon, 22 Sep 2025 13:11:13 -0700 Subject: [PATCH 09/30] added dash steps/tabs, moved out of r2, reformatted most of the R2 SQL docs --- src/content/dash-routes/index.json | 2 +- .../docs/{r2/sql => r2-sql}/get-started.mdx | 165 ++++++++++++-- src/content/docs/r2-sql/index.mdx | 55 +++++ src/content/docs/r2-sql/query-data.mdx | 133 ++++++++++++ .../platform => r2-sql/reference}/index.mdx | 2 +- .../reference}/limitations-best-practices.mdx | 0 .../platform => r2-sql/reference}/pricing.mdx | 0 .../reference}/sql-reference.mdx | 0 .../{r2/sql => r2-sql}/troubleshooting.mdx | 2 + .../tutorials/end-to-end-pipeline.mdx | 201 +++++++++++++++--- .../{r2/sql => r2-sql}/tutorials/index.mdx | 0 src/content/docs/r2/r2-sql.mdx | 9 + src/content/docs/r2/sql/index.mdx | 21 -- src/content/docs/r2/sql/query-data.mdx | 78 ------- src/content/products/r2-sql.yaml | 12 ++ src/icons/r2-sql.svg | 1 + 16 files changed, 533 insertions(+), 148 deletions(-) rename src/content/docs/{r2/sql => r2-sql}/get-started.mdx (61%) create mode 100644 src/content/docs/r2-sql/index.mdx create mode 100644 src/content/docs/r2-sql/query-data.mdx rename src/content/docs/{r2/sql/platform => r2-sql/reference}/index.mdx (81%) rename src/content/docs/{r2/sql/platform => r2-sql/reference}/limitations-best-practices.mdx (100%) rename src/content/docs/{r2/sql/platform => r2-sql/reference}/pricing.mdx (100%) rename src/content/docs/{r2/sql/platform => r2-sql/reference}/sql-reference.mdx (100%) rename src/content/docs/{r2/sql => r2-sql}/troubleshooting.mdx (99%) rename src/content/docs/{r2/sql => r2-sql}/tutorials/end-to-end-pipeline.mdx (72%) rename src/content/docs/{r2/sql => r2-sql}/tutorials/index.mdx (100%) create mode 100644 src/content/docs/r2/r2-sql.mdx delete mode 100644 src/content/docs/r2/sql/index.mdx delete mode 100644 src/content/docs/r2/sql/query-data.mdx create mode 100644 src/content/products/r2-sql.yaml create mode 100644 src/icons/r2-sql.svg diff --git a/src/content/dash-routes/index.json b/src/content/dash-routes/index.json index c9037feda8e904..7cba51353839fc 100644 --- a/src/content/dash-routes/index.json +++ b/src/content/dash-routes/index.json @@ -261,7 +261,7 @@ }, { "name": "Pipelines", - "deeplink": "/?to=/:account/workers/pipelines", + "deeplink": "/?to=/:account/pipelines", "parent": ["Storage & Databases"] }, { diff --git a/src/content/docs/r2/sql/get-started.mdx b/src/content/docs/r2-sql/get-started.mdx similarity index 61% rename from src/content/docs/r2/sql/get-started.mdx rename to src/content/docs/r2-sql/get-started.mdx index 3b4a518140d710..909218bee3ed01 100644 --- a/src/content/docs/r2/sql/get-started.mdx +++ b/src/content/docs/r2-sql/get-started.mdx @@ -8,6 +8,10 @@ description: Learn how to get up and running with R2 SQL using R2 Data Catalog a --- import { Render, + Steps, + Tabs, + TabItem, + DashButton, LinkCard, } from "~/components"; @@ -34,19 +38,28 @@ Use a Node version manager like [Volta](https://volta.sh/) or [nvm](https://gith You'll need API tokens to interact with Cloudflare services. -### Custom API Token -1. Go to **My Profile** → **API Tokens** in the Cloudflare dashboard -2. Select **Create Token** → **Custom token** -3. Add the following permissions: - - **Workers Pipelines** - Read, Send, Edit - - **Workers R2 Storage** - Edit, Read - - **Workers R2 Data Catalog** - Edit, Read - - **Workers R2 SQL** - Read + +1. In the Cloudflare dashboard, go to the **R2 object storage** page. + + +2. Select **Manage API tokens**. + +3. Select **Create API token**. + +4. Select the **R2 Token** text to edit your API token name. + +5. Under **Permissions**, choose the **Admin Read & Write** permission. + +6. Select **Create API Token**. + +7. Note the **Token value**. + + Export your new token as an environment variable: ```bash -export WRANGLER_R2_SQL_AUTH_TOKEN=your_token_here +export WRANGLER_R2_SQL_AUTH_TOKEN= #paste your token here ``` If this is your first time using Wrangler, make sure to login. @@ -54,23 +67,75 @@ If this is your first time using Wrangler, make sure to login. npx wrangler login ``` -## 2. Create an R2 bucket +## 2. Create an R2 bucket and enable R2 Data Catalog + + + + +Create an R2 bucket: + + ```bash + npx wrangler r2 bucket create r2-sql-demo + ``` + + + + + +1. In the Cloudflare dashboard, go to the **R2 object storage** page. + -Create a new R2 bucket: +2. Select **Create bucket**. + +3. Enter the bucket name: r2-sql-demo + +4. Select **Create bucket**. + + + + +## 2. Enable R2 Data Catalog + + + + +Enable the catalog on your R2 bucket: ```bash -npx wrangler r2 bucket create r2-sql-demo +npx wrangler r2 bucket catalog enable r2-sql-demo ``` -## 3. Enable R2 Data Catalog +When you run this command, take note of the "Warehouse". You will need these later. + + + + + +1. In the Cloudflare dashboard, go to the **R2 object storage** page. + + +2. Select the bucket: r2-sql-demo. + +3. Switch to the **Settings** tab, scroll down to **R2 Data Catalog**, and select **Enable**. + +4. Once enabled, note the **Catalog URI** and **Warehouse name**. + + + -Enable [R2 Data Catalog](/r2/data-catalog/) feature on your bucket to use Apache Iceberg tables: + +:::note +Copy the warehouse (ACCOUNTID_BUCKETNAME) and paste it in the `export` below. We'll use it later in the tutorial. +::: ```bash -npx wrangler r2 bucket catalog enable r2-sql-demo +export $WAREHOUSE= #Paste your warehouse here ``` -## 4. Create the data Pipeline +## 3. Create the data Pipeline + + + ### 1. Create the Pipeline Stream First, create a schema file called `demo_schema.json` with the following `json` schema: @@ -157,13 +222,75 @@ npx wrangler pipelines create demo_pipeline \ Note that there is a filter on this statement that will only send events where `numbers` is greater than 5 ::: + + + +1. In the Cloudflare dashboard, go to **Pipelines** > **Pipelines**. + + +2. Select **Create Pipeline**. + +3. **Connect to a Stream**: + - Pipeline name: `demo` + - Enable HTTP endpoint for sending data: Enabled + - HTTP authentication: Disabled (default) + - Select **Next** + +4. **Define Input Schema**: + - Select **JSON editor** + - Copy in the schema: + ```json + { + "fields": [ + {"name": "user_id", "type": "int64", "required": true}, + {"name": "payload", "type": "string", "required": false}, + {"name": "numbers", "type": "int32", "required": false} + ] + } + ``` + + - Select **Next** + +5. **Define Sink**: + - Select your R2 bucket: `r2-sql-demo` + - Storage type: **R2 Data Catalog** + - Namespace: `fraud_detection` + - Table name: `transactions` + - **Advanced Settings**: Change **Maximum Time Interval** to `30 seconds` + - Select **Next** + +6. **Credentials**: + - Disable **Automatically create an Account API token for your sink** + - Enter **Catalog Token** from step 1 + - Select **Next** + +7. **Pipeline Definition**: + - Leave the default SQL query: + ```sql + INSERT INTO demo_sink SELECT * FROM demo_stream; + ``` + - Select **Create Pipeline** + +8. :::note + Note the **HTTP Ingest Endpoint URL** from the output. This is the endpoint you'll use to send data to your pipeline. + ::: + + + +```bash +# The http ingest endpoint +export STREAM_ENDPOINT= #the http ingest endpoint from the output (see example below) +``` + + + + ## 5. Send some data Next, let's send some events to our stream: ```curl curl -X POST "$STREAM_ENDPOINT" \ - -H "Authorization: Bearer YOUR_API_TOKEN" \ -H "Content-Type: application/json" \ -d '[ { @@ -194,7 +321,7 @@ This will send 4 events in one `POST`. Since our Pipeline is filtering out recor After you've sent your events to the stream, it will take about 30 seconds for the data to show in the table since that's what we configured our `roll interval` to be in the Sink. ```bash -npx wrangler r2 sql query "SELECT * FROM demo.first_table LIMIT 10" +npx wrangler r2 sql query "$WAREHOUSE" "SELECT * FROM demo.first_table LIMIT 10" ``` diff --git a/src/content/docs/r2-sql/index.mdx b/src/content/docs/r2-sql/index.mdx new file mode 100644 index 00000000000000..554d31c511cd99 --- /dev/null +++ b/src/content/docs/r2-sql/index.mdx @@ -0,0 +1,55 @@ +--- +pcx_content_type: navigation +title: R2 SQL +sidebar: + order: 7 +head: + - tag: title + content: R2 SQL +description: A distributed SQL engine for R2 Data Catalog +--- + +## Query Apache Iceberg tables in R2 Data Catalog Using R2 SQL + + +:::note +R2 SQL is in public beta, and any developer with an R2 subscription can start using it. Currently, outside of standard R2 storage and operations, you will not be billed for your use of R2 SQL. We will update [the pricing page](/r2-sql/reference/pricing) and provide at least 30 days notice before enabling billing. +::: + +R2 SQL is Cloudflare's serverless, distributed, analytics query engine for querying [Apache Iceberg](https://iceberg.apache.org/) tables stored in [R2 data catalog](https://developers.cloudflare.com/r2/data-catalog/). R2 SQL is designed to efficiently query large amounts of data by automatically utilizing file pruning, Cloudflare's distributed compute, and R2 object storage. + +```sh +❯ npx wrangler r2 sql query "3373912de3f5202317188ae01300bd6_data-catalog" \ +"SELECT * FROM default.transactions LIMIT 10" + + ⛅️ wrangler 4.38.0 +──────────────────────────────────────────────────────────────────────────── +▲ [WARNING] 🚧 `wrangler r2 sql query` is an open-beta command. Please report any issues to https://github.com/cloudflare/workers-sdk/issues/new/choose + + +┌─────────────────────────────┬──────────────────────────────────────┬─────────┬──────────┬──────────────────────────────────┬───────────────┬───────────────────┬──────────┐ +│ __ingest_ts │ transaction_id │ user_id │ amount │ transaction_timestamp │ location │ merchant_category │ is_fraud │ +├─────────────────────────────┼──────────────────────────────────────┼─────────┼──────────┼──────────────────────────────────┼───────────────┼───────────────────┼──────────┤ +│ 2025-09-20T22:30:11.872554Z │ fdc1beed-157c-4d2d-90cf-630fdea58051 │ 1679 │ 13241.59 │ 2025-09-20T02:23:04.269988+00:00 │ NEW_YORK │ RESTAURANT │ false │ +├─────────────────────────────┼──────────────────────────────────────┼─────────┼──────────┼──────────────────────────────────┼───────────────┼───────────────────┼──────────┤ +│ 2025-09-20T22:30:11.724378Z │ ea7ef106-8284-4d08-9348-ad33989b6381 │ 1279 │ 17615.79 │ 2025-09-20T02:23:04.271090+00:00 │ MIAMI │ GAS_STATION │ true │ +├─────────────────────────────┼──────────────────────────────────────┼─────────┼──────────┼──────────────────────────────────┼───────────────┼───────────────────┼──────────┤ +│ 2025-09-20T22:30:11.724330Z │ afcdee4d-5c71-42be-97ec-e282b6937a8c │ 1843 │ 7311.65 │ 2025-09-20T06:23:04.267890+00:00 │ SEATTLE │ GROCERY │ true │ +├─────────────────────────────┼──────────────────────────────────────┼─────────┼──────────┼──────────────────────────────────┼───────────────┼───────────────────┼──────────┤ +│ 2025-09-20T22:30:11.657007Z │ b99d14e0-dbe0-49bc-a417-0ee57f8bed99 │ 1976 │ 15228.21 │ 2025-09-16T23:23:04.269426+00:00 │ NEW_YORK │ RETAIL │ false │ +├─────────────────────────────┼──────────────────────────────────────┼─────────┼──────────┼──────────────────────────────────┼───────────────┼───────────────────┼──────────┤ +│ 2025-09-20T22:30:11.656992Z │ 712cd094-ad4c-4d24-819a-0d3daaaceea1 │ 1184 │ 7570.89 │ 2025-09-20T00:23:04.269163+00:00 │ LOS_ANGELES │ RESTAURANT │ true │ +├─────────────────────────────┼──────────────────────────────────────┼─────────┼──────────┼──────────────────────────────────┼───────────────┼───────────────────┼──────────┤ +│ 2025-09-20T22:30:11.656912Z │ b5a1aab3-676d-4492-92b8-aabcde6db261 │ 1196 │ 46611.25 │ 2025-09-20T16:23:04.268693+00:00 │ NEW_YORK │ RETAIL │ true │ +├─────────────────────────────┼──────────────────────────────────────┼─────────┼──────────┼──────────────────────────────────┼───────────────┼───────────────────┼──────────┤ +│ 2025-09-20T22:30:11.613740Z │ 432d3976-8d89-4813-9099-ea2afa2c0e70 │ 1720 │ 21547.9 │ 2025-09-20T05:23:04.273681+00:00 │ SAN FRANCISCO │ GROCERY │ true │ +├─────────────────────────────┼──────────────────────────────────────┼─────────┼──────────┼──────────────────────────────────┼───────────────┼───────────────────┼──────────┤ +│ 2025-09-20T22:30:11.532068Z │ 25e0b851-3092-4ade-842f-e3189e07d4ee │ 1562 │ 29311.54 │ 2025-09-20T05:23:04.277405+00:00 │ NEW_YORK │ RETAIL │ false │ +├─────────────────────────────┼──────────────────────────────────────┼─────────┼──────────┼──────────────────────────────────┼───────────────┼───────────────────┼──────────┤ +│ 2025-09-20T22:30:11.526037Z │ 8001746d-05fe-42fe-a189-40caf81d7aa2 │ 1817 │ 15976.5 │ 2025-09-15T16:23:04.266632+00:00 │ SEATTLE │ RESTAURANT │ true │ +└─────────────────────────────┴──────────────────────────────────────┴─────────┴──────────┴──────────────────────────────────┴───────────────┴───────────────────┴──────────┘ +Read 11.3 kB across 4 files from R2 +On average, 3.36 kB / s +``` + +Create an end to end data pipeline and query your first table in R2 SQL by following [this step by step guide](/r2-sql/tutorials/end-to-end-pipeline/), learn how to create a data pipeline that takes a stream of events and automatically creates an Apache Iceberg table, making them accessible with R2 SQL. \ No newline at end of file diff --git a/src/content/docs/r2-sql/query-data.mdx b/src/content/docs/r2-sql/query-data.mdx new file mode 100644 index 00000000000000..a4aeee435eb273 --- /dev/null +++ b/src/content/docs/r2-sql/query-data.mdx @@ -0,0 +1,133 @@ +--- +title: Query data in R2 Data Catalog +pcx_content_type: configuration +description: Understand how to query data with R2 SQL +sidebar: + order: 3 +--- +import { + Render, + LinkCard, +} from "~/components"; + +:::note +R2 SQL is currently in open beta +::: + +Learn how to: +- Create an API key with the necessary permissions. +- Query data with R2 SQL. + +R2 SQL can currently be accessed via Wrangler commands or a REST API. + +## Create an API key with the right permissions + +To query Apache Iceberg tables in R2 Data Catalog, you must provide a Cloudflare API token with R2 SQL, R2 Data Catalog, and R2 storage permissions. + +### Create API token in the dashboard + +Create an [API token](https://dash.cloudflare.com/profile/api-tokens) with: + +- Access to R2 Data Catalog (**minimum**: read-only) +- Access to R2 storage (**minimum**: read-only) +- Access to R2 SQL (**minimum**: read-only) + +Wrangler now supports the environment variable `WRANGLER_R2_SQL_AUTH_TOKEN` which you can `export` your token as. + +### Create API token via API + +To create an API token programmatically for use with R2 SQL, you'll need to specify R2 SQL, R2 Data Catalog, and R2 storage permission groups in your [Access Policy](/r2/api/tokens/#access-policy). + +#### Example Access Policy + +```json +[ + { + "id": "f267e341f3dd4697bd3b9f71dd96247f", + "effect": "allow", + "resources": { + "com.cloudflare.edge.r2.bucket.4793d734c0b8e484dfc37ec392b5fa8a_default_my-bucket": "*", + "com.cloudflare.edge.r2.bucket.4793d734c0b8e484dfc37ec392b5fa8a_eu_my-eu-bucket": "*" + }, + "permission_groups": [ + { + "id": "45db74139a62490b9b60eb7c4f34994b", + "name": "Workers R2 Data Catalog Read" + }, + { + "id": "6a018a9f2fc74eb6b293b0c548f38b39", + "name": "Workers R2 Storage Bucket Item Read" + }, + { + "id": "f45430d92e2b4a6cb9f94f2594c141b8", + "name": "Workers R2 SQL Read" + } + ] + } +] +``` + + +## Query data via Wrangler + +Export your R2 API token as an environment variable: + +```bash +export WRANGLER_R2_SQL_AUTH_TOKEN=your_token_here +``` + +If this is your first time using Wrangler, make sure to login. +```bash +npx wrangler login +``` + +:::note +You'll want to copy the **warehouse** of the R2 Data Catalog: +::: + +```sh +❯ npx wrangler r2 bucket catalog get [BUCKET_NAME] + + ⛅️ wrangler 4.38.0 +──────────────────────────────────────────────────────────────────────────── +▲ [WARNING] 🚧 `wrangler r2 bucket catalog get` is an open-beta command. Please report any issues to https://github.com/cloudflare/workers-sdk/issues/new/choose + + +Catalog URI: https://catalog.cloudflarestorage.com/[ACCOUNT_ID]/[BUCKET_NAME] +Warehouse: [ACCOUNT_ID]_[BUCKET_NAME] +Status: active +``` + +To query R2 SQL with Wrangler, simply run: + +```sh +npx wrangler r2 sql query "YOUR_WAREHOUSE" "SELECT * FROM namespace.table_name limit 10;" +``` +For a full list of supported sql commands, check out the [R2 SQL reference page](/r2-sql/reference/sql-reference). + + +## REST API +Below is an example of using R2 SQL via the REST endpoint: + +```bash +curl -X POST \ + "https://api.sql.cloudflarestorage.com/api/v1/accounts/{ACCOUNT_ID}/r2-sql/query/{BUCKET_NAME}" \ + -H "Authorization: Bearer ${WRANGLER_R2_SQL_AUTH_TOKEN}" \ + -H "Content-Type: application/json" \ + -d '{ + "query": "SELECT * FROM namespace.table_name limit 10;" + }' +``` + +Learn more: + + + diff --git a/src/content/docs/r2/sql/platform/index.mdx b/src/content/docs/r2-sql/reference/index.mdx similarity index 81% rename from src/content/docs/r2/sql/platform/index.mdx rename to src/content/docs/r2-sql/reference/index.mdx index ef43ff93fe3c19..ab0a6ad35089fb 100644 --- a/src/content/docs/r2/sql/platform/index.mdx +++ b/src/content/docs/r2-sql/reference/index.mdx @@ -1,5 +1,5 @@ --- -title: Platform +title: Reference pcx_content_type: navigation sidebar: group: diff --git a/src/content/docs/r2/sql/platform/limitations-best-practices.mdx b/src/content/docs/r2-sql/reference/limitations-best-practices.mdx similarity index 100% rename from src/content/docs/r2/sql/platform/limitations-best-practices.mdx rename to src/content/docs/r2-sql/reference/limitations-best-practices.mdx diff --git a/src/content/docs/r2/sql/platform/pricing.mdx b/src/content/docs/r2-sql/reference/pricing.mdx similarity index 100% rename from src/content/docs/r2/sql/platform/pricing.mdx rename to src/content/docs/r2-sql/reference/pricing.mdx diff --git a/src/content/docs/r2/sql/platform/sql-reference.mdx b/src/content/docs/r2-sql/reference/sql-reference.mdx similarity index 100% rename from src/content/docs/r2/sql/platform/sql-reference.mdx rename to src/content/docs/r2-sql/reference/sql-reference.mdx diff --git a/src/content/docs/r2/sql/troubleshooting.mdx b/src/content/docs/r2-sql/troubleshooting.mdx similarity index 99% rename from src/content/docs/r2/sql/troubleshooting.mdx rename to src/content/docs/r2-sql/troubleshooting.mdx index 9a1a18349aa1c7..ad4688334bbb3b 100644 --- a/src/content/docs/r2/sql/troubleshooting.mdx +++ b/src/content/docs/r2-sql/troubleshooting.mdx @@ -7,6 +7,8 @@ sidebar: order: 7 --- + + # R2 SQL Troubleshooting Guide This guide covers potential errors and limitations you may encounter when using R2 SQL. R2 SQL is in open beta and supported functionality will evolve and change over time. diff --git a/src/content/docs/r2/sql/tutorials/end-to-end-pipeline.mdx b/src/content/docs/r2-sql/tutorials/end-to-end-pipeline.mdx similarity index 72% rename from src/content/docs/r2/sql/tutorials/end-to-end-pipeline.mdx rename to src/content/docs/r2-sql/tutorials/end-to-end-pipeline.mdx index 6759f1cb0d8273..f8b6d7face7e29 100644 --- a/src/content/docs/r2/sql/tutorials/end-to-end-pipeline.mdx +++ b/src/content/docs/r2-sql/tutorials/end-to-end-pipeline.mdx @@ -6,12 +6,21 @@ products: - R2 - R2 Data Catalog - R2 SQL + - Pipelines --- +import { + Render, + Steps, + Tabs, + TabItem, + DashButton, + LinkCard, +} from "~/components"; -# Build a fraud detection pipeline with the Cloudflare Data Platform +# Build an end to end data pipeline with the Cloudflare Data Platform -In this guide, you will learn how to build a complete data pipeline using Cloudflare Pipelines, R2 Data Catalog, and R2 SQL. This also includes a sample Python script that creates and sends financial transaction data to your Pipeline that can be queried by R2 SQL or any Apache Iceberg-compatible query engine. +In this tutorial, you will learn how to build a complete data pipeline using Cloudflare Pipelines, R2 Data Catalog, and R2 SQL. This also includes a sample Python script that creates and sends financial transaction data to your Pipeline that can be queried by R2 SQL or any Apache Iceberg-compatible query engine. This tutorial demonstrates how to: - Set up R2 Data Catalog to store our transaction events in an Apache Iceberg table @@ -21,7 +30,6 @@ This tutorial demonstrates how to: ## Prerequisites - 1. Sign up for a [Cloudflare account](https://dash.cloudflare.com/sign-up). 2. Install [Node.js](https://nodejs.org/en/). 3. Install [Python 3.8+](https://python.org) for the data generation script. @@ -34,19 +42,37 @@ Use a Node version manager like [Volta](https://volta.sh/) or [nvm](https://gith You'll need API tokens to interact with Cloudflare services. -### Custom API Token -1. Go to **My Profile** → **API Tokens** in the Cloudflare dashboard -2. Select **Create Token** → **Custom token** -3. Add the following permissions: - - **Workers R2 Storage** - Edit, Read - - **Workers R2 Data Catalog** - Edit, Read - - **Workers R2 SQL** - Read - - **Workers R2 SQL** - Read, Send, Edit + + +1. In the Cloudflare dashboard, go to the **API tokens** page. + + +2. Select **Create Token**. + +3. Select **Get started** next to Create Custom Token. + +4. Enter a name for your API token. + +5. Under **Permissions**, choose: + - **Workers Pipelines** with Read, Send, and Edit permissions + - **Workers R2 Data Catalog** with Read and Edit permissions + - **Workers R2 SQL** with Read permissions + - **Workers R2 Storage** with Read and Edit permissions + +6. Optionally add a TTL to this token + +7. Select **Continue to summary**. + +8. Click **Create Token** + +8. Note the **Token value**. + + Export your new token as an environment variable: ```bash -export WRANGLER_R2_SQL_AUTH_TOKEN=your_token_here +export WRANGLER_R2_SQL_AUTH_TOKEN= #paste your token here ``` If this is your first time using Wrangler, make sure to login. @@ -54,22 +80,63 @@ If this is your first time using Wrangler, make sure to login. npx wrangler login ``` -## 2. Create an R2 bucket +## 2. Create an R2 bucket and enable R2 Data Catalog -Create a new R2 bucket to store your fraud detection data: + + -```bash -npx wrangler r2 bucket create fraud-pipeline -``` +Create an R2 bucket: + + ```bash + npx wrangler r2 bucket create fraud-pipeline + ``` + + + + + +1. In the Cloudflare dashboard, go to the **R2 object storage** page. + + +2. Select **Create bucket**. -## 3. Enable R2 Data Catalog +3. Enter the bucket name: fraud-pipeline -Enable the Data Catalog feature on your bucket to use Apache Iceberg tables: +4. Select **Create bucket**. + + + + +## 2. Enable R2 Data Catalog + + + + +Enable the catalog on your R2 bucket: ```bash npx wrangler r2 bucket catalog enable fraud-pipeline ``` +When you run this command, take note of the "Warehouse" and "Catalog URI". You will need these later. + + + + + +1. In the Cloudflare dashboard, go to the **R2 object storage** page. + + +2. Select the bucket: fraud-pipeline. + +3. Switch to the **Settings** tab, scroll down to **R2 Data Catalog**, and select **Enable**. + +4. Once enabled, note the **Catalog URI** and **Warehouse name**. + + + + + :::note Copy the warehouse (ACCOUNTID_BUCKETNAME) and paste it in the `export` below. We'll use it later in the tutorial. ::: @@ -80,14 +147,34 @@ export $WAREHOUSE= #Paste your warehouse here ### Optional - Enable compaction on your R2 Data Catalog R2 Data Catalog can automatically compact tables for you. In production event streaming use cases, it's common to end up with many small files, so it's recommended to enable compaction. Since this is a sample use case, this is optional. + + + ```bash npx wrangler r2 bucket catalog compaction enable fraud-pipeline --token $WRANGLER_R2_SQL_AUTH_TOKEN ``` + + + + +1. In the Cloudflare dashboard, go to the **R2 object storage** page. + + +2. Select the bucket: fraud-pipeline. + +3. Switch to the **Settings** tab, scroll down to **R2 Data Catalog**, click on edit icon, and select **Enable**. + +4. You can choose a target file size or leave the default. Click save. + + + + ## 4. Set up the pipeline infrastructure ### Create the Pipeline stream - + + First, create a schema file called `raw_transactions_schema.json` with the following `json` schema: ```json { @@ -106,7 +193,7 @@ First, create a schema file called `raw_transactions_schema.json` with the follo Create a stream to receive incoming fraud detection events: ```bash -npx wrangler pipelines streams create raw_stream \ +npx wrangler pipelines streams create raw_events_stream \ --schema-file raw_transactions_schema.json \ --http-enabled true \ --http-auth false @@ -121,12 +208,12 @@ export STREAM_ENDPOINT= #the http ingest endpoint from the output (see example b The output should look like this: ```sh -🌀 Creating stream 'raw_stream'... -✨ Successfully created stream 'raw_stream' with id 'stream_id'. +🌀 Creating stream 'raw_events_stream'... +✨ Successfully created stream 'raw_events_stream' with id 'stream_id'. Creation Summary: General: - Name: raw_stream + Name: raw_events_stream HTTP Ingest: Enabled: Yes @@ -159,7 +246,7 @@ Input Schema: Create a sink that writes data to your R2 bucket as Apache Iceberg tables: ```bash -npx wrangler pipelines sinks create raw_sink \ +npx wrangler pipelines sinks create raw_events_sink \ --type "r2-data-catalog" \ --bucket "fraud-pipeline" \ --roll-interval 30 \ @@ -178,8 +265,66 @@ Connect your stream to your sink with SQL: ```bash npx wrangler pipelines create raw_events_pipeline \ - --sql "INSERT INTO raw_sink SELECT * FROM raw_stream" + --sql "INSERT INTO raw_events_sink SELECT * FROM raw_events_stream" ``` + + + +1. In the Cloudflare dashboard, go to **Pipelines** > **Pipelines**. + + +2. Select **Create Pipeline**. + +3. **Connect to a Stream**: + - Pipeline name: `raw_events` + - Enable HTTP endpoint for sending data: Enabled + - HTTP authentication: Disabled (default) + - Select **Next** + +4. **Define Input Schema**: + - Select **JSON editor** + - Copy in the schema: + ```json + { + "fields": [ + {"name": "transaction_id", "type": "string", "required": true}, + {"name": "user_id", "type": "int64", "required": true}, + {"name": "amount", "type": "f64", "required": false}, + {"name": "transaction_timestamp", "type": "string", "required": false}, + {"name": "location", "type": "string", "required": false}, + {"name": "merchant_category", "type": "string", "required": false}, + {"name": "is_fraud", "type": "bool", "required": false} + ] + } + ``` + + - Select **Next** + +5. **Define Sink**: + - Select your R2 bucket: `fraud-pipeline` + - Storage type: **R2 Data Catalog** + - Namespace: `fraud_detection` + - Table name: `transactions` + - **Advanced Settings**: Change **Maximum Time Interval** to `30 seconds` + - Select **Next** + +6. **Credentials**: + - Disable **Automatically create an Account API token for your sink** + - Enter **Catalog Token** from step 1 + - Select **Next** + +7. **Pipeline Definition**: + - Leave the default SQL query: + ```sql + INSERT INTO raw_events_sink SELECT * FROM raw_events_stream; + ``` + - Select **Create Pipeline** + +8. After pipeline creation, note the **Stream ID** for the next step. + + + + ## 5. Generate fraud detection data @@ -367,11 +512,11 @@ npx wrangler pipelines sinks create fraud_filter_sink \ --catalog-token $WRANGLER_R2_SQL_AUTH_TOKEN ``` -Now you'll create a new SQL query to process data from the original `raw_stream` stream and only write flagged transactions that are over the `amount` of 1000. +Now you'll create a new SQL query to process data from the original `raw_events_stream` stream and only write flagged transactions that are over the `amount` of 1000. ```bash npx wrangler pipelines create fraud_events_pipeline \ - --sql "INSERT INTO fraud_filter_sink SELECT * FROM raw_stream WHERE is_fraud=true and amount > 1000" + --sql "INSERT INTO fraud_filter_sink SELECT * FROM raw_events_stream WHERE is_fraud=true and amount > 1000" ``` :::note diff --git a/src/content/docs/r2/sql/tutorials/index.mdx b/src/content/docs/r2-sql/tutorials/index.mdx similarity index 100% rename from src/content/docs/r2/sql/tutorials/index.mdx rename to src/content/docs/r2-sql/tutorials/index.mdx diff --git a/src/content/docs/r2/r2-sql.mdx b/src/content/docs/r2/r2-sql.mdx new file mode 100644 index 00000000000000..44656fe71508e6 --- /dev/null +++ b/src/content/docs/r2/r2-sql.mdx @@ -0,0 +1,9 @@ +--- +pcx_content_type: navigation +title: R2 SQL +external_link: /r2-sql/ +sidebar: + order: 7 + group: + badge: Beta +--- \ No newline at end of file diff --git a/src/content/docs/r2/sql/index.mdx b/src/content/docs/r2/sql/index.mdx deleted file mode 100644 index a97dd2bfbd417b..00000000000000 --- a/src/content/docs/r2/sql/index.mdx +++ /dev/null @@ -1,21 +0,0 @@ ---- -pcx_content_type: navigation -title: R2 SQL -sidebar: - order: 7 - group: - badge: Beta -head: [] -description: A distributed SQL engine for R2 Data Catalog ---- - -## Efficiently Query Apache Iceberg tables in R2 Data Catalog Using R2 SQL. - - -:::note -R2 SQL is in public beta, and any developer with an R2 subscription can start using it. Currently, outside of standard R2 storage and operations, you will not be billed for your use of R2 SQL -::: - -R2 SQL is Cloudflare's serverless, distributed, analytics query engine for querying Apache Iceberg tables stored in [R2 data catalog](https://developers.cloudflare.com/r2/data-catalog/). R2 SQL is designed to efficiently query large amounts of data by automatically utilizing file pruning, Cloudflare's distributed compute, and R2 object storage. - -Create an end to end data pipeline and query your first table in R2 SQL by following [this step by step guide](/r2/sql/end-to-end-pipeline/), learn how to create a data pipeline that takes a stream of events and automatically creates an Apache Iceberg table, making them accessible with R2 SQL. \ No newline at end of file diff --git a/src/content/docs/r2/sql/query-data.mdx b/src/content/docs/r2/sql/query-data.mdx deleted file mode 100644 index c30e8d6c9ee1d9..00000000000000 --- a/src/content/docs/r2/sql/query-data.mdx +++ /dev/null @@ -1,78 +0,0 @@ ---- -title: Query data in R2 Data Catalog -pcx_content_type: example -sidebar: - order: 3 ---- - -:::note -R2 SQL is currently in open beta -::: - -## Prerequisites - -- Sign up for a [Cloudflare account](https://dash.cloudflare.com/sign-up/workers-and-pages). -- [Create an R2 bucket](/r2/buckets/create-buckets/) and [enable the data catalog](/r2/data-catalog/manage-catalogs/#enable-r2-data-catalog-on-a-bucket). -- [Create an R2 API token](/r2/api/tokens/) with [R2, R2 SQL, and data catalog permissions](/r2/api/tokens/#permissions). -- Tables must have a time-based partition key in order to be queried by R2 SQL. Read about the current [limitations](/r2/sql/platform/limitations-best-practices) to learn more. - -R2 SQL can currently be accessed via Wrangler commands or a REST API. - -## Wrangler - - -Export your R2 API token as an environment variable: - -```bash -export WRANGLER_R2_SQL_AUTH_TOKEN=your_token_here -``` - -If this is your first time using Wrangler, make sure to login. -```bash -npx wrangler login -``` - -You'll also want to grab the **warehouse** of the R2 Data Catalog: - -```sh -❯ npx wrangler r2 bucket catalog get [BUCKET_NAME] - - ⛅️ wrangler 4.38.0 -──────────────────────────────────────────────────────────────────────────── -▲ [WARNING] 🚧 `wrangler r2 bucket catalog get` is an open-beta command. Please report any issues to https://github.com/cloudflare/workers-sdk/issues/new/choose - - -Catalog URI: https://catalog.cloudflarestorage.com/[ACCOUNT_ID]/[BUCKET_NAME] -Warehouse: [ACCOUNT_ID]_[BUCKET_NAME] -Status: active -``` - -To query R2 SQL with Wrangler, simply run: - -```sh -npx wrangler r2 sql query "YOUR_WAREHOUSE" "SELECT * FROM namespace.table_name limit 10;" -``` -For a full list of supported sql commands, check out the [R2 SQL reference page](/r2/sql/platform/sql-reference). - - -## REST API - -Set your environment variable - -```bash -export ACCOUNT_ID="your-cloudflare-account-id" -export BUCKET_NAME="your-r2-bucket-name" -export WRANGLER_R2_SQL_AUTH_TOKEN="your_token_here" -``` - -Now you're ready to use the REST endpoint - -```bash -curl -X POST \ - "https://api.sql.cloudflarestorage.com/api/v1/accounts/${ACCOUNT_ID}/r2-sql/query/${BUCKET_NAME}" \ - -H "Authorization: Bearer ${WRANGLER_R2_SQL_AUTH_TOKEN}" \ - -H "Content-Type: application/json" \ - -d '{ - "query": "SELECT * FROM namespace.table_name limit 10;" - }' | jq . -``` \ No newline at end of file diff --git a/src/content/products/r2-sql.yaml b/src/content/products/r2-sql.yaml new file mode 100644 index 00000000000000..4c72d34cd28cab --- /dev/null +++ b/src/content/products/r2-sql.yaml @@ -0,0 +1,12 @@ + +name: R2 SQL + +product: + title: R2 SQL + url: /r2-sql/ + group: Developer platform + +meta: + title: R2 SQL docs + description: Cloudflare's serverless, distributed query engine for data stored in R2 Data Catalog + author: '@cloudflare' \ No newline at end of file diff --git a/src/icons/r2-sql.svg b/src/icons/r2-sql.svg new file mode 100644 index 00000000000000..3d391a2de4036a --- /dev/null +++ b/src/icons/r2-sql.svg @@ -0,0 +1 @@ + \ No newline at end of file From 5a2776806d5a0eeb06e8facf6549cde5a8b6cb2f Mon Sep 17 00:00:00 2001 From: Marc Selwan Date: Mon, 22 Sep 2025 13:27:03 -0700 Subject: [PATCH 10/30] added new R2 SQL token env variable --- .../docs/workers/wrangler/system-environment-variables.mdx | 3 +++ 1 file changed, 3 insertions(+) diff --git a/src/content/docs/workers/wrangler/system-environment-variables.mdx b/src/content/docs/workers/wrangler/system-environment-variables.mdx index d24c93dacafc7c..c8864ecc75c06c 100644 --- a/src/content/docs/workers/wrangler/system-environment-variables.mdx +++ b/src/content/docs/workers/wrangler/system-environment-variables.mdx @@ -84,6 +84,8 @@ Wrangler supports the following environment variables: - `DOCKER_HOST` - Used for local development of [Containers](/containers/local-dev). Wrangler will attempt to automatically find the correct socket to use to communicate with your container engine. If that does not work (usually surfacing as an `internal error` when attempting to connect to your Container), you can try setting the socket path using this environment variable. +* `WRANGLER_R2_SQL_AUTH_TOKEN` + - API token used for executing queries with [R2 SQL](/r2-sql). ## Example `.env` file The following is an example `.env` file: @@ -96,6 +98,7 @@ WRANGLER_SEND_METRICS=true CLOUDFLARE_API_BASE_URL=https://api.cloudflare.com/client/v4 WRANGLER_LOG=debug WRANGLER_LOG_PATH=../Desktop/my-logs/my-log-file.log +WRANGLER_R2_SQL_AUTH_TOKEN= ``` ## Deprecated global variables From dde1d628cf417fdbefa5d4b597c659b339440d79 Mon Sep 17 00:00:00 2001 From: Marc Selwan Date: Mon, 22 Sep 2025 13:56:15 -0700 Subject: [PATCH 11/30] adding wrangler commands --- src/content/docs/r2-sql/index.mdx | 2 +- src/content/docs/r2-sql/platform/index.mdx | 7 ++++++ .../{reference => platform}/pricing.mdx | 0 .../r2-sql/platform/wrangler-commands.mdx | 14 +++++++++++ .../docs/workers/wrangler/commands.mdx | 6 +++++ .../workers/wrangler-commands/r2-sql.mdx | 24 +++++++++++++++++++ 6 files changed, 52 insertions(+), 1 deletion(-) create mode 100644 src/content/docs/r2-sql/platform/index.mdx rename src/content/docs/r2-sql/{reference => platform}/pricing.mdx (100%) create mode 100644 src/content/docs/r2-sql/platform/wrangler-commands.mdx create mode 100644 src/content/partials/workers/wrangler-commands/r2-sql.mdx diff --git a/src/content/docs/r2-sql/index.mdx b/src/content/docs/r2-sql/index.mdx index 554d31c511cd99..2bb44f23e21e14 100644 --- a/src/content/docs/r2-sql/index.mdx +++ b/src/content/docs/r2-sql/index.mdx @@ -13,7 +13,7 @@ description: A distributed SQL engine for R2 Data Catalog :::note -R2 SQL is in public beta, and any developer with an R2 subscription can start using it. Currently, outside of standard R2 storage and operations, you will not be billed for your use of R2 SQL. We will update [the pricing page](/r2-sql/reference/pricing) and provide at least 30 days notice before enabling billing. +R2 SQL is in public beta, and any developer with an R2 subscription can start using it. Currently, outside of standard R2 storage and operations, you will not be billed for your use of R2 SQL. We will update [the pricing page](/r2-sql/platform/pricing) and provide at least 30 days notice before enabling billing. ::: R2 SQL is Cloudflare's serverless, distributed, analytics query engine for querying [Apache Iceberg](https://iceberg.apache.org/) tables stored in [R2 data catalog](https://developers.cloudflare.com/r2/data-catalog/). R2 SQL is designed to efficiently query large amounts of data by automatically utilizing file pruning, Cloudflare's distributed compute, and R2 object storage. diff --git a/src/content/docs/r2-sql/platform/index.mdx b/src/content/docs/r2-sql/platform/index.mdx new file mode 100644 index 00000000000000..ef43ff93fe3c19 --- /dev/null +++ b/src/content/docs/r2-sql/platform/index.mdx @@ -0,0 +1,7 @@ +--- +title: Platform +pcx_content_type: navigation +sidebar: + group: + hideIndex: true +--- diff --git a/src/content/docs/r2-sql/reference/pricing.mdx b/src/content/docs/r2-sql/platform/pricing.mdx similarity index 100% rename from src/content/docs/r2-sql/reference/pricing.mdx rename to src/content/docs/r2-sql/platform/pricing.mdx diff --git a/src/content/docs/r2-sql/platform/wrangler-commands.mdx b/src/content/docs/r2-sql/platform/wrangler-commands.mdx new file mode 100644 index 00000000000000..76e85fae5e3720 --- /dev/null +++ b/src/content/docs/r2-sql/platform/wrangler-commands.mdx @@ -0,0 +1,14 @@ +--- +pcx_content_type: concept +title: Wrangler commands +sidebar: + order: 80 +--- + +import { Render, Type, MetaInfo } from "~/components"; + + + +## Global commands + + \ No newline at end of file diff --git a/src/content/docs/workers/wrangler/commands.mdx b/src/content/docs/workers/wrangler/commands.mdx index 383044d7e36322..622ef7b15c76b6 100644 --- a/src/content/docs/workers/wrangler/commands.mdx +++ b/src/content/docs/workers/wrangler/commands.mdx @@ -36,6 +36,7 @@ Wrangler offers a number of commands to manage your Cloudflare Workers. - [`kv bulk`](#kv-bulk) - Manage multiple key-value pairs within a Workers KV namespace in batches. - [`r2 bucket`](#r2-bucket) - Manage Workers R2 buckets. - [`r2 object`](#r2-object) - Manage Workers R2 objects. +- [`r2 sql`](#r2-sql) - Query tables in R2 Data Catalog with R2 SQL. - [`secret`](#secret) - Manage the secret variables for a Worker. - [`secret bulk`](#secret-bulk) - Manage multiple secret variables for a Worker. - [`secrets-store secret`](#secrets-store-secret) - Manage account secrets within a secrets store. @@ -356,6 +357,11 @@ wrangler delete [