|
| 1 | +--- |
| 2 | +pcx_content_type: get-started |
| 3 | +title: Getting started |
| 4 | +head: [] |
| 5 | +sidebar: |
| 6 | + order: 2 |
| 7 | +description: Learn how to get up and running with R2 SQL using R2 Data Catalog and Pipelines |
| 8 | +--- |
| 9 | +import { |
| 10 | + Render, |
| 11 | + LinkCard, |
| 12 | +} from "~/components"; |
| 13 | + |
| 14 | +## Overview |
| 15 | + |
| 16 | +This guide will instruct you through: |
| 17 | + |
| 18 | +- Creating an [R2 bucket](/r2/buckets/) and enabling its [data catalog](/r2/data-catalog/). |
| 19 | +- Using Wrangler to create a Pipeline Stream, Sink, and the SQL that reads from the stream and writes it to the sink |
| 20 | +- Sending some data to the stream via the HTTP Streams endpoint |
| 21 | +- Querying the data using R2 SQL |
| 22 | + |
| 23 | +## Prerequisites |
| 24 | + |
| 25 | +1. Sign up for a [Cloudflare account](https://dash.cloudflare.com/sign-up). |
| 26 | +2. Install [Node.js](https://nodejs.org/en/). |
| 27 | +3. Install [Wrangler](/workers/wranger/install-and-update) |
| 28 | + |
| 29 | +:::note[Node.js version manager] |
| 30 | +Use a Node version manager like [Volta](https://volta.sh/) or [nvm](https://github.com/nvm-sh/nvm) to avoid permission issues and change Node.js versions. Wrangler requires a Node version of 16.17.0 or later. |
| 31 | +::: |
| 32 | + |
| 33 | +## 1. Set up authentication |
| 34 | + |
| 35 | +You'll need API tokens to interact with Cloudflare services. |
| 36 | + |
| 37 | +### Custom API Token |
| 38 | +1. Go to **My Profile** → **API Tokens** in the Cloudflare dashboard |
| 39 | +2. Select **Create Token** → **Custom token** |
| 40 | +3. Add the following permissions: |
| 41 | + - **Workers Pipelines** - Read, Send, Edit |
| 42 | + - **Workers R2 Storage** - Edit, Read |
| 43 | + - **Workers R2 Data Catalog** - Edit, Read |
| 44 | + - **Workers R2 SQL** - Read |
| 45 | + |
| 46 | +Export your new token as an environment variable: |
| 47 | + |
| 48 | +```bash |
| 49 | +export WRANGLER_R2_SQL_AUTH_TOKEN=your_token_here |
| 50 | +``` |
| 51 | + |
| 52 | +If this is your first time using Wrangler, make sure to login. |
| 53 | +```bash |
| 54 | +npx wrangler login |
| 55 | +``` |
| 56 | + |
| 57 | +## 2. Create an R2 bucket |
| 58 | + |
| 59 | +Create a new R2 bucket: |
| 60 | + |
| 61 | +```bash |
| 62 | +npx wrangler r2 bucket create r2-sql-demo |
| 63 | +``` |
| 64 | + |
| 65 | +## 3. Enable R2 Data Catalog |
| 66 | + |
| 67 | +Enable [R2 Data Catalog](/r2/data-catalog/) feature on your bucket to use Apache Iceberg tables: |
| 68 | + |
| 69 | +```bash |
| 70 | +npx wrangler r2 bucket catalog enable r2-sql-demo |
| 71 | +``` |
| 72 | +## 4. Create the data Pipeline |
| 73 | + |
| 74 | +### 1. Create the Pipeline Stream |
| 75 | + |
| 76 | +First, create a schema file called `demo_schema.json` with the following `json` schema: |
| 77 | +```json |
| 78 | +{ |
| 79 | + "fields": [ |
| 80 | + {"name": "user_id", "type": "int64", "required": true}, |
| 81 | + {"name": "payload", "type": "string", "required": false}, |
| 82 | + {"name": "numbers", "type": "int32", "required": false} |
| 83 | + ] |
| 84 | +} |
| 85 | +``` |
| 86 | +Next, crete the stream we'll use to ingest events to: |
| 87 | + |
| 88 | +```bash |
| 89 | +npx wrangler pipelines streams create demo_stream \ |
| 90 | + --schema-file demo_schema.json \ |
| 91 | + --http-enabled true \ |
| 92 | + --http-auth false |
| 93 | +``` |
| 94 | +:::note |
| 95 | +Note the **HTTP Ingest Endpoint URL** from the output. This is the endpoint you'll use to send data to your pipeline. |
| 96 | +::: |
| 97 | + |
| 98 | +```bash |
| 99 | +# The http ingest endpoint from the output (see example below) |
| 100 | +export STREAM_ENDPOINT= #the http ingest endpoint from the output (see example below) |
| 101 | +``` |
| 102 | +The output should look like this: |
| 103 | +```sh |
| 104 | +🌀 Creating stream 'demo_stream'... |
| 105 | +✨ Successfully created stream 'demo_stream' with id 'stream_id'. |
| 106 | + |
| 107 | +Creation Summary: |
| 108 | +General: |
| 109 | + Name: demo_stream |
| 110 | + |
| 111 | +HTTP Ingest: |
| 112 | + Enabled: Yes |
| 113 | + Authentication: No |
| 114 | + Endpoint: https://stream_id.ingest.cloudflare.com |
| 115 | + CORS Origins: None |
| 116 | + |
| 117 | +Input Schema: |
| 118 | +┌────────────┬────────┬────────────┬──────────┐ |
| 119 | +│ Field Name │ Type │ Unit/Items │ Required │ |
| 120 | +├────────────┼────────┼────────────┼──────────┤ |
| 121 | +│ user_id │ int64 │ │ Yes │ |
| 122 | +├────────────┼────────┼────────────┼──────────┤ |
| 123 | +│ payload │ string │ │ No │ |
| 124 | +├────────────┼────────┼────────────┼──────────┤ |
| 125 | +│ numbers │ int32 │ │ No │ |
| 126 | +└────────────┴────────┴────────────┴──────────┘ |
| 127 | +``` |
| 128 | + |
| 129 | + |
| 130 | +### 2. Create the Pipeline Sink |
| 131 | + |
| 132 | +Create a sink that writes data to your R2 bucket as Apache Iceberg tables: |
| 133 | + |
| 134 | +```bash |
| 135 | +npx wrangler pipelines sinks create demo_sink \ |
| 136 | + --type "r2-data-catalog" \ |
| 137 | + --bucket "r2-sql-demo" \ |
| 138 | + --roll-interval 30 \ |
| 139 | + --namespace "demo" \ |
| 140 | + --table "first_table" \ |
| 141 | + --catalog-token $WRANGLER_R2_SQL_AUTH_TOKEN |
| 142 | +``` |
| 143 | + |
| 144 | +:::note |
| 145 | +This creates a `sink` configuration that will write to the Iceberg table demo.first_table in your R2 Data Catalog every 30 seconds. Pipelines automatically appends an `__ingest_ts` column that is used to partition the table by `DAY` |
| 146 | +::: |
| 147 | + |
| 148 | +### 3. Create the Pipeline |
| 149 | + |
| 150 | +Pipelines are SQL statements read data from the stream, does some work, and writes it to the sink |
| 151 | + |
| 152 | +```bash |
| 153 | +npx wrangler pipelines create demo_pipeline \ |
| 154 | + --sql "INSERT INTO demo_sink SELECT * FROM demo_stream WHERE numbers > 5;" |
| 155 | +``` |
| 156 | +:::note |
| 157 | +Note that there is a filter on this statement that will only send events where `numbers` is greater than 5 |
| 158 | +::: |
| 159 | + |
| 160 | +## 5. Send some data |
| 161 | + |
| 162 | +Next, let's send some events to our stream: |
| 163 | + |
| 164 | +```curl |
| 165 | +curl -X POST "$STREAM_ENDPOINT" \ |
| 166 | + -H "Authorization: Bearer YOUR_API_TOKEN" \ |
| 167 | + -H "Content-Type: application/json" \ |
| 168 | + -d '[ |
| 169 | + { |
| 170 | + "user_id": 1, |
| 171 | + "payload": "you should see this", |
| 172 | + "numbers": 42 |
| 173 | + }, |
| 174 | + { |
| 175 | + "user_id": 2, |
| 176 | + "payload": "you should also see this", |
| 177 | + "numbers": 100 |
| 178 | + }, |
| 179 | + { |
| 180 | + "user_id": 3, |
| 181 | + "payload": null, |
| 182 | + "numbers": 1 |
| 183 | + }, |
| 184 | + { |
| 185 | + "user_id": 4, |
| 186 | + "numbers": null |
| 187 | + } |
| 188 | + ]' |
| 189 | +``` |
| 190 | +This will send 4 events in one `POST`. Since our Pipeline is filtering out records with `numbers` less than 5, `user_id` `3` and `4` should not appear in the table. Feel free to change values and send more events. |
| 191 | + |
| 192 | +## 6. Query the table with R2 SQL |
| 193 | + |
| 194 | +After you've sent your events to the stream, it will take about 30 seconds for the data to show in the table since that's what we configured our `roll interval` to be in the Sink. |
| 195 | + |
| 196 | +```bash |
| 197 | +npx wrangler r2 sql query "SELECT * FROM demo.first_table LIMIT 10" |
| 198 | +``` |
| 199 | + |
| 200 | +<LinkCard |
| 201 | + title="Managing R2 Data Catalogs" |
| 202 | + href="/r2/data-catalog/manage-catalogs/" |
| 203 | + description="Enable or disable R2 Data Catalog on your bucket, retrieve configuration details, and authenticate your Iceberg engine." |
| 204 | +/> |
| 205 | + |
| 206 | +<LinkCard |
| 207 | + title="Try another example" |
| 208 | + href="/r2/sql/tutorials/end-to-end-pipeline" |
| 209 | + description="Detailed tutorial for setting up a simple fruad detection data pipeline and generate events for it in Python." |
| 210 | +/> |
0 commit comments