Skip to content

Commit b073768

Browse files
committed
Adds documentation for R2 Data Catalog
1 parent b3a03ce commit b073768

File tree

5 files changed

+448
-40
lines changed

5 files changed

+448
-40
lines changed
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
---
2+
pcx_content_type: navigation
3+
title: Configuration examples
4+
head: []
5+
sidebar:
6+
order: 3
7+
group:
8+
hideIndex: true
9+
description: Find detailed setup instructions for Apache Spark and other common query engines.
10+
---
11+
12+
import { DirectoryListing } from "~/components";
13+
14+
Below are configuration examples to connect various Iceberg engines to [R2 Data Catalog](/r2/data-catalog/):
15+
16+
<DirectoryListing />
Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
---
2+
title: PyIceberg
3+
pcx_content_type: example
4+
---
5+
6+
Below is an example of using [PyIceberg](https://py.iceberg.apache.org/) to connect to R2 Data Catalog.
7+
8+
## Prerequisites
9+
10+
- Sign up for a [Cloudflare account](https://dash.cloudflare.com/sign-up/workers-and-pages).
11+
- Create an [R2 bucket](/r2/buckets/) and enable the data catalog.
12+
- Create an [R2 API token](/r2/api/s3/tokens/) with both R2 and data catalog permissions.
13+
- Install the [PyIceberg](https://py.iceberg.apache.org/#installation) and [PyArrow](https://arrow.apache.org/docs/python/install.html) libraries.
14+
15+
## Example usage
16+
17+
```py
18+
import pyarrow as pa
19+
from pyiceberg.catalog.rest import RestCatalog
20+
from pyiceberg.exceptions import NamespaceAlreadyExistsError
21+
22+
# Define catalog connection details (replace variables)
23+
WAREHOUSE = "<WAREHOUSE>"
24+
TOKEN = "<TOKEN>"
25+
CATALOG_URL = f"https://catalog.cloudflarestorage.com/{WAREHOUSE}"
26+
27+
# Connect to R2 Data Catalog
28+
catalog = RestCatalog(
29+
name="my_catalog",
30+
warehouse=WAREHOUSE,
31+
uri=CATALOG_URL,
32+
token=TOKEN,
33+
)
34+
35+
# Create default namespace
36+
catalog.create_namespace("default")
37+
38+
# Create simple PyArrow table
39+
df = pa.table({
40+
"id": [1, 2, 3],
41+
"name": ["Alice", "Bob", "Charlie"],
42+
})
43+
44+
# Create an Iceberg table
45+
test_table = ("default", "my_table")
46+
table = catalog.create_table(
47+
test_table,
48+
schema=df.schema,
49+
)
50+
```
Lines changed: 291 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,291 @@
1+
---
2+
pcx_content_type: get-started
3+
title: Get Started
4+
head: []
5+
sidebar:
6+
order: 2
7+
description: Learn how to enable the R2 Data Catalog on your bucket, load sample data, and run your first query.
8+
---
9+
10+
import {
11+
Render,
12+
PackageManagers,
13+
Steps,
14+
FileTree,
15+
Tabs,
16+
TabItem,
17+
TypeScriptExample,
18+
WranglerConfig,
19+
LinkCard,
20+
} from "~/components";
21+
22+
## Overview
23+
24+
This guide will instruct you through:
25+
26+
- Creating your first [R2 bucket](/r2/buckets/) and enabling its [data catalog](/r2/data-catalog/).
27+
- Creating an API token needed for query engines to authenticate with your data catalog.
28+
- Using [PyIceberg](https://py.iceberg.apache.org/) to create your first Iceberg table in a [marimo](https://marimo.io/) Python notebook.
29+
- Using [PyIceberg](https://py.iceberg.apache.org/) to load sample data into your table and query it.
30+
31+
## Prerequisites
32+
33+
<Render file="prereqs" product="workers" />
34+
35+
## 1. Create an R2 bucket
36+
37+
<Tabs syncKey='CLIvDash'>
38+
<TabItem label='Wrangler CLI'>
39+
40+
<Steps>
41+
1. If not already logged in, run:
42+
43+
```
44+
npx wrangler login
45+
```
46+
47+
2. Then, enable the catalog on your chosen R2 bucket:
48+
49+
```
50+
npx wrangler r2 bucket r2-data-catalog-tutorial
51+
```
52+
53+
</Steps>
54+
55+
</TabItem>
56+
<TabItem label='Dashboard'>
57+
58+
<Steps>
59+
1. From the Cloudflare dashboard, select **R2 Object Storage** from the sidebar.
60+
2. Select the bucket you want to enable as a data catalog.
61+
3. Switch to the **Settings** tab, scroll down to **R2 Data Catalog**, and select **Enable**.
62+
4. Once enabled, note the **Catalog URI** and **Warehouse name**.
63+
</Steps>
64+
</TabItem>
65+
</Tabs>
66+
67+
## 2. Enable the data catalog for your bucket
68+
69+
<Tabs syncKey='CLIvDash'>
70+
<TabItem label='Wrangler CLI'>
71+
72+
Then, enable the catalog on your chosen R2 bucket:
73+
74+
```
75+
npx wrangler r2 bucket catalog enable r2-data-catalog-tutorial
76+
```
77+
78+
</TabItem>
79+
<TabItem label='Dashboard'>
80+
81+
<Steps>
82+
1. From the Cloudflare dashboard, select **R2 Object Storage** from the sidebar.
83+
2. Select the bucket you want to enable as a data catalog.
84+
3. Switch to the **Settings** tab, scroll down to **R2 Data Catalog**, and select **Enable**.
85+
4. Once enabled, note the **Catalog URI** and **Warehouse name**.
86+
</Steps>
87+
</TabItem>
88+
</Tabs>
89+
90+
## 3. Create an API token
91+
92+
Iceberg clients (including [PyIceberg](https://py.iceberg.apache.org/)) must authenticate to the catalog with a [Cloudflare API token](/fundamentals/api/get-started/create-token/) that has both R2 and catalog permissions.
93+
94+
<Steps>
95+
1. From the Cloudflare dashboard, select **R2 Object Storage** from the sidebar.
96+
97+
2. Expand the **API** dropdown and select **Manage API tokens**.
98+
99+
3. Select **Create API token**.
100+
101+
4. Select the **R2 Token** text to edit your API token name.
102+
103+
5. Under **Permissions**, choose the **Admin Read & Write** permission.
104+
105+
6. Select **Create API Token**.
106+
107+
7. Note the **Token value**, you will need this.
108+
109+
</Steps>
110+
111+
## 4. Install uv
112+
113+
Next, you'll need to install a Python package manager, in this guide we'll be using [uv](https://docs.astral.sh/uv/). If you don't already have uv installed, follow the [installing uv guide](https://docs.astral.sh/uv/getting-started/installation/).
114+
115+
## 5. Install marimo
116+
117+
We'll be using [marimo](https://github.com/marimo-team/marimo) as a Python notebook.
118+
119+
<Steps>
120+
1. Create a directory where our notebook will live:
121+
122+
```
123+
mkdir r2-data-catalog-notebook
124+
```
125+
126+
2. Change into our new directory:
127+
128+
```
129+
cd r2-data-catalog-notebook
130+
```
131+
132+
3. Create a new Python virtual environment:
133+
134+
```
135+
uv venv
136+
```
137+
138+
4. Activate the Python virtual environment:
139+
140+
```
141+
source .venv/bin/activate
142+
```
143+
144+
5. Install marimo with uv:
145+
146+
```py
147+
uv pip install marimo
148+
```
149+
150+
</Steps>
151+
152+
## 6. Create a Python notebook to interact with our data warehouse
153+
154+
<Steps>
155+
1. Create a file called `r2-data-catalog-tutorial.py`.
156+
157+
2. Paste the following code snippet into your `r2-data-catalog-tutorial.py` file:
158+
159+
```py
160+
import marimo
161+
162+
__generated_with = "0.11.31"
163+
app = marimo.App(width="medium")
164+
165+
166+
@app.cell
167+
def _():
168+
import marimo as mo
169+
return (mo,)
170+
171+
172+
@app.cell
173+
def _():
174+
import pandas
175+
import pyarrow as pa
176+
import pyarrow.compute as pc
177+
import pyarrow.parquet as pq
178+
179+
from pyiceberg.catalog.rest import RestCatalog
180+
from pyiceberg.exceptions import NamespaceAlreadyExistsError
181+
182+
# Define catalog connection details (replace variables)
183+
WAREHOUSE = "<WAREHOUSE>"
184+
TOKEN = "<TOKEN>"
185+
CATALOG_URL = f"https://catalog.cloudflarestorage.com/{WAREHOUSE}"
186+
187+
# Connect to R2 Data Catalog
188+
catalog = RestCatalog(
189+
name="my_catalog",
190+
warehouse=WAREHOUSE,
191+
uri=CATALOG_URL,
192+
token=TOKEN,
193+
)
194+
return (
195+
CATALOG_URL,
196+
NamespaceAlreadyExistsError,
197+
RestCatalog,
198+
TOKEN,
199+
WAREHOUSE,
200+
catalog,
201+
pa,
202+
pandas,
203+
pc,
204+
pq,
205+
)
206+
207+
208+
@app.cell
209+
def _(NamespaceAlreadyExistsError, catalog):
210+
# Create default namespace if needed
211+
try:
212+
catalog.create_namespace("default")
213+
except NamespaceAlreadyExistsError:
214+
pass
215+
return
216+
217+
218+
@app.cell
219+
def _(pa):
220+
# Create simple PyArrow table
221+
df = pa.table({
222+
"id": [1, 2, 3],
223+
"name": ["Alice", "Bob", "Charlie"],
224+
"score": [80.0, 92.5, 88.0],
225+
})
226+
return (df,)
227+
228+
229+
@app.cell
230+
def _(catalog, df):
231+
# Create or load Iceberg table
232+
test_table = ("default", "people")
233+
if not catalog.table_exists(test_table):
234+
print(f"Creating table: {test_table}")
235+
table = catalog.create_table(
236+
test_table,
237+
schema=df.schema,
238+
)
239+
else:
240+
table = catalog.load_table(test_table)
241+
return table, test_table
242+
243+
244+
@app.cell
245+
def _(df, table):
246+
# Append data
247+
table.append(df)
248+
return
249+
250+
251+
@app.cell
252+
def _(table):
253+
print("Table contents:")
254+
scanned = table.scan().to_arrow()
255+
print(scanned.to_pandas())
256+
return (scanned,)
257+
258+
259+
@app.cell
260+
def _():
261+
# Optional cleanup. To run uncomment and run cell
262+
# print(f"Deleting table: {test_table}")
263+
# catalog.drop_table(test_table)
264+
# print("Table dropped.")
265+
return
266+
267+
268+
if __name__ == "__main__":
269+
app.run()
270+
```
271+
272+
3. Replace the `WAREHOUSE` and `TOKEN` variables with your values from sections **2** and **3** respectively.
273+
274+
</Steps>
275+
In the Python notebook above, you:
276+
277+
1. Connect to your catalog.
278+
2. Create the `default` namespace.
279+
3. Create a simple PyArrow table.
280+
4. Create (or load) the `people` table in the `default` namespace.
281+
5. Append sample data to the table.
282+
6. Print the contents of the table.
283+
7. (Optional) Drop the `people` table we created for this tutorial.
284+
285+
## Learn more
286+
287+
<LinkCard
288+
title="Configuration examples"
289+
href="/r2/data-catalog/config-examples/"
290+
description="Find detailed setup instructions for Apache Spark and other common query engines."
291+
/>

0 commit comments

Comments
 (0)