Skip to content

Commit d6c9f60

Browse files
committed
Adds documentation for R2 Data Catalog
1 parent 7ead684 commit d6c9f60

File tree

15 files changed

+780
-70
lines changed

15 files changed

+780
-70
lines changed

src/content/docs/r2/api/tokens.mdx

Lines changed: 12 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -45,12 +45,18 @@ Jurisdictional buckets can only be accessed via the corresponding jurisdictional
4545

4646
## Permissions
4747

48-
| Permission | Description |
49-
| ------------------- | ----------------------------------------------------------------------------------------------------------------------------------------- |
50-
| Admin Read & Write | Allows the ability to create, list and delete buckets, and edit bucket configurations in addition to list, write, and read object access. |
51-
| Admin Read only | Allows the ability to list buckets and view bucket configuration in addition to list and read object access. |
52-
| Object Read & Write | Allows the ability to read, write, and list objects in specific buckets. |
53-
| Object Read only | Allows the ability to read and list objects in specific buckets. |
48+
| Permission | Description |
49+
| ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
50+
| Admin Read & Write | Allows the ability to create, list, and delete buckets, edit bucket configuration, read, write, and list objects, and read and write access to data catalog tables and associated metadata. |
51+
| Admin Read only | Allows the ability to list buckets and view bucket configuration, read and list objects, and read access to data catalog tables and associated metadata. |
52+
| Object Read & Write | Allows the ability to read, write, and list objects in specific buckets. |
53+
| Object Read only | Allows the ability to read and list objects in specific buckets. |
54+
55+
:::note
56+
57+
Currently Admin Read & Write or Admin Read only permission is required to interact with and query [R2 Data Catalog](/r2/data-catalog/).
58+
59+
:::
5460

5561
## Create API tokens via API
5662

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
---
2+
pcx_content_type: navigation
3+
title: Configuration examples
4+
head: []
5+
sidebar:
6+
order: 3
7+
group:
8+
hideIndex: true
9+
description: Find detailed setup instructions for Apache Spark and other common query engines.
10+
---
11+
12+
import { DirectoryListing } from "~/components";
13+
14+
Below are configuration examples to connect various Iceberg engines to [R2 Data Catalog](/r2/data-catalog/):
15+
16+
<DirectoryListing />
Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
---
2+
title: PyIceberg
3+
pcx_content_type: example
4+
---
5+
6+
Below is an example of using [PyIceberg](https://py.iceberg.apache.org/) to connect to R2 Data Catalog.
7+
8+
## Prerequisites
9+
10+
- Sign up for a [Cloudflare account](https://dash.cloudflare.com/sign-up/workers-and-pages).
11+
- Create an [R2 bucket](/r2/buckets/) and enable the data catalog.
12+
- Create an [R2 API token](/r2/api/tokens/) with both [R2 and data catalog permissions](/r2/api/tokens/#permissions).
13+
- Install the [PyIceberg](https://py.iceberg.apache.org/#installation) and [PyArrow](https://arrow.apache.org/docs/python/install.html) libraries.
14+
15+
## Example usage
16+
17+
```py
18+
import pyarrow as pa
19+
from pyiceberg.catalog.rest import RestCatalog
20+
from pyiceberg.exceptions import NamespaceAlreadyExistsError
21+
22+
# Define catalog connection details (replace variables)
23+
WAREHOUSE = "<WAREHOUSE>"
24+
TOKEN = "<TOKEN>"
25+
CATALOG_URI = "<CATALOG_URI>"
26+
27+
# Connect to R2 Data Catalog
28+
catalog = RestCatalog(
29+
name="my_catalog",
30+
warehouse=WAREHOUSE,
31+
uri=CATALOG_URI,
32+
token=TOKEN,
33+
)
34+
35+
# Create default namespace
36+
catalog.create_namespace("default")
37+
38+
# Create simple PyArrow table
39+
df = pa.table({
40+
"id": [1, 2, 3],
41+
"name": ["Alice", "Bob", "Charlie"],
42+
})
43+
44+
# Create an Iceberg table
45+
test_table = ("default", "my_table")
46+
table = catalog.create_table(
47+
test_table,
48+
schema=df.schema,
49+
)
50+
```
Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
---
2+
title: Snowflake
3+
pcx_content_type: example
4+
---
5+
6+
Below is an example of using [Snowflake](https://docs.snowflake.com/en/user-guide/tables-iceberg-configure-catalog-integration-rest) to connect and query data from R2 Data Catalog (read-only).
7+
8+
## Prerequisites
9+
10+
- Sign up for a [Cloudflare account](https://dash.cloudflare.com/sign-up/workers-and-pages).
11+
- Create an [R2 bucket](/r2/buckets/) and enable the data catalog.
12+
- Create an [R2 API token](/r2/api/tokens/) with both [R2 and data catalog permissions](/r2/api/tokens/#permissions).
13+
- A [Snowflake](https://www.snowflake.com/) account with the necessary privileges to create external volumes and catalog integrations.
14+
15+
## Example usage
16+
17+
In your Snowflake [SQL worksheet](https://docs.snowflake.com/en/user-guide/ui-snowsight-worksheets-gs) or [notebook](https://docs.snowflake.com/en/user-guide/ui-snowsight/notebooks) run the following commands:
18+
19+
```sql
20+
-- Create a database (if you don't already have one) to organize your external data
21+
CREATE DATABASE IF NOT EXISTS r2_example_db;
22+
23+
-- Create an external volume pointing to your R2 bucket
24+
CREATE OR REPLACE EXTERNAL VOLUME ext_vol_r2
25+
STORAGE_LOCATIONS = (
26+
(
27+
NAME = 'my_r2_storage_location'
28+
STORAGE_PROVIDER = 'S3COMPAT'
29+
STORAGE_BASE_URL = 's3compat://<bucket-name>'
30+
CREDENTIALS = (
31+
AWS_KEY_ID = '<access_key>'
32+
AWS_SECRET_KEY = '<secret_access_key>'
33+
)
34+
STORAGE_ENDPOINT = '<account_id>.r2.cloudflarestorage.com'
35+
)
36+
)
37+
ALLOW_WRITES = FALSE;
38+
39+
-- Create a catalog integration for R2 Data Catalog (read-only)
40+
CREATE OR REPLACE CATALOG INTEGRATION r2_data_catalog
41+
CATALOG_SOURCE = ICEBERG_REST
42+
TABLE_FORMAT = ICEBERG
43+
CATALOG_NAMESPACE = 'default'
44+
REST_CONFIG = (
45+
CATALOG_URI = '<catalog_uri>'
46+
CATALOG_NAME = '<warehouse_name>'
47+
)
48+
REST_AUTHENTICATION = (
49+
TYPE = BEARER
50+
BEARER_TOKEN = '<token>'
51+
)
52+
ENABLED = TRUE;
53+
54+
-- Create an Apache Iceberg table in your selected Snowflake database
55+
CREATE ICEBERG TABLE my_iceberg_table
56+
CATALOG = 'r2_data_catalog'
57+
EXTERNAL_VOLUME = 'ext_vol_r2'
58+
CATALOG_TABLE_NAME = 'my_table'; -- Name of existing table in your R2 data catalog
59+
60+
-- Query your Iceberg table
61+
SELECT * FROM my_iceberg_table;
62+
```
Lines changed: 173 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,173 @@
1+
---
2+
title: Spark
3+
pcx_content_type: example
4+
---
5+
6+
Below is an example of how you can build an [Apache Spark](https://spark.apache.org/) application (with Scala) which connects to the R2 Data Catalog. This application is built to run locally, but it can be adapted to run on a cluster.
7+
8+
## Prerequisites
9+
10+
- Sign up for a [Cloudflare account](https://dash.cloudflare.com/sign-up/workers-and-pages).
11+
- Create an [R2 bucket](/r2/buckets/) and enable the data catalog.
12+
- Create an [R2 API token](/r2/api/tokens/) with both [R2 and data catalog permissions](/r2/api/tokens/#permissions).
13+
- Install Java 17, Spark 3.5.3, and SBT 1.10.11
14+
- Note: The specific versions of tools are critical for getting things to work in this example.
15+
- Tip: [“SDKMAN”](https://sdkman.io/) is a convenient package manager for installing SDKs.
16+
17+
## Example usage
18+
19+
Create a new empty project directory somewhere on your machine. In your project directory, create the following file at `src/main/scala/com/example/R2DataCatalogDemo.scala`.
20+
21+
```java
22+
package com.example
23+
24+
import org.apache.spark.sql.SparkSession
25+
26+
object R2DataCatalogDemo {
27+
def main(args: Array[String]): Unit = {
28+
29+
val uri = sys.env("CATALOG_URI")
30+
val warehouse = sys.env("WAREHOUSE")
31+
val token = sys.env("TOKEN")
32+
33+
val spark = SparkSession.builder()
34+
.appName("My R2 Data Catalog Demo")
35+
.master("local[*]")
36+
.config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
37+
.config("spark.sql.catalog.mydemo", "org.apache.iceberg.spark.SparkCatalog")
38+
.config("spark.sql.catalog.mydemo.type", "rest")
39+
.config("spark.sql.catalog.mydemo.uri", uri)
40+
.config("spark.sql.catalog.mydemo.warehouse", warehouse)
41+
.config("spark.sql.catalog.mydemo.token", token)
42+
.getOrCreate()
43+
44+
import spark.implicits._
45+
46+
val data = Seq(
47+
(1, "Alice", 25),
48+
(2, "Bob", 30),
49+
(3, "Charlie", 35),
50+
(4, "Diana", 40)
51+
).toDF("id", "name", "age")
52+
53+
spark.sql("USE mydemo")
54+
55+
spark.sql("CREATE NAMESPACE IF NOT EXISTS demoNamespace")
56+
57+
data.writeTo("demoNamespace.demotable").createOrReplace()
58+
59+
val readResult = spark.sql("SELECT * FROM demoNamespace.demotable WHERE age > 30")
60+
println("Records with age > 30:")
61+
readResult.show()
62+
}
63+
}
64+
```
65+
66+
For this demo, we will use [sbt (“simple build tool”)](https://www.scala-sbt.org/) to build the application and manage dependencies. Here is an example `built.sbt` file (stored in your project root) for running this application. It will produce a “fat JAR” which bundles the dependencies into a single JAR.
67+
68+
```java
69+
name := "R2DataCatalogDemo"
70+
71+
version := "1.0"
72+
73+
val sparkVersion = "3.5.3"
74+
val icebergVersion = "1.8.1"
75+
76+
// You need to use binaries of Spark compiled with either 2.12 or 2.13; and 2.12 is more common.
77+
// If you download Spark 3.5.3 with sdkman, then it comes with 2.12.18
78+
scalaVersion := "2.12.18"
79+
80+
libraryDependencies ++= Seq(
81+
"org.apache.spark" %% "spark-core" % sparkVersion,
82+
"org.apache.spark" %% "spark-sql" % sparkVersion,
83+
"org.apache.iceberg" % "iceberg-core" % icebergVersion,
84+
"org.apache.iceberg" % "iceberg-spark-runtime-3.5_2.12" % icebergVersion,
85+
"org.apache.iceberg" % "iceberg-aws-bundle" % icebergVersion,
86+
)
87+
88+
// build a fat JAR with all dependencies
89+
assembly / assemblyMergeStrategy := {
90+
case PathList("META-INF", "services", xs @ _*) => MergeStrategy.concat
91+
case PathList("META-INF", xs @ _*) => MergeStrategy.discard
92+
case "reference.conf" => MergeStrategy.concat
93+
case "application.conf" => MergeStrategy.concat
94+
case x if x.endsWith(".properties") => MergeStrategy.first
95+
case x => MergeStrategy.first
96+
}
97+
98+
// For Java 17 Compatability
99+
Compile / javacOptions ++= Seq("--release", "17")
100+
```
101+
102+
[Assembly](https://github.com/sbt/sbt-assembly?tab=readme-ov-file) is a plugin that can build fat JARs. Create a `project/assembly.sbt` and add the following:
103+
104+
```
105+
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "1.2.0")
106+
```
107+
108+
You can build the project by running sbt. Be sure that you have installed Java, Spark, and SBT. You may need to restart your shell.
109+
110+
```bash
111+
sdk install java 17.0.14-amzn
112+
sdk install spark 3.5.3
113+
sdk install sbt 1.10.11
114+
```
115+
116+
Now you can run sbt to build the fat JAR.
117+
118+
```bash
119+
sbt clean assembly
120+
```
121+
122+
The fat JAR should be available at `target/scala-2.12/R2DataCatalogDemo-assembly-1.0.jar`. Now you can run via `spark-submit`. Here's an example shell script at `submit.sh` to execute it:
123+
124+
```
125+
# We need to set these "--add-opens" so that Spark can run on Java 17 (it needs access to
126+
# parts of the JVM which have been modularized and made internal).
127+
JAVA_17_COMPATABILITY="--add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED"
128+
129+
spark-submit \
130+
--conf "spark.driver.extraJavaOptions=$JAVA_17_COMPATABILITY" \
131+
--conf "spark.executor.extraJavaOptions=$JAVA_17_COMPATABILITY" \
132+
--class com.example.R2DataCatalogDemo target/scala-2.12/R2DataCatalogDemo-assembly-1.0.jar
133+
```
134+
135+
To make your script executable run:
136+
137+
```bash
138+
chmod +x submit.sh
139+
```
140+
141+
Just to recap, your project structure should look something like this:
142+
143+
```
144+
.
145+
├── Makefile
146+
├── README.md
147+
├── build.sbt
148+
├── project
149+
│ ├── assembly.sbt
150+
│ ├── build.properties
151+
│ └── project
152+
├── spark-submit.sh
153+
└── src
154+
└── main
155+
└── scala
156+
└── com
157+
└── example
158+
└── R2DataCatalogDemo.scala
159+
```
160+
161+
To run, set environment variables for your catalog's URI, warehouse name, and your Cloudflare API token.
162+
163+
```bash
164+
export CATALOG_URI=
165+
export WAREHOUSE=
166+
export TOKEN=
167+
```
168+
169+
Now you may run your submit script:
170+
171+
```bash
172+
./submit.sh
173+
```

0 commit comments

Comments
 (0)