Skip to content
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 22 additions & 0 deletions src/content/changelog/r2/2025-04-10-r2-data-catalog-beta.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
---
title: R2 Data Catalog is a managed Apache Iceberg data catalog built directly into R2 buckets
description: A managed Apache Iceberg data catalog built directly into R2 buckets
products:
- r2
date: 2025-04-10T13:00:00Z
hidden: true
---

Today, we're launching [R2 Data Catalog](/r2/data-catalog/) in open beta, a managed Apache Iceberg catalog built directly into your [Cloudflare R2](/r2/) bucket.

If you're not already familiar with it, [Apache Iceberg](https://iceberg.apache.org/) is an open table format designed to handle large-scale analytics datasets stored in object storage, offering ACID transactions and schema evolution. R2 Data Catalog exposes a standard Iceberg REST catalog interface, so you can connect engines like [Spark](/r2/data-catalog/config-examples/spark/), [Snowflake](/r2/data-catalog/config-examples/snowflake/), and [PyIceberg](/r2/data-catalog/config-examples/pyiceberg/) to start querying your tables using the tools you already know.

To enable a data catalog on your R2 bucket, find **R2 Data Catalog** in your buckets settings in the dashboard or run:

```bash
npx wrangler r2 bucket catalog enable my-bucket
```

And that's it. You'll get a catalog URI and warehouse you can plug into your favorite Iceberg engines.

Visit our [getting started guide](/r2/data-catalog/get-started/) for step-by-step instructions on enabling R2 Data Catalog, creating tables, and running your first queries.
18 changes: 12 additions & 6 deletions src/content/docs/r2/api/tokens.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -45,12 +45,18 @@ Jurisdictional buckets can only be accessed via the corresponding jurisdictional

## Permissions

| Permission | Description |
| ------------------- | ----------------------------------------------------------------------------------------------------------------------------------------- |
| Admin Read & Write | Allows the ability to create, list and delete buckets, and edit bucket configurations in addition to list, write, and read object access. |
| Admin Read only | Allows the ability to list buckets and view bucket configuration in addition to list and read object access. |
| Object Read & Write | Allows the ability to read, write, and list objects in specific buckets. |
| Object Read only | Allows the ability to read and list objects in specific buckets. |
| Permission | Description |
| ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Admin Read & Write | Allows the ability to create, list, and delete buckets, edit bucket configuration, read, write, and list objects, and read and write access to data catalog tables and associated metadata. |
| Admin Read only | Allows the ability to list buckets and view bucket configuration, read and list objects, and read access to data catalog tables and associated metadata. |
| Object Read & Write | Allows the ability to read, write, and list objects in specific buckets. |
| Object Read only | Allows the ability to read and list objects in specific buckets. |

:::note

Currently Admin Read & Write or Admin Read only permission is required to interact with and query [R2 Data Catalog](/r2/data-catalog/).

:::

## Create API tokens via API

Expand Down
16 changes: 16 additions & 0 deletions src/content/docs/r2/data-catalog/config-examples/index.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
---
pcx_content_type: navigation
title: Connect to Iceberg engines
head: []
sidebar:
order: 4
group:
hideIndex: true
description: Find detailed setup instructions for Apache Spark and other common query engines.
---

import { DirectoryListing } from "~/components";

Below are configuration examples to connect various Iceberg engines to [R2 Data Catalog](/r2/data-catalog/):

<DirectoryListing />
50 changes: 50 additions & 0 deletions src/content/docs/r2/data-catalog/config-examples/pyiceberg.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
---
title: PyIceberg
pcx_content_type: example
---

Below is an example of using [PyIceberg](https://py.iceberg.apache.org/) to connect to R2 Data Catalog.

## Prerequisites

- Sign up for a [Cloudflare account](https://dash.cloudflare.com/sign-up/workers-and-pages).
- Create an [R2 bucket](/r2/buckets/) and enable the data catalog.
- Create an [R2 API token](/r2/api/tokens/) with both [R2 and data catalog permissions](/r2/api/tokens/#permissions).
- Install the [PyIceberg](https://py.iceberg.apache.org/#installation) and [PyArrow](https://arrow.apache.org/docs/python/install.html) libraries.

## Example usage

```py
import pyarrow as pa
from pyiceberg.catalog.rest import RestCatalog
from pyiceberg.exceptions import NamespaceAlreadyExistsError

# Define catalog connection details (replace variables)
WAREHOUSE = "<WAREHOUSE>"
TOKEN = "<TOKEN>"
CATALOG_URI = "<CATALOG_URI>"

# Connect to R2 Data Catalog
catalog = RestCatalog(
name="my_catalog",
warehouse=WAREHOUSE,
uri=CATALOG_URI,
token=TOKEN,
)

# Create default namespace
catalog.create_namespace("default")

# Create simple PyArrow table
df = pa.table({
"id": [1, 2, 3],
"name": ["Alice", "Bob", "Charlie"],
})

# Create an Iceberg table
test_table = ("default", "my_table")
table = catalog.create_table(
test_table,
schema=df.schema,
)
```
62 changes: 62 additions & 0 deletions src/content/docs/r2/data-catalog/config-examples/snowflake.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
---
title: Snowflake
pcx_content_type: example
---

Below is an example of using [Snowflake](https://docs.snowflake.com/en/user-guide/tables-iceberg-configure-catalog-integration-rest) to connect and query data from R2 Data Catalog (read-only).

## Prerequisites

- Sign up for a [Cloudflare account](https://dash.cloudflare.com/sign-up/workers-and-pages).
- Create an [R2 bucket](/r2/buckets/) and enable the data catalog.
- Create an [R2 API token](/r2/api/tokens/) with both [R2 and data catalog permissions](/r2/api/tokens/#permissions).
- A [Snowflake](https://www.snowflake.com/) account with the necessary privileges to create external volumes and catalog integrations.

## Example usage

In your Snowflake [SQL worksheet](https://docs.snowflake.com/en/user-guide/ui-snowsight-worksheets-gs) or [notebook](https://docs.snowflake.com/en/user-guide/ui-snowsight/notebooks) run the following commands:

```sql
-- Create a database (if you don't already have one) to organize your external data
CREATE DATABASE IF NOT EXISTS r2_example_db;

-- Create an external volume pointing to your R2 bucket
CREATE OR REPLACE EXTERNAL VOLUME ext_vol_r2
STORAGE_LOCATIONS = (
(
NAME = 'my_r2_storage_location'
STORAGE_PROVIDER = 'S3COMPAT'
STORAGE_BASE_URL = 's3compat://<bucket-name>'
CREDENTIALS = (
AWS_KEY_ID = '<access_key>'
AWS_SECRET_KEY = '<secret_access_key>'
)
STORAGE_ENDPOINT = '<account_id>.r2.cloudflarestorage.com'
)
)
ALLOW_WRITES = FALSE;

-- Create a catalog integration for R2 Data Catalog (read-only)
CREATE OR REPLACE CATALOG INTEGRATION r2_data_catalog
CATALOG_SOURCE = ICEBERG_REST
TABLE_FORMAT = ICEBERG
CATALOG_NAMESPACE = 'default'
REST_CONFIG = (
CATALOG_URI = '<catalog_uri>'
CATALOG_NAME = '<warehouse_name>'
)
REST_AUTHENTICATION = (
TYPE = BEARER
BEARER_TOKEN = '<token>'
)
ENABLED = TRUE;

-- Create an Apache Iceberg table in your selected Snowflake database
CREATE ICEBERG TABLE my_iceberg_table
CATALOG = 'r2_data_catalog'
EXTERNAL_VOLUME = 'ext_vol_r2'
CATALOG_TABLE_NAME = 'my_table'; -- Name of existing table in your R2 data catalog

-- Query your Iceberg table
SELECT * FROM my_iceberg_table;
```
175 changes: 175 additions & 0 deletions src/content/docs/r2/data-catalog/config-examples/spark.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
---
title: Spark
pcx_content_type: example
---

Below is an example of how you can build an [Apache Spark](https://spark.apache.org/) application (with Scala) which connects to the R2 Data Catalog. This application is built to run locally, but it can be adapted to run on a cluster.

## Prerequisites

- Sign up for a [Cloudflare account](https://dash.cloudflare.com/sign-up/workers-and-pages).
- Create an [R2 bucket](/r2/buckets/) and enable the data catalog.
- Create an [R2 API token](/r2/api/tokens/) with both [R2 and data catalog permissions](/r2/api/tokens/#permissions).
- Install Java 17, Spark 3.5.3, and SBT 1.10.11
- Note: The specific versions of tools are critical for getting things to work in this example.
- Tip: [“SDKMAN”](https://sdkman.io/) is a convenient package manager for installing SDKs.

## Example usage

To start, create a new empty project directory somewhere on your machine. Inside that directory, create the following file at `src/main/scala/com/example/R2DataCatalogDemo.scala`. This will serve as the main entry point for your Spark application.

```java
package com.example

import org.apache.spark.sql.SparkSession

object R2DataCatalogDemo {
def main(args: Array[String]): Unit = {

val uri = sys.env("CATALOG_URI")
val warehouse = sys.env("WAREHOUSE")
val token = sys.env("TOKEN")

val spark = SparkSession.builder()
.appName("My R2 Data Catalog Demo")
.master("local[*]")
.config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
.config("spark.sql.catalog.mydemo", "org.apache.iceberg.spark.SparkCatalog")
.config("spark.sql.catalog.mydemo.type", "rest")
.config("spark.sql.catalog.mydemo.uri", uri)
.config("spark.sql.catalog.mydemo.warehouse", warehouse)
.config("spark.sql.catalog.mydemo.token", token)
.getOrCreate()

import spark.implicits._

val data = Seq(
(1, "Alice", 25),
(2, "Bob", 30),
(3, "Charlie", 35),
(4, "Diana", 40)
).toDF("id", "name", "age")

spark.sql("USE mydemo")

spark.sql("CREATE NAMESPACE IF NOT EXISTS demoNamespace")

data.writeTo("demoNamespace.demotable").createOrReplace()

val readResult = spark.sql("SELECT * FROM demoNamespace.demotable WHERE age > 30")
println("Records with age > 30:")
readResult.show()
}
}
```

For building this application and managing dependencies, we'll use [sbt (“simple build tool”)](https://www.scala-sbt.org/). The following is an example `build.sbt` file to place at the root of your project. It is configured to produce a "fat JAR", bundling all required dependencies.

```java
name := "R2DataCatalogDemo"

version := "1.0"

val sparkVersion = "3.5.3"
val icebergVersion = "1.8.1"

// You need to use binaries of Spark compiled with either 2.12 or 2.13; and 2.12 is more common.
// If you download Spark 3.5.3 with sdkman, then it comes with 2.12.18
scalaVersion := "2.12.18"

libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-sql" % sparkVersion,
"org.apache.iceberg" % "iceberg-core" % icebergVersion,
"org.apache.iceberg" % "iceberg-spark-runtime-3.5_2.12" % icebergVersion,
"org.apache.iceberg" % "iceberg-aws-bundle" % icebergVersion,
)

// build a fat JAR with all dependencies
assembly / assemblyMergeStrategy := {
case PathList("META-INF", "services", xs @ _*) => MergeStrategy.concat
case PathList("META-INF", xs @ _*) => MergeStrategy.discard
case "reference.conf" => MergeStrategy.concat
case "application.conf" => MergeStrategy.concat
case x if x.endsWith(".properties") => MergeStrategy.first
case x => MergeStrategy.first
}

// For Java 17 Compatability
Compile / javacOptions ++= Seq("--release", "17")
```

To enable the [sbt-assembly plugin](https://github.com/sbt/sbt-assembly?tab=readme-ov-file) (used to build fat JARs), add the following to a new file at `project/assembly.sbt`:

```
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "1.2.0")
```

Make sure Java, Spark, and sbt are installed and available in your shell. If you're using SDKMAN, you can install them as shown below:

```bash
sdk install java 17.0.14-amzn
sdk install spark 3.5.3
sdk install sbt 1.10.11
```

With everything installed, you can now build the project using sbt. This will generate a single bundled JAR file.

```bash
sbt clean assembly
```

After building, the output JAR should be located at `target/scala-2.12/R2DataCatalogDemo-assembly-1.0.jar`.

To run the application, you'll use `spark-submit`. Below is an example shell script (`submit.sh`) that includes the necessary Java compatability flags for Spark on Java 17:

```
# We need to set these "--add-opens" so that Spark can run on Java 17 (it needs access to
# parts of the JVM which have been modularized and made internal).
JAVA_17_COMPATABILITY="--add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED"

spark-submit \
--conf "spark.driver.extraJavaOptions=$JAVA_17_COMPATABILITY" \
--conf "spark.executor.extraJavaOptions=$JAVA_17_COMPATABILITY" \
--class com.example.R2DataCatalogDemo target/scala-2.12/R2DataCatalogDemo-assembly-1.0.jar
```

Before running it, make sure the script is executable:

```bash
chmod +x submit.sh
```

At this point, your project directory should be structured like this:

```
.
├── Makefile
├── README.md
├── build.sbt
├── project
│ ├── assembly.sbt
│ ├── build.properties
│ └── project
├── spark-submit.sh
└── src
└── main
└── scala
└── com
└── example
└── R2DataCatalogDemo.scala
```

Before submitting the job, make sure you have the required environment variable set for your catalog URI, warehouse, and [Cloudflare API token](/r2/api/tokens/).

```bash
export CATALOG_URI=
export WAREHOUSE=
export TOKEN=
```

You're now ready to run the job:

```bash
./submit.sh
```
Loading
Loading