cloudflare · jonesphillip · Apr 9, 2025 · Mar 30, 2025 · Apr 5, 2025 · Apr 5, 2025
@@ -0,0 +1,22 @@
+---
+title: R2 Data Catalog is a managed Apache Iceberg data catalog built directly into R2 buckets
+description: A managed Apache Iceberg data catalog built directly into R2 buckets
+products:
+  - r2
+date: 2025-04-10T13:00:00Z
+hidden: true
+---
+
+Today, we're launching [R2 Data Catalog](/r2/data-catalog/) in open beta, a managed Apache Iceberg catalog built directly into your [Cloudflare R2](/r2/) bucket.
+
+If you're not already familiar with it, [Apache Iceberg](https://iceberg.apache.org/) is an open table format designed to handle large-scale analytics datasets stored in object storage, offering ACID transactions and schema evolution. R2 Data Catalog exposes a standard Iceberg REST catalog interface, so you can connect engines like [Spark](/r2/data-catalog/config-examples/spark/), [Snowflake](/r2/data-catalog/config-examples/snowflake/), and [PyIceberg](/r2/data-catalog/config-examples/pyiceberg/) to start querying your tables using the tools you already know.
+
+To enable a data catalog on your R2 bucket, find **R2 Data Catalog** in your buckets settings in the dashboard or run:
+
+```bash
+npx wrangler r2 bucket catalog enable my-bucket
+```
+
+And that's it. You'll get a catalog URI and warehouse you can plug into your favorite Iceberg engines.
+
+Visit our [getting started guide](/r2/data-catalog/get-started/) for step-by-step instructions on enabling R2 Data Catalog, creating tables, and running your first queries.
@@ -45,12 +45,18 @@ Jurisdictional buckets can only be accessed via the corresponding jurisdictional
 
 ## Permissions
 
-| Permission          | Description                                                                                                                               |
-| ------------------- | ----------------------------------------------------------------------------------------------------------------------------------------- |
-| Admin Read & Write  | Allows the ability to create, list and delete buckets, and edit bucket configurations in addition to list, write, and read object access. |
-| Admin Read only     | Allows the ability to list buckets and view bucket configuration in addition to list and read object access.                              |
-| Object Read & Write | Allows the ability to read, write, and list objects in specific buckets.                                                                  |
-| Object Read only    | Allows the ability to read and list objects in specific buckets.                                                                          |
+| Permission          | Description                                                                                                                                                                                 |
+| ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| Admin Read & Write  | Allows the ability to create, list, and delete buckets, edit bucket configuration, read, write, and list objects, and read and write access to data catalog tables and associated metadata. |
+| Admin Read only     | Allows the ability to list buckets and view bucket configuration, read and list objects, and read access to data catalog tables and associated metadata.                                    |
+| Object Read & Write | Allows the ability to read, write, and list objects in specific buckets.                                                                                                                    |
+| Object Read only    | Allows the ability to read and list objects in specific buckets.                                                                                                                            |
+
+:::note
+
+Currently Admin Read & Write or Admin Read only permission is required to interact with and query [R2 Data Catalog](/r2/data-catalog/).
+
+:::
 
 ## Create API tokens via API
 

@@ -0,0 +1,16 @@
+---
+pcx_content_type: navigation
+title: Connect to Iceberg engines
+head: []
+sidebar:
+  order: 4
+  group:
+    hideIndex: true
+description: Find detailed setup instructions for Apache Spark and other common query engines.
+---
+
+import { DirectoryListing } from "~/components";
+
+Below are configuration examples to connect various Iceberg engines to [R2 Data Catalog](/r2/data-catalog/):
+
+<DirectoryListing />
@@ -0,0 +1,50 @@
+---
+title: PyIceberg
+pcx_content_type: example
+---
+
+Below is an example of using [PyIceberg](https://py.iceberg.apache.org/) to connect to R2 Data Catalog.
+
+## Prerequisites
+
+- Sign up for a [Cloudflare account](https://dash.cloudflare.com/sign-up/workers-and-pages).
+- Create an [R2 bucket](/r2/buckets/) and enable the data catalog.
+- Create an [R2 API token](/r2/api/tokens/) with both [R2 and data catalog permissions](/r2/api/tokens/#permissions).
+- Install the [PyIceberg](https://py.iceberg.apache.org/#installation) and [PyArrow](https://arrow.apache.org/docs/python/install.html) libraries.
+
+## Example usage
+
+```py
+import pyarrow as pa
+from pyiceberg.catalog.rest import RestCatalog
+from pyiceberg.exceptions import NamespaceAlreadyExistsError
+
+# Define catalog connection details (replace variables)
+WAREHOUSE = "<WAREHOUSE>"
+TOKEN = "<TOKEN>"
+CATALOG_URI = "<CATALOG_URI>"
+
+# Connect to R2 Data Catalog
+catalog = RestCatalog(
+    name="my_catalog",
+    warehouse=WAREHOUSE,
+    uri=CATALOG_URI,
+    token=TOKEN,
+)
+
+# Create default namespace
+catalog.create_namespace("default")
+
+# Create simple PyArrow table
+df = pa.table({
+    "id": [1, 2, 3],
+    "name": ["Alice", "Bob", "Charlie"],
+})
+
+# Create an Iceberg table
+test_table = ("default", "my_table")
+table = catalog.create_table(
+    test_table,
+    schema=df.schema,
+)
+```
@@ -0,0 +1,62 @@
+---
+title: Snowflake
+pcx_content_type: example
+---
+
+Below is an example of using [Snowflake](https://docs.snowflake.com/en/user-guide/tables-iceberg-configure-catalog-integration-rest) to connect and query data from R2 Data Catalog (read-only).
+
+## Prerequisites
+
+- Sign up for a [Cloudflare account](https://dash.cloudflare.com/sign-up/workers-and-pages).
+- Create an [R2 bucket](/r2/buckets/) and enable the data catalog.
+- Create an [R2 API token](/r2/api/tokens/) with both [R2 and data catalog permissions](/r2/api/tokens/#permissions).
+- A [Snowflake](https://www.snowflake.com/) account with the necessary privileges to create external volumes and catalog integrations.
+
+## Example usage
+
+In your Snowflake [SQL worksheet](https://docs.snowflake.com/en/user-guide/ui-snowsight-worksheets-gs) or [notebook](https://docs.snowflake.com/en/user-guide/ui-snowsight/notebooks) run the following commands:
+
+```sql
+-- Create a database (if you don't already have one) to organize your external data
+CREATE DATABASE IF NOT EXISTS r2_example_db;
+
+-- Create an external volume pointing to your R2 bucket
+CREATE OR REPLACE EXTERNAL VOLUME ext_vol_r2
+    STORAGE_LOCATIONS = (
+        (
+            NAME = 'my_r2_storage_location'
+            STORAGE_PROVIDER = 'S3COMPAT'
+            STORAGE_BASE_URL = 's3compat://<bucket-name>'
+            CREDENTIALS = (
+                AWS_KEY_ID = '<access_key>'
+                AWS_SECRET_KEY = '<secret_access_key>'
+            )
+            STORAGE_ENDPOINT = '<account_id>.r2.cloudflarestorage.com'
+        )
+    )
+    ALLOW_WRITES = FALSE;
+
+-- Create a catalog integration for R2 Data Catalog (read-only)
+CREATE OR REPLACE CATALOG INTEGRATION r2_data_catalog
+    CATALOG_SOURCE = ICEBERG_REST
+    TABLE_FORMAT = ICEBERG
+    CATALOG_NAMESPACE = 'default'
+    REST_CONFIG = (
+        CATALOG_URI = '<catalog_uri>'
+        CATALOG_NAME = '<warehouse_name>'
+    )
+    REST_AUTHENTICATION = (
+        TYPE = BEARER
+        BEARER_TOKEN = '<token>'
+    )
+    ENABLED = TRUE;
+
+-- Create an Apache Iceberg table in your selected Snowflake database
+CREATE ICEBERG TABLE my_iceberg_table
+    CATALOG = 'r2_data_catalog'
+    EXTERNAL_VOLUME = 'ext_vol_r2'
+    CATALOG_TABLE_NAME = 'my_table';  -- Name of existing table in your R2 data catalog
+
+-- Query your Iceberg table
+SELECT * FROM my_iceberg_table;
+```
diff --git a/src/content/docs/r2/data-catalog/config-examples/spark.mdx b/src/content/docs/r2/data-catalog/config-examples/spark.mdx
@@ -0,0 +1,175 @@
+---
+title: Spark
+pcx_content_type: example
+---
+
+Below is an example of how you can build an [Apache Spark](https://spark.apache.org/) application (with Scala) which connects to the R2 Data Catalog. This application is built to run locally, but it can be adapted to run on a cluster.
+
+## Prerequisites
+
+- Sign up for a [Cloudflare account](https://dash.cloudflare.com/sign-up/workers-and-pages).
+- Create an [R2 bucket](/r2/buckets/) and enable the data catalog.
+- Create an [R2 API token](/r2/api/tokens/) with both [R2 and data catalog permissions](/r2/api/tokens/#permissions).
+- Install Java 17, Spark 3.5.3, and SBT 1.10.11
+  - Note: The specific versions of tools are critical for getting things to work in this example.
+  - Tip: [“SDKMAN”](https://sdkman.io/) is a convenient package manager for installing SDKs.
+
+## Example usage
+
+To start, create a new empty project directory somewhere on your machine. Inside that directory, create the following file at `src/main/scala/com/example/R2DataCatalogDemo.scala`. This will serve as the main entry point for your Spark application.
+
+```java
+package com.example
+
+import org.apache.spark.sql.SparkSession
+
+object R2DataCatalogDemo {
+    def main(args: Array[String]): Unit = {
+
+        val uri = sys.env("CATALOG_URI")
+        val warehouse = sys.env("WAREHOUSE")
+        val token = sys.env("TOKEN")
+
+        val spark = SparkSession.builder()
+            .appName("My R2 Data Catalog Demo")
+            .master("local[*]")
+            .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
+            .config("spark.sql.catalog.mydemo", "org.apache.iceberg.spark.SparkCatalog")
+            .config("spark.sql.catalog.mydemo.type", "rest")
+            .config("spark.sql.catalog.mydemo.uri", uri)
+            .config("spark.sql.catalog.mydemo.warehouse", warehouse)
+            .config("spark.sql.catalog.mydemo.token", token)
+            .getOrCreate()
+
+        import spark.implicits._
+
+        val data = Seq(
+            (1, "Alice", 25),
+            (2, "Bob", 30),
+            (3, "Charlie", 35),
+            (4, "Diana", 40)
+        ).toDF("id", "name", "age")
+
+        spark.sql("USE mydemo")
+
+        spark.sql("CREATE NAMESPACE IF NOT EXISTS demoNamespace")
+
+        data.writeTo("demoNamespace.demotable").createOrReplace()
+
+        val readResult = spark.sql("SELECT * FROM demoNamespace.demotable WHERE age > 30")
+        println("Records with age > 30:")
+        readResult.show()
+    }
+}
+```
+
+For building this application and managing dependencies, we'll use [sbt (“simple build tool”)](https://www.scala-sbt.org/). The following is an example `build.sbt` file to place at the root of your project. It is configured to produce a "fat JAR", bundling all required dependencies.
+
+```java
+name := "R2DataCatalogDemo"
+
+version := "1.0"
+
+val sparkVersion = "3.5.3"
+val icebergVersion = "1.8.1"
+
+// You need to use binaries of Spark compiled with either 2.12 or 2.13; and 2.12 is more common.
+// If you download Spark 3.5.3 with sdkman, then it comes with 2.12.18
+scalaVersion := "2.12.18"
+
+libraryDependencies ++= Seq(
+    "org.apache.spark" %% "spark-core" % sparkVersion,
+    "org.apache.spark" %% "spark-sql" % sparkVersion,
+    "org.apache.iceberg" % "iceberg-core" % icebergVersion,
+    "org.apache.iceberg" % "iceberg-spark-runtime-3.5_2.12" % icebergVersion,
+    "org.apache.iceberg" % "iceberg-aws-bundle" % icebergVersion,
+)
+
+// build a fat JAR with all dependencies
+assembly / assemblyMergeStrategy := {
+    case PathList("META-INF", "services", xs @ _*) => MergeStrategy.concat
+    case PathList("META-INF", xs @ _*) => MergeStrategy.discard
+    case "reference.conf" => MergeStrategy.concat
+    case "application.conf" => MergeStrategy.concat
+    case x if x.endsWith(".properties") => MergeStrategy.first
+    case x => MergeStrategy.first
+}
+
+// For Java  17 Compatability
+Compile / javacOptions ++= Seq("--release", "17")
+```
+
+To enable the [sbt-assembly plugin](https://github.com/sbt/sbt-assembly?tab=readme-ov-file) (used to build fat JARs), add the following to a new file at `project/assembly.sbt`:
+
+```
+addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "1.2.0")
+```
+
+Make sure Java, Spark, and sbt are installed and available in your shell. If you're using SDKMAN, you can install them as shown below:
+
+```bash
+sdk install java 17.0.14-amzn
+sdk install spark 3.5.3
+sdk install sbt 1.10.11
+```
+
+With everything installed, you can now build the project using sbt. This will generate a single bundled JAR file.
+
+```bash
+sbt clean assembly
+```
+
+After building, the output JAR should be located at `target/scala-2.12/R2DataCatalogDemo-assembly-1.0.jar`.
+
+To run the application, you'll use `spark-submit`. Below is an example shell script (`submit.sh`) that includes the necessary Java compatability flags for Spark on Java 17:
+
+```
+# We need to set these "--add-opens" so that Spark can run on Java 17 (it needs access to
+# parts of the JVM which have been modularized and made internal).
+JAVA_17_COMPATABILITY="--add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED"
+
+spark-submit \
+--conf "spark.driver.extraJavaOptions=$JAVA_17_COMPATABILITY" \
+--conf "spark.executor.extraJavaOptions=$JAVA_17_COMPATABILITY" \
+--class com.example.R2DataCatalogDemo target/scala-2.12/R2DataCatalogDemo-assembly-1.0.jar
+```
+
+Before running it, make sure the script is executable:
+
+```bash
+chmod +x submit.sh
+```
+
+At this point, your project directory should be structured like this:
+
+```
+.
+├── Makefile
+├── README.md
+├── build.sbt
+├── project
+│   ├── assembly.sbt
+│   ├── build.properties
+│   └── project
+├── spark-submit.sh
+└── src
+    └── main
+        └── scala
+            └── com
+                └── example
+                    └── R2DataCatalogDemo.scala
+```
+
+Before submitting the job, make sure you have the required environment variable set for your catalog URI, warehouse, and [Cloudflare API token](/r2/api/tokens/).
+
+```bash
+export CATALOG_URI=
+export WAREHOUSE=
+export TOKEN=
+```
+
+You're now ready to run the job:
+
+```bash
+./submit.sh
+```