-
Notifications
You must be signed in to change notification settings - Fork 10k
Adds documentation for data catalog #21422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 3 commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
ffb61c4
Adds documentation for R2 Data Catalog
jonesphillip 5eab4a3
Added managing catalogs documentation and R2 Data Catalog as a product.
jonesphillip 4826f79
Add changelog entry
jonesphillip fdc400b
PCX review
Oxyjun 9b8cc06
Fix PR comments/typos.
jonesphillip 492e01c
Added PySpark example configuration.
jonesphillip e9c21fd
Update src/content/docs/r2/data-catalog/config-examples/spark-scala.mdx
Oxyjun 765fad6
Added more context for data catalog auth
jonesphillip 7fa96e3
Add access policy example for r2 data catalog API tokens
jonesphillip File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
22 changes: 22 additions & 0 deletions
22
src/content/changelog/r2/2025-04-10-r2-data-catalog-beta.mdx
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,22 @@ | ||
| --- | ||
| title: R2 Data Catalog is a managed Apache Iceberg data catalog built directly into R2 buckets | ||
| description: A managed Apache Iceberg data catalog built directly into R2 buckets | ||
| products: | ||
| - r2 | ||
| date: 2025-04-10T13:00:00Z | ||
| hidden: true | ||
| --- | ||
|
|
||
| Today, we're launching [R2 Data Catalog](/r2/data-catalog/) in open beta, a managed Apache Iceberg catalog built directly into your [Cloudflare R2](/r2/) bucket. | ||
|
|
||
| If you're not already familiar with it, [Apache Iceberg](https://iceberg.apache.org/) is an open table format designed to handle large-scale analytics datasets stored in object storage, offering ACID transactions and schema evolution. R2 Data Catalog exposes a standard Iceberg REST catalog interface, so you can connect engines like [Spark](/r2/data-catalog/config-examples/spark/), [Snowflake](/r2/data-catalog/config-examples/snowflake/), and [PyIceberg](/r2/data-catalog/config-examples/pyiceberg/) to start querying your tables using the tools you already know. | ||
|
|
||
| To enable a data catalog on your R2 bucket, find **R2 Data Catalog** in your buckets settings in the dashboard or run: | ||
|
|
||
| ```bash | ||
| npx wrangler r2 bucket catalog enable my-bucket | ||
| ``` | ||
|
|
||
| And that's it. You'll get a catalog URI and warehouse you can plug into your favorite Iceberg engines. | ||
|
|
||
| Visit our [getting started guide](/r2/data-catalog/get-started/) for step-by-step instructions on enabling R2 Data Catalog, creating tables, and running your first queries. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
16 changes: 16 additions & 0 deletions
16
src/content/docs/r2/data-catalog/config-examples/index.mdx
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,16 @@ | ||
| --- | ||
| pcx_content_type: navigation | ||
| title: Connect to Iceberg engines | ||
| head: [] | ||
| sidebar: | ||
| order: 4 | ||
| group: | ||
| hideIndex: true | ||
| description: Find detailed setup instructions for Apache Spark and other common query engines. | ||
| --- | ||
|
|
||
| import { DirectoryListing } from "~/components"; | ||
|
|
||
| Below are configuration examples to connect various Iceberg engines to [R2 Data Catalog](/r2/data-catalog/): | ||
|
|
||
| <DirectoryListing /> |
50 changes: 50 additions & 0 deletions
50
src/content/docs/r2/data-catalog/config-examples/pyiceberg.mdx
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,50 @@ | ||
| --- | ||
| title: PyIceberg | ||
| pcx_content_type: example | ||
| --- | ||
|
|
||
| Below is an example of using [PyIceberg](https://py.iceberg.apache.org/) to connect to R2 Data Catalog. | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| - Sign up for a [Cloudflare account](https://dash.cloudflare.com/sign-up/workers-and-pages). | ||
| - Create an [R2 bucket](/r2/buckets/) and enable the data catalog. | ||
| - Create an [R2 API token](/r2/api/tokens/) with both [R2 and data catalog permissions](/r2/api/tokens/#permissions). | ||
| - Install the [PyIceberg](https://py.iceberg.apache.org/#installation) and [PyArrow](https://arrow.apache.org/docs/python/install.html) libraries. | ||
|
|
||
| ## Example usage | ||
|
|
||
| ```py | ||
| import pyarrow as pa | ||
| from pyiceberg.catalog.rest import RestCatalog | ||
| from pyiceberg.exceptions import NamespaceAlreadyExistsError | ||
|
|
||
| # Define catalog connection details (replace variables) | ||
| WAREHOUSE = "<WAREHOUSE>" | ||
| TOKEN = "<TOKEN>" | ||
| CATALOG_URI = "<CATALOG_URI>" | ||
|
|
||
| # Connect to R2 Data Catalog | ||
| catalog = RestCatalog( | ||
| name="my_catalog", | ||
| warehouse=WAREHOUSE, | ||
| uri=CATALOG_URI, | ||
| token=TOKEN, | ||
| ) | ||
|
|
||
| # Create default namespace | ||
| catalog.create_namespace("default") | ||
|
|
||
| # Create simple PyArrow table | ||
| df = pa.table({ | ||
| "id": [1, 2, 3], | ||
| "name": ["Alice", "Bob", "Charlie"], | ||
| }) | ||
|
|
||
| # Create an Iceberg table | ||
| test_table = ("default", "my_table") | ||
| table = catalog.create_table( | ||
| test_table, | ||
| schema=df.schema, | ||
| ) | ||
| ``` |
62 changes: 62 additions & 0 deletions
62
src/content/docs/r2/data-catalog/config-examples/snowflake.mdx
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,62 @@ | ||
| --- | ||
| title: Snowflake | ||
| pcx_content_type: example | ||
| --- | ||
|
|
||
| Below is an example of using [Snowflake](https://docs.snowflake.com/en/user-guide/tables-iceberg-configure-catalog-integration-rest) to connect and query data from R2 Data Catalog (read-only). | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| - Sign up for a [Cloudflare account](https://dash.cloudflare.com/sign-up/workers-and-pages). | ||
| - Create an [R2 bucket](/r2/buckets/) and enable the data catalog. | ||
| - Create an [R2 API token](/r2/api/tokens/) with both [R2 and data catalog permissions](/r2/api/tokens/#permissions). | ||
| - A [Snowflake](https://www.snowflake.com/) account with the necessary privileges to create external volumes and catalog integrations. | ||
|
|
||
| ## Example usage | ||
|
|
||
| In your Snowflake [SQL worksheet](https://docs.snowflake.com/en/user-guide/ui-snowsight-worksheets-gs) or [notebook](https://docs.snowflake.com/en/user-guide/ui-snowsight/notebooks) run the following commands: | ||
|
|
||
| ```sql | ||
| -- Create a database (if you don't already have one) to organize your external data | ||
| CREATE DATABASE IF NOT EXISTS r2_example_db; | ||
|
|
||
| -- Create an external volume pointing to your R2 bucket | ||
| CREATE OR REPLACE EXTERNAL VOLUME ext_vol_r2 | ||
| STORAGE_LOCATIONS = ( | ||
| ( | ||
| NAME = 'my_r2_storage_location' | ||
| STORAGE_PROVIDER = 'S3COMPAT' | ||
| STORAGE_BASE_URL = 's3compat://<bucket-name>' | ||
| CREDENTIALS = ( | ||
| AWS_KEY_ID = '<access_key>' | ||
| AWS_SECRET_KEY = '<secret_access_key>' | ||
| ) | ||
| STORAGE_ENDPOINT = '<account_id>.r2.cloudflarestorage.com' | ||
| ) | ||
| ) | ||
| ALLOW_WRITES = FALSE; | ||
|
|
||
| -- Create a catalog integration for R2 Data Catalog (read-only) | ||
| CREATE OR REPLACE CATALOG INTEGRATION r2_data_catalog | ||
| CATALOG_SOURCE = ICEBERG_REST | ||
| TABLE_FORMAT = ICEBERG | ||
| CATALOG_NAMESPACE = 'default' | ||
| REST_CONFIG = ( | ||
| CATALOG_URI = '<catalog_uri>' | ||
| CATALOG_NAME = '<warehouse_name>' | ||
| ) | ||
| REST_AUTHENTICATION = ( | ||
| TYPE = BEARER | ||
| BEARER_TOKEN = '<token>' | ||
| ) | ||
| ENABLED = TRUE; | ||
|
|
||
| -- Create an Apache Iceberg table in your selected Snowflake database | ||
| CREATE ICEBERG TABLE my_iceberg_table | ||
| CATALOG = 'r2_data_catalog' | ||
| EXTERNAL_VOLUME = 'ext_vol_r2' | ||
| CATALOG_TABLE_NAME = 'my_table'; -- Name of existing table in your R2 data catalog | ||
|
|
||
| -- Query your Iceberg table | ||
| SELECT * FROM my_iceberg_table; | ||
| ``` |
175 changes: 175 additions & 0 deletions
175
src/content/docs/r2/data-catalog/config-examples/spark.mdx
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,175 @@ | ||
| --- | ||
| title: Spark | ||
| pcx_content_type: example | ||
| --- | ||
|
|
||
| Below is an example of how you can build an [Apache Spark](https://spark.apache.org/) application (with Scala) which connects to the R2 Data Catalog. This application is built to run locally, but it can be adapted to run on a cluster. | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| - Sign up for a [Cloudflare account](https://dash.cloudflare.com/sign-up/workers-and-pages). | ||
| - Create an [R2 bucket](/r2/buckets/) and enable the data catalog. | ||
| - Create an [R2 API token](/r2/api/tokens/) with both [R2 and data catalog permissions](/r2/api/tokens/#permissions). | ||
| - Install Java 17, Spark 3.5.3, and SBT 1.10.11 | ||
| - Note: The specific versions of tools are critical for getting things to work in this example. | ||
| - Tip: [“SDKMAN”](https://sdkman.io/) is a convenient package manager for installing SDKs. | ||
|
|
||
| ## Example usage | ||
|
|
||
| To start, create a new empty project directory somewhere on your machine. Inside that directory, create the following file at `src/main/scala/com/example/R2DataCatalogDemo.scala`. This will serve as the main entry point for your Spark application. | ||
|
|
||
| ```java | ||
| package com.example | ||
|
|
||
| import org.apache.spark.sql.SparkSession | ||
|
|
||
| object R2DataCatalogDemo { | ||
| def main(args: Array[String]): Unit = { | ||
|
|
||
| val uri = sys.env("CATALOG_URI") | ||
| val warehouse = sys.env("WAREHOUSE") | ||
| val token = sys.env("TOKEN") | ||
|
|
||
| val spark = SparkSession.builder() | ||
| .appName("My R2 Data Catalog Demo") | ||
| .master("local[*]") | ||
| .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") | ||
| .config("spark.sql.catalog.mydemo", "org.apache.iceberg.spark.SparkCatalog") | ||
| .config("spark.sql.catalog.mydemo.type", "rest") | ||
| .config("spark.sql.catalog.mydemo.uri", uri) | ||
| .config("spark.sql.catalog.mydemo.warehouse", warehouse) | ||
| .config("spark.sql.catalog.mydemo.token", token) | ||
| .getOrCreate() | ||
|
|
||
| import spark.implicits._ | ||
|
|
||
| val data = Seq( | ||
| (1, "Alice", 25), | ||
| (2, "Bob", 30), | ||
| (3, "Charlie", 35), | ||
| (4, "Diana", 40) | ||
| ).toDF("id", "name", "age") | ||
|
|
||
| spark.sql("USE mydemo") | ||
|
|
||
| spark.sql("CREATE NAMESPACE IF NOT EXISTS demoNamespace") | ||
|
|
||
| data.writeTo("demoNamespace.demotable").createOrReplace() | ||
|
|
||
| val readResult = spark.sql("SELECT * FROM demoNamespace.demotable WHERE age > 30") | ||
| println("Records with age > 30:") | ||
| readResult.show() | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| For building this application and managing dependencies, we'll use [sbt (“simple build tool”)](https://www.scala-sbt.org/). The following is an example `build.sbt` file to place at the root of your project. It is configured to produce a "fat JAR", bundling all required dependencies. | ||
|
|
||
| ```java | ||
| name := "R2DataCatalogDemo" | ||
|
|
||
| version := "1.0" | ||
|
|
||
| val sparkVersion = "3.5.3" | ||
| val icebergVersion = "1.8.1" | ||
|
|
||
| // You need to use binaries of Spark compiled with either 2.12 or 2.13; and 2.12 is more common. | ||
| // If you download Spark 3.5.3 with sdkman, then it comes with 2.12.18 | ||
| scalaVersion := "2.12.18" | ||
|
|
||
| libraryDependencies ++= Seq( | ||
| "org.apache.spark" %% "spark-core" % sparkVersion, | ||
| "org.apache.spark" %% "spark-sql" % sparkVersion, | ||
| "org.apache.iceberg" % "iceberg-core" % icebergVersion, | ||
| "org.apache.iceberg" % "iceberg-spark-runtime-3.5_2.12" % icebergVersion, | ||
| "org.apache.iceberg" % "iceberg-aws-bundle" % icebergVersion, | ||
| ) | ||
|
|
||
| // build a fat JAR with all dependencies | ||
| assembly / assemblyMergeStrategy := { | ||
| case PathList("META-INF", "services", xs @ _*) => MergeStrategy.concat | ||
| case PathList("META-INF", xs @ _*) => MergeStrategy.discard | ||
| case "reference.conf" => MergeStrategy.concat | ||
| case "application.conf" => MergeStrategy.concat | ||
| case x if x.endsWith(".properties") => MergeStrategy.first | ||
| case x => MergeStrategy.first | ||
| } | ||
|
|
||
| // For Java 17 Compatability | ||
| Compile / javacOptions ++= Seq("--release", "17") | ||
| ``` | ||
|
|
||
| To enable the [sbt-assembly plugin](https://github.com/sbt/sbt-assembly?tab=readme-ov-file) (used to build fat JARs), add the following to a new file at `project/assembly.sbt`: | ||
|
|
||
| ``` | ||
| addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "1.2.0") | ||
| ``` | ||
|
|
||
| Make sure Java, Spark, and sbt are installed and available in your shell. If you're using SDKMAN, you can install them as shown below: | ||
|
|
||
| ```bash | ||
| sdk install java 17.0.14-amzn | ||
| sdk install spark 3.5.3 | ||
| sdk install sbt 1.10.11 | ||
| ``` | ||
|
|
||
| With everything installed, you can now build the project using sbt. This will generate a single bundled JAR file. | ||
|
|
||
| ```bash | ||
| sbt clean assembly | ||
| ``` | ||
|
|
||
| After building, the output JAR should be located at `target/scala-2.12/R2DataCatalogDemo-assembly-1.0.jar`. | ||
|
|
||
| To run the application, you'll use `spark-submit`. Below is an example shell script (`submit.sh`) that includes the necessary Java compatability flags for Spark on Java 17: | ||
|
|
||
| ``` | ||
| # We need to set these "--add-opens" so that Spark can run on Java 17 (it needs access to | ||
| # parts of the JVM which have been modularized and made internal). | ||
| JAVA_17_COMPATABILITY="--add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED" | ||
|
|
||
| spark-submit \ | ||
| --conf "spark.driver.extraJavaOptions=$JAVA_17_COMPATABILITY" \ | ||
| --conf "spark.executor.extraJavaOptions=$JAVA_17_COMPATABILITY" \ | ||
| --class com.example.R2DataCatalogDemo target/scala-2.12/R2DataCatalogDemo-assembly-1.0.jar | ||
| ``` | ||
|
|
||
| Before running it, make sure the script is executable: | ||
|
|
||
| ```bash | ||
| chmod +x submit.sh | ||
| ``` | ||
|
|
||
| At this point, your project directory should be structured like this: | ||
|
|
||
| ``` | ||
| . | ||
| ├── Makefile | ||
| ├── README.md | ||
| ├── build.sbt | ||
| ├── project | ||
| │ ├── assembly.sbt | ||
| │ ├── build.properties | ||
| │ └── project | ||
| ├── spark-submit.sh | ||
| └── src | ||
| └── main | ||
| └── scala | ||
| └── com | ||
| └── example | ||
| └── R2DataCatalogDemo.scala | ||
| ``` | ||
|
|
||
| Before submitting the job, make sure you have the required environment variable set for your catalog URI, warehouse, and [Cloudflare API token](/r2/api/tokens/). | ||
|
|
||
| ```bash | ||
| export CATALOG_URI= | ||
| export WAREHOUSE= | ||
| export TOKEN= | ||
| ``` | ||
|
|
||
| You're now ready to run the job: | ||
|
|
||
| ```bash | ||
| ./submit.sh | ||
| ``` |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.