Added PySpark example configuration.

jonesphillip · jonesphillip · commit 492e01c3674b · 2025-04-07T21:35:49.000-07:00
diff --git a/src/content/changelog/r2/2025-04-10-r2-data-catalog-beta.mdx b/src/content/changelog/r2/2025-04-10-r2-data-catalog-beta.mdx
@@ -9,9 +9,9 @@ hidden: true
 
 Today, we're launching [R2 Data Catalog](/r2/data-catalog/) in open beta, a managed Apache Iceberg catalog built directly into your [Cloudflare R2](/r2/) bucket.
 
-If you're not already familiar with it, [Apache Iceberg](https://iceberg.apache.org/) is an open table format designed to handle large-scale analytics datasets stored in object storage, offering ACID transactions and schema evolution. R2 Data Catalog exposes a standard Iceberg REST catalog interface, so you can connect engines like [Spark](/r2/data-catalog/config-examples/spark/), [Snowflake](/r2/data-catalog/config-examples/snowflake/), and [PyIceberg](/r2/data-catalog/config-examples/pyiceberg/) to start querying your tables using the tools you already know.
+If you're not already familiar with it, [Apache Iceberg](https://iceberg.apache.org/) is an open table format designed to handle large-scale analytics datasets stored in object storage, offering ACID transactions and schema evolution. R2 Data Catalog exposes a standard Iceberg REST catalog interface, so you can connect engines like [Spark](/r2/data-catalog/config-examples/spark-scala/), [Snowflake](/r2/data-catalog/config-examples/snowflake/), and [PyIceberg](/r2/data-catalog/config-examples/pyiceberg/) to start querying your tables using the tools you already know.
 
-To enable a data catalog on your R2 bucket, find **R2 Data Catalog** in your buckets settings in the dashboard or run:
+To enable a data catalog on your R2 bucket, find **R2 Data Catalog** in your buckets settings in the dashboard, or run:
 
 ```bash
 npx wrangler r2 bucket catalog enable my-bucket
diff --git a/src/content/docs/r2/data-catalog/config-examples/spark-python.mdx b/src/content/docs/r2/data-catalog/config-examples/spark-python.mdx
@@ -0,0 +1,71 @@
+---
+title: Spark (PySpark)
+pcx_content_type: example
+---
+
+Below is an example of using [PySpark](https://spark.apache.org/docs/latest/api/python/index.html) to connect to R2 Data Catalog.
+
+## Prerequisites
+
+- Sign up for a [Cloudflare account](https://dash.cloudflare.com/sign-up/workers-and-pages).
+- [Create an R2 bucket](/r2/buckets/create-buckets/) and [enable the data catalog](/r2/data-catalog/manage-catalogs/#enable-r2-data-catalog-on-a-bucket).
+- [Create an R2 API token](/r2/api/tokens/) with both [R2 and data catalog permissions](/r2/api/tokens/#permissions).
+- Install the [PySpark](https://spark.apache.org/docs/latest/api/python/getting_started/install.html) library.
+
+## Example usage
+
+```py
+from pyspark.sql import SparkSession
+
+# Define catalog connection details (replace variables)
+WAREHOUSE = "<WAREHOUSE>"
+TOKEN = "<TOKEN>"
+CATALOG_URI = "<CATALOG_URI>"
+
+# Build Spark session with Iceberg configurations
+spark = SparkSession.builder \
+  .appName("R2DataCatalogExample") \
+  .config('spark.jars.packages', 'org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.6.1,org.apache.iceberg:iceberg-aws-bundle:1.6.1') \
+  .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
+  .config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog") \
+  .config("spark.sql.catalog.my_catalog.type", "rest") \
+  .config("spark.sql.catalog.my_catalog.uri", CATALOG_URI) \
+  .config("spark.sql.catalog.my_catalog.warehouse", WAREHOUSE) \
+  .config("spark.sql.catalog.my_catalog.token", TOKEN) \
+  .config("spark.sql.catalog.my_catalog.header.X-Iceberg-Access-Delegation", "vended-credentials") \
+  .config("spark.sql.catalog.my_catalog.s3.remote-signing-enabled", "false") \
+  .config("spark.sql.defaultCatalog", "my_catalog") \
+  .getOrCreate()
+spark.sql("USE my_catalog")
+
+# Create namespace if it does not exist
+spark.sql("CREATE NAMESPACE IF NOT EXISTS default")
+
+# Create a table in the namespace using Iceberg
+spark.sql("""
+    CREATE TABLE IF NOT EXISTS default.my_table (
+        id BIGINT,
+        name STRING
+    )
+    USING iceberg
+""")
+
+# Create a simple DataFrame
+df = spark.createDataFrame(
+    [(1, "Alice"), (2, "Bob"), (3, "Charlie")],
+    ["id", "name"]
+)
+
+# Write the DataFrame to the Iceberg table
+df.write \
+    .format("iceberg") \
+    .mode("append") \
+    .save("default.my_table")
+
+# Read the data back from the Iceberg table
+result_df = spark.read \
+    .format("iceberg") \
+    .load("default.my_table")
+
+result_df.show()
+```
diff --git a/src/content/docs/r2/data-catalog/config-examples/spark-scala.mdx b/src/content/docs/r2/data-catalog/config-examples/spark-scala.mdx
@@ -1,5 +1,5 @@
 ---
-title: Spark
+title: Spark (Scala)
 pcx_content_type: example
 ---
 
diff --git a/src/content/docs/r2/data-catalog/index.mdx b/src/content/docs/r2/data-catalog/index.mdx
@@ -15,7 +15,7 @@ import { Render, LinkCard } from "~/components";
 R2 Data Catalog is in **public beta**, and any developer with an [R2 subscription](/r2/pricing/) can start using it. Currently, outside of standard R2 storage and operations, you will not be billed for your use of R2 Data Catalog.
 :::
 
-R2 Data Catalog is a managed [Apache Iceberg](https://iceberg.apache.org/) data catalog built directly into your R2 bucket. It exposes a standard Iceberg REST catalog interface, so you can connect the engines you already use, like [Spark](/r2/data-catalog/config-examples/spark/), [Snowflake](/r2/data-catalog/config-examples/snowflake/), and [PyIceberg](/r2/data-catalog/config-examples/pyiceberg/).
+R2 Data Catalog is a managed [Apache Iceberg](https://iceberg.apache.org/) data catalog built directly into your R2 bucket. It exposes a standard Iceberg REST catalog interface, so you can connect the engines you already use, like [Spark](/r2/data-catalog/config-examples/spark-scala/), [Snowflake](/r2/data-catalog/config-examples/snowflake/), and [PyIceberg](/r2/data-catalog/config-examples/pyiceberg/).
 
 R2 Data Catalog makes it easy to turn an R2 bucket into a data warehouse or lakehouse for a variety of analytical workloads including log analytics, business intelligence, and data pipelines. R2's zero-egress fee model means that data users and consumers can access and analyze data from different clouds, data platforms, or regions without incurring transfer costs.