Skip to content

Commit 492e01c

Browse files
committed
Added PySpark example configuration.
1 parent 9b8cc06 commit 492e01c

File tree

4 files changed

+75
-4
lines changed

4 files changed

+75
-4
lines changed

src/content/changelog/r2/2025-04-10-r2-data-catalog-beta.mdx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,9 +9,9 @@ hidden: true
99

1010
Today, we're launching [R2 Data Catalog](/r2/data-catalog/) in open beta, a managed Apache Iceberg catalog built directly into your [Cloudflare R2](/r2/) bucket.
1111

12-
If you're not already familiar with it, [Apache Iceberg](https://iceberg.apache.org/) is an open table format designed to handle large-scale analytics datasets stored in object storage, offering ACID transactions and schema evolution. R2 Data Catalog exposes a standard Iceberg REST catalog interface, so you can connect engines like [Spark](/r2/data-catalog/config-examples/spark/), [Snowflake](/r2/data-catalog/config-examples/snowflake/), and [PyIceberg](/r2/data-catalog/config-examples/pyiceberg/) to start querying your tables using the tools you already know.
12+
If you're not already familiar with it, [Apache Iceberg](https://iceberg.apache.org/) is an open table format designed to handle large-scale analytics datasets stored in object storage, offering ACID transactions and schema evolution. R2 Data Catalog exposes a standard Iceberg REST catalog interface, so you can connect engines like [Spark](/r2/data-catalog/config-examples/spark-scala/), [Snowflake](/r2/data-catalog/config-examples/snowflake/), and [PyIceberg](/r2/data-catalog/config-examples/pyiceberg/) to start querying your tables using the tools you already know.
1313

14-
To enable a data catalog on your R2 bucket, find **R2 Data Catalog** in your buckets settings in the dashboard or run:
14+
To enable a data catalog on your R2 bucket, find **R2 Data Catalog** in your buckets settings in the dashboard, or run:
1515

1616
```bash
1717
npx wrangler r2 bucket catalog enable my-bucket
Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
---
2+
title: Spark (PySpark)
3+
pcx_content_type: example
4+
---
5+
6+
Below is an example of using [PySpark](https://spark.apache.org/docs/latest/api/python/index.html) to connect to R2 Data Catalog.
7+
8+
## Prerequisites
9+
10+
- Sign up for a [Cloudflare account](https://dash.cloudflare.com/sign-up/workers-and-pages).
11+
- [Create an R2 bucket](/r2/buckets/create-buckets/) and [enable the data catalog](/r2/data-catalog/manage-catalogs/#enable-r2-data-catalog-on-a-bucket).
12+
- [Create an R2 API token](/r2/api/tokens/) with both [R2 and data catalog permissions](/r2/api/tokens/#permissions).
13+
- Install the [PySpark](https://spark.apache.org/docs/latest/api/python/getting_started/install.html) library.
14+
15+
## Example usage
16+
17+
```py
18+
from pyspark.sql import SparkSession
19+
20+
# Define catalog connection details (replace variables)
21+
WAREHOUSE = "<WAREHOUSE>"
22+
TOKEN = "<TOKEN>"
23+
CATALOG_URI = "<CATALOG_URI>"
24+
25+
# Build Spark session with Iceberg configurations
26+
spark = SparkSession.builder \
27+
.appName("R2DataCatalogExample") \
28+
.config('spark.jars.packages', 'org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.6.1,org.apache.iceberg:iceberg-aws-bundle:1.6.1') \
29+
.config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
30+
.config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog") \
31+
.config("spark.sql.catalog.my_catalog.type", "rest") \
32+
.config("spark.sql.catalog.my_catalog.uri", CATALOG_URI) \
33+
.config("spark.sql.catalog.my_catalog.warehouse", WAREHOUSE) \
34+
.config("spark.sql.catalog.my_catalog.token", TOKEN) \
35+
.config("spark.sql.catalog.my_catalog.header.X-Iceberg-Access-Delegation", "vended-credentials") \
36+
.config("spark.sql.catalog.my_catalog.s3.remote-signing-enabled", "false") \
37+
.config("spark.sql.defaultCatalog", "my_catalog") \
38+
.getOrCreate()
39+
spark.sql("USE my_catalog")
40+
41+
# Create namespace if it does not exist
42+
spark.sql("CREATE NAMESPACE IF NOT EXISTS default")
43+
44+
# Create a table in the namespace using Iceberg
45+
spark.sql("""
46+
CREATE TABLE IF NOT EXISTS default.my_table (
47+
id BIGINT,
48+
name STRING
49+
)
50+
USING iceberg
51+
""")
52+
53+
# Create a simple DataFrame
54+
df = spark.createDataFrame(
55+
[(1, "Alice"), (2, "Bob"), (3, "Charlie")],
56+
["id", "name"]
57+
)
58+
59+
# Write the DataFrame to the Iceberg table
60+
df.write \
61+
.format("iceberg") \
62+
.mode("append") \
63+
.save("default.my_table")
64+
65+
# Read the data back from the Iceberg table
66+
result_df = spark.read \
67+
.format("iceberg") \
68+
.load("default.my_table")
69+
70+
result_df.show()
71+
```

src/content/docs/r2/data-catalog/config-examples/spark.mdx renamed to src/content/docs/r2/data-catalog/config-examples/spark-scala.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: Spark
2+
title: Spark (Scala)
33
pcx_content_type: example
44
---
55

src/content/docs/r2/data-catalog/index.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ import { Render, LinkCard } from "~/components";
1515
R2 Data Catalog is in **public beta**, and any developer with an [R2 subscription](/r2/pricing/) can start using it. Currently, outside of standard R2 storage and operations, you will not be billed for your use of R2 Data Catalog.
1616
:::
1717

18-
R2 Data Catalog is a managed [Apache Iceberg](https://iceberg.apache.org/) data catalog built directly into your R2 bucket. It exposes a standard Iceberg REST catalog interface, so you can connect the engines you already use, like [Spark](/r2/data-catalog/config-examples/spark/), [Snowflake](/r2/data-catalog/config-examples/snowflake/), and [PyIceberg](/r2/data-catalog/config-examples/pyiceberg/).
18+
R2 Data Catalog is a managed [Apache Iceberg](https://iceberg.apache.org/) data catalog built directly into your R2 bucket. It exposes a standard Iceberg REST catalog interface, so you can connect the engines you already use, like [Spark](/r2/data-catalog/config-examples/spark-scala/), [Snowflake](/r2/data-catalog/config-examples/snowflake/), and [PyIceberg](/r2/data-catalog/config-examples/pyiceberg/).
1919

2020
R2 Data Catalog makes it easy to turn an R2 bucket into a data warehouse or lakehouse for a variety of analytical workloads including log analytics, business intelligence, and data pipelines. R2's zero-egress fee model means that data users and consumers can access and analyze data from different clouds, data platforms, or regions without incurring transfer costs.
2121

0 commit comments

Comments
 (0)