A collection of Python scripts (examples) for managing Apache Iceberg tables on Cloudflare R2 Data Catalog using PySpark.
- Create: Create namespaces and tables from SQL files
- Insert: Insert data from SQL files, CSV, or Parquet
- JSON to Iceberg: Read JSON files from R2 and convert them to partitioned Iceberg tables with schema inference
- Delete: Delete data with optional cleanup of unreferenced files
- Drop: Drop tables or namespaces with optional purge of all files from R2
- Orphan File Removal: Clean up old snapshots and orphan files for a single table
- File Analysis: Analyze Iceberg table metadata to determine if compaction or re-partitioning is needed
- Spark Config Example: Centralized Spark config example
- Python 3.8+
- Java 8, 11, or 17 (required by PySpark)
- Cloudflare R2 Data Catalog account
# macOS with Homebrew
brew install openjdk@17
# Add to PATH (add to ~/.zshrc or ~/.bash_profile)
export PATH="/opt/homebrew/opt/openjdk@17/bin:$PATH"pip install -r requirements.txtrename r2dc_spark_config.py.example to r2dc_spark_config.py and update the credentials:
WAREHOUSE = "your-warehouse-path"
TOKEN = "your-token"
ENDPOINT = "your-catalog-uri"NOTE: if you want to cleanup/remove files, you'll need to configure S3 style access in the config:
S3_ACCESS_KEY_ID = "key"
S3_SECRET_ACCESS_KEY = "secret"
S3_ENDPOINT = "https://<account_id>.r2.cloudflarestorage.com/" Create a namespace:
python r2dc_create.py --namespace my_namespaceCreate a table from SQL file:
python r2dc_create.py --sql-file create_table.sqlExample create_table.sql:
CREATE TABLE IF NOT EXISTS my_namespace.users (
id INT,
name STRING,
email STRING,
created_at TIMESTAMP
) PARTITIONED BY (days(created_at))List all namespaces:
python r2dc_create.py --list-namespacesList tables in a namespace:
python r2dc_create.py --list-tables my_namespaceDescribe a table:
python r2dc_create.py --describe my_namespace.usersInsert from SQL file:
python r2dc_insert.py --sql-file insert_data.sqlExample insert_data.sql:
INSERT INTO my_namespace.users VALUES
(1, 'Alice', 'alice@example.com', TIMESTAMP '2024-01-01 10:00:00'),
(2, 'Bob', 'bob@example.com', TIMESTAMP '2024-01-02 11:00:00')Insert from CSV:
python r2dc_insert.py --csv-file data.csv --table my_namespace.usersInsert from Parquet:
python r2dc_insert.py --parquet-file data.parquet --table my_namespace.usersInsert with overwrite mode:
python r2dc_insert.py --csv-file data.csv --table my_namespace.users --mode overwriteShow table data:
python r2dc_insert.py --show my_namespace.users --limit 20Reads JSON files from an R2 bucket, infers the schema, and writes a partitioned Iceberg table. By default, tables are partitioned by days(__ingest_ts) for R2 SQL compatibility.
Basic usage — read all JSON from a bucket:
python r2dc_json_to_iceberg.py --bucket my-data --namespace analytics --table eventsRead from a specific prefix (folder):
python r2dc_json_to_iceberg.py --bucket my-data --prefix logs/2026/03 --namespace analytics --table march_logsUse an existing timestamp column for the ingest timestamp:
python r2dc_json_to_iceberg.py --bucket my-data --namespace analytics --table events --timestamp-col event_timeCustom partition key — identity partition by a column:
python r2dc_json_to_iceberg.py --bucket my-data --namespace analytics --table events --partition-by categoryCustom partition key — partition by month instead of day:
python r2dc_json_to_iceberg.py --bucket my-data --namespace analytics --table events --partition-by "months(__ingest_ts)"Multiple partition keys:
python r2dc_json_to_iceberg.py --bucket my-data --namespace analytics --table events \
--partition-by "days(__ingest_ts)" --partition-by categoryHash bucket partition (good for high-cardinality columns):
python r2dc_json_to_iceberg.py --bucket my-data --namespace analytics --table events --partition-by "bucket(16, user_id)"Partition by a nested JSON field (e.g. {"metadata": {"region": "us-east"}}):
python r2dc_json_to_iceberg.py --bucket my-data --namespace analytics --table events --partition-by metadata.regionNested fields are automatically extracted to top-level columns (metadata.region → metadata_region).
Append to an existing table:
python r2dc_json_to_iceberg.py --bucket my-data --namespace analytics --table events --mode appendMulti-line JSON files:
python r2dc_json_to_iceberg.py --bucket my-data --namespace analytics --table events --multilineSupported partition expressions:
| Expression | Description |
|---|---|
days(col) |
Time-based partition by day (R2 SQL compatible) |
hours(col) |
Time-based partition by hour |
months(col) |
Time-based partition by month |
years(col) |
Time-based partition by year |
bucket(n, col) |
Hash partition into n buckets |
truncate(n, col) |
Truncate partition (width n) |
col |
Identity partition (exact column value) |
parent.child |
Nested field (extracted to parent_child) |
Delete all records:
python r2dc_delete.py --table my_namespace.usersDelete with WHERE clause:
python r2dc_delete.py --table my_namespace.users --where "id > 100"Delete with cleanup (recommended):
python r2dc_delete.py --table my_namespace.users --cleanupThis will:
- Delete the data
- Expire old snapshots (default: 7 days)
- Remove orphan files (default: 3 days)
Delete with custom cleanup settings:
python r2dc_delete.py --table my_namespace.users --cleanup --expire-days 5 --orphan-days 2Drop table (removes from catalog, keeps files in storage):
python r2dc_drop.py --table my_namespace.usersDrop table and DELETE ALL FILES from storage:
python r2dc_drop.py --table my_namespace.users --purgeDrop namespace (must be empty):
python r2dc_drop.py --namespace my_namespaceDrop namespace and all tables in it:
python r2dc_drop.py --namespace my_namespace --cascadeDrop namespace, all tables, and DELETE ALL FILES:
python r2dc_drop.py --namespace my_namespace --cascade --purge-tablesSkip confirmation prompts (use with caution):
python r2dc_drop.py --table my_namespace.users --purge --forceIMPORTANT NOTES:
--purgeoperations DELETE ALL FILES from storage and CANNOT BE UNDONE--purgerequires S3 credentials to be configured inr2dc_spark_config.py- Without
--purge, the table is removed from the catalog but files remain in storage --purge-tablesonly works with--cascadefor namespace operations- By default, you will be prompted to confirm destructive operations
The orphan file removal script performs maintenance on a single table by expiring old snapshots and removing orphan files that are no longer referenced by any snapshot.
Clean up a table with default settings:
python r2dc_orphan_file_removal.py pokegraph.pokemonThis will:
- Expire snapshots older than 7 days
- Remove orphan files older than 3 days
Clean up with custom retention periods:
python r2dc_orphan_file_removal.py pokegraph.pokemon 5 2This will:
- Expire snapshots older than 5 days
- Remove orphan files older than 2 days
Usage:
python r2dc_orphan_file_removal.py <namespace.table> [expire_days] [orphan_days]Parameters:
namespace.table: Required. Fully qualified table nameexpire_days: Optional. Default 7. Expire snapshots older than N daysorphan_days: Optional. Default 3. Remove orphan files older than N days
IMPORTANT NOTES:
- The script only deletes data older than the specified days, not within that time period
- For example, with
expire_days=7, only snapshots created MORE than 7 days ago are expired - All snapshots and files within the retention window are kept safe
- This operation requires S3 credentials to be configured in
r2dc_spark_config.py - Run this periodically to keep storage costs down by removing unused files
The file analysis script inspects Iceberg table metadata (snapshots, manifests, data files, partitions) and produces a compaction health report with GREEN/YELLOW/RED scores and actionable recommendations.
Analyze a single table:
python r2dc_file_analysis.py my_namespace.my_tableThis will output:
- Table overview (file count, data size, records, partitions, snapshots, manifests)
- File size distribution (min, p10, median, avg, p90, max, std dev)
- Files per partition breakdown
- Small file analysis distinguishing compactable vs non-compactable files
- Over-partitioning detection
- Metadata health (snapshot count, manifest stats)
- Compaction scoreboard with GREEN/YELLOW/RED ratings
- Actionable recommendations
Analyze and save detailed results to JSON:
python r2dc_file_analysis.py my_namespace.my_table --json analysis_results.jsonAnalyze all tables in a namespace:
python r2dc_file_analysis.py --all my_namespaceSpecify a custom target file size (default: 128 MB):
python r2dc_file_analysis.py my_namespace.my_table --target-file-size 64Disable colored output (for piping or logging):
python r2dc_file_analysis.py my_namespace.my_table --no-colorUsage:
python r2dc_file_analysis.py <namespace.table> [--json PATH] [--target-file-size MB] [--no-color]
python r2dc_file_analysis.py --all <namespace> [--target-file-size MB] [--no-color]What the scores mean:
| Score | Data Compaction | Metadata Compaction |
|---|---|---|
| GREEN | No compaction needed | Metadata is healthy |
| YELLOW | Consider running compaction | Consider expiring snapshots |
| RED | Compaction recommended | Expire snapshots and rewrite manifests |
pyspark/
├── r2dc_spark_config.py # Shared Spark configuration
├── r2dc_create.py # Create namespaces and tables
├── r2dc_insert.py # Insert operations
├── r2dc_json_to_iceberg.py # Convert JSON files in R2 to Iceberg tables
├── r2dc_delete.py # Delete operations with cleanup
├── r2dc_drop.py # Drop tables/namespaces with optional purge to cleanup files
├── r2dc_orphan_file_removal.py # Orphan file cleanup for single table
├── r2dc_file_analysis.py # Iceberg file analysis & compaction advisor
├── requirements.txt # Python dependencies
└── README.md # This file
When inserting data, you can specify different modes:
append(default): Add new data to existing tableoverwrite: Replace all existing dataerror: Throw error if table existsignore: Silently ignore if table exists
The delete script includes several cleanup operations:
- Expire Snapshots: Marks old snapshots for deletion
- Remove Orphan Files: Removes data files no longer referenced by any snapshot
- Cleanup Metadata: Removes old metadata files
# 1. Create namespace
python r2dc_create.py --namespace sales_data
# 2. Create table from SQL file
python r2dc_create.py --sql-file tables/sales.sql
# 3. Insert data from CSV
python r2dc_insert.py --csv-file data/sales_2024.csv --table sales_data.sales
# 4. Verify data
python r2dc_insert.py --show sales_data.sales --limit 10
# 5. Delete old records with cleanup
python r2dc_delete.py --table sales_data.sales --where "date < '2024-01-01'" --cleanup
# 6. Check if table needs compaction or maintenance
python r2dc_file_analysis.py sales_data.sales
# 7. Perform maintenance to clean up old snapshots and orphan files
python r2dc_orphan_file_removal.py sales_data.sales
# 8. Drop table and remove all files from storage (when done)
python r2dc_drop.py --table sales_data.sales --purge