Skip to content

Commit 8a470c6

Browse files
MarcinthecloudjonesphillipOxyjun
authored
Adds compaction to R2 Data Catalog docs and wrangler commands (#25366)
* Adds compaction to R2 Data Catalog docs and wrangler commands * adding changelog for compaction in the PR due to deep linking to new doc pages in this repo * Mostly small changes to compaction docs + changelog * PCX Review --------- Co-authored-by: Phillip Jones <[email protected]> Co-authored-by: Jun Lee <[email protected]>
1 parent 215e5b4 commit 8a470c6

File tree

6 files changed

+177
-8
lines changed

6 files changed

+177
-8
lines changed
47.6 KB
Loading
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
---
2+
title: R2 Data Catalog now supports compaction
3+
description: Users can now enable compaction on R2 Data Catalog
4+
products:
5+
- r2
6+
date: 2025-09-25T13:00:00
7+
hidden: true
8+
---
9+
import {
10+
LinkCard,
11+
} from "~/components";
12+
13+
You can now enable automatic compaction for [Apache Iceberg](https://iceberg.apache.org/) tables in [R2 Data Catalog](/r2/data-catalog/) to improve query performance.
14+
15+
Compaction is the process of taking a group of small files and combining them into fewer larger files. This is an important maintenance operation as it helps ensure that query performance remains consistent by reducing the number of files that needs to be scanned.
16+
17+
To enable automatic compaction in R2 Data Catalog, find it under **R2 Data Catalog** in your R2 bucket settings in the dashboard.
18+
19+
![compaction-dash](~/assets/images/changelog/r2/compaction.png)
20+
21+
Or with [Wrangler](/workers/wrangler/), run:
22+
23+
```bash
24+
npx wrangler r2 bucket catalog compaction enable <BUCKET_NAME> --target-size 128 --token <API_TOKEN>
25+
```
26+
27+
To get started with compaction, check out [manage catalogs](/r2/data-catalog/manage-catalogs/). For best practices and limitations, refer to [about compaction](/r2/data-catalog/about-compaction/).
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
---
2+
pcx_content_type: configuration
3+
title: About compaction
4+
description: Learn about R2 Data Catalog compaction
5+
sidebar:
6+
order: 4
7+
---
8+
9+
## What is compaction?
10+
11+
Compaction is the process of taking a group of small files and combining them into fewer larger files. This is an important maintenance operation as it helps ensure that query performance remains consistent by reducing the number of files that needs to be scanned.
12+
13+
## Why do I need compaction?
14+
15+
Every write operation in [Apache Iceberg](https://iceberg.apache.org/), no matter how small or large, results in a series of new files being generated. As time goes on, the number of files can grow unbounded. This can lead to:
16+
- Slower queries and increased I/O operations: Without compaction, query engines will have to open and read each individual file, resulting in longer query times and increased costs.
17+
- Increased metadata overhead: Query engines must scan metadata files to determine which ones to read. With thousands of small files, query planning takes longer even before data is accessed.
18+
- Reduced compression efficiency: Smaller files compress less efficiently than larger files, leading to higher storage costs and more data to transfer during queries.
19+
20+
## R2 Data Catalog automatic compaction
21+
22+
R2 Data Catalog can now [manage compaction](/r2/data-catalog/manage-catalogs) for Apache Iceberg tables stored in R2. When enabled, compaction runs automatically and combines new files that have not been compacted yet.
23+
24+
Compacted files are prefixed with `compacted-` in the `/data/` directory.
25+
26+
### Choosing the right target file size
27+
28+
You can configure the target file size for compaction. Currently, the minimum is 64 MB and the maximum is 512 MB.
29+
30+
Different compute engines have different optimal file sizes, so check their documentation.
31+
32+
Performance tradeoffs depend on your use case. For example, queries that return small amounts of data may perform better with smaller files, as larger files could result in reading unnecessary data.
33+
- For workloads that are more latency sensitive, consider a smaller target file size (for example, 64 MB - 128 MB)
34+
- For streaming ingest workloads, consider medium file sizes (for example, 128 MB - 256 MB)
35+
- For OLAP style queries that need to scan a lot of data, consider larger file sizes (for example, 256 MB - 512 MB)
36+
37+
## Current limitations
38+
- During open beta, compaction will compact up to 2 GB worth of files once per hour for each table.
39+
- Only data files stored in parquet format are currently supported with compaction.
40+
- Snapshot expiration and orphan file cleanup is not supported yet.
41+
- Minimum target file size is 64 MB and maximum is 512 MB.

src/content/docs/r2/data-catalog/config-examples/index.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ pcx_content_type: navigation
33
title: Connect to Iceberg engines
44
head: []
55
sidebar:
6-
order: 4
6+
order: 5
77
group:
88
hideIndex: true
99
description: Find detailed setup instructions for Apache Spark and other common query engines.

src/content/docs/r2/data-catalog/manage-catalogs.mdx

Lines changed: 74 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,8 @@ Learn how to:
2828

2929
Enabling the catalog on a bucket turns on the REST catalog interface and provides a **Catalog URI** and **Warehouse name** required by Iceberg clients. Once enabled, you can create and manage Iceberg tables in that bucket.
3030

31-
### Dashboard
31+
<Tabs syncKey='CLIvDash'>
32+
<TabItem label='Dashboard'>
3233

3334
<Steps>
3435
1. In the Cloudflare dashboard, go to the **R2 object storage** page.
@@ -39,22 +40,25 @@ Enabling the catalog on a bucket turns on the REST catalog interface and provide
3940
4. Once enabled, note the **Catalog URI** and **Warehouse name**.
4041
</Steps>
4142

42-
### Wrangler CLI
43-
43+
</TabItem>
44+
<TabItem label='Wrangler CLI'>
4445
To enable the catalog on your bucket, run the [`r2 bucket catalog enable command`](/workers/wrangler/commands/#r2-bucket-catalog-enable):
4546

4647
```bash
4748
npx wrangler r2 bucket catalog enable <BUCKET_NAME>
4849
```
4950

5051
After enabling, Wrangler will return your catalog URI and warehouse name.
52+
</TabItem>
53+
</Tabs>
54+
5155

5256
## Disable R2 Data Catalog on a bucket
5357

5458
When you disable the catalog on a bucket, it immediately stops serving requests from the catalog interface. Any Iceberg table references stored in that catalog become inaccessible until you re-enable it.
5559

56-
### Dashboard
57-
60+
<Tabs syncKey='CLIvDash'>
61+
<TabItem label='Dashboard'>
5862
<Steps>
5963
1. In the Cloudflare dashboard, go to the **R2 object storage** page.
6064

@@ -63,13 +67,76 @@ When you disable the catalog on a bucket, it immediately stops serving requests
6367
3. Switch to the **Settings** tab, scroll down to **R2 Data Catalog**, and select **Disable**.
6468
</Steps>
6569

66-
### Wrangler CLI
67-
70+
</TabItem>
71+
<TabItem label='Wrangler CLI'>
6872
To disable the catalog on your bucket, run the [`r2 bucket catalog disable command`](/workers/wrangler/commands/#r2-bucket-catalog-disable):
6973

7074
```bash
7175
npx wrangler r2 bucket catalog disable <BUCKET_NAME>
7276
```
77+
</TabItem>
78+
</Tabs>
79+
80+
## Enable compaction
81+
Compaction improves query performance by combining the many small files created during data ingestion into fewer, larger files according to the set `target file size`. For more information about compaction and why it's valuable, refer to [About compaction](/r2/data-catalog/about-compaction/).
82+
<Tabs syncKey='CLIvDash'>
83+
<TabItem label='Dashboard'>
84+
85+
<Steps>
86+
1. In the Cloudflare dashboard, go to the **R2 object storage** page.
87+
88+
<DashButton url="/?to=/:account/r2/overview" />
89+
2. Select the bucket you want to enable compaction on.
90+
3. Switch to the **Settings** tab, scroll down to **R2 Data Catalog**, and click on the **Edit** icon next to the compaction card.
91+
4. Enable compaction and optionally set a target file size. The default is 128 MB.
92+
5. (Optional) Provide a Cloudflare API token for compaction to access and rewrite files in your bucket.
93+
6. Select **Save**.
94+
</Steps>
95+
96+
</TabItem>
97+
<TabItem label='Wrangler CLI'>
98+
99+
To enable the compaction on your catalog, run the [`r2 bucket catalog enable command`](/workers/wrangler/commands/#r2-bucket-catalog-compaction-enable):
100+
101+
```bash
102+
npx wrangler r2 bucket catalog compaction enable <BUCKET_NAME> --target-size 128 --token <API_TOKEN>
103+
```
104+
</TabItem>
105+
</Tabs>
106+
107+
:::note[API token permission requirements]
108+
Compaction requires a Cloudflare API token with both R2 storage and R2 Data Catalog read/write permissions to act as a service credential. The compaction process uses this token to read files, combine them, and update table metadata.
109+
110+
Refer to [Authenticate your Iceberg engine](#authenticate-your-iceberg-engine) for details on creating a token with the required permissions.
111+
112+
Once enabled, compaction applies retroactively to all existing tables and automatically to newly created tables. During open beta, we currently compact up to 2 GB worth of files once per hour for each table.
113+
114+
## Disable compaction
115+
Disabling compaction will prevent the process from running for all tables managed by the catalog. You can re-enable it at any time.
116+
117+
<Tabs syncKey='CLIvDash'>
118+
<TabItem label='Dashboard'>
119+
120+
<Steps>
121+
1. In the Cloudflare dashboard, go to the **R2 object storage** page.
122+
123+
<DashButton url="/?to=/:account/r2/overview" />
124+
2. Select the bucket you want to enable compaction on.
125+
3. Switch to the **Settings** tab, scroll down to **R2 Data Catalog**, and click on the **edit** icon next to the compaction card.
126+
4. Disable compaction.
127+
5. Select **Save**.
128+
</Steps>
129+
130+
</TabItem>
131+
<TabItem label='Wrangler CLI'>
132+
133+
To disable the compaction on your catalog, run the [`r2 bucket catalog disable command`](/workers/wrangler/commands/#r2-bucket-catalog-compaction-disable):
134+
135+
```bash
136+
npx wrangler r2 bucket catalog compaction disable <BUCKET_NAME>
137+
```
138+
</TabItem>
139+
</Tabs>
73140

74141
## Authenticate your Iceberg engine
75142

src/content/partials/workers/wrangler-commands/r2.mdx

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -106,6 +106,40 @@ wrangler r2 bucket catalog get <NAME> [OPTIONS]
106106
- `NAME` <Type text="string" /> <MetaInfo text="required" />
107107
- The name of the R2 bucket whose data catalog status to retrieve.
108108

109+
<AnchorHeading
110+
title="`catalog compaction enable`"
111+
slug="r2-bucket-catalog-compaction-enable"
112+
depth={3}
113+
/>
114+
115+
Enable compaction on a [R2 Data Catalog](/r2/data-catalog/).
116+
117+
```txt
118+
wrangler r2 bucket catalog compaction enable <BUCKET> [OPTIONS]
119+
```
120+
121+
- `BUCKET` <Type text="string" /> <MetaInfo text="required" />
122+
- The name of the bucket to enable R2 Data Catalog compaction for.
123+
- `--token` <Type text="string" /> <MetaInfo text="required" />
124+
- The R2 API token with R2 Data Catalog edit permissions
125+
- `--target-size` <Type text="number" /> <MetaInfo text="required" />
126+
- The target file size (in MB) compaction will attempt to generate. Default: 128.
127+
128+
<AnchorHeading
129+
title="`catalog compaction disable`"
130+
slug="r2-bucket-catalog-compaction-disable"
131+
depth={3}
132+
/>
133+
134+
Disable compaction on a [R2 Data Catalog](/r2/data-catalog/).
135+
136+
```txt
137+
wrangler r2 bucket catalog compaction disable <BUCKET> [OPTIONS]
138+
```
139+
140+
- `BUCKET` <Type text="string" /> <MetaInfo text="required" />
141+
- The name of the bucket to enable R2 Data Catalog compaction for.
142+
109143
<AnchorHeading title="`cors set`" slug="r2-bucket-cors-set" depth={3} />
110144

111145
Set the [CORS configuration](/r2/buckets/cors/) for an R2 bucket from a JSON file.

0 commit comments

Comments
 (0)