Skip to content

Commit aa90ed0

Browse files
committed
Mostly small changes to compaction docs + changelog
1 parent 2babf07 commit aa90ed0

File tree

5 files changed

+34
-50
lines changed

5 files changed

+34
-50
lines changed
Lines changed: 8 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,39 +1,27 @@
11
---
2-
title: R2 Data Catalog now supports compaction.
2+
title: R2 Data Catalog now supports compaction
33
description: Users can now enable compaction on R2 Data Catalog
44
products:
55
- r2
6-
date: 2025-09-25
6+
date: 2025-09-25T13:00:00
77
hidden: true
88
---
99
import {
1010
LinkCard,
1111
} from "~/components";
1212

13-
Today, we're adding support for managed [compaction](/r2/data-catalog/about-compaction) for [Apache Iceberg](https://iceberg.apache.org/) tables managed by [R2 Data Catalog](/r2/data-catalog/).
13+
You can now enable automatic compaction for [Apache Iceberg](https://iceberg.apache.org/) tables in [R2 Data Catalog](/r2/data-catalog/) to improve query performance.
1414

15-
Compaction is the process of taking a group of small files and combining them into fewer larger files. This is an important maintenance operation as it helps ensure that performance remains consistent by reducing the number of files that needs to be scanned when running queries.
15+
Compaction is the process of taking a group of small files and combining them into fewer larger files. This is an important maintenance operation as it helps ensure that query performance remains consistent by reducing the number of files that needs to be scanned.
1616

17-
To enable compaction in R2 Data Catalog, find it under **R2 Data Catalog** in your bucket settings in the dashboard
17+
To enable automatic compaction in R2 Data Catalog, find it under **R2 Data Catalog** in your R2 bucket settings in the dashboard.
1818

1919
![compaction-dash](~/assets/images/changelog/r2/compaction.png)
2020

21-
or simply run:
21+
Or with [Wrangler](/workers/wrangler/), run:
2222

2323
```bash
24-
npx wrangler r2 bucket catalog compaction enable <BUCKET_NAME> --targetSizeMb 128 --token <API_TOKEN>
24+
npx wrangler r2 bucket catalog compaction enable <BUCKET_NAME> --target-size 128 --token <API_TOKEN>
2525
```
2626

27-
And that's it. Compaction will start running automatically.
28-
29-
<LinkCard
30-
title="Learn more about R2 Data Catalog"
31-
href="/r2/data-catalog/manage-catalogs"
32-
description="Learn how to manage R2 Data Catalog and enable compaction on your bucket."
33-
/>
34-
35-
<LinkCard
36-
title="Learn more about compaction"
37-
href="/r2/data-catalog/about-compaction"
38-
description="Learn more about compaction, best practices, and limitations."
39-
/>
27+
To get started with compaction, check out [manage catalogs](/r2/data-catalog/manage-catalogs/). For best practices and limitations, refer to [about compaction](/r2/data-catalog/about-compaction/).

src/content/docs/r2/data-catalog/about-compaction.mdx

Lines changed: 11 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -8,36 +8,32 @@ sidebar:
88

99
## What is compaction?
1010

11-
Compaction is the process of taking a group of small files and combining them into fewer larger files. This is an important maintenance operation as it helps ensure that performance remains consistent by reducing the number of files that needs to be scanned.
11+
Compaction is the process of taking a group of small files and combining them into fewer larger files. This is an important maintenance operation as it helps ensure that query performance remains consistent by reducing the number of files that needs to be scanned.
1212

1313
## Why do I need compaction?
1414

15-
Every write operation in Apache Iceberg, no matter how small or large, results in a series of new files being generated. As time goes on, the number of files can grow unbounded. This can lead to:
16-
- Increased metadata overhead: Each file has its own metadata, file path, column statistics, etc. This means the query engine will have to read a large amount of metadata to satisfy a given query.
17-
- Increased I/O operations: Without compaction, query engines will have to open and read each individual file, resulting in increased resource usage and cost.
18-
- Reduced compression efficiency: Smaller files tend to compress less efficiently compared to larger files.
15+
Every write operation in [Apache Iceberg](https://iceberg.apache.org/), no matter how small or large, results in a series of new files being generated. As time goes on, the number of files can grow unbounded. This can lead to:
16+
- Slower queries and increased I/O operations: Without compaction, query engines will have to open and read each individual file, resulting in longer query times and increased costs.
17+
- Increased metadata overhead: Query engines must scan metadata files to determine which ones to read. With thousands of small files, query planning takes longer even before data is accessed.
18+
- Reduced compression efficiency: Smaller files compress less efficiently than larger files, leading to higher storage costs and more data to transfer during queries.
1919

20-
## R2 Data Catalog Compaction
20+
## R2 Data Catalog automatic compaction
2121

22-
R2 Data Catalog can now [manage compaction](/r2/data-catalog/manage-catalogs) for Apache Iceberg tables stored in R2. The compaction service periodically runs every hour and compacts new files that have not been compacted yet.
22+
R2 Data Catalog can now [manage compaction](/r2/data-catalog/manage-catalogs) for Apache Iceberg tables stored in R2. When enabled, compaction runs automatically and combines new files that have not been compacted yet.
2323

24-
You can tell which files in R2 have been compacted as they have a `compacted-` added to the file name in the `/data/` directory.
24+
Compacted files are prefixed with `compacted-` in the `/data/` directory.
2525

2626
### Choosing the right target file size
2727

28-
You can configure the target file size compaction should try to generate if possible. There is a minimum of `64 MB` and a maximum of `512 MB` currently.
28+
You can configure the target file size for compaction. Currently, the minimum is 64 MB and the maximum is 512 MB.
2929

30-
Different compute engines tend to have different best practices when it comes to ideal file sizes so it's best to consult their documentation to find out what's best.
30+
Different compute engines have different optimal file sizes, so check their documentation.
3131

32-
It's important to note that there are performance tradeoffs to keep in mind based on the use case. For example, if your use case is primarily performing queries that are well defined and will return small amounts of data, having a smaller target file size might be more beneficial as you might end up reading more data than necessary with larger files.
32+
Performance tradeoffs depend on your use case. For example, queries that return small amounts of data may perform better with smaller files, as larger files could result in reading unnecessary data.
3333
- For workloads that are more latency sensitive, consider a smaller target file size (for example, 64MB - 128MB)
3434
- For streaming ingest workloads, consider medium file sizes (for example, 128MB - 256MB)
3535
- For OLAP style queries that need to scan a lot of data, consider larger file sizes (for example, 256MB - 512MB)
3636

37-
:::note
38-
Make sure to check to your compute engine documentation to check if there's a recommended file size.
39-
:::
40-
4137
## Current limitations
4238
- During open beta, compaction will compact up to 2GB worth of files once per hour for each table.
4339
- Only data files stored in parquet format are currently supported with compaction.

src/content/docs/r2/data-catalog/config-examples/index.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ pcx_content_type: navigation
33
title: Connect to Iceberg engines
44
head: []
55
sidebar:
6-
order: 4
6+
order: 5
77
group:
88
hideIndex: true
99
description: Find detailed setup instructions for Apache Spark and other common query engines.

src/content/docs/r2/data-catalog/manage-catalogs.mdx

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -78,7 +78,7 @@ npx wrangler r2 bucket catalog disable <BUCKET_NAME>
7878
</Tabs>
7979

8080
## Enable compaction
81-
Compaction is a performance optimization that takes the many small files created when ingesting data and combines them into fewer, larger files according to the set `target file size`. [Click here](http:///r2/data-catalog/about-compaction) to learn more about compaction.
81+
Compaction improves query performance by combining the many small files created during data ingestion into fewer, larger files according to the set `target file size`. For more information about compaction and why it's valuable, refer to [About compaction](/r2/data-catalog/about-compaction/).
8282
<Tabs syncKey='CLIvDash'>
8383
<TabItem label='Dashboard'>
8484

@@ -87,10 +87,10 @@ Compaction is a performance optimization that takes the many small files created
8787

8888
<DashButton url="/?to=/:account/r2/overview" />
8989
2. Select the bucket you want to enable compaction on.
90-
3. Switch to the **Settings** tab, scroll down to **R2 Data Catalog**, and click on the **edit** icon next to the compaction card.
91-
4. Click enable and optionally set a target file size. The default is 128MB.
92-
5. Provide a credential: you can choose to allow us generate an account-level token on your behalf, which is scoped to your bucket (recommended); or, you can manually input a token.
93-
6. Slick *save*
90+
3. Switch to the **Settings** tab, scroll down to **R2 Data Catalog**, and click on the **Edit** icon next to the compaction card.
91+
4. Enable compaction and optionally set a target file size. The default is 128MB.
92+
5. (Optional) Provide a Cloudflare API token for compaction to access and rewrite files in your bucket.
93+
6. Select **Save**.
9494
</Steps>
9595

9696
</TabItem>
@@ -99,16 +99,16 @@ Compaction is a performance optimization that takes the many small files created
9999
To enable the compaction on your catalog, run the [`r2 bucket catalog enable command`](/workers/wrangler/commands/#r2-bucket-catalog-compaction-enable):
100100

101101
```bash
102-
npx wrangler r2 bucket catalog compaction enable <BUCKET_NAME> --targetSizeMb 128 --token <API_TOKEN>
102+
npx wrangler r2 bucket catalog compaction enable <BUCKET_NAME> --target-size 128 --token <API_TOKEN>
103103
```
104104
</TabItem>
105105
</Tabs>
106-
An API Token is a required argument to enable compaction because our process needs to access and write metadata in R2 Catalog and access files in the R2 bucket, just like any Iceberg client would require. So we need an API Token with R2 Bucket Read/Write permissions and R2 Catalog Read/Write permissions. The token is encrypted and securely accessed only by our compaction processor. Once saved, it is not required again if you re-enable compaction on your Catalog after disabling it.
106+
Compaction requires a Cloudflare API token with **both** R2 storage and R2 Data Catalog read/write permissions to act as a service credential. The compaction process uses this token to read files, combine them, and update table metadata. Refer to [Authenticate your Iceberg engine](#authenticate-your-iceberg-engine) for details on creating a token with the required permissions.
107107

108-
After enabling, compaction will be enabled retroactively on all existing tables, and will be enabled by default for newly created tables. During open beta, we currently compact up to 2GB worth of files once per hour for each table.
108+
Once enabled, compaction applies retroactively to all existing tables and automatically to newly created tables. During open beta, we currently compact up to 2GB worth of files once per hour for each table.
109109

110110
## Disable compaction
111-
Disabling compaction will prevent the process from running for all tables within the Catalog. You can re-enable it at any time.
111+
Disabling compaction will prevent the process from running for all tables managed by the catalog. You can re-enable it at any time.
112112

113113
<Tabs syncKey='CLIvDash'>
114114
<TabItem label='Dashboard'>
@@ -119,8 +119,8 @@ Disabling compaction will prevent the process from running for all tables within
119119
<DashButton url="/?to=/:account/r2/overview" />
120120
2. Select the bucket you want to enable compaction on.
121121
3. Switch to the **Settings** tab, scroll down to **R2 Data Catalog**, and click on the **edit** icon next to the compaction card.
122-
4. Click **disable**.
123-
5. Slick *save*.
122+
4. Disable compaction.
123+
5. Select **Save**.
124124
</Steps>
125125

126126
</TabItem>
@@ -129,7 +129,7 @@ Disabling compaction will prevent the process from running for all tables within
129129
To disable the compaction on your catalog, run the [`r2 bucket catalog disable command`](/workers/wrangler/commands/#r2-bucket-catalog-compaction-disable):
130130

131131
```bash
132-
npx wrangler r2 bucket catalog compaction disable <BUCKET_NAME> --token <API_TOKEN>
132+
npx wrangler r2 bucket catalog compaction disable <BUCKET_NAME>
133133
```
134134
</TabItem>
135135
</Tabs>

src/content/partials/workers/wrangler-commands/r2.mdx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -122,8 +122,8 @@ wrangler r2 bucket catalog compaction enable <BUCKET> [OPTIONS]
122122
- The name of the bucket to enable R2 Data Catalog compaction for.
123123
- `--token` <Type text="string" /> <MetaInfo text="required" />
124124
- The R2 API token with R2 Data Catalog edit permissions
125-
- `--targetSizeMb` <Type text="number" /> <MetaInfo text="required" />
126-
- The target file size compaction will attempt to generate if possible. Default: 128MB.
125+
- `--target-size` <Type text="number" /> <MetaInfo text="required" />
126+
- The target file size (in MB) compaction will attempt to generate. Default: 128.
127127

128128
<AnchorHeading
129129
title="`catalog compaction disable`"

0 commit comments

Comments
 (0)