Skip to content

Commit a3e6fe0

Browse files
Adds compaction to R2 Data Catalog docs and wrangler commands
1 parent eb19c17 commit a3e6fe0

File tree

3 files changed

+149
-7
lines changed

3 files changed

+149
-7
lines changed
Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
---
2+
pcx_content_type: configuration
3+
title: About compaction
4+
description: Learn about R2 Data Catalog compaction
5+
sidebar:
6+
order: 4
7+
---
8+
9+
## What is compaction?
10+
11+
Compaction is the process of taking a group of small files and combining them into fewer larger files. This is an important maintenance operation as it helps ensure that performance remains consistent by reducing the number of files that needs to be scanned.
12+
13+
## Why do I need compaction?
14+
15+
Every write operation in Apache Iceberg, no matter how small or large, results in a series of new files being generated. As time goes on, the number of files can grow unbounded. This can lead to:
16+
- Increased metadata overhead: Each file has its own metadata, file path, column statistics, etc. This means the query engine will have to read a large amount of metadata to satisfy a given query.
17+
- Increased I/O operations: Without compaction, query engines will have to open and read each individual file, resulting in increased resource usage and cost.
18+
- Reduced compression efficiency: Smaller files tend to compress less efficiently compared to larger files.
19+
20+
## R2 Data Catalog Compaction
21+
22+
R2 Data Catalog can now [manage compaction](/r2/data-catalog/manage-catalogs) for Apache Iceberg tables stored in R2. The compaction service periodically runs every hour and compacts new files that have not been compacted yet.
23+
24+
You can tell which files in R2 have been compacted as they have a `compacted-` added to the file name in the `/data/` directory.
25+
26+
### Choosing the right target file size
27+
28+
You can configure the target file size compaction should try to generate if possible. There is a minimum of `64 MB` and a maximum of `512 MB` currently.
29+
30+
Different compute engines tend to have different best practices when it comes to ideal file sizes so it's best to consult their documentation to find out what's best.
31+
32+
It's important to note that there are performance tradeoffs to keep in mind based on the use case. For example, if your use case is primarily performing queries that are well defined and will return small amounts of data, having a smaller target file size might be more beneficial as you might end up reading more data than necessary with larger files.
33+
- For workloads that are more latency sensitive, consider a smaller target file size (for example, 64MB - 128MB)
34+
- For streaming ingest workloads, consider medium file sizes (for example, 128MB - 256MB)
35+
- For OLAP style queries that need to scan a lot of data, consider larger file sizes (for example, 256MB - 512MB)
36+
37+
:::note
38+
Make sure to check to your compute engine documentation to check if there's a recommended file size.
39+
:::
40+
41+
## Current limitations
42+
- During open beta, compaction will compact up to 2GB worth of files once per hour for each table.
43+
- Only data files stored in parquet format are currently supported with compaction.
44+
- Snapshot expiration and orphan file cleanup is not supported yet.
45+
- Minimum target file size is 64 MB and maximum is 512 MB.

src/content/docs/r2/data-catalog/manage-catalogs.mdx

Lines changed: 70 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,8 @@ Learn how to:
2828

2929
Enabling the catalog on a bucket turns on the REST catalog interface and provides a **Catalog URI** and **Warehouse name** required by Iceberg clients. Once enabled, you can create and manage Iceberg tables in that bucket.
3030

31-
### Dashboard
31+
<Tabs syncKey='CLIvDash'>
32+
<TabItem label='Dashboard'>
3233

3334
<Steps>
3435
1. In the Cloudflare dashboard, go to the **R2 object storage** page.
@@ -39,22 +40,25 @@ Enabling the catalog on a bucket turns on the REST catalog interface and provide
3940
4. Once enabled, note the **Catalog URI** and **Warehouse name**.
4041
</Steps>
4142

42-
### Wrangler CLI
43-
43+
</TabItem>
44+
<TabItem label='Wrangler CLI'>
4445
To enable the catalog on your bucket, run the [`r2 bucket catalog enable command`](/workers/wrangler/commands/#r2-bucket-catalog-enable):
4546

4647
```bash
4748
npx wrangler r2 bucket catalog enable <BUCKET_NAME>
4849
```
4950

5051
After enabling, Wrangler will return your catalog URI and warehouse name.
52+
</TabItem>
53+
</Tabs>
54+
5155

5256
## Disable R2 Data Catalog on a bucket
5357

5458
When you disable the catalog on a bucket, it immediately stops serving requests from the catalog interface. Any Iceberg table references stored in that catalog become inaccessible until you re-enable it.
5559

56-
### Dashboard
57-
60+
<Tabs syncKey='CLIvDash'>
61+
<TabItem label='Dashboard'>
5862
<Steps>
5963
1. In the Cloudflare dashboard, go to the **R2 object storage** page.
6064

@@ -63,13 +67,72 @@ When you disable the catalog on a bucket, it immediately stops serving requests
6367
3. Switch to the **Settings** tab, scroll down to **R2 Data Catalog**, and select **Disable**.
6468
</Steps>
6569

66-
### Wrangler CLI
67-
70+
</TabItem>
71+
<TabItem label='Wrangler CLI'>
6872
To disable the catalog on your bucket, run the [`r2 bucket catalog disable command`](/workers/wrangler/commands/#r2-bucket-catalog-disable):
6973

7074
```bash
7175
npx wrangler r2 bucket catalog disable <BUCKET_NAME>
7276
```
77+
</TabItem>
78+
</Tabs>
79+
80+
## Enable compaction
81+
Compaction is a performance optimization that takes the many small files created when ingesting data and combines them into fewer, larger files according to the set `target file size`. [Click here](http:///r2/data-catalog/about-compaction) to learn more about compaction.
82+
<Tabs syncKey='CLIvDash'>
83+
<TabItem label='Dashboard'>
84+
85+
<Steps>
86+
1. In the Cloudflare dashboard, go to the **R2 object storage** page.
87+
88+
<DashButton url="/?to=/:account/r2/overview" />
89+
2. Select the bucket you want to enable compaction on.
90+
3. Switch to the **Settings** tab, scroll down to **R2 Data Catalog**, and click on the **edit** icon next to the compaction card.
91+
4. Click enable and optionally set a target file size. The default is 128MB.
92+
5. Provide a credential: you can choose to allow us generate an account-level token on your behalf, which is scoped to your bucket (recommended); or, you can manually input a token.
93+
6. Slick *save*
94+
</Steps>
95+
96+
</TabItem>
97+
<TabItem label='Wrangler CLI'>
98+
99+
To enable the compaction on your catalog, run the [`r2 bucket catalog enable command`](/workers/wrangler/commands/#r2-bucket-catalog-compaction-enable):
100+
101+
```bash
102+
npx wrangler r2 bucket catalog compaction enable <BUCKET_NAME> --targetSizeMb 128 --token <API_TOKEN>
103+
```
104+
</TabItem>
105+
</Tabs>
106+
An API Token is a required argument to enable compaction because our process needs to access and write metadata in R2 Catalog and access files in the R2 bucket, just like any Iceberg client would require. So we need an API Token with R2 Bucket Read/Write permissions and R2 Catalog Read/Write permissions. The token is encrypted and securely accessed only by our compaction processor. Once saved, it is not required again if you re-enable compaction on your Catalog after disabling it.
107+
108+
After enabling, compaction will be enabled retroactively on all existing tables, and will be enabled by default for newly created tables. During open beta, we currently compact up to 2GB worth of files once per hour for each table.
109+
110+
## Disable compaction
111+
Disabling compaction will prevent the process from running for all tables within the Catalog. You can re-enable it at any time.
112+
113+
<Tabs syncKey='CLIvDash'>
114+
<TabItem label='Dashboard'>
115+
116+
<Steps>
117+
1. In the Cloudflare dashboard, go to the **R2 object storage** page.
118+
119+
<DashButton url="/?to=/:account/r2/overview" />
120+
2. Select the bucket you want to enable compaction on.
121+
3. Switch to the **Settings** tab, scroll down to **R2 Data Catalog**, and click on the **edit** icon next to the compaction card.
122+
4. Click **disable**.
123+
5. Slick *save*.
124+
</Steps>
125+
126+
</TabItem>
127+
<TabItem label='Wrangler CLI'>
128+
129+
To disable the compaction on your catalog, run the [`r2 bucket catalog disable command`](/workers/wrangler/commands/#r2-bucket-catalog-compaction-disable):
130+
131+
```bash
132+
npx wrangler r2 bucket catalog compaction disable <BUCKET_NAME> --token <API_TOKEN>
133+
```
134+
</TabItem>
135+
</Tabs>
73136

74137
## Authenticate your Iceberg engine
75138

src/content/partials/workers/wrangler-commands/r2.mdx

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -106,6 +106,40 @@ wrangler r2 bucket catalog get <NAME> [OPTIONS]
106106
- `NAME` <Type text="string" /> <MetaInfo text="required" />
107107
- The name of the R2 bucket whose data catalog status to retrieve.
108108

109+
<AnchorHeading
110+
title="`catalog compaction enable`"
111+
slug="r2-bucket-catalog-compaction-enable"
112+
depth={3}
113+
/>
114+
115+
Enable compaction on a [R2 Data Catalog](/r2/data-catalog/).
116+
117+
```txt
118+
wrangler r2 bucket catalog compaction enable <BUCKET> [OPTIONS]
119+
```
120+
121+
- `BUCKET` <Type text="string" /> <MetaInfo text="required" />
122+
- The name of the bucket to enable R2 Data Catalog compaction for.
123+
- `--token` <Type text="string" /> <MetaInfo text="required" />
124+
- The R2 API token with R2 Data Catalog edit permissions
125+
- `--targetSizeMb` <Type text="number" /> <MetaInfo text="required" />
126+
- The target file size compaction will attempt to generate if possible. Default: 128MB.
127+
128+
<AnchorHeading
129+
title="`catalog compaction disable`"
130+
slug="r2-bucket-catalog-compaction-disable"
131+
depth={3}
132+
/>
133+
134+
Disable compaction on a [R2 Data Catalog](/r2/data-catalog/).
135+
136+
```txt
137+
wrangler r2 bucket catalog compaction disable <BUCKET> [OPTIONS]
138+
```
139+
140+
- `BUCKET` <Type text="string" /> <MetaInfo text="required" />
141+
- The name of the bucket to enable R2 Data Catalog compaction for.
142+
109143
<AnchorHeading title="`cors set`" slug="r2-bucket-cors-set" depth={3} />
110144

111145
Set the [CORS configuration](/r2/buckets/cors/) for an R2 bucket from a JSON file.

0 commit comments

Comments
 (0)