-
Notifications
You must be signed in to change notification settings - Fork 10k
Adds compaction to R2 Data Catalog docs and wrangler commands #25366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
jonesphillip
merged 4 commits into
cloudflare:production
from
Marcinthecloud:R2DC-Compaction-Docs
Sep 25, 2025
Merged
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
a3e6fe0
Adds compaction to R2 Data Catalog docs and wrangler commands
Marcinthecloud 2babf07
adding changelog for compaction in the PR due to deep linking to new …
Marcinthecloud aa90ed0
Mostly small changes to compaction docs + changelog
jonesphillip e807d09
PCX Review
Oxyjun File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
27 changes: 27 additions & 0 deletions
27
src/content/changelog/r2/2025-09-25-data-catalog-compaction.mdx
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,27 @@ | ||
| --- | ||
| title: R2 Data Catalog now supports compaction | ||
| description: Users can now enable compaction on R2 Data Catalog | ||
| products: | ||
| - r2 | ||
| date: 2025-09-25T13:00:00 | ||
| hidden: true | ||
| --- | ||
| import { | ||
| LinkCard, | ||
| } from "~/components"; | ||
|
|
||
| You can now enable automatic compaction for [Apache Iceberg](https://iceberg.apache.org/) tables in [R2 Data Catalog](/r2/data-catalog/) to improve query performance. | ||
|
|
||
| Compaction is the process of taking a group of small files and combining them into fewer larger files. This is an important maintenance operation as it helps ensure that query performance remains consistent by reducing the number of files that needs to be scanned. | ||
|
|
||
| To enable automatic compaction in R2 Data Catalog, find it under **R2 Data Catalog** in your R2 bucket settings in the dashboard. | ||
|
|
||
|  | ||
|
|
||
| Or with [Wrangler](/workers/wrangler/), run: | ||
|
|
||
| ```bash | ||
| npx wrangler r2 bucket catalog compaction enable <BUCKET_NAME> --target-size 128 --token <API_TOKEN> | ||
| ``` | ||
|
|
||
| To get started with compaction, check out [manage catalogs](/r2/data-catalog/manage-catalogs/). For best practices and limitations, refer to [about compaction](/r2/data-catalog/about-compaction/). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,41 @@ | ||
| --- | ||
| pcx_content_type: configuration | ||
| title: About compaction | ||
| description: Learn about R2 Data Catalog compaction | ||
| sidebar: | ||
| order: 4 | ||
| --- | ||
|
|
||
| ## What is compaction? | ||
|
|
||
| Compaction is the process of taking a group of small files and combining them into fewer larger files. This is an important maintenance operation as it helps ensure that query performance remains consistent by reducing the number of files that needs to be scanned. | ||
|
|
||
| ## Why do I need compaction? | ||
|
|
||
| Every write operation in [Apache Iceberg](https://iceberg.apache.org/), no matter how small or large, results in a series of new files being generated. As time goes on, the number of files can grow unbounded. This can lead to: | ||
| - Slower queries and increased I/O operations: Without compaction, query engines will have to open and read each individual file, resulting in longer query times and increased costs. | ||
| - Increased metadata overhead: Query engines must scan metadata files to determine which ones to read. With thousands of small files, query planning takes longer even before data is accessed. | ||
| - Reduced compression efficiency: Smaller files compress less efficiently than larger files, leading to higher storage costs and more data to transfer during queries. | ||
|
|
||
| ## R2 Data Catalog automatic compaction | ||
|
|
||
| R2 Data Catalog can now [manage compaction](/r2/data-catalog/manage-catalogs) for Apache Iceberg tables stored in R2. When enabled, compaction runs automatically and combines new files that have not been compacted yet. | ||
|
|
||
| Compacted files are prefixed with `compacted-` in the `/data/` directory. | ||
|
|
||
| ### Choosing the right target file size | ||
|
|
||
| You can configure the target file size for compaction. Currently, the minimum is 64 MB and the maximum is 512 MB. | ||
|
|
||
| Different compute engines have different optimal file sizes, so check their documentation. | ||
|
|
||
| Performance tradeoffs depend on your use case. For example, queries that return small amounts of data may perform better with smaller files, as larger files could result in reading unnecessary data. | ||
| - For workloads that are more latency sensitive, consider a smaller target file size (for example, 64 MB - 128 MB) | ||
| - For streaming ingest workloads, consider medium file sizes (for example, 128 MB - 256 MB) | ||
| - For OLAP style queries that need to scan a lot of data, consider larger file sizes (for example, 256 MB - 512 MB) | ||
|
|
||
| ## Current limitations | ||
| - During open beta, compaction will compact up to 2 GB worth of files once per hour for each table. | ||
| - Only data files stored in parquet format are currently supported with compaction. | ||
| - Snapshot expiration and orphan file cleanup is not supported yet. | ||
| - Minimum target file size is 64 MB and maximum is 512 MB. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the Wrangler commands, we should consider using the
WranglerNamespacecomponent at some point. https://developers.cloudflare.com/style-guide/components/wrangler-namespace/