Merge branch 'su/compaction-size-filtering' of github.com:altertable-ai/ducklake-web into altertable-ai-su/compaction-size-filtering

szarnyasg · szarnyasg · commit 165c973c3949 · 2026-02-23T13:24:24.000+01:00
diff --git a/docs/preview/duckdb/maintenance/merge_adjacent_files.md b/docs/preview/duckdb/maintenance/merge_adjacent_files.md
@@ -27,10 +27,51 @@ Or if you want to target a specific table within a schema:
 CALL ducklake_merge_adjacent_files('my_ducklake', 't', schema => 'some_schema');
 ```
 
-Compacting data files can be a very memory intensive operation. You may consider performing this operation in batches by specifying the `max_compacted_files` parameter.
+## Advanced Options
+
+The `merge_adjacent_files` function supports optional parameters to filter which files are considered for compaction and control memory usage. This enables advanced compaction strategies and more granular control over the compaction process.
+
+- **`max_compacted_files`**: Limits the maximum number of files to compact in a single operation. Compacting data files can be a very memory intensive operation, so you may consider performing this operation in batches by specifying this parameter.
+- **`min_file_size`**: Files smaller than this size (in bytes) are excluded from compaction. If not specified, all files are considered regardless of minimum size.
+- **`max_file_size`**: Files at or larger than this size (in bytes) are excluded from compaction. If not specified, it defaults to `target_file_size`. Must be greater than 0.
+
+Example with compacted files limit:
 
 ```sql
-CALL ducklake_merge_adjacent_files('my_ducklake', 't', schema => 'some_schema', max_compacted_files => 1000);
+CALL ducklake_merge_adjacent_files('my_ducklake', max_compacted_files => 100);
+```
+
+Example with size filtering:
+
+```sql
+-- Only merge files between 10KB and 100KB
+CALL ducklake_merge_adjacent_files('my_ducklake', min_file_size => 10240, max_file_size => 102400);
+```
+
+### Example: Tiered Compaction Strategy for Streaming Workloads
+
+File size filtering enables tiered compaction strategies, which are particularly useful for realtime/streamed ingestion patterns. A tiered approach merges files in stages:
+
+- **Tier 0 → Tier 1**: Done often, merge small files (< 1MB) into ~5MB files
+- **Tier 1 → Tier 2**: Done occasionally, merge medium files (1MB-10MB) into ~32MB files
+- **Tier 2 → Tier 3**: Done rarely, merge large files (10MB-64MB) into ~128MB files
+
+This compaction strategy provides more predictable I/O amplification and better incremental compaction for streaming workloads.
+
+Example tiered compaction workflow:
+
+```sql
+-- Tier 0 → Tier 1: merge small files
+CALL ducklake_set_option('my_ducklake', 'target_file_size', '5MB');
+CALL ducklake_merge_adjacent_files('my_ducklake', max_file_size => 1048576);
+
+-- Tier 1 → Tier 2: merge medium files
+CALL ducklake_set_option('my_ducklake', 'target_file_size', '32MB');
+CALL ducklake_merge_adjacent_files('my_ducklake', min_file_size => 1048576, max_file_size => 10485760);
+
+-- Tier 2 → Tier 3: merge large files
+CALL ducklake_set_option('my_ducklake', 'target_file_size', '128MB');
+CALL ducklake_merge_adjacent_files('my_ducklake', min_file_size => 10485760, max_file_size => 67108864);
 ```
 
 > Calling this function does not immediately delete the old files.