Skip to content

Conversation

@nagraham
Copy link
Collaborator

@nagraham nagraham commented Jan 5, 2026

See the related issue: #111

Problem

We have observed compaction run again and again for tables which have not added or removed data, even for days at a time. We would expect the compaction job to produce files that are either larger than our configured max size, or fewer than the min_group_file_count, so that subsequent jobs would not run at all. This is impacting tables which are partitioned.

Solution

The ideal approach would be to update the strategies in file_selection/strategy.rs to group files by partition using partition information in the FileScanTask. Unfortunately, partition information is not in the forked iceberg-rust (yet). That is a recently added attribute: link to task.rs.

Approaches

  • (SUGGESTED) Option A: Cherry-pick the commit adding partition info into the forked iceberg-rust and then group by partition info from the FileScanTask
    • (+) This is an ideal long term solution because it uses partition information in the FileScanTask struct.
    • (-) One risk is that it may have a lot of complex conflicts to resolve, which could delay getting out a fix
      • UPDATE: I cherry-picked that commit into the rising-wave fork of iceberg-rust, and resolved the conflicts. So this may not be a risk.
  • Option B: Get partition information from datafile metadata in Manifest files, and pass a mapping into the Strategy.execute(). Compare a FileScanTask's datapath with the path in the mapp.
    • (+) We can do this now without upgrading iceberg-rust
    • (-) It adds annoying mapping code / complexity
    • (-) Temporary solution. Would move to Option A later once iceberg-rust is up to date.
    • (-) It adds a bit more I/O to get manifest files

Implemented Option A

This PR implements Option a, using Partition values to group up FileScanTask if the Table has a partition.

Testing

I wrote an initial commit with an integration test that tests the expectation that the min_group_file_count would filter out a table where the number of files within each partition is less than the min_group_file_count. Without a fix, it fails, and thus reproduces the issue. With the fix, it passes.

Here is an example of the failing test:

---- integration_tests::test_min_files_in_group_applies_to_partitioned_table stdout ----

thread 'integration_tests::test_min_files_in_group_applies_to_partitioned_table' panicked at integration-tests/src/integration_tests.rs:460:5:
Compaction should NOT have re-run compaction because the files within each partition are less than the min_group_file_count; stats: RewriteFilesStat { input_files_count: 5, output_files_count: 5, input_total_bytes: 48312, output_total_bytes: 48312, input_data_file_count: 5, input_position_delete_file_count: 0, input_equality_delete_file_count: 0, input_data_file_total_bytes: 48312, input_position_delete_file_total_bytes: 0, input_equality_delete_file_total_bytes: 0 }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


failures:
    integration_tests::test_min_files_in_group_applies_to_partitioned_table

test result: FAILED. 2 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out; finished in 4.66s

Bonus: This change also updates the integ test library to run on Mac/Windows machines.

@nagraham nagraham marked this pull request as ready for review January 7, 2026 17:51
@Li0k
Copy link
Collaborator

Li0k commented Jan 7, 2026

Thanks for the PR, I will review it ASAP

max_concurrent_closes: self
.max_concurrent_closes
.unwrap_or(DEFAULT_MAX_CONCURRENT_CLOSES),
partition_key: self.partition_key,
Copy link
Collaborator Author

@nagraham nagraham Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discovered this triggers a panic deeper in iceberg-rust. The root cause is self.partition_key is set to None. However, the RecordBatchPartitionSplitter will provide a partition_key when build() is invoked. We should use that partition_key instead. By using None, we have invalid partition data. I think the first instance works because the initial writer has the correct partition_key.

This triggered a panic in construct_partition_summaries due to the iterators not having the same length. Long term, iceberg-rust should probably return a better error rather than panic. But perhaps panicking is better than writing corrupted data.

Truncated stack trace:

itertools: .zip_eq() reached end of one iterator before the other
iceberg::spec::manifest::writer::ManifestWriter::construct_partition_summaries
iceberg::spec::manifest::writer::ManifestWriter::write_manifest_file
iceberg::transaction::snapshot::SnapshotProducer::write_added_manifest
iceberg::transaction::rewrite_files::RewriteFilesAction::commit

I wrote a 2nd integration test and verified that it triggers a panic without this modification. It also demonstrates the conditions which trigger the error:

  1. The table must be partitioned
  2. At least one partition must have a group of input files which is larger than the target_size, and thus triggers a "roll over" to a new output file

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants