Skip to content

Conversation

@xingbowang
Copy link
Contributor

Summary:
This adds a new public API to allow applications to abort all running compactions and prevent new ones from starting. Unlike DisableManualCompaction() which only pauses manual compactions and waits for them to finish naturally, AbortAllCompactions() actively signals running compactions (both automatic and manual) to terminate early and waits for them to complete before returning.

The abort signal is checked periodically during compaction (every 100 keys), so ongoing compactions abort quickly. Any output files from aborted compactions are automatically cleaned up to prevent partial results from being installed.

This is useful for scenarios where applications need to quickly stop all compaction activity, such as during graceful shutdown or when performing maintenance operations.

Test Plan:

  • Unit tests in db_compaction_abort_test.cc cover various abort scenarios including: abort before/during compaction, abort with multiple subcompactions, nested abort/resume calls, abort with CompactFiles API, abort across multiple column families, and timing guarantees
  • Updated compaction_job_test.cc to include the new parameter

@meta-cla meta-cla bot added the CLA Signed label Jan 8, 2026
@xingbowang xingbowang force-pushed the comp_abort branch 2 times, most recently from b112174 to 405863c Compare January 23, 2026 18:16
Summary:
This adds a new public API to allow applications to abort all running compactions and prevent new ones from starting. Unlike DisableManualCompaction() which only pauses manual compactions and waits for them to finish naturally, AbortAllCompactions() actively signals running compactions (both automatic and manual) to terminate early and waits for them to complete before returning.

The abort signal is checked periodically during compaction (every 100 keys), so ongoing compactions abort quickly. Any output files from aborted compactions are automatically cleaned up to prevent partial results from being installed.

This is useful for scenarios where applications need to quickly stop all compaction activity, such as during graceful shutdown or when performing maintenance operations.

This also adds a new public API to resume compactions after the call to abort.

Limitation: compaction service is not support.

Test Plan:
- Unit tests in db_compaction_abort_test.cc cover various abort scenarios including: abort before/during compaction, abort with multiple subcompactions, nested abort/resume calls, abort with CompactFiles API, abort across multiple column families, and timing guarantees
- Updated compaction_job_test.cc to include the new parameter
- Stress test
@meta-codesync
Copy link

meta-codesync bot commented Jan 26, 2026

@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D91480994.

@xingbowang xingbowang marked this pull request as ready for review January 26, 2026 21:31
@xingbowang xingbowang requested a review from anand1976 January 27, 2026 05:02
Copy link
Contributor

@anand1976 anand1976 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM

}
for (const std::string& file_path :
sub_compact.Outputs(is_proximal_level)->GetOutputFilePaths()) {
Status s = env_->DeleteFile(file_path);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't a compaction with non-ok status get automatically cleaned up? Why do we need to explicitly do the cleanup here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The normal cleanup path (Cleanup() in compaction_outputs.h lines 247-253) only abandons in-progress builders. This does not delete already-finished output files that were successfully written to disk.

When compaction runs with multiple subcompactions in parallel:

  1. Subcompaction A completes successfully → produces finished SST/blob files on disk
  2. Subcompaction B gets aborted (or the overall compaction is paused)

The overall compaction status becomes CompactionAborted or ManualCompactionPaused.
At this point, Subcompaction A's output files are fully written and finished on disk
But the overall compaction is aborted, so these files will never be installed to the LSM tree
Without explicit cleanup, these files become orphans on disk

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Thanks for the clarification. I think you're right. Is this a problem with subcompactions in general then? For example, if subcompaction B fails due to IO error, then there's no cleanup of subcompaction A's files. Not saying that needs to be addressed in this PR since its a separate issue, but is it something to be tracked?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some more investigation around this. There is another function FindObsoleteFiles that scan directories to find files not in any live version, and perform clean up, on compaction or flush failure. We could rely on that for rare failure such as IO error. For abort operation, we could switch to that as well. However, it would break resumable compaction, as FindObsoleteFiles does not know a compaction is resumable or not.

const uint64_t num_records = c_iter->iter_stats().num_input_records;

// Periodic cron operations: stats update, abort check, and sync points
if (num_records % kCronEvery == kCronEvery - 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Can we avoid the % (or make kCronEvery a power of 2)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

// max_subcompactions values
class DBCompactionAbortSubcompactionTest
: public DBCompactionAbortTest,
public ::testing::WithParamInterface<int> {};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a comment specifying what exactly the param is for

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

ConfigureOptionsForStyle(options, style);
Reopen(options);

// Use larger value size for Universal compaction to ensure compaction work
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate a bit more? Why wouldn't it work? If not having a specific amount of work breaks timing of the test, it may not be ideal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to clean this up. We no long this special configuration after tuning the parameter. Remove the specialization.

Copy link
Contributor

@anand1976 anand1976 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

}
for (const std::string& file_path :
sub_compact.Outputs(is_proximal_level)->GetOutputFilePaths()) {
Status s = env_->DeleteFile(file_path);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Thanks for the clarification. I think you're right. Is this a problem with subcompactions in general then? For example, if subcompaction B fails due to IO error, then there's no cleanup of subcompaction A's files. Not saying that needs to be addressed in this PR since its a separate issue, but is it something to be tracked?

@meta-codesync
Copy link

meta-codesync bot commented Jan 29, 2026

@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D91480994.

@meta-codesync
Copy link

meta-codesync bot commented Jan 30, 2026

@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D91480994.

@meta-codesync
Copy link

meta-codesync bot commented Jan 30, 2026

@xingbowang merged this pull request in 656b734.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants