-
Notifications
You must be signed in to change notification settings - Fork 0
Fix PartitionDeleted error during index rebuild by handling stale partition references #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix PartitionDeleted error during index rebuild by handling stale partition references #2
Conversation
|
Important Review skippedBot user detected. To trigger a single review, invoke the You can disable this status message by setting the Note Free review on us!CodeRabbit is offering free reviews until Wed Dec 17 2025 to showcase some of the refinements we've made. Comment |
… partitions Co-authored-by: anidotnet <696662+anidotnet@users.noreply.github.com>
… issues Co-authored-by: anidotnet <696662+anidotnet@users.noreply.github.com>
…mentation Co-authored-by: anidotnet <696662+anidotnet@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes a PartitionDeleted error that occurs during index rebuild operations when Fjall partitions are deleted and immediately recreated, causing stale internal references.
Key Changes:
- Added
open_partition_with_retry()helper function with retry logic and delays for handling deleted partition scenarios - Refactored
open_map()to delegate to the retry helper instead of directly callingopen_partition - Enhanced
remove_map()error handling with debug logging for already-deleted partitions
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // If the retry also fails, wait a moment for file system cleanup | ||
| log::debug!("First retry failed, waiting briefly for cleanup: {}", retry_err); | ||
|
|
||
| // Block briefly to allow Fjall's file system cleanup to complete | ||
| std::thread::sleep(std::time::Duration::from_millis( | ||
| Self::PARTITION_CLEANUP_DELAY_MS | ||
| )); | ||
|
|
||
| // Final attempt | ||
| ks.open_partition(name, config_clone) | ||
| .map_err(|e| { | ||
| log::error!("Failed to recreate partition '{}' after retries: {}", name, e); | ||
| to_nitrite_error(e) | ||
| }) |
Copilot
AI
Dec 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The retry logic unconditionally retries after sleep regardless of the error type. If the second attempt (line 290) fails for a reason other than PartitionDeleted (e.g., permission error, disk full), the code still sleeps and retries. Consider checking if retry_err is also a PartitionDeleted error before sleeping, or at minimum logging the error type to help diagnose non-transient failures.
| // If the retry also fails, wait a moment for file system cleanup | |
| log::debug!("First retry failed, waiting briefly for cleanup: {}", retry_err); | |
| // Block briefly to allow Fjall's file system cleanup to complete | |
| std::thread::sleep(std::time::Duration::from_millis( | |
| Self::PARTITION_CLEANUP_DELAY_MS | |
| )); | |
| // Final attempt | |
| ks.open_partition(name, config_clone) | |
| .map_err(|e| { | |
| log::error!("Failed to recreate partition '{}' after retries: {}", name, e); | |
| to_nitrite_error(e) | |
| }) | |
| let retry_err_msg = retry_err.to_string(); | |
| if Self::is_partition_deleted_error(&retry_err_msg) { | |
| log::debug!("First retry failed with PartitionDeleted, waiting briefly for cleanup: {}", retry_err); | |
| // Block briefly to allow Fjall's file system cleanup to complete | |
| std::thread::sleep(std::time::Duration::from_millis( | |
| Self::PARTITION_CLEANUP_DELAY_MS | |
| )); | |
| // Final attempt | |
| ks.open_partition(name, config_clone) | |
| .map_err(|e| { | |
| log::error!("Failed to recreate partition '{}' after retries: {}", name, e); | |
| to_nitrite_error(e) | |
| }) | |
| } else { | |
| log::error!( | |
| "Failed to recreate partition '{}' on retry: {} (error type: {})", | |
| name, retry_err, retry_err_msg | |
| ); | |
| Err(to_nitrite_error(retry_err)) | |
| } |
| /// | ||
| /// This handles the case where a partition was deleted (e.g., during index rebuild) | ||
| /// and needs to be recreated. When Fjall reports a partition is deleted, we simply | ||
| /// try to open it again, which will create a new partition. | ||
| /// |
Copilot
AI
Dec 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation states "try to open it again" but the implementation actually tries to open it twice more (line 290 and line 302), for a total of 3 attempts. The documentation should clarify that up to 3 total attempts are made: initial attempt, immediate retry, and retry after delay.
| /// | |
| /// This handles the case where a partition was deleted (e.g., during index rebuild) | |
| /// and needs to be recreated. When Fjall reports a partition is deleted, we simply | |
| /// try to open it again, which will create a new partition. | |
| /// | |
| /// | |
| /// This handles the case where a partition was deleted (e.g., during index rebuild) | |
| /// and needs to be recreated. When Fjall reports a partition is deleted, this method | |
| /// will attempt to open the partition up to three times: the initial attempt, an immediate | |
| /// retry if the partition was deleted, and a final retry after a brief delay if the second | |
| /// attempt also fails. This ensures that transient file system cleanup issues are handled | |
| /// gracefully and a new partition is created if needed. | |
| /// |
| // If the retry also fails, wait a moment for file system cleanup | ||
| log::debug!("First retry failed, waiting briefly for cleanup: {}", retry_err); | ||
|
|
||
| // Block briefly to allow Fjall's file system cleanup to complete | ||
| std::thread::sleep(std::time::Duration::from_millis( | ||
| Self::PARTITION_CLEANUP_DELAY_MS | ||
| )); | ||
|
|
||
| // Final attempt | ||
| ks.open_partition(name, config_clone) | ||
| .map_err(|e| { | ||
| log::error!("Failed to recreate partition '{}' after retries: {}", name, e); | ||
| to_nitrite_error(e) | ||
| }) |
Copilot
AI
Dec 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The immediate retry attempt (line 290) doesn't check if the error is a PartitionDeleted error before proceeding to the delayed retry. This means even non-transient errors (e.g., permission denied, invalid configuration) will trigger the sleep-and-retry path. Consider checking if retry_err indicates a transient PartitionDeleted error before proceeding to the delayed retry, otherwise fail fast for unrelated errors.
| // If the retry also fails, wait a moment for file system cleanup | |
| log::debug!("First retry failed, waiting briefly for cleanup: {}", retry_err); | |
| // Block briefly to allow Fjall's file system cleanup to complete | |
| std::thread::sleep(std::time::Duration::from_millis( | |
| Self::PARTITION_CLEANUP_DELAY_MS | |
| )); | |
| // Final attempt | |
| ks.open_partition(name, config_clone) | |
| .map_err(|e| { | |
| log::error!("Failed to recreate partition '{}' after retries: {}", name, e); | |
| to_nitrite_error(e) | |
| }) | |
| // Only proceed to delayed retry if this is still a PartitionDeleted error | |
| let retry_err_msg = retry_err.to_string(); | |
| if Self::is_partition_deleted_error(&retry_err_msg) { | |
| log::debug!("First retry failed with PartitionDeleted, waiting briefly for cleanup: {}", retry_err); | |
| // Block briefly to allow Fjall's file system cleanup to complete | |
| std::thread::sleep(std::time::Duration::from_millis( | |
| Self::PARTITION_CLEANUP_DELAY_MS | |
| )); | |
| // Final attempt | |
| ks.open_partition(name, config_clone) | |
| .map_err(|e| { | |
| log::error!("Failed to recreate partition '{}' after retries: {}", name, e); | |
| to_nitrite_error(e) | |
| }) | |
| } else { | |
| log::error!("Failed to recreate partition '{}' on retry: {}", name, retry_err); | |
| Err(to_nitrite_error(retry_err)) | |
| } |
| // Clone config once to avoid multiple clones in retry paths | ||
| let config_clone = config.clone(); | ||
|
|
||
| match ks.open_partition(name, config_clone.clone()) { |
Copilot
AI
Dec 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The config parameter is cloned unnecessarily. Since config is already a reference, you can use it directly in the open_partition calls. The intermediate config_clone variable and its subsequent clones add overhead without providing any benefit. Consider removing line 274 and using config.clone() directly in the open_partition calls.
During
rebuild_index, Fjall partitions are deleted then immediately recreated. Theopen_partitioncall fails withPartitionDeletedbecause Fjall retains stale internal references to the deleted partition.Changes
open_partition_with_retry(): HandlesPartitionDeletederrors by retrying with a 50ms delay for file system cleanupopen_map(): Delegates to retry helper instead of directopen_partitioncallremove_map(): Added debug logging for already-deleted partitionsImplementation
The retry mechanism accounts for the timing window where Fjall's partition deletion has completed but internal cleanup is still in progress. Fjall's
open_partitionhas create-if-not-exists semantics, so subsequent calls create a fresh partition.Original prompt
Root Cause Analysis
The current fix in PR #1 addresses the wrong part of the code flow. The actual issue is:
rebuild_index, the indexer callsdrop_index()which deletes the Fjall partition and removes it from the registrywrite_index_entry()is called to rebuild the indexwrite_index_entry()tries to find the index, doesn't find it (because it was just dropped), and callscreate_nitrite_index()create_nitrite_index()creates a newSimpleIndexorCompoundIndex, which tries to open the map from the storeopen_map()method tries to open a partition that was just deletedPartitionDeletederror because:open_partition(), Fjall detects it was deleted and throws the errorThe Real Fix Needed
The issue is that Fjall's
open_partition()should handle the case where a partition was deleted and needs to be recreated. Currently, when you try to open a deleted partition, it throwsPartitionDeletedinstead of creating a new one.We need to modify the Fjall adapter's
open_mapfunction to:open_partitionfails withPartitionDeletederrorImplementation
Update
nitrite-fjall-adapter/src/store.rsin theopen_mapfunction:Additional Improvements
Also ensure that
remove_mapproperly handles already-deleted partitions without propagating errors: