Fix: Write .liv files to disk for segment replication and remote store#20573
Fix: Write .liv files to disk for segment replication and remote store#20573cuonghm2809 wants to merge 1 commit intoopensearch-project:mainfrom
Conversation
In segment replication and remote store scenarios, deletion information must be persisted to .liv (live documents) files on disk so they can be replicated to replica shards. Without this fix, replicas lack deletion metadata, causing NullPointerException during background merge operations when processing k-NN vectors. Root Cause: - DirectoryReader.open() was called with single parameter (IndexWriter only) - This defaults to writeAllDeletes=false, keeping deletions in memory only - Segment replication copies only disk files, missing .liv files - During merge, Lucene assumes all docs are live (no deletion info) - k-NN merge tries to access deleted document vectors -> NPE The Fix: - Set writeAllDeletes=true for segment replication and remote store - Forces .liv files to be written to disk during refresh - Replicas receive complete deletion information - Background merge correctly skips deleted documents Impact: - Affects: SEGMENT replication + k-NN vectors + document deletions - Severity: Index becomes RED, requires manual shard restoration - Fix: Aligns with Lucene's intended behavior for file-based replication Related: Issue opensearch-project#20572 Signed-off-by: Cuong Ha <cuonghm2809@gmail.com>
📝 WalkthroughWalkthroughA single file modification introduces a Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Suggested labels
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
No actionable comments were generated in the recent review. 🎉 📜 Recent review detailsConfiguration used: Organization UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🧰 Additional context used🧠 Learnings (1)📚 Learning: 2026-01-13T17:40:27.167ZApplied to files:
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
🔇 Additional comments (1)
✏️ Tip: You can disable this entire section by setting Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #20573 +/- ##
============================================
+ Coverage 73.25% 73.33% +0.07%
- Complexity 72103 72170 +67
============================================
Files 5798 5798
Lines 329732 329757 +25
Branches 47519 47524 +5
============================================
+ Hits 241554 241830 +276
+ Misses 68805 68516 -289
- Partials 19373 19411 +38 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
This PR is stalled because it has been open for 30 days with no activity. |
Description
This PR fixes a critical bug in segment replication and remote store that causes
NullPointerExceptionduring background merge operations when processing k-NN vector fields with document deletions.Problem
When using SEGMENT replication or remote store with k-NN vectors, the index becomes RED during background merge with the following error:
Root Cause
DirectoryReader.open()is called with only one parameter (IndexWriter), which defaults towriteAllDeletes=false. This keeps deletion information in memory only and does NOT write.liv(live documents) files to disk.Impact on Segment Replication:
.livfiles written to disk).vec,.cfs, etc.) but NOT.livfilesThe Fix
Set
writeAllDeletes=truewhen opening DirectoryReader for segment replication and remote store:This ensures
.livfiles are written to disk during refresh and replicated to replica shards.Impact
Affected Users:
Severity: HIGH
Related Issues
Additional Notes
This fix aligns OpenSearch's behavior with Lucene's design intent: when using file-based replication (segment replication, remote store), deletion information must be persisted to disk via
.livfiles. The current default (writeAllDeletes=false) is optimized for document replication where each replica independently manages deletions.The fix applies to:
index.replication.type=SEGMENTBackward Compatibility: This change only affects segment replication and remote store scenarios. Document replication (default) behavior is unchanged.