Skip to content

Commit 0ab989d

Browse files
Merge pull request #23 from triblespace/233ept-codex/remove-opstamp-and-refactor-indexing-code
Remove operation stamps
2 parents 2c1bcd4 + 23286b9 commit 0ab989d

39 files changed

+178
-7154
lines changed

ARCHITECTURE.md

Lines changed: 4 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -28,11 +28,11 @@ Roughly speaking the design is following these guiding principles:
2828
- Indexing should be O(1) in memory. (In practice it is just sublinear)
2929
- Search should be as fast as possible
3030

31-
This comes at the cost of the dynamicity of the index: while it is possible to add, and delete documents from our corpus, the tantivy is designed to handle these updates in large batches.
31+
This comes at the cost of the dynamicity of the index: the indexer is append-only and does not natively support document deletions.
3232

3333
## [core/](src/core): Index, segments, searchers
3434

35-
Core contains all of the high-level code to make it possible to create an index, add documents, delete documents and commit.
35+
Core contains all of the high-level code to make it possible to create an index, add documents, and commit.
3636

3737
This is both the most high-level part of tantivy, the least performance-sensitive one, the seemingly most mundane code... And paradoxically the most complicated part.
3838

@@ -56,12 +56,10 @@ For a better idea of how indexing works, you may read the [following blog post](
5656

5757
### Deletes
5858

59-
Deletes happen by deleting a "term". Tantivy does not offer any notion of primary id, so it is up to the user to use a field in their schema as if it was a primary id, and delete the associated term if they want to delete only one specific document.
59+
Document removal is handled externally by triblespace. The indexer itself only supports adding documents and committing new segments.
6060

6161
On commit, tantivy will find all of the segments with documents matching this existing term and remove from [alive bitset file](src/fastfield/alive_bitset.rs) that represents the bitset of the alive document ids.
62-
Like all segment files, this file is immutable. Because it is possible to have more than one alive bitset file at a given instant, the alive bitset filename has the format ```segment_id . commit_opstamp . del```.
63-
64-
An opstamp is simply an incremental id that identifies any operation applied to the index. For instance, performing a commit or adding a document.
62+
Like all segment files, this file is immutable. The alive bitset filename has the format ```segment_id . del```.
6563

6664
### DocId
6765

CHANGELOG.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,3 +38,6 @@ have been removed to keep the changelog focused on Yeehaw's history.
3838
- reorder inventory tasks to prioritize fixing doctest regressions.
3939
- remove `quickwit` feature and associated asynchronous APIs.
4040
- remove obsolete document type codes.
41+
- remove delete queue and segment delete tracking; document removal now handled externally by triblespace.
42+
- remove operation stamp infrastructure in preparation for commit-handle redesign.
43+
- simplify searcher generation to track segment ids only and drop legacy delete tests.

examples/deleting_updating_documents.rs

Lines changed: 0 additions & 115 deletions
This file was deleted.

examples/faceted_search_with_tweaked_score.rs

Lines changed: 24 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ use std::collections::HashSet;
1212
use yeehaw::collector::TopDocs;
1313
use yeehaw::query::BooleanQuery;
1414
use yeehaw::schema::*;
15-
use yeehaw::{doc, DocId, Index, IndexWriter, Score, SegmentReader};
15+
use yeehaw::{DocId, Index, IndexWriter, Score, SegmentReader};
1616

1717
fn main() -> yeehaw::Result<()> {
1818
let mut schema_builder = Schema::builder();
@@ -25,27 +25,29 @@ fn main() -> yeehaw::Result<()> {
2525

2626
let mut index_writer: IndexWriter = index.writer(30_000_000)?;
2727

28-
index_writer.add_document(doc!(
29-
title => "Fried egg",
30-
ingredient => Facet::from("/ingredient/egg"),
31-
ingredient => Facet::from("/ingredient/oil"),
32-
))?;
33-
index_writer.add_document(doc!(
34-
title => "Scrambled egg",
35-
ingredient => Facet::from("/ingredient/egg"),
36-
ingredient => Facet::from("/ingredient/butter"),
37-
ingredient => Facet::from("/ingredient/milk"),
38-
ingredient => Facet::from("/ingredient/salt"),
39-
))?;
40-
index_writer.add_document(doc!(
41-
title => "Egg rolls",
42-
ingredient => Facet::from("/ingredient/egg"),
43-
ingredient => Facet::from("/ingredient/garlic"),
44-
ingredient => Facet::from("/ingredient/salt"),
45-
ingredient => Facet::from("/ingredient/oil"),
46-
ingredient => Facet::from("/ingredient/tortilla-wrap"),
47-
ingredient => Facet::from("/ingredient/mushroom"),
48-
))?;
28+
let mut doc = TantivyDocument::new();
29+
doc.add_text(title, "Fried egg");
30+
doc.add_facet(ingredient, Facet::from("/ingredient/egg"));
31+
doc.add_facet(ingredient, Facet::from("/ingredient/oil"));
32+
index_writer.add_document(doc)?;
33+
34+
let mut doc = TantivyDocument::new();
35+
doc.add_text(title, "Scrambled egg");
36+
doc.add_facet(ingredient, Facet::from("/ingredient/egg"));
37+
doc.add_facet(ingredient, Facet::from("/ingredient/butter"));
38+
doc.add_facet(ingredient, Facet::from("/ingredient/milk"));
39+
doc.add_facet(ingredient, Facet::from("/ingredient/salt"));
40+
index_writer.add_document(doc)?;
41+
42+
let mut doc = TantivyDocument::new();
43+
doc.add_text(title, "Egg rolls");
44+
doc.add_facet(ingredient, Facet::from("/ingredient/egg"));
45+
doc.add_facet(ingredient, Facet::from("/ingredient/garlic"));
46+
doc.add_facet(ingredient, Facet::from("/ingredient/salt"));
47+
doc.add_facet(ingredient, Facet::from("/ingredient/oil"));
48+
doc.add_facet(ingredient, Facet::from("/ingredient/tortilla-wrap"));
49+
doc.add_facet(ingredient, Facet::from("/ingredient/mushroom"));
50+
index_writer.add_document(doc)?;
4951
index_writer.commit()?;
5052

5153
let reader = index.reader()?;

examples/warmer.rs

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,7 @@ use yeehaw::index::SegmentId;
77
use yeehaw::query::QueryParser;
88
use yeehaw::schema::{Schema, FAST, TEXT};
99
use yeehaw::{
10-
doc, DocAddress, DocId, Index, IndexWriter, Opstamp, Searcher, SearcherGeneration,
11-
SegmentReader, Warmer,
10+
doc, DocAddress, DocId, Index, IndexWriter, Searcher, SearcherGeneration, SegmentReader, Warmer,
1211
};
1312

1413
// This example shows how warmers can be used to
@@ -26,7 +25,7 @@ pub trait PriceFetcher: Send + Sync + 'static {
2625
fn fetch_prices(&self, product_ids: &[ProductId]) -> Vec<Price>;
2726
}
2827

29-
type SegmentKey = (SegmentId, Option<Opstamp>);
28+
type SegmentKey = SegmentId;
3029

3130
struct DynamicPriceColumn {
3231
field: String,
@@ -44,8 +43,11 @@ impl DynamicPriceColumn {
4443
}
4544

4645
pub fn price_for_segment(&self, segment_reader: &SegmentReader) -> Option<Arc<Vec<Price>>> {
47-
let segment_key = (segment_reader.segment_id(), segment_reader.delete_opstamp());
48-
self.price_cache.read().unwrap().get(&segment_key).cloned()
46+
self.price_cache
47+
.read()
48+
.unwrap()
49+
.get(&segment_reader.segment_id())
50+
.cloned()
4951
}
5052
}
5153
impl Warmer for DynamicPriceColumn {
@@ -72,11 +74,10 @@ impl Warmer for DynamicPriceColumn {
7274
})
7375
.collect();
7476

75-
let key = (segment.segment_id(), segment.delete_opstamp());
7677
self.price_cache
7778
.write()
7879
.unwrap()
79-
.insert(key, Arc::new(prices));
80+
.insert(segment.segment_id(), Arc::new(prices));
8081
}
8182

8283
Ok(())
@@ -85,8 +86,7 @@ impl Warmer for DynamicPriceColumn {
8586
fn garbage_collect(&self, live_generations: &[&SearcherGeneration]) {
8687
let live_keys: HashSet<SegmentKey> = live_generations
8788
.iter()
88-
.flat_map(|gen| gen.segments())
89-
.map(|(&segment_id, &opstamp)| (segment_id, opstamp))
89+
.flat_map(|gen| gen.segments().iter().copied())
9090
.collect();
9191

9292
self.price_cache

src/core/mod.rs

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,3 @@ pub static META_FILEPATH: Lazy<&'static Path> = Lazy::new(|| Path::new("meta.jso
2020
/// Removing this file is safe, but will prevent the garbage collection of all of the file that
2121
/// are currently in the directory
2222
pub static MANAGED_FILEPATH: Lazy<&'static Path> = Lazy::new(|| Path::new(".managed.json"));
23-
24-
#[cfg(test)]
25-
mod tests;

src/core/searcher.rs

Lines changed: 12 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
use std::collections::BTreeMap;
21
use std::sync::Arc;
32
use std::{fmt, io};
43

@@ -8,27 +7,18 @@ use crate::index::{SegmentId, SegmentReader};
87
use crate::query::{Bm25StatisticsProvider, EnableScoring, Query};
98
use crate::schema::{Schema, Term};
109
use crate::space_usage::SearcherSpaceUsage;
11-
use crate::{Index, Opstamp, TrackedObject};
10+
use crate::{Index, TrackedObject};
1211

1312
/// Identifies the searcher generation accessed by a [`Searcher`].
1413
///
15-
/// While this might seem redundant, a [`SearcherGeneration`] contains
16-
/// both a `generation_id` AND a list of `(SegmentId, DeleteOpstamp)`.
17-
///
18-
/// This is on purpose. This object is used by the [`Warmer`](crate::reader::Warmer) API.
19-
/// Having both information makes it possible to identify which
20-
/// artifact should be refreshed or garbage collected.
14+
/// Identifies the searcher generation accessed by a [`Searcher`].
2115
///
22-
/// Depending on the use case, `Warmer`'s implementers can decide to
23-
/// produce artifacts per:
24-
/// - `generation_id` (e.g. some searcher level aggregates)
25-
/// - `(segment_id, delete_opstamp)` (e.g. segment level aggregates)
26-
/// - `segment_id` (e.g. for immutable document level information)
27-
/// - `(generation_id, segment_id)` (e.g. for consistent dynamic column)
28-
/// - ...
16+
/// This object is used by the [`Warmer`](crate::reader::Warmer) API to tie
17+
/// external resources to the set of segments currently loaded by the
18+
/// searcher.
2919
#[derive(Debug, Clone, PartialEq, Eq, PartialOrd, Ord, Hash)]
3020
pub struct SearcherGeneration {
31-
segments: BTreeMap<SegmentId, Option<Opstamp>>,
21+
segments: Vec<SegmentId>,
3222
generation_id: u64,
3323
}
3424

@@ -37,13 +27,9 @@ impl SearcherGeneration {
3727
segment_readers: &[SegmentReader],
3828
generation_id: u64,
3929
) -> Self {
40-
let mut segment_id_to_del_opstamp = BTreeMap::new();
41-
for segment_reader in segment_readers {
42-
segment_id_to_del_opstamp
43-
.insert(segment_reader.segment_id(), segment_reader.delete_opstamp());
44-
}
30+
let segments = segment_readers.iter().map(|s| s.segment_id()).collect();
4531
Self {
46-
segments: segment_id_to_del_opstamp,
32+
segments,
4733
generation_id,
4834
}
4935
}
@@ -53,8 +39,8 @@ impl SearcherGeneration {
5339
self.generation_id
5440
}
5541

56-
/// Return a `(SegmentId -> DeleteOpstamp)` mapping.
57-
pub fn segments(&self) -> &BTreeMap<SegmentId, Option<Opstamp>> {
42+
/// Return the list of segment ids referenced by this generation.
43+
pub fn segments(&self) -> &[SegmentId] {
5844
&self.segments
5945
}
6046
}
@@ -222,11 +208,9 @@ impl SearcherInner {
222208
segment_readers: Vec<SegmentReader>,
223209
generation: TrackedObject<SearcherGeneration>,
224210
) -> io::Result<SearcherInner> {
211+
let expected: Vec<_> = segment_readers.iter().map(|r| r.segment_id()).collect();
225212
assert_eq!(
226-
&segment_readers
227-
.iter()
228-
.map(|reader| (reader.segment_id(), reader.delete_opstamp()))
229-
.collect::<BTreeMap<_, _>>(),
213+
&expected,
230214
generation.segments(),
231215
"Set of segments referenced by this Searcher and its SearcherGeneration must match"
232216
);

0 commit comments

Comments
 (0)