Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 4 additions & 6 deletions ARCHITECTURE.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,11 +28,11 @@ Roughly speaking the design is following these guiding principles:
- Indexing should be O(1) in memory. (In practice it is just sublinear)
- Search should be as fast as possible

This comes at the cost of the dynamicity of the index: while it is possible to add, and delete documents from our corpus, the tantivy is designed to handle these updates in large batches.
This comes at the cost of the dynamicity of the index: the indexer is append-only and does not natively support document deletions.

## [core/](src/core): Index, segments, searchers

Core contains all of the high-level code to make it possible to create an index, add documents, delete documents and commit.
Core contains all of the high-level code to make it possible to create an index, add documents, and commit.

This is both the most high-level part of tantivy, the least performance-sensitive one, the seemingly most mundane code... And paradoxically the most complicated part.

Expand All @@ -56,12 +56,10 @@ For a better idea of how indexing works, you may read the [following blog post](

### Deletes

Deletes happen by deleting a "term". Tantivy does not offer any notion of primary id, so it is up to the user to use a field in their schema as if it was a primary id, and delete the associated term if they want to delete only one specific document.
Document removal is handled externally by triblespace. The indexer itself only supports adding documents and committing new segments.

On commit, tantivy will find all of the segments with documents matching this existing term and remove from [alive bitset file](src/fastfield/alive_bitset.rs) that represents the bitset of the alive document ids.
Like all segment files, this file is immutable. Because it is possible to have more than one alive bitset file at a given instant, the alive bitset filename has the format ```segment_id . commit_opstamp . del```.

An opstamp is simply an incremental id that identifies any operation applied to the index. For instance, performing a commit or adding a document.
Like all segment files, this file is immutable. The alive bitset filename has the format ```segment_id . del```.

### DocId

Expand Down
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,3 +38,6 @@ have been removed to keep the changelog focused on Yeehaw's history.
- reorder inventory tasks to prioritize fixing doctest regressions.
- remove `quickwit` feature and associated asynchronous APIs.
- remove obsolete document type codes.
- remove delete queue and segment delete tracking; document removal now handled externally by triblespace.
- remove operation stamp infrastructure in preparation for commit-handle redesign.
- simplify searcher generation to track segment ids only and drop legacy delete tests.
115 changes: 0 additions & 115 deletions examples/deleting_updating_documents.rs

This file was deleted.

46 changes: 24 additions & 22 deletions examples/faceted_search_with_tweaked_score.rs
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ use std::collections::HashSet;
use yeehaw::collector::TopDocs;
use yeehaw::query::BooleanQuery;
use yeehaw::schema::*;
use yeehaw::{doc, DocId, Index, IndexWriter, Score, SegmentReader};
use yeehaw::{DocId, Index, IndexWriter, Score, SegmentReader};

fn main() -> yeehaw::Result<()> {
let mut schema_builder = Schema::builder();
Expand All @@ -25,27 +25,29 @@ fn main() -> yeehaw::Result<()> {

let mut index_writer: IndexWriter = index.writer(30_000_000)?;

index_writer.add_document(doc!(
title => "Fried egg",
ingredient => Facet::from("/ingredient/egg"),
ingredient => Facet::from("/ingredient/oil"),
))?;
index_writer.add_document(doc!(
title => "Scrambled egg",
ingredient => Facet::from("/ingredient/egg"),
ingredient => Facet::from("/ingredient/butter"),
ingredient => Facet::from("/ingredient/milk"),
ingredient => Facet::from("/ingredient/salt"),
))?;
index_writer.add_document(doc!(
title => "Egg rolls",
ingredient => Facet::from("/ingredient/egg"),
ingredient => Facet::from("/ingredient/garlic"),
ingredient => Facet::from("/ingredient/salt"),
ingredient => Facet::from("/ingredient/oil"),
ingredient => Facet::from("/ingredient/tortilla-wrap"),
ingredient => Facet::from("/ingredient/mushroom"),
))?;
let mut doc = TantivyDocument::new();
doc.add_text(title, "Fried egg");
doc.add_facet(ingredient, Facet::from("/ingredient/egg"));
doc.add_facet(ingredient, Facet::from("/ingredient/oil"));
index_writer.add_document(doc)?;

let mut doc = TantivyDocument::new();
doc.add_text(title, "Scrambled egg");
doc.add_facet(ingredient, Facet::from("/ingredient/egg"));
doc.add_facet(ingredient, Facet::from("/ingredient/butter"));
doc.add_facet(ingredient, Facet::from("/ingredient/milk"));
doc.add_facet(ingredient, Facet::from("/ingredient/salt"));
index_writer.add_document(doc)?;

let mut doc = TantivyDocument::new();
doc.add_text(title, "Egg rolls");
doc.add_facet(ingredient, Facet::from("/ingredient/egg"));
doc.add_facet(ingredient, Facet::from("/ingredient/garlic"));
doc.add_facet(ingredient, Facet::from("/ingredient/salt"));
doc.add_facet(ingredient, Facet::from("/ingredient/oil"));
doc.add_facet(ingredient, Facet::from("/ingredient/tortilla-wrap"));
doc.add_facet(ingredient, Facet::from("/ingredient/mushroom"));
index_writer.add_document(doc)?;
index_writer.commit()?;

let reader = index.reader()?;
Expand Down
18 changes: 9 additions & 9 deletions examples/warmer.rs
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,7 @@ use yeehaw::index::SegmentId;
use yeehaw::query::QueryParser;
use yeehaw::schema::{Schema, FAST, TEXT};
use yeehaw::{
doc, DocAddress, DocId, Index, IndexWriter, Opstamp, Searcher, SearcherGeneration,
SegmentReader, Warmer,
doc, DocAddress, DocId, Index, IndexWriter, Searcher, SearcherGeneration, SegmentReader, Warmer,
};

// This example shows how warmers can be used to
Expand All @@ -26,7 +25,7 @@ pub trait PriceFetcher: Send + Sync + 'static {
fn fetch_prices(&self, product_ids: &[ProductId]) -> Vec<Price>;
}

type SegmentKey = (SegmentId, Option<Opstamp>);
type SegmentKey = SegmentId;

struct DynamicPriceColumn {
field: String,
Expand All @@ -44,8 +43,11 @@ impl DynamicPriceColumn {
}

pub fn price_for_segment(&self, segment_reader: &SegmentReader) -> Option<Arc<Vec<Price>>> {
let segment_key = (segment_reader.segment_id(), segment_reader.delete_opstamp());
self.price_cache.read().unwrap().get(&segment_key).cloned()
self.price_cache
.read()
.unwrap()
.get(&segment_reader.segment_id())
.cloned()
}
}
impl Warmer for DynamicPriceColumn {
Expand All @@ -72,11 +74,10 @@ impl Warmer for DynamicPriceColumn {
})
.collect();

let key = (segment.segment_id(), segment.delete_opstamp());
self.price_cache
.write()
.unwrap()
.insert(key, Arc::new(prices));
.insert(segment.segment_id(), Arc::new(prices));
}

Ok(())
Expand All @@ -85,8 +86,7 @@ impl Warmer for DynamicPriceColumn {
fn garbage_collect(&self, live_generations: &[&SearcherGeneration]) {
let live_keys: HashSet<SegmentKey> = live_generations
.iter()
.flat_map(|gen| gen.segments())
.map(|(&segment_id, &opstamp)| (segment_id, opstamp))
.flat_map(|gen| gen.segments().iter().copied())
.collect();

self.price_cache
Expand Down
3 changes: 0 additions & 3 deletions src/core/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,3 @@ pub static META_FILEPATH: Lazy<&'static Path> = Lazy::new(|| Path::new("meta.jso
/// Removing this file is safe, but will prevent the garbage collection of all of the file that
/// are currently in the directory
pub static MANAGED_FILEPATH: Lazy<&'static Path> = Lazy::new(|| Path::new(".managed.json"));

#[cfg(test)]
mod tests;
40 changes: 12 additions & 28 deletions src/core/searcher.rs
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
use std::collections::BTreeMap;
use std::sync::Arc;
use std::{fmt, io};

Expand All @@ -8,27 +7,18 @@ use crate::index::{SegmentId, SegmentReader};
use crate::query::{Bm25StatisticsProvider, EnableScoring, Query};
use crate::schema::{Schema, Term};
use crate::space_usage::SearcherSpaceUsage;
use crate::{Index, Opstamp, TrackedObject};
use crate::{Index, TrackedObject};

/// Identifies the searcher generation accessed by a [`Searcher`].
///
/// While this might seem redundant, a [`SearcherGeneration`] contains
/// both a `generation_id` AND a list of `(SegmentId, DeleteOpstamp)`.
///
/// This is on purpose. This object is used by the [`Warmer`](crate::reader::Warmer) API.
/// Having both information makes it possible to identify which
/// artifact should be refreshed or garbage collected.
/// Identifies the searcher generation accessed by a [`Searcher`].
///
/// Depending on the use case, `Warmer`'s implementers can decide to
/// produce artifacts per:
/// - `generation_id` (e.g. some searcher level aggregates)
/// - `(segment_id, delete_opstamp)` (e.g. segment level aggregates)
/// - `segment_id` (e.g. for immutable document level information)
/// - `(generation_id, segment_id)` (e.g. for consistent dynamic column)
/// - ...
/// This object is used by the [`Warmer`](crate::reader::Warmer) API to tie
/// external resources to the set of segments currently loaded by the
/// searcher.
#[derive(Debug, Clone, PartialEq, Eq, PartialOrd, Ord, Hash)]
pub struct SearcherGeneration {
segments: BTreeMap<SegmentId, Option<Opstamp>>,
segments: Vec<SegmentId>,
generation_id: u64,
}

Expand All @@ -37,13 +27,9 @@ impl SearcherGeneration {
segment_readers: &[SegmentReader],
generation_id: u64,
) -> Self {
let mut segment_id_to_del_opstamp = BTreeMap::new();
for segment_reader in segment_readers {
segment_id_to_del_opstamp
.insert(segment_reader.segment_id(), segment_reader.delete_opstamp());
}
let segments = segment_readers.iter().map(|s| s.segment_id()).collect();
Self {
segments: segment_id_to_del_opstamp,
segments,
generation_id,
}
}
Expand All @@ -53,8 +39,8 @@ impl SearcherGeneration {
self.generation_id
}

/// Return a `(SegmentId -> DeleteOpstamp)` mapping.
pub fn segments(&self) -> &BTreeMap<SegmentId, Option<Opstamp>> {
/// Return the list of segment ids referenced by this generation.
pub fn segments(&self) -> &[SegmentId] {
&self.segments
}
}
Expand Down Expand Up @@ -222,11 +208,9 @@ impl SearcherInner {
segment_readers: Vec<SegmentReader>,
generation: TrackedObject<SearcherGeneration>,
) -> io::Result<SearcherInner> {
let expected: Vec<_> = segment_readers.iter().map(|r| r.segment_id()).collect();
assert_eq!(
&segment_readers
.iter()
.map(|reader| (reader.segment_id(), reader.delete_opstamp()))
.collect::<BTreeMap<_, _>>(),
&expected,
generation.segments(),
"Set of segments referenced by this Searcher and its SearcherGeneration must match"
);
Expand Down
Loading
Loading