Skip to content

Commit 2c1bcd4

Browse files
Merge pull request #21 from triblespace/e4veap-codex/remove-unnecessary-features-for-rewrite
chore: drop docstore module
2 parents 8588e53 + 9c6ac0e commit 2c1bcd4

File tree

83 files changed

+266
-5142
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

83 files changed

+266
-5142
lines changed

ARCHITECTURE.md

Lines changed: 0 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -106,7 +106,6 @@ The schema defines all of the fields that the indexes [`Document`](src/schema/do
106106

107107
Depending on the type of the field, you can decide to
108108

109-
- put it in the docstore
110109
- store it as a fast field
111110
- index it
112111

@@ -135,29 +134,6 @@ This conversion is done by the serializer.
135134
Finally, the reader is in charge of offering an API to read on this on-disk read-only representation.
136135
In tantivy, readers are designed to require very little anonymous memory. The data is read straight from an mmapped file, and loading an index is as fast as mmapping its files.
137136

138-
## [store/](src/store): Here is my DocId, Gimme my document
139-
140-
The docstore is a row-oriented storage that, for each document, stores a subset of the fields
141-
that are marked as stored in the schema. The docstore is compressed using a general-purpose algorithm
142-
like LZ4.
143-
144-
**Useful for**
145-
146-
In search engines, it is often used to display search results.
147-
Once the top 10 documents have been identified, we fetch them from the store, and display them or their snippet on the search result page (aka SERP).
148-
149-
**Not useful for**
150-
151-
Fetching a document from the store is typically a "slow" operation. It usually consists in
152-
153-
- searching into a compact tree-like data structure to find the position of the right block.
154-
- decompressing a small block
155-
- returning the document from this block.
156-
157-
It is NOT meant to be called for every document matching a query.
158-
159-
As a rule of thumb, if you hit the docstore more than 100 times per search query, you are probably misusing tantivy.
160-
161137
## [fastfield/](src/fastfield): Here is my DocId, Gimme my value
162138

163139
Fast fields are stored in a column-oriented storage that allows for random access.

CHANGELOG.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,8 +12,13 @@ have been removed to keep the changelog focused on Yeehaw's history.
1212
- update examples to import the `yeehaw` crate instead of `tantivy`.
1313
- run preflight tests without enabling the `unstable` feature.
1414
- handle unknown column codes gracefully in `ColumnarReader::iter_columns`.
15+
- rewrite doctests and examples to import the `yeehaw` crate directly.
1516

1617
## Features/Improvements
18+
- drop docstore module and references in preparation for trible.space rewrite.
19+
- purge remaining docstore references from core modules and tests.
20+
- remove docstore-dependent code from examples.
21+
- drop binary document serializer/deserializer now that docstore is gone.
1722
- remove `quickwit` feature flag and related async code.
1823
- add docs/example and Vec<u32> values to sstable [#2660](https://github.com/quickwit-oss/yeehaw/pull/2660)(@PSeitz)
1924
- Add string fast field support to `TopDocs`. [#2642](https://github.com/quickwit-oss/yeehaw/pull/2642)(@stuhood)
@@ -32,3 +37,4 @@ have been removed to keep the changelog focused on Yeehaw's history.
3237
- expand documentation for document deserialization traits.
3338
- reorder inventory tasks to prioritize fixing doctest regressions.
3439
- remove `quickwit` feature and associated asynchronous APIs.
40+
- remove obsolete document type codes.

INVENTORY.md

Lines changed: 6 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -20,23 +20,19 @@ This document outlines the long term plan to rewrite this project so that it rel
2020
- Replace the `Directory` abstraction with a backend that reads and writes blobs via the Trible Space `BlobStore`.
2121
- Index writers and readers operate on blob handles instead of filesystem paths.
2222

23-
3. **Drop the docstore module**
24-
- Primary documents are kept in Trible Space; segments no longer store their own row oriented docstore.
25-
- Search results fetch documents via blob handles.
26-
27-
4. **Remove `Opstamp` and use commit handles**
23+
3. **Remove `Opstamp` and use commit handles**
2824
- Commits record the segments they include.
2925
- Merges rely on commit ancestry instead of monotonic operation stamps.
3026

31-
5. **Introduce 128-bit IDs with `Universe` mapping**
27+
4. **Introduce 128-bit IDs with `Universe` mapping**
3228
- Map external `u128` identifiers to compact `DocId` values.
3329
- Persist the mapping so search results can translate back.
3430

35-
6. **Typed DSL for fuzzy search**
31+
5. **Typed DSL for fuzzy search**
3632
- Generate search filters from Trible namespaces.
3733
- Provide macros that participate in both `find!` queries and full text search.
3834

39-
7. **Index update merge workflow**
35+
6. **Index update merge workflow**
4036
- Wrap indexing operations in workspace commits.
4137
- Use Trible's compare-and-swap push mechanism so multiple writers merge gracefully.
4238

@@ -59,3 +55,5 @@ This inventory captures the direction of the rewrite and the major tasks require
5955
- Migrate inline benchmarks to a stable harness so the `unstable` feature can be tested on stable Rust.
6056
15. **Evaluate removing sstable term dictionary and crate now that `quickwit` feature is gone**
6157
- Determine whether the `sstable` crate should remain in the workspace or be extracted.
58+
16. **Prune obsolete document type codes** *(done)*
59+
- Removed unused `type_codes` constants after dropping docstore serialization.

columnar/columnar-cli-inspect/Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ edition = "2021"
55
license = "MIT"
66

77
[dependencies]
8-
tantivy = {path="../..", package="tantivy"}
8+
yeehaw = {path="../.."}
99
columnar = {path="../", package="tantivy-columnar"}
1010
common = {path="../../common", package="tantivy-common"}
1111

columnar/columnar-cli-inspect/src/main.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ use columnar::ColumnarReader;
22
use common::file_slice::{FileSlice, WrapFile};
33
use std::io;
44
use std::path::Path;
5-
use tantivy::directory::footer::Footer;
5+
use yeehaw::directory::footer::Footer;
66

77
fn main() -> io::Result<()> {
88
println!("Opens a columnar file written by tantivy and validates it.");

examples/aggregation.rs

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -37,8 +37,7 @@ fn main() -> yeehaw::Result<()> {
3737
.set_index_option(IndexRecordOption::WithFreqs)
3838
.set_tokenizer("raw"),
3939
)
40-
.set_fast(None)
41-
.set_stored();
40+
.set_fast(None);
4241
schema_builder.add_text_field("category", text_fieldtype);
4342
schema_builder.add_f64_field("stock", FAST);
4443
schema_builder.add_f64_field("price", FAST);

examples/basic_search.rs

Lines changed: 3 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,6 @@
88
// - create an index in a directory
99
// - index a few documents into our index
1010
// - search for the best document matching a basic query
11-
// - retrieve the best document's original content.
1211

1312
// ---
1413
// Importing yeehaw...
@@ -33,28 +32,10 @@ fn main() -> yeehaw::Result<()> {
3332
// First we need to define a schema ...
3433
let mut schema_builder = Schema::builder();
3534

36-
// Our first field is title.
37-
// We want full-text search for it, and we also want
38-
// to be able to retrieve the document after the search.
39-
//
40-
// `TEXT | STORED` is some syntactic sugar to describe
41-
// that.
42-
//
43-
// `TEXT` means the field should be tokenized and indexed,
44-
// along with its term frequency and term positions.
45-
//
46-
// `STORED` means that the field will also be saved
47-
// in a compressed, row-oriented key-value store.
48-
// This store is useful for reconstructing the
49-
// documents that were selected during the search phase.
50-
schema_builder.add_text_field("title", TEXT | STORED);
35+
// Our first field is title. We want full-text search for it.
36+
schema_builder.add_text_field("title", TEXT);
5137

5238
// Our second field is body.
53-
// We want full-text search for it, but we do not
54-
// need to be able to retrieve it
55-
// for our application.
56-
//
57-
// We can make our index lighter by omitting the `STORED` flag.
5839
schema_builder.add_text_field("body", TEXT);
5940

6041
let schema = schema_builder.build();
@@ -210,15 +191,8 @@ fn main() -> yeehaw::Result<()> {
210191
// We can now perform our query.
211192
let top_docs = searcher.search(&query, &TopDocs::with_limit(10))?;
212193

213-
// The actual documents still need to be
214-
// retrieved from Yeehaw's store.
215-
//
216-
// Since the body field was not configured as stored,
217-
// the document returned will only contain
218-
// a title.
219194
for (_score, doc_address) in top_docs {
220-
let retrieved_doc: TantivyDocument = searcher.doc(doc_address)?;
221-
println!("{}", retrieved_doc.to_json(&schema));
195+
println!("{doc_address:?}");
222196
}
223197

224198
// We can also get an explanation to understand

examples/custom_tokenizer.rs

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -26,9 +26,7 @@ fn main() -> yeehaw::Result<()> {
2626
let text_field_indexing = TextFieldIndexing::default()
2727
.set_tokenizer("ngram3")
2828
.set_index_option(IndexRecordOption::WithFreqsAndPositions);
29-
let text_options = TextOptions::default()
30-
.set_indexing_options(text_field_indexing)
31-
.set_stored();
29+
let text_options = TextOptions::default().set_indexing_options(text_field_indexing);
3230
let title = schema_builder.add_text_field("title", text_options);
3331

3432
// Our second field is body.
@@ -103,8 +101,7 @@ fn main() -> yeehaw::Result<()> {
103101
let top_docs = searcher.search(&query, &TopDocs::with_limit(10))?;
104102

105103
for (_, doc_address) in top_docs {
106-
let retrieved_doc: TantivyDocument = searcher.doc(doc_address)?;
107-
println!("{}", retrieved_doc.to_json(&schema));
104+
println!("{doc_address:?}");
108105
}
109106

110107
Ok(())

examples/date_time_field.rs

Lines changed: 3 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -4,19 +4,18 @@
44

55
use yeehaw::collector::TopDocs;
66
use yeehaw::query::QueryParser;
7-
use yeehaw::schema::{DateOptions, Document, Schema, Value, INDEXED, STORED, STRING};
7+
use yeehaw::schema::{DateOptions, Schema, INDEXED, STRING};
88
use yeehaw::{Index, IndexWriter, TantivyDocument};
99

1010
fn main() -> yeehaw::Result<()> {
1111
// # Defining the schema
1212
let mut schema_builder = Schema::builder();
1313
let opts = DateOptions::from(INDEXED)
14-
.set_stored()
1514
.set_fast()
1615
.set_precision(yeehaw::schema::DateTimePrecision::Seconds);
1716
// Add `occurred_at` date field type
18-
let occurred_at = schema_builder.add_date_field("occurred_at", opts);
19-
let event_type = schema_builder.add_text_field("event", STRING | STORED);
17+
let _occurred_at = schema_builder.add_date_field("occurred_at", opts);
18+
let event_type = schema_builder.add_text_field("event", STRING);
2019
let schema = schema_builder.build();
2120

2221
// # Indexing documents
@@ -59,19 +58,6 @@ fn main() -> yeehaw::Result<()> {
5958
.parse_query(r#"occurred_at:[2022-06-22T12:58:00Z TO 2022-06-23T00:00:00Z}"#)?;
6059
let count_docs = searcher.search(&*query, &TopDocs::with_limit(4))?;
6160
assert_eq!(count_docs.len(), 1);
62-
for (_score, doc_address) in count_docs {
63-
let retrieved_doc = searcher.doc::<TantivyDocument>(doc_address)?;
64-
assert!(retrieved_doc
65-
.get_first(occurred_at)
66-
.unwrap()
67-
.as_value()
68-
.as_datetime()
69-
.is_some(),);
70-
assert_eq!(
71-
retrieved_doc.to_json(&schema),
72-
r#"{"event":["comment"],"occurred_at":["2022-06-22T13:00:00.22Z"]}"#
73-
);
74-
}
7561
}
7662
Ok(())
7763
}

examples/deleting_updating_documents.rs

Lines changed: 8 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -13,30 +13,12 @@ use yeehaw::query::TermQuery;
1313
use yeehaw::schema::*;
1414
use yeehaw::{doc, Index, IndexReader, IndexWriter};
1515

16-
// A simple helper function to fetch a single document
17-
// given its id from our index.
18-
// It will be helpful to check our work.
19-
fn extract_doc_given_isbn(
20-
reader: &IndexReader,
21-
isbn_term: &Term,
22-
) -> yeehaw::Result<Option<TantivyDocument>> {
16+
// Helper to check whether a document with the given ISBN exists.
17+
fn exists_doc_with_isbn(reader: &IndexReader, isbn_term: &Term) -> yeehaw::Result<bool> {
2318
let searcher = reader.searcher();
24-
25-
// This is the simplest query you can think of.
26-
// It matches all of the documents containing a specific term.
27-
//
28-
// The second argument is here to tell we don't care about decoding positions,
29-
// or term frequencies.
3019
let term_query = TermQuery::new(isbn_term.clone(), IndexRecordOption::Basic);
3120
let top_docs = searcher.search(&term_query, &TopDocs::with_limit(1))?;
32-
33-
if let Some((_score, doc_address)) = top_docs.first() {
34-
let doc = searcher.doc(*doc_address)?;
35-
Ok(Some(doc))
36-
} else {
37-
// no doc matching this ID.
38-
Ok(None)
39-
}
21+
Ok(top_docs.first().is_some())
4022
}
4123

4224
fn main() -> yeehaw::Result<()> {
@@ -61,10 +43,8 @@ fn main() -> yeehaw::Result<()> {
6143
// use the `STRING` shortcut. `STRING` stands for indexed (without term frequency or positions)
6244
// and untokenized.
6345
//
64-
// Because we also want to be able to see this `id` in our returned documents,
65-
// we also mark the field as stored.
66-
let isbn = schema_builder.add_text_field("isbn", STRING | STORED);
67-
let title = schema_builder.add_text_field("title", TEXT | STORED);
46+
let isbn = schema_builder.add_text_field("isbn", STRING);
47+
let title = schema_builder.add_text_field("title", TEXT);
6848
let schema = schema_builder.build();
6949

7050
let index = Index::create_in_ram(schema.clone());
@@ -92,11 +72,7 @@ fn main() -> yeehaw::Result<()> {
9272
let frankenstein_isbn = Term::from_field_text(isbn, "978-9176370711");
9373

9474
// Oops our frankenstein doc seems misspelled
95-
let frankenstein_doc_misspelled = extract_doc_given_isbn(&reader, &frankenstein_isbn)?.unwrap();
96-
assert_eq!(
97-
frankenstein_doc_misspelled.to_json(&schema),
98-
r#"{"isbn":["978-9176370711"],"title":["Frankentein"]}"#,
99-
);
75+
assert!(exists_doc_with_isbn(&reader, &frankenstein_isbn)?);
10076

10177
// # Update = Delete + Insert
10278
//
@@ -106,8 +82,7 @@ fn main() -> yeehaw::Result<()> {
10682
// and reinsert the document.
10783
//
10884
// This can be complicated as it means you need to have access
109-
// to the entire document. It is good practise to integrate yeehaw
110-
// with a key value store for this reason.
85+
// to the entire document.
11186
//
11287
// To remove one of the document, we just call `delete_term`
11388
// on its id.
@@ -134,11 +109,7 @@ fn main() -> yeehaw::Result<()> {
134109
reader.reload()?;
135110

136111
// No more typo!
137-
let frankenstein_new_doc = extract_doc_given_isbn(&reader, &frankenstein_isbn)?.unwrap();
138-
assert_eq!(
139-
frankenstein_new_doc.to_json(&schema),
140-
r#"{"isbn":["978-9176370711"],"title":["Frankenstein"]}"#,
141-
);
112+
assert!(exists_doc_with_isbn(&reader, &frankenstein_isbn)?);
142113

143114
Ok(())
144115
}

0 commit comments

Comments
 (0)