|
1 | 1 | # 📘 Powered by Lucene |
2 | 2 |
|
3 | | -https://lucene.apache.org |
| 3 | +[Lucene](https://lucene.apache.org), is a Java library providing powerful indexing and search |
| 4 | +features, as well as spellchecking, hit highlighting, and advanced analysis/tokenization |
| 5 | +capabilities. Without a doubt, you've already used Lucene perhaps without even realizing it. |
| 6 | +It powers the search facilities of countless websites and applications, both public and private. |
4 | 7 |
|
5 | 8 | ## Anatomy of a Lucene index |
6 | 9 |
|
7 | | -A Lucene index encapsulates specialized data structures unique to each type of data indexed. |
| 10 | +A Lucene index encapsulates specialized data structures unique to each type of data indexed. |
8 | 11 |
|
9 | | - * Numbers and dates: ... |
10 | | - * Geo-spatial: ... |
| 12 | + * Numbers, dates, geo-spatial points: Indexed into a k-d structure |
| 13 | + * Vectors: Hierarchical Navigable Small Worlds (HNSW) data structure |
11 | 14 | * Text: via inverted indexes |
12 | 15 |
|
13 | | -Each field is indexed independently. |
| 16 | +What Lucene, and Atlas Search, call an "index" is really a collection of separate individual |
| 17 | +per-field data structures. |
14 | 18 |
|
15 | | -Segmented architecture, append-only, for fast indexing. Background processes to optimize the index |
16 | | -segments. |
| 19 | +Lucene is designed for both fast searches and speedy indexing. The indexing speed derives from its |
| 20 | +append-only segmented architecture. When new docuemnts are indexed, they are added to a new segments. |
| 21 | +When that indexing session is complete, the new segments are opened and blended with all the other |
| 22 | +active segments. Background processes optimize the index segments by combining them to form larger, |
| 23 | +and less, segments over time. |
| 24 | + |
| 25 | +A single Lucene index can handle up to 2 billion documents. There is generally a 1-1 correspondence |
| 26 | +between documents in your collection to Lucene documents, with the exception of nested documents |
| 27 | +mapped as `embeddedDocuments` (a topic covered later). To differentiate the terminology, Atlas Search |
| 28 | +calls the documents in Lucene index "index objects". See |
| 29 | +[index size and configuration doc](https://www.mongodb.com/docs/atlas/atlas-search/performance/index-performance/#index-size-and-configuration) |
| 30 | +for more details. |
17 | 31 |
|
18 | 32 | ## Inverted Index |
19 | 33 |
|
| 34 | +Textual content is the heart and soul of Lucene. `string` fields are analzyed. The output of the |
| 35 | +analysis process is a series of **terms**. Terms are generally a normalized version of the individual |
| 36 | +words of the text. These terms are then organized into an **inverted index** data structure. |
| 37 | +This data structure is lexicographically (or alphabetically) ordered. The following image illustrates |
| 38 | +an inverted index built from 3 documents, each with a single string field. |
| 39 | + |
20 | 40 |  |
21 | 41 |
|
| 42 | +Along with an ordered dictionary of terms, corpus and document-level statistics are also collected into |
| 43 | +the inverted index structure. These statistics include: |
| 44 | + |
| 45 | + * term frequency (`tf`): the number of times a term occurs in the field |
| 46 | + * document frequency (`df`): how many documents contain the term |
| 47 | + * field length: how many terms are there in each field |
| 48 | + |
22 | 49 | ## Search algorithms |
23 | 50 |
|
24 | | - * "index intersection" using skip lists |
25 | | - * link to Adrien's presentation |
| 51 | +Lucene queries leverage the data structures built at index-time to quickly find, and rank, matching |
| 52 | +documents. The synergy "index intersection" shines when searching across multiple fields in a single |
| 53 | +query. |
| 54 | + |
| 55 | +Atlas Search translates its search operators directly to Lucene's `Query` API. |
| 56 | + |
| 57 | +## Resources |
26 | 58 |
|
27 | | -Atlas Search translates its search operators to Lucene's `Query` API. |
| 59 | + * ["What is in a Lucene index"](https://www.youtube.com/watch?v=T5RmMNDR5XI) - a very |
| 60 | + educational presentation delivered by Lucene project committer Adrien Grand. |
| 61 | + * ["Visualizing Lucene's segment merges"](https://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html) |
| 62 | + - an illustrative set of animations on the how Lucene keeps itself optimized, balancing both |
| 63 | + indexing and searching needs. |
0 commit comments