Skip to content

Commit 5f5d194

Browse files
authored
LUCENE-9353: revise format documentation of Lucene90BlockTreeTermsWriter (#90)
1 parent 5592d58 commit 5f5d194

File tree

1 file changed

+38
-23
lines changed

1 file changed

+38
-23
lines changed

lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Lucene90BlockTreeTermsWriter.java

Lines changed: 38 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -93,6 +93,7 @@ order, meaning if you just next() the file pointer will
9393
*
9494
* <ul>
9595
* <li><code>.tim</code>: <a href="#Termdictionary">Term Dictionary</a>
96+
* <li><code>.tmd</code>: <a href="#Termmetadata">Term Metadata</a>
9697
* <li><code>.tip</code>: <a href="#Termindex">Term Index</a>
9798
* </ul>
9899
*
@@ -113,7 +114,7 @@ order, meaning if you just next() the file pointer will
113114
*
114115
* <ul>
115116
* <li>TermsDict (.tim) --&gt; Header, <i>PostingsHeader</i>, NodeBlock<sup>NumBlocks</sup>,
116-
* FieldSummary, DirOffset, Footer
117+
* Footer
117118
* <li>NodeBlock --&gt; (OuterNode | InnerNode)
118119
* <li>OuterNode --&gt; EntryCount, SuffixLength, Byte<sup>SuffixLength</sup>, StatsLength, &lt;
119120
* TermStats &gt;<sup>EntryCount</sup>, MetaLength,
@@ -122,16 +123,10 @@ order, meaning if you just next() the file pointer will
122123
* &lt; TermStats ? &gt;<sup>EntryCount</sup>, MetaLength, &lt;<i>TermMetadata ?
123124
* </i>&gt;<sup>EntryCount</sup>
124125
* <li>TermStats --&gt; DocFreq, TotalTermFreq
125-
* <li>FieldSummary --&gt; NumFields, &lt;FieldNumber, NumTerms, RootCodeLength,
126-
* Byte<sup>RootCodeLength</sup>, SumTotalTermFreq?, SumDocFreq, DocCount, LongsSize, MinTerm,
127-
* MaxTerm&gt;<sup>NumFields</sup>
128126
* <li>Header --&gt; {@link CodecUtil#writeHeader CodecHeader}
129-
* <li>DirOffset --&gt; {@link DataOutput#writeLong Uint64}
130-
* <li>MinTerm,MaxTerm --&gt; {@link DataOutput#writeVInt VInt} length followed by the byte[]
131-
* <li>EntryCount,SuffixLength,StatsLength,DocFreq,MetaLength,NumFields,
132-
* FieldNumber,RootCodeLength,DocCount,LongsSize --&gt; {@link DataOutput#writeVInt VInt}
133-
* <li>TotalTermFreq,NumTerms,SumTotalTermFreq,SumDocFreq --&gt; {@link DataOutput#writeVLong
134-
* VLong}
127+
* <li>EntryCount,SuffixLength,StatsLength,DocFreq,MetaLength --&gt; {@link DataOutput#writeVInt
128+
* VInt}
129+
* <li>TotalTermFreq --&gt; {@link DataOutput#writeVLong VLong}
135130
* <li>Footer --&gt; {@link CodecUtil#writeFooter CodecFooter}
136131
* </ul>
137132
*
@@ -140,24 +135,48 @@ order, meaning if you just next() the file pointer will
140135
* <ul>
141136
* <li>Header is a {@link CodecUtil#writeHeader CodecHeader} storing the version information for
142137
* the BlockTree implementation.
143-
* <li>DirOffset is a pointer to the FieldSummary section.
144138
* <li>DocFreq is the count of documents which contain the term.
145139
* <li>TotalTermFreq is the total number of occurrences of the term. This is encoded as the
146140
* difference between the total number of occurrences and the DocFreq.
141+
* <li>PostingsHeader and TermMetadata are plugged into by the specific postings implementation:
142+
* these contain arbitrary per-file data (such as parameters or versioning information) and
143+
* per-term data (such as pointers to inverted files).
144+
* <li>For inner nodes of the tree, every entry will steal one bit to mark whether it points to
145+
* child nodes(sub-block). If so, the corresponding TermStats and TermMetaData are omitted.
146+
* </ul>
147+
*
148+
* <p><a id="Termmetadata"></a>
149+
*
150+
* <h2>Term Metadata</h2>
151+
*
152+
* <p>The .tmd file contains the list of term metadata (such as FST index metadata) and field level
153+
* statistics (such as sum of total term freq).
154+
*
155+
* <ul>
156+
* <li>TermsMeta (.tmd) --&gt; Header, NumFields, &lt;FieldStats&gt;<sup>NumFields</sup>,
157+
* TermIndexLength, TermDictLength, Footer
158+
* <li>FieldStats --&gt; FieldNumber, NumTerms, RootCodeLength, Byte<sup>RootCodeLength</sup>,
159+
* SumTotalTermFreq?, SumDocFreq, DocCount, MinTerm, MaxTerm, IndexStartFP, FSTHeader,
160+
* <i>FSTMetadata</i>
161+
* <li>Header,FSTHeader --&gt; {@link CodecUtil#writeHeader CodecHeader}
162+
* <li>TermIndexLength, TermDictLength --&gt; {@link DataOutput#writeLong Uint64}
163+
* <li>MinTerm,MaxTerm --&gt; {@link DataOutput#writeVInt VInt} length followed by the byte[]
164+
* <li>NumFields,FieldNumber,RootCodeLength,DocCount --&gt; {@link DataOutput#writeVInt VInt}
165+
* <li>NumTerms,SumTotalTermFreq,SumDocFreq,IndexStartFP --&gt; {@link DataOutput#writeVLong
166+
* VLong}
167+
* <li>Footer --&gt; {@link CodecUtil#writeFooter CodecFooter}
168+
* </ul>
169+
*
170+
* <p>Notes:
171+
*
172+
* <ul>
147173
* <li>FieldNumber is the fields number from {@link FieldInfos}. (.fnm)
148174
* <li>NumTerms is the number of unique terms for the field.
149175
* <li>RootCode points to the root block for the field.
150176
* <li>SumDocFreq is the total number of postings, the number of term-document pairs across the
151177
* entire field.
152178
* <li>DocCount is the number of documents that have at least one posting for this field.
153-
* <li>LongsSize records how many long values the postings writer/reader record per term (e.g., to
154-
* hold freq/prox/doc file offsets).
155179
* <li>MinTerm, MaxTerm are the lowest and highest term in this field.
156-
* <li>PostingsHeader and TermMetadata are plugged into by the specific postings implementation:
157-
* these contain arbitrary per-file data (such as parameters or versioning information) and
158-
* per-term data (such as pointers to inverted files).
159-
* <li>For inner nodes of the tree, every entry will steal one bit to mark whether it points to
160-
* child nodes(sub-block). If so, the corresponding TermStats and TermMetaData are omitted
161180
* </ul>
162181
*
163182
* <a id="Termindex"></a>
@@ -169,11 +188,8 @@ order, meaning if you just next() the file pointer will
169188
* saving a disk seek.
170189
*
171190
* <ul>
172-
* <li>TermsIndex (.tip) --&gt; Header, FSTIndex<sup>NumFields</sup>
173-
* &lt;IndexStartFP&gt;<sup>NumFields</sup>, DirOffset, Footer
191+
* <li>TermsIndex (.tip) --&gt; Header, FSTIndex<sup>NumFields</sup>Footer
174192
* <li>Header --&gt; {@link CodecUtil#writeHeader CodecHeader}
175-
* <li>DirOffset --&gt; {@link DataOutput#writeLong Uint64}
176-
* <li>IndexStartFP --&gt; {@link DataOutput#writeVLong VLong}
177193
* <!-- TODO: better describe FST output here -->
178194
* <li>FSTIndex --&gt; {@link FST FST&lt;byte[]&gt;}
179195
* <li>Footer --&gt; {@link CodecUtil#writeFooter CodecFooter}
@@ -185,7 +201,6 @@ order, meaning if you just next() the file pointer will
185201
* <li>The .tip file contains a separate FST for each field. The FST maps a term prefix to the
186202
* on-disk block that holds all terms starting with that prefix. Each field's IndexStartFP
187203
* points to its FST.
188-
* <li>DirOffset is a pointer to the start of the IndexStartFPs for all fields
189204
* <li>It's possible that an on-disk block would contain too many terms (more than the allowed
190205
* maximum (default: 48)). When this happens, the block is sub-divided into new blocks (called
191206
* "floor blocks"), and then the output in the FST for the block's prefix encodes the leading

0 commit comments

Comments
 (0)