@@ -93,6 +93,7 @@ order, meaning if you just next() the file pointer will
93
93
*
94
94
* <ul>
95
95
* <li><code>.tim</code>: <a href="#Termdictionary">Term Dictionary</a>
96
+ * <li><code>.tmd</code>: <a href="#Termmetadata">Term Metadata</a>
96
97
* <li><code>.tip</code>: <a href="#Termindex">Term Index</a>
97
98
* </ul>
98
99
*
@@ -113,7 +114,7 @@ order, meaning if you just next() the file pointer will
113
114
*
114
115
* <ul>
115
116
* <li>TermsDict (.tim) --> Header, <i>PostingsHeader</i>, NodeBlock<sup>NumBlocks</sup>,
116
- * FieldSummary, DirOffset, Footer
117
+ * Footer
117
118
* <li>NodeBlock --> (OuterNode | InnerNode)
118
119
* <li>OuterNode --> EntryCount, SuffixLength, Byte<sup>SuffixLength</sup>, StatsLength, <
119
120
* TermStats ><sup>EntryCount</sup>, MetaLength,
@@ -122,16 +123,10 @@ order, meaning if you just next() the file pointer will
122
123
* < TermStats ? ><sup>EntryCount</sup>, MetaLength, <<i>TermMetadata ?
123
124
* </i>><sup>EntryCount</sup>
124
125
* <li>TermStats --> DocFreq, TotalTermFreq
125
- * <li>FieldSummary --> NumFields, <FieldNumber, NumTerms, RootCodeLength,
126
- * Byte<sup>RootCodeLength</sup>, SumTotalTermFreq?, SumDocFreq, DocCount, LongsSize, MinTerm,
127
- * MaxTerm><sup>NumFields</sup>
128
126
* <li>Header --> {@link CodecUtil#writeHeader CodecHeader}
129
- * <li>DirOffset --> {@link DataOutput#writeLong Uint64}
130
- * <li>MinTerm,MaxTerm --> {@link DataOutput#writeVInt VInt} length followed by the byte[]
131
- * <li>EntryCount,SuffixLength,StatsLength,DocFreq,MetaLength,NumFields,
132
- * FieldNumber,RootCodeLength,DocCount,LongsSize --> {@link DataOutput#writeVInt VInt}
133
- * <li>TotalTermFreq,NumTerms,SumTotalTermFreq,SumDocFreq --> {@link DataOutput#writeVLong
134
- * VLong}
127
+ * <li>EntryCount,SuffixLength,StatsLength,DocFreq,MetaLength --> {@link DataOutput#writeVInt
128
+ * VInt}
129
+ * <li>TotalTermFreq --> {@link DataOutput#writeVLong VLong}
135
130
* <li>Footer --> {@link CodecUtil#writeFooter CodecFooter}
136
131
* </ul>
137
132
*
@@ -140,24 +135,48 @@ order, meaning if you just next() the file pointer will
140
135
* <ul>
141
136
* <li>Header is a {@link CodecUtil#writeHeader CodecHeader} storing the version information for
142
137
* the BlockTree implementation.
143
- * <li>DirOffset is a pointer to the FieldSummary section.
144
138
* <li>DocFreq is the count of documents which contain the term.
145
139
* <li>TotalTermFreq is the total number of occurrences of the term. This is encoded as the
146
140
* difference between the total number of occurrences and the DocFreq.
141
+ * <li>PostingsHeader and TermMetadata are plugged into by the specific postings implementation:
142
+ * these contain arbitrary per-file data (such as parameters or versioning information) and
143
+ * per-term data (such as pointers to inverted files).
144
+ * <li>For inner nodes of the tree, every entry will steal one bit to mark whether it points to
145
+ * child nodes(sub-block). If so, the corresponding TermStats and TermMetaData are omitted.
146
+ * </ul>
147
+ *
148
+ * <p><a id="Termmetadata"></a>
149
+ *
150
+ * <h2>Term Metadata</h2>
151
+ *
152
+ * <p>The .tmd file contains the list of term metadata (such as FST index metadata) and field level
153
+ * statistics (such as sum of total term freq).
154
+ *
155
+ * <ul>
156
+ * <li>TermsMeta (.tmd) --> Header, NumFields, <FieldStats><sup>NumFields</sup>,
157
+ * TermIndexLength, TermDictLength, Footer
158
+ * <li>FieldStats --> FieldNumber, NumTerms, RootCodeLength, Byte<sup>RootCodeLength</sup>,
159
+ * SumTotalTermFreq?, SumDocFreq, DocCount, MinTerm, MaxTerm, IndexStartFP, FSTHeader,
160
+ * <i>FSTMetadata</i>
161
+ * <li>Header,FSTHeader --> {@link CodecUtil#writeHeader CodecHeader}
162
+ * <li>TermIndexLength, TermDictLength --> {@link DataOutput#writeLong Uint64}
163
+ * <li>MinTerm,MaxTerm --> {@link DataOutput#writeVInt VInt} length followed by the byte[]
164
+ * <li>NumFields,FieldNumber,RootCodeLength,DocCount --> {@link DataOutput#writeVInt VInt}
165
+ * <li>NumTerms,SumTotalTermFreq,SumDocFreq,IndexStartFP --> {@link DataOutput#writeVLong
166
+ * VLong}
167
+ * <li>Footer --> {@link CodecUtil#writeFooter CodecFooter}
168
+ * </ul>
169
+ *
170
+ * <p>Notes:
171
+ *
172
+ * <ul>
147
173
* <li>FieldNumber is the fields number from {@link FieldInfos}. (.fnm)
148
174
* <li>NumTerms is the number of unique terms for the field.
149
175
* <li>RootCode points to the root block for the field.
150
176
* <li>SumDocFreq is the total number of postings, the number of term-document pairs across the
151
177
* entire field.
152
178
* <li>DocCount is the number of documents that have at least one posting for this field.
153
- * <li>LongsSize records how many long values the postings writer/reader record per term (e.g., to
154
- * hold freq/prox/doc file offsets).
155
179
* <li>MinTerm, MaxTerm are the lowest and highest term in this field.
156
- * <li>PostingsHeader and TermMetadata are plugged into by the specific postings implementation:
157
- * these contain arbitrary per-file data (such as parameters or versioning information) and
158
- * per-term data (such as pointers to inverted files).
159
- * <li>For inner nodes of the tree, every entry will steal one bit to mark whether it points to
160
- * child nodes(sub-block). If so, the corresponding TermStats and TermMetaData are omitted
161
180
* </ul>
162
181
*
163
182
* <a id="Termindex"></a>
@@ -169,11 +188,8 @@ order, meaning if you just next() the file pointer will
169
188
* saving a disk seek.
170
189
*
171
190
* <ul>
172
- * <li>TermsIndex (.tip) --> Header, FSTIndex<sup>NumFields</sup>
173
- * <IndexStartFP><sup>NumFields</sup>, DirOffset, Footer
191
+ * <li>TermsIndex (.tip) --> Header, FSTIndex<sup>NumFields</sup>Footer
174
192
* <li>Header --> {@link CodecUtil#writeHeader CodecHeader}
175
- * <li>DirOffset --> {@link DataOutput#writeLong Uint64}
176
- * <li>IndexStartFP --> {@link DataOutput#writeVLong VLong}
177
193
* <!-- TODO: better describe FST output here -->
178
194
* <li>FSTIndex --> {@link FST FST<byte[]>}
179
195
* <li>Footer --> {@link CodecUtil#writeFooter CodecFooter}
@@ -185,7 +201,6 @@ order, meaning if you just next() the file pointer will
185
201
* <li>The .tip file contains a separate FST for each field. The FST maps a term prefix to the
186
202
* on-disk block that holds all terms starting with that prefix. Each field's IndexStartFP
187
203
* points to its FST.
188
- * <li>DirOffset is a pointer to the start of the IndexStartFPs for all fields
189
204
* <li>It's possible that an on-disk block would contain too many terms (more than the allowed
190
205
* maximum (default: 48)). When this happens, the block is sub-divided into new blocks (called
191
206
* "floor blocks"), and then the output in the FST for the block's prefix encodes the leading
0 commit comments