1
- """
2
- BM25Transformer()
3
-
4
- Convert a collection of raw documents to a matrix using the Okapi BM25 document-word statistic.
5
-
6
- BM25 is an approach similar to that of TF-IDF in terms of representing documents in a vector
7
- space. The BM25 scoring function uses both term frequency (TF) and inverse document frequency
8
- (IDF) so that, for each term in a document, its relative concentration in the document is
9
- scored (like TF-IDF). However, BM25 improves upon TF-IDF by incorporating probability - particularly,
10
- the probability that a user will consider a search result relevant based on the terms in the search query
11
- and those in each document.
12
-
13
- The parameters `max_doc_freq`, `min_doc_freq`, and `smooth_idf` all work identically to those in the
14
- `TfidfTransformer`. BM25 introduces two additional parameters:
15
-
16
- `κ` is the term frequency saturation characteristic. Higher values represent slower saturation. What
17
- we mean by saturation is the degree to which a term occuring extra times adds to the overall score. This defaults
18
- to 2.
19
-
20
- `β` is a parameter, bound between 0 and 1, that amplifies the particular document length compared to the average length.
21
- The bigger β is, the more document length is amplified in terms of the overall score. The default value is 0.75.
22
-
23
- For more explanations, please see:
24
- - http://ethen8181.github.io/machine-learning/search/bm25_intro.html
25
- - https://en.wikipedia.org/wiki/Okapi_BM25
26
- - https://nlp.stanford.edu/IR-book/html/htmledition/okapi-bm25-a-non-binary-model-1.html
27
-
28
- The parameters `max_doc_freq` and `min_doc_freq` restrict the vocabulary
29
- that the transformer will consider. `max_doc_freq` indicates that terms in only
30
- up to the specified percentage of documents will be considered. For example, if
31
- `max_doc_freq` is set to 0.9, terms that are in more than 90% of documents
32
- will be removed. Similarly, the `min_doc_freq` parameter restricts terms in the
33
- other direction. A value of 0.01 means that only terms that are at least in 1% of
34
- documents will be included.
35
- """
36
1
mutable struct BM25Transformer <: AbstractTextTransformer
37
2
max_doc_freq:: Float64
38
3
min_doc_freq:: Float64
@@ -41,13 +6,13 @@ mutable struct BM25Transformer <: AbstractTextTransformer
41
6
smooth_idf:: Bool
42
7
end
43
8
44
- function BM25Transformer (;
9
+ function BM25Transformer (;
45
10
max_doc_freq:: Float64 = 1.0 ,
46
11
min_doc_freq:: Float64 = 0.0 ,
47
12
κ:: Int = 2 ,
48
13
β:: Float64 = 0.75 ,
49
14
smooth_idf:: Bool = true
50
- )
15
+ )
51
16
transformer = BM25Transformer (max_doc_freq, min_doc_freq, κ, β, smooth_idf)
52
17
message = MMI. clean! (transformer)
53
18
isempty (message) || @warn message
@@ -103,14 +68,14 @@ function build_bm25!(doc_term_mat::SparseMatrixCSC{T},
103
68
return bm25
104
69
end
105
70
106
- function _transform (transformer:: BM25Transformer ,
71
+ function _transform (transformer:: BM25Transformer ,
107
72
result:: BMI25TransformerResult ,
108
73
v:: Corpus )
109
74
doc_terms = build_dtm (v, result. vocab)
110
75
bm25 = similar (doc_terms. dtm, eltype (result. idf_vector))
111
76
build_bm25! (doc_terms. dtm, bm25, result. idf_vector, result. mean_words_in_docs; κ= transformer. κ, β= transformer. β)
112
77
113
- # here we return the `adjoint` of our sparse matrix to conform to
78
+ # here we return the `adjoint` of our sparse matrix to conform to
114
79
# the `n x p` dimensions throughout MLJ
115
80
return adjoint (bm25)
116
81
end
@@ -142,6 +107,82 @@ MMI.metadata_model(BM25Transformer,
142
107
AbstractVector{<: STB.Multiset{STB.Textual} }
143
108
},
144
109
output_scitype = AbstractMatrix{STB. Continuous},
145
- docstring = " Build BM-25 matrix from raw documents" ,
146
110
path = " MLJText.BM25Transformer"
147
111
)
112
+
113
+ # # DOC STRING
114
+
115
+ """
116
+ $(MMI. doc_header (BM25Transformer))
117
+
118
+ The transformer converts a collection of documents, tokenized or pre-parsed as bags of
119
+ words/ngrams, to a matrix of [Okapi BM25 document-word
120
+ statistics](https://en.wikipedia.org/wiki/Okapi_BM25). The BM25 scoring function uses both
121
+ term frequency (TF) and inverse document frequency (IDF, defined below), as in
122
+ [`TfidfTransformer`](ref), but additionally adjusts for the probability that a user will
123
+ consider a search result relevant based, on the terms in the search query and those in
124
+ each document.
125
+
126
+ $DOC_IDF
127
+
128
+ References:
129
+
130
+ - http://ethen8181.github.io/machine-learning/search/bm25_intro.html
131
+ - https://en.wikipedia.org/wiki/Okapi_BM25
132
+ - https://nlp.stanford.edu/IR-book/html/htmledition/okapi-bm25-a-non-binary-model-1.html
133
+
134
+ # Training data
135
+
136
+ In MLJ or MLJBase, bind an instance `model` to data with
137
+
138
+ mach = machine(model, X)
139
+
140
+ $DOC_IDF
141
+
142
+ Train the machine using `fit!(mach, rows=...)`.
143
+
144
+ # Hyper-parameters
145
+
146
+ - `max_doc_freq=1.0`: Restricts the vocabulary that the transformer will consider.
147
+ Terms that occur in `> max_doc_freq` documents will not be considered by the
148
+ transformer. For example, if `max_doc_freq` is set to 0.9, terms that are in more than
149
+ 90% of the documents will be removed.
150
+
151
+ - `min_doc_freq=0.0`: Restricts the vocabulary that the transformer will consider.
152
+ Terms that occur in `< max_doc_freq` documents will not be considered by the
153
+ transformer. A value of 0.01 means that only terms that are at least in 1% of the
154
+ documents will be included.
155
+
156
+ - `κ=2`: The term frequency saturation characteristic. Higher values represent slower
157
+ saturation. What we mean by saturation is the degree to which a term occurring extra
158
+ times adds to the overall score.
159
+
160
+ - `β=0.075`: Amplifies the particular document length compared to the average length. The
161
+ bigger β is, the more document length is amplified in terms of the overall score. The
162
+ default value is 0.75, and the bounds are restricted between 0 and 1.
163
+
164
+ - `smooth_idf=true`: Control which definition of IDF to use (see above).
165
+
166
+ # Operations
167
+
168
+ - `transform(mach, Xnew)`: Based on the vocabulary, IDF, and mean word counts learned in
169
+ training, return the matrix of BM25 scores for `Xnew`, a vector of the same form as `X`
170
+ above. The matrix has size `(n, p)`, where `n = length(Xnew)` and `p` the size of the
171
+ vocabulary. Tokens/ngrams not appearing in the learned vocabulary are scored zero.
172
+
173
+ # Fitted parameters
174
+
175
+ The fields of `fitted_params(mach)` are:
176
+
177
+ - `vocab`: A vector containing the string used in the transformer's vocabulary.
178
+
179
+ - `idf_vector`: The transformer's calculated IDF vector.
180
+
181
+ - `mean_words_in_docs`: The mean number of words in each document.
182
+
183
+ $(doc_examples (:BM25Transformer ))
184
+
185
+ See also [`TfidfTransformer`](@ref), [`CountTransformer`](@ref)
186
+
187
+ """
188
+ BM25Transformer
0 commit comments