|
1 | 1 | """
|
2 | 2 | BM25Transformer()
|
3 | 3 |
|
4 |
| - Convert a collection of raw documents to a matrix using the Okapi BM25 document-word statistic. |
5 |
| -
|
6 |
| - BM25 is an approach similar to that of TF-IDF in terms of representing documents in a vector |
7 |
| - space. The BM25 scoring function uses both term frequency (TF) and inverse document frequency |
8 |
| - (IDF) so that, for each term in a document, its relative concentration in the document is |
9 |
| - scored (like TF-IDF). However, BM25 improves upon TF-IDF by incorporating probability - particularly, |
10 |
| - the probability that a user will consider a search result relevant based on the terms in the search query |
11 |
| - and those in each document. |
12 |
| -
|
13 |
| - The parameters `max_doc_freq`, `min_doc_freq`, and `smooth_idf` all work identically to those in the |
14 |
| - `TfidfTransformer`. BM25 introduces two additional parameters: |
15 |
| -
|
16 |
| - `κ` is the term frequency saturation characteristic. Higher values represent slower satuartion. What |
17 |
| - we mean by saturation is the degree to which a term occuring extra times adds to the overall score. This defaults |
18 |
| - to 2. |
19 |
| -
|
20 |
| - `β` is a parameter, bound between 0 and 1, that amplifies the particular document length compared to the average length. |
21 |
| - The bigger β is, the more document length is amplified in terms of the overall score. The default value is 0.75. |
22 |
| -
|
23 |
| - For more explanations, please see: |
24 |
| - http://ethen8181.github.io/machine-learning/search/bm25_intro.html |
25 |
| - https://en.wikipedia.org/wiki/Okapi_BM25 |
26 |
| - https://nlp.stanford.edu/IR-book/html/htmledition/okapi-bm25-a-non-binary-model-1.html |
| 4 | +Convert a collection of raw documents to a matrix using the Okapi BM25 document-word statistic. |
| 5 | +
|
| 6 | +BM25 is an approach similar to that of TF-IDF in terms of representing documents in a vector |
| 7 | +space. The BM25 scoring function uses both term frequency (TF) and inverse document frequency |
| 8 | +(IDF) so that, for each term in a document, its relative concentration in the document is |
| 9 | +scored (like TF-IDF). However, BM25 improves upon TF-IDF by incorporating probability - particularly, |
| 10 | +the probability that a user will consider a search result relevant based on the terms in the search query |
| 11 | +and those in each document. |
| 12 | +
|
| 13 | +The parameters `max_doc_freq`, `min_doc_freq`, and `smooth_idf` all work identically to those in the |
| 14 | +`TfidfTransformer`. BM25 introduces two additional parameters: |
| 15 | +
|
| 16 | +`κ` is the term frequency saturation characteristic. Higher values represent slower saturation. What |
| 17 | +we mean by saturation is the degree to which a term occuring extra times adds to the overall score. This defaults |
| 18 | +to 2. |
| 19 | +
|
| 20 | +`β` is a parameter, bound between 0 and 1, that amplifies the particular document length compared to the average length. |
| 21 | +The bigger β is, the more document length is amplified in terms of the overall score. The default value is 0.75. |
| 22 | +
|
| 23 | +For more explanations, please see: |
| 24 | +http://ethen8181.github.io/machine-learning/search/bm25_intro.html |
| 25 | +https://en.wikipedia.org/wiki/Okapi_BM25 |
| 26 | +https://nlp.stanford.edu/IR-book/html/htmledition/okapi-bm25-a-non-binary-model-1.html |
| 27 | +
|
| 28 | +The parameters `max_doc_freq` and `min_doc_freq` restrict the vocabulary |
| 29 | +that the transformer will consider. `max_doc_freq` indicates that terms in only |
| 30 | +up to the specified percentage of documents will be considered. For example, if |
| 31 | +`max_doc_freq` is set to 0.9, terms that are in more than 90% of documents |
| 32 | +will be removed. Similarly, the `min_doc_freq` parameter restricts terms in the |
| 33 | +other direction. A value of 0.01 means that only terms that are at least in 1% of |
| 34 | +documents will be included. |
27 | 35 | """
|
28 |
| -MMI.@mlj_model mutable struct BM25Transformer <: AbstractTextTransformer |
29 |
| - max_doc_freq::Float64 = 1.0 |
30 |
| - min_doc_freq::Float64 = 0.0 |
31 |
| - κ::Int=2 |
32 |
| - β::Float64=0.75 |
| 36 | +mutable struct BM25Transformer <: AbstractTextTransformer |
| 37 | + max_doc_freq::Float64 |
| 38 | + min_doc_freq::Float64 |
| 39 | + κ::Int |
| 40 | + β::Float64 |
| 41 | + smooth_idf::Bool |
| 42 | +end |
| 43 | + |
| 44 | +function BM25Transformer(; |
| 45 | + max_doc_freq::Float64 = 1.0, |
| 46 | + min_doc_freq::Float64 = 0.0, |
| 47 | + κ::Int=2, |
| 48 | + β::Float64=0.75, |
33 | 49 | smooth_idf::Bool = true
|
| 50 | + ) |
| 51 | + transformer = BM25Transformer(max_doc_freq, min_doc_freq, κ, β, smooth_idf) |
| 52 | + message = MMI.clean!(transformer) |
| 53 | + isempty(message) || @warn message |
| 54 | + return transformer |
34 | 55 | end
|
35 | 56 |
|
36 | 57 | struct BMI25TransformerResult
|
|
0 commit comments