You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+62-5Lines changed: 62 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,14 +10,14 @@ extension providing tools and models for text analysis.
10
10
11
11
The goal of this package is to provide an interface to various Natural Language Processing (NLP) resources for `MLJ` via such existing packages like [TextAnalysis](https://github.com/JuliaText/TextAnalysis.jl)
12
12
13
-
Currently, we have TF-IDF Transformer which converts a collection of raw documents into a TF-IDF matrix.
13
+
Currently, we have a TF-IDF Transformer which converts a collection of raw documents into a TF-IDF matrix. We also have a similar way of representing documents using the Okapi Best Match 25 algorithm - this works in a similar fashion to TF-IDF but introduces the probability that a term is relevant in a particular document. See [Okapi BM25](https://en.wikipedia.org/wiki/Okapi_BM25). Finally, there is also a simple Bag-of-Word representation available.
14
14
15
15
## TF-IDF Transformer
16
-
"TF" means term-frequency while "TF-IDF" means term-frequency times inverse document-frequency. This is a common term weighting scheme in information retrieval, that has also found good use in document classification.
16
+
"TF" means term-frequency while "TF-IDF" means term-frequency times inverse document-frequency. This is a common term weighting scheme in information retrieval, that has also found good use in document classification.
17
17
18
18
The goal of using TF-IDF instead of the raw frequencies of occurrence of a token in a given document is to scale down the impact of tokens that occur very frequently in a given corpus and that are hence empirically less informative than features that occur in a small fraction of the training corpus.
19
19
20
-
### Uses
20
+
### Usage
21
21
The TF-IDF Transformer accepts a variety of inputs for the raw documents that one wishes to convert into a TF-IDF matrix.
22
22
23
23
Raw documents can simply be provided as tokenized documents.
@@ -38,9 +38,9 @@ The resulting matrix looks like:
38
38
2×11 adjoint(::SparseArrays.SparseMatrixCSC{Float64, Int64}) with eltype Float64:
Functionality similar to Scikit-Learn's implementation with N-Grams can easily be implemented using features from `TextAnalysis`. Then the N-Grams themselves (either as a dictionary of Strings or dictionary of Tuples) can be passed into the transformer. We will likely introduce an additional transformer to handle these types of conversions in a future update to `MLJText`.
43
+
Functionality similar to Scikit-Learn's implementation with N-Grams can easily be implemented using features from `TextAnalysis`. Then the N-Grams themselves (either as a dictionary of Strings or dictionary of Tuples) can be passed into the transformer. We will likely introduce an additional transformer to handle these types of conversions in a future update to `MLJText`.
44
44
```julia
45
45
46
46
# this will create unigrams and bigrams
@@ -53,3 +53,60 @@ MLJ.fit!(mach)
53
53
54
54
tfidf_mat =transform(mach, ngram_docs)
55
55
```
56
+
57
+
## BM25 Transformer
58
+
BM25 is an approach similar to that of TF-IDF in terms of representing documents in a vector space. The BM25 scoring function uses both term frequency (TF) and inverse document frequency (IDF) so that, for each term in a document, its relative concentration in the document is scored (like TF-IDF). However, BM25 improves upon TF-IDF by incorporating probability - particularly, the probability that a user will consider a search result relevant based on the terms in the search query and those in each document.
59
+
60
+
### Usage
61
+
This transformer is used in much the same way as the `TfidfTransformer`.
62
+
63
+
```julia
64
+
using MLJ, MLJText, TextAnalysis
65
+
66
+
docs = ["Hi my name is Sam.", "How are you today?"]
67
+
bm25_transformer =BM25Transformer()
68
+
mach =machine(bm25_transformer, tokenize.(docs))
69
+
MLJ.fit!(mach)
70
+
71
+
bm25_mat =transform(mach, tokenize.(docs))
72
+
```
73
+
74
+
The resulting matrix looks like:
75
+
```
76
+
2×11 adjoint(::SparseArrays.SparseMatrixCSC{Float64, Int64}) with eltype Float64:
You will note that this transformer has some additional parameters compared to the `TfidfTransformer`:
82
+
```
83
+
BM25Transformer(
84
+
max_doc_freq = 1.0,
85
+
min_doc_freq = 0.0,
86
+
κ = 2,
87
+
β = 0.75,
88
+
smooth_idf = true)
89
+
```
90
+
Please see [http://ethen8181.github.io/machine-learning/search/bm25_intro.html](http://ethen8181.github.io/machine-learning/search/bm25_intro.html) for more details about how these parameters affect the matrix that is generated.
91
+
92
+
## Bag-of-Words Transformer
93
+
The `MLJText` package also offers a way to represent documents using the simpler bag-of-words representation. This returns a document-term matrix (as you would get in `TextAnalysis`) that consists of the count for every word in the corpus for each document in the corpus.
94
+
95
+
### Usage
96
+
```julia
97
+
using MLJ, MLJText, TextAnalysis
98
+
99
+
docs = ["Hi my name is Sam.", "How are you today?"]
0 commit comments