Skip to content

Commit d772365

Browse files
authored
Merge pull request #12 from JuliaAI/add_cv_and_bmi25
add BM25 and BagOfWords transformers, update tests, update readme, re…
2 parents 000a451 + 30b470a commit d772365

11 files changed

+606
-183
lines changed

Project.toml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,15 @@
11
name = "MLJText"
22
uuid = "5e27fcf9-6bac-46ba-8580-b5712f3d6387"
33
authors = ["Chris Alexander <[email protected]>, Anthony D. Blaom <[email protected]>"]
4-
version = "0.1.0"
4+
version = "0.1.1"
55

66
[deps]
77
CorpusLoaders = "214a0ac2-f95b-54f7-a80b-442ed9c2c9e8"
88
MLJModelInterface = "e80e1ace-859a-464e-9ed9-23947d8ae3ea"
99
ScientificTypes = "321657f4-b219-11e9-178b-2701a2544e81"
1010
ScientificTypesBase = "30f210dd-8aff-4c5f-94ba-8e64358c1161"
1111
SparseArrays = "2f01184e-e22b-5df5-ae63-d93ebab69eaf"
12+
Statistics = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"
1213
TextAnalysis = "a2db99b7-8b79-58f8-94bf-bbc811eef33d"
1314

1415
[compat]

README.md

Lines changed: 62 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -10,14 +10,14 @@ extension providing tools and models for text analysis.
1010

1111
The goal of this package is to provide an interface to various Natural Language Processing (NLP) resources for `MLJ` via such existing packages like [TextAnalysis](https://github.com/JuliaText/TextAnalysis.jl)
1212

13-
Currently, we have TF-IDF Transformer which converts a collection of raw documents into a TF-IDF matrix.
13+
Currently, we have a TF-IDF Transformer which converts a collection of raw documents into a TF-IDF matrix. We also have a similar way of representing documents using the Okapi Best Match 25 algorithm - this works in a similar fashion to TF-IDF but introduces the probability that a term is relevant in a particular document. See [Okapi BM25](https://en.wikipedia.org/wiki/Okapi_BM25). Finally, there is also a simple Bag-of-Word representation available.
1414

1515
## TF-IDF Transformer
16-
"TF" means term-frequency while "TF-IDF" means term-frequency times inverse document-frequency. This is a common term weighting scheme in information retrieval, that has also found good use in document classification.
16+
"TF" means term-frequency while "TF-IDF" means term-frequency times inverse document-frequency. This is a common term weighting scheme in information retrieval, that has also found good use in document classification.
1717

1818
The goal of using TF-IDF instead of the raw frequencies of occurrence of a token in a given document is to scale down the impact of tokens that occur very frequently in a given corpus and that are hence empirically less informative than features that occur in a small fraction of the training corpus.
1919

20-
### Uses
20+
### Usage
2121
The TF-IDF Transformer accepts a variety of inputs for the raw documents that one wishes to convert into a TF-IDF matrix.
2222

2323
Raw documents can simply be provided as tokenized documents.
@@ -38,9 +38,9 @@ The resulting matrix looks like:
3838
2×11 adjoint(::SparseArrays.SparseMatrixCSC{Float64, Int64}) with eltype Float64:
3939
0.234244 0.0 0.234244 0.0 0.234244 0.0 0.234244 0.234244 0.234244 0.0 0.0
4040
0.0 0.281093 0.0 0.281093 0.0 0.281093 0.0 0.0 0.0 0.281093 0.281093
41-
```
41+
```
4242

43-
Functionality similar to Scikit-Learn's implementation with N-Grams can easily be implemented using features from `TextAnalysis`. Then the N-Grams themselves (either as a dictionary of Strings or dictionary of Tuples) can be passed into the transformer. We will likely introduce an additional transformer to handle these types of conversions in a future update to `MLJText`.
43+
Functionality similar to Scikit-Learn's implementation with N-Grams can easily be implemented using features from `TextAnalysis`. Then the N-Grams themselves (either as a dictionary of Strings or dictionary of Tuples) can be passed into the transformer. We will likely introduce an additional transformer to handle these types of conversions in a future update to `MLJText`.
4444
```julia
4545

4646
# this will create unigrams and bigrams
@@ -53,3 +53,60 @@ MLJ.fit!(mach)
5353

5454
tfidf_mat = transform(mach, ngram_docs)
5555
```
56+
57+
## BM25 Transformer
58+
BM25 is an approach similar to that of TF-IDF in terms of representing documents in a vector space. The BM25 scoring function uses both term frequency (TF) and inverse document frequency (IDF) so that, for each term in a document, its relative concentration in the document is scored (like TF-IDF). However, BM25 improves upon TF-IDF by incorporating probability - particularly, the probability that a user will consider a search result relevant based on the terms in the search query and those in each document.
59+
60+
### Usage
61+
This transformer is used in much the same way as the `TfidfTransformer`.
62+
63+
```julia
64+
using MLJ, MLJText, TextAnalysis
65+
66+
docs = ["Hi my name is Sam.", "How are you today?"]
67+
bm25_transformer = BM25Transformer()
68+
mach = machine(bm25_transformer, tokenize.(docs))
69+
MLJ.fit!(mach)
70+
71+
bm25_mat = transform(mach, tokenize.(docs))
72+
```
73+
74+
The resulting matrix looks like:
75+
```
76+
2×11 adjoint(::SparseArrays.SparseMatrixCSC{Float64, Int64}) with eltype Float64:
77+
0.676463 0.0 0.676463 0.0 0.676463 0.0 0.676463 0.676463 0.676463 0.0 0.0
78+
0.0 0.81599 0.0 0.81599 0.0 0.81599 0.0 0.0 0.0 0.81599 0.81599
79+
```
80+
81+
You will note that this transformer has some additional parameters compared to the `TfidfTransformer`:
82+
```
83+
BM25Transformer(
84+
max_doc_freq = 1.0,
85+
min_doc_freq = 0.0,
86+
κ = 2,
87+
β = 0.75,
88+
smooth_idf = true)
89+
```
90+
Please see [http://ethen8181.github.io/machine-learning/search/bm25_intro.html](http://ethen8181.github.io/machine-learning/search/bm25_intro.html) for more details about how these parameters affect the matrix that is generated.
91+
92+
## Bag-of-Words Transformer
93+
The `MLJText` package also offers a way to represent documents using the simpler bag-of-words representation. This returns a document-term matrix (as you would get in `TextAnalysis`) that consists of the count for every word in the corpus for each document in the corpus.
94+
95+
### Usage
96+
```julia
97+
using MLJ, MLJText, TextAnalysis
98+
99+
docs = ["Hi my name is Sam.", "How are you today?"]
100+
bagofwords_transformer = BagOfWordsTransformer()
101+
mach = machine(bagofwords_transformer, tokenize.(docs))
102+
MLJ.fit!(mach)
103+
104+
bagofwords_mat = transform(mach, tokenize.(docs))
105+
```
106+
107+
The resulting matrix looks like:
108+
```
109+
2×11 adjoint(::SparseArrays.SparseMatrixCSC{Int64, Int64}) with eltype Int64:
110+
1 0 1 0 1 0 1 1 1 0 0
111+
0 1 0 1 0 1 0 0 0 1 1
112+
```

src/MLJText.jl

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,16 +6,24 @@ import ScientificTypes: DefaultConvention
66
import CorpusLoaders
77
using SparseArrays
88
using TextAnalysis
9+
using Statistics
910

1011
const MMI = MLJModelInterface
1112
const STB = ScientificTypesBase
1213
const CL = CorpusLoaders
1314

1415
const PKG = "MLJText" # substitute model-providing package name
1516

17+
const ScientificNGram{N} = NTuple{<:Any,STB.Textual}
18+
const NGram{N} = NTuple{<:Any,<:AbstractString}
19+
1620
include("scitypes.jl")
21+
include("utils.jl")
22+
include("abstract_text_transformer.jl")
1723
include("tfidf_transformer.jl")
24+
include("bagofwords_transformer.jl")
25+
include("bm25_transformer.jl")
1826

19-
export TfidfTransformer
27+
export TfidfTransformer, BM25Transformer, BagOfWordsTransformer
2028

2129
end # module

src/abstract_text_transformer.jl

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
abstract type AbstractTextTransformer <: MMI.Unsupervised end
2+
3+
function MMI.clean!(transformer::AbstractTextTransformer)
4+
warning = ""
5+
if transformer.min_doc_freq < 0.0
6+
warning *= "Need min_doc_freq ≥ 0. Resetting min_doc_freq=0. "
7+
transformer.min_doc_freq = 0.0
8+
end
9+
10+
if transformer.max_doc_freq > 1.0
11+
warning *= "Need max_doc_freq ≤ 1. Resetting max_doc_freq=1. "
12+
transformer.max_doc_freq = 1.0
13+
end
14+
15+
if transformer.max_doc_freq < transformer.min_doc_freq
16+
warning *= "max_doc_freq cannot be less than min_doc_freq, resetting to defaults. "
17+
transformer.min_doc_freq = 0.0
18+
transformer.max_doc_freq = 1.0
19+
end
20+
return warning
21+
end
22+
23+
## General method to fit text transformer models ##
24+
MMI.fit(transformer::AbstractTextTransformer, verbosity::Int, X) =
25+
_fit(transformer, verbosity, build_corpus(X))
26+
27+
function _fit(transformer::AbstractTextTransformer, verbosity::Int, X::Corpus)
28+
# process corpus vocab
29+
update_lexicon!(X)
30+
dtm_matrix = build_dtm(X)
31+
n = size(dtm_matrix.dtm, 2) # docs are columns
32+
33+
# calculate min and max doc freq limits
34+
if transformer.max_doc_freq < 1 || transformer.min_doc_freq > 0
35+
high = round(Int, transformer.max_doc_freq * n)
36+
low = round(Int, transformer.min_doc_freq * n)
37+
new_dtm, vocab = limit_features(dtm_matrix, high, low)
38+
else
39+
new_dtm = dtm_matrix.dtm
40+
vocab = dtm_matrix.terms
41+
end
42+
43+
# calculate IDF
44+
idf = compute_idf(transformer.smooth_idf, new_dtm)
45+
46+
# prepare result
47+
fitresult = get_result(transformer, idf, vocab)
48+
cache = nothing
49+
50+
return fitresult, cache, NamedTuple()
51+
end
52+
53+
## General method to transform using text transformer models ##
54+
MMI.transform(transformer::AbstractTextTransformer, result, v) =
55+
_transform(transformer, result, build_corpus(v))

src/bagofwords_transformer.jl

Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
"""
2+
BagOfWordsTransformer()
3+
4+
Convert a collection of raw documents to matrix representing a bag-of-words structure.
5+
6+
Essentially, a bag-of-words approach to representing documents in a matrix is comprised of
7+
a count of every word in the document corpus/collection for every document. This is a simple
8+
but often quite powerful way of representing documents as vectors. The resulting representation is
9+
a matrix with rows representing every document in the corpus and columns representing every word
10+
in the corpus. The value for each cell is the raw count of a particular word in a particular
11+
document.
12+
13+
Similarly to the `TfidfTransformer`, the vocabulary considered can be restricted
14+
to words occuring in a maximum or minimum portion of documents.
15+
16+
The parameters `max_doc_freq` and `min_doc_freq` restrict the vocabulary
17+
that the transformer will consider. `max_doc_freq` indicates that terms in only
18+
up to the specified percentage of documents will be considered. For example, if
19+
`max_doc_freq` is set to 0.9, terms that are in more than 90% of documents
20+
will be removed. Similarly, the `min_doc_freq` parameter restricts terms in the
21+
other direction. A value of 0.01 means that only terms that are at least in 1% of
22+
documents will be included.
23+
"""
24+
mutable struct BagOfWordsTransformer <: AbstractTextTransformer
25+
max_doc_freq::Float64
26+
min_doc_freq::Float64
27+
end
28+
29+
function BagOfWordsTransformer(; max_doc_freq::Float64 = 1.0, min_doc_freq::Float64 = 0.0)
30+
transformer = BagOfWordsTransformer(max_doc_freq, min_doc_freq)
31+
message = MMI.clean!(transformer)
32+
isempty(message) || @warn message
33+
return transformer
34+
end
35+
36+
struct BagOfWordsTransformerResult
37+
vocab::Vector{String}
38+
end
39+
40+
function _fit(transformer::BagOfWordsTransformer, verbosity::Int, X::Corpus)
41+
# process corpus vocab
42+
update_lexicon!(X)
43+
44+
# calculate min and max doc freq limits
45+
if transformer.max_doc_freq < 1 || transformer.min_doc_freq > 0
46+
# we need to build out the DTM
47+
dtm_matrix = build_dtm(X)
48+
n = size(dtm_matrix.dtm, 2) # docs are columns
49+
high = round(Int, transformer.max_doc_freq * n)
50+
low = round(Int, transformer.min_doc_freq * n)
51+
_, vocab = limit_features(dtm_matrix, high, low)
52+
else
53+
vocab = sort(collect(keys(lexicon(X))))
54+
end
55+
56+
# prepare result
57+
fitresult = BagOfWordsTransformerResult(vocab)
58+
cache = nothing
59+
60+
return fitresult, cache, NamedTuple()
61+
end
62+
63+
function _transform(::BagOfWordsTransformer,
64+
result::BagOfWordsTransformerResult,
65+
v::Corpus)
66+
dtm_matrix = build_dtm(v, result.vocab)
67+
68+
# here we return the `adjoint` of our sparse matrix to conform to
69+
# the `n x p` dimensions throughout MLJ
70+
return adjoint(dtm_matrix.dtm)
71+
end
72+
73+
# for returning user-friendly form of the learned parameters:
74+
function MMI.fitted_params(::BagOfWordsTransformer, fitresult::BagOfWordsTransformerResult)
75+
vocab = fitresult.vocab
76+
return (vocab = vocab,)
77+
end
78+
79+
## META DATA
80+
81+
MMI.metadata_pkg(BagOfWordsTransformer,
82+
name="$PKG",
83+
uuid="7876af07-990d-54b4-ab0e-23690620f79a",
84+
url="https://github.com/JuliaAI/MLJText.jl",
85+
is_pure_julia=true,
86+
license="MIT",
87+
is_wrapper=false
88+
)
89+
90+
MMI.metadata_model(BagOfWordsTransformer,
91+
input_scitype = Union{
92+
AbstractVector{<:AbstractVector{STB.Textual}},
93+
AbstractVector{<:STB.Multiset{<:ScientificNGram}},
94+
AbstractVector{<:STB.Multiset{STB.Textual}}
95+
},
96+
output_scitype = AbstractMatrix{STB.Continuous},
97+
docstring = "Build Bag-of-Words matrix for corpus of documents",
98+
path = "MLJText.BagOfWordsTransformer"
99+
)

0 commit comments

Comments
 (0)