You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The goal of this package is to provide an interface to various Natural Language Processing (NLP) resources for `MLJ` via such existing packages like [TextAnalysis](https://github.com/JuliaText/TextAnalysis.jl)
12
+
13
+
Currently, we have TF-IDF Transformer which converts a collection of raw documents into a TF-IDF matrix.
14
+
15
+
## TF-IDF Transformer
16
+
"TF" means term-frequency while "TF-IDF" means term-frequency times inverse document-frequency. This is a common term weighting scheme in information retrieval, that has also found good use in document classification.
17
+
18
+
The goal of using TF-IDF instead of the raw frequencies of occurrence of a token in a given document is to scale down the impact of tokens that occur very frequently in a given corpus and that are hence empirically less informative than features that occur in a small fraction of the training corpus.
19
+
20
+
### Uses
21
+
The TF-IDF Transformer accepts a variety of inputs for the raw documents that one wishes to convert into a TF-IDF matrix.
22
+
23
+
Raw documents can simply be provided as tokenized documents.
24
+
25
+
```julia
26
+
using MLJ, MLJText, TextAnalysis
27
+
28
+
docs = ["Hi my name is Sam.", "How are you today?"]
29
+
tfidf_transformer =TfidfTransformer()
30
+
mach =machine(tfidf_transformer, tokenize.(docs))
31
+
MLJ.fit!(mach)
32
+
33
+
tfidf_mat =transform(mach, tokenize.(docs))
34
+
```
35
+
36
+
The resulting matrix looks like:
37
+
```
38
+
2×11 adjoint(::SparseArrays.SparseMatrixCSC{Float64, Int64}) with eltype Float64:
Functionality similar to Scikit-Learn's implementation with N-Grams can easily be implemented using features from `TextAnalysis`. Then the N-Grams themselves (either as a dictionary of Strings or dictionary of Tuples) can be passed into the transformer. We will likely introduce an additional transformer to handle these types of conversions in a future update to `MLJText`.
0 commit comments