-
Notifications
You must be signed in to change notification settings - Fork 1
Description
I've used the CountTransformer
to produce a word frequency matrix as follows:
CountTransformer = @load CountTransformer pkg=MLJText
trans_machine = machine(CountTransformer(), doc_list)
fit!(trans_machine)
X1 = transform(trans_machine, doc_list)
Then a function word_count
has been applied to X1
(it aggregates the numbers in X1
for doing Naive Bayes; i.e. each element of X1
is accessed once).
This takes about 245 seconds (on a M1 iMac); the size of X1
is (33716, 159093).
If I produce the word frequency matrix using TextAnalysis
directly as follows:
crps = Corpus(TokenDocument.(doc_list))
update_lexicon!(crps)
m = DocumentTermMatrix(crps)
X2 = dtm(m)
... then word_count
runs in about 16.7 sec on matrix X2
. So accessing the elements of X1
is almost 15 times slower than to X2
.
The difference between the two is, that X2
is a "pure" SparseMatrix
whereas X1
is of type LinearAlgebra.Adjoint{Int64, SparseMatrixCSC{Int64, Int64}}
. I didn't find any information on how this data structure is represented in Julia.
Therefore I have a few questions:
- Is there a way to access the elements of
X1
faster (or rather: why is that so slow)? - I've tried to extract the "pure"
SparseMatrix
fromX1
usingX3 = X1[1:end, 1:end]
. But this takes almost 364 sec. Is there a faster way to get it?
With these findings, it is of course not recommendable to use CountTransformer
for this purpose ... or did I miss something?
Metadata
Metadata
Assignees
Labels
Type
Projects
Status