-
Notifications
You must be signed in to change notification settings - Fork 2
Description
I've used the CountTransformer to produce a word frequency matrix as follows:
CountTransformer = @load CountTransformer pkg=MLJText
trans_machine = machine(CountTransformer(), doc_list)
fit!(trans_machine)
X1 = transform(trans_machine, doc_list)Then a function word_count has been applied to X1 (it aggregates the numbers in X1 for doing Naive Bayes; i.e. each element of X1 is accessed once).
This takes about 245 seconds (on a M1 iMac); the size of X1 is (33716, 159093).
If I produce the word frequency matrix using TextAnalysis directly as follows:
crps = Corpus(TokenDocument.(doc_list))
update_lexicon!(crps)
m = DocumentTermMatrix(crps)
X2 = dtm(m)... then word_count runs in about 16.7 sec on matrix X2. So accessing the elements of X1 is almost 15 times slower than to X2.
The difference between the two is, that X2 is a "pure" SparseMatrix whereas X1 is of type LinearAlgebra.Adjoint{Int64, SparseMatrixCSC{Int64, Int64}}. I didn't find any information on how this data structure is represented in Julia.
Therefore I have a few questions:
- Is there a way to access the elements of
X1faster (or rather: why is that so slow)? - I've tried to extract the "pure"
SparseMatrixfromX1usingX3 = X1[1:end, 1:end]. But this takes almost 364 sec. Is there a faster way to get it?
With these findings, it is of course not recommendable to use CountTransformer for this purpose ... or did I miss something?
Metadata
Metadata
Assignees
Labels
Type
Projects
Status