Skip to content

Accessing elements in data structure produced by CountTransformer is quite slow #29

@roland-KA

Description

@roland-KA

I've used the CountTransformer to produce a word frequency matrix as follows:

CountTransformer = @load CountTransformer pkg=MLJText
trans_machine = machine(CountTransformer(), doc_list)
fit!(trans_machine)
X1 = transform(trans_machine, doc_list)

Then a function word_count has been applied to X1 (it aggregates the numbers in X1 for doing Naive Bayes; i.e. each element of X1 is accessed once).

This takes about 245 seconds (on a M1 iMac); the size of X1 is (33716, 159093).

If I produce the word frequency matrix using TextAnalysis directly as follows:

crps = Corpus(TokenDocument.(doc_list))
update_lexicon!(crps)
m = DocumentTermMatrix(crps)
X2 =  dtm(m)

... then word_count runs in about 16.7 sec on matrix X2. So accessing the elements of X1 is almost 15 times slower than to X2.

The difference between the two is, that X2 is a "pure" SparseMatrix whereas X1 is of type LinearAlgebra.Adjoint{Int64, SparseMatrixCSC{Int64, Int64}}. I didn't find any information on how this data structure is represented in Julia.

Therefore I have a few questions:

  • Is there a way to access the elements of X1 faster (or rather: why is that so slow)?
  • I've tried to extract the "pure" SparseMatrix from X1 using X3 = X1[1:end, 1:end]. But this takes almost 364 sec. Is there a faster way to get it?

With these findings, it is of course not recommendable to use CountTransformer for this purpose ... or did I miss something?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    priority low / involved

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions