Accessing elements in data structure produced by `CountTransformer` is quite slow

I've used the `CountTransformer` to produce a word frequency matrix as follows:

```julia
CountTransformer = @load CountTransformer pkg=MLJText
trans_machine = machine(CountTransformer(), doc_list)
fit!(trans_machine)
X1 = transform(trans_machine, doc_list)
```
Then a function `word_count` has been applied to `X1` (it aggregates the numbers in `X1` for doing Naive Bayes; i.e. each element of `X1` is accessed once).

This takes about 245 seconds (on a M1 iMac); the size of `X1` is (33716, 159093).

If I produce the word frequency matrix using `TextAnalysis` directly as follows:

```julia
crps = Corpus(TokenDocument.(doc_list))
update_lexicon!(crps)
m = DocumentTermMatrix(crps)
X2 =  dtm(m)
``` 
... then `word_count` runs in about 16.7 sec on matrix `X2`. So accessing the elements of `X1` is almost 15 times slower than to `X2`.

The difference between the two is, that `X2` is a "pure" `SparseMatrix` whereas `X1` is of type `LinearAlgebra.Adjoint{Int64, SparseMatrixCSC{Int64, Int64}}`. I didn't find any information on how this data structure is represented in Julia.

Therefore I have a few questions:
- Is there a way to access the elements of `X1` faster (or rather: why is that so slow)?
- I've tried to extract the "pure" `SparseMatrix` from `X1` using `X3 = X1[1:end, 1:end]`.  But this takes almost 364 sec. Is there a faster way to get it?

With these findings, it is of course not recommendable to use `CountTransformer` for this purpose ... or did I miss something?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Accessing elements in data structure produced by `CountTransformer` is quite slow #29

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Accessing elements in data structure produced by CountTransformer is quite slow #29

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Accessing elements in data structure produced by `CountTransformer` is quite slow #29