Skip to content

Using SingleCellExperiment slows down processing on QFeatures objects #83

@leopoldguyot

Description

@leopoldguyot

Hello,
Some weeks ago, @lgatto showed me that the performance of readQFeatures is significantly better than readSCP. This is because readSCP requires an additional conversion step: first, the QFeatures object is created with SE (SummarizedExperiment) objects, and then these SE objects are converted into SCP (SingleCellExperiment) objects. This conversion takes a significant amount of time, almost twice as long.

Unit: milliseconds
           expr      min       lq     mean   median       uq      max neval
1       readSCP 200.7866 215.9122 221.5260 220.8470 231.5974 238.6010    10
2 readQFeatures 128.7114 131.8418 146.7216 136.9536 143.7293 197.8956    10

To address this, I modified readSCP so that the QFeatures object is created directly with SCP objects, eliminating the need for a conversion (In fact, I reused the code of readQFeatures but changing the call to SummarizedExperiment() to SingleCellExperiment()).

However, to my surprise, this new implementation, which I expected to be faster, actually takes more time than the current readSCP implementation.

Unit: milliseconds
           expr      min       lq     mean   median       uq      max neval
1       readSCP 199.9432 203.7096 217.4162 208.5496 233.9788 256.7043    10
2      readSCP2 370.7522 377.4332 387.5432 387.9019 391.7997 406.4224    10
3 readQFeatures 126.8745 128.1580 135.7015 133.2495 139.8490 157.7520    10

After profiling different chunks of code, I found that functions called inside readQFeatures processing SCP objects are significantly slower than the same functions applied to SE objects.

Further investigation revealed that this slowdown is due to the fact that the SCE (SingleCellExperiment) class inherits from RangedSummarizedExperiment rather than SummarizedExperiment. As a result, when methods are called on an SCE object, the implementation from RangedSummarizedExperiment is used. This implementation is much slower because it often requires a call to rowRanges, which takes some computation. For example, the execution time of rowData differs noticeably between an SE and an SCE. Even though difference in runtime between 1500 and 100 microseconds does not seems a lot, but if this operation is performed a lot of time it can impact the performance.

Unit: microseconds
         expr      min       lq      mean   median        uq       max neval
 rowData(sce) 1524.946 1573.279 2159.7648 1615.196 1672.3090 52978.701   100
  rowData(se)   86.633   98.068  116.2062  115.442  124.5145   311.793   100

This means that using SCE objects instead of SE objects could slow down the entire workflow (for instance see the impact of using SCE instead of SE in the new implementation of readSCP I made).

To test this in a real use case, I used the leduc2022 vignette from SCP.replication. For each step of the vignette, I recorded the execution time of this step, comparing the case where the initial QFeatures object contained SCE objects versus the case where it contained SE objects.
Note that to make this comparison possible I needed to remove one step, the medianCVperCell step which in the current implementation of scp does not allow to work with SE. This is caused by a check in the internal function filterCV that force the use of a SCE, but this function could also work for a SE. I assume that other functions from scp could have the same issue.

Image

The results show a performance difference between SE and SCE for all the steps, but this difference is not as pronounced as in readSCP.

Note that this benchmark was made with the current BioConductor version of QFeatures which does not have the optimisation for aggregateFeatures.

Given that, as far as I understand, the functionalities provided by the SCE class are mainly used in the context of SCPlainer, wouldn't it be more efficient to use SE objects in QFeatures while allowing the possibility of exporting an assay as an SCE object for use with SCPlainer, for instance?

What do you think @lgatto @cvanderaa ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions