Skip to content

Extremely Slow Performance of dataSummarizationPTM Function (2+ Days for Summarization) #113

@dingdaxianxinnn

Description

@dingdaxianxinnn

Hi MSstatsPTM team,
I'm experiencing extremely slow performance when running the dataSummarizationPTM function, and I would like to understand why this is happening and if there are ways to optimize it.
Environment Details:
R version:4.3.1
Operating Systems tested:
Personal computer: 16GB RAM, 8 cores
High-Performance Computing (HPC) cluster: 32GB RAM (per node), 4 cores allocated

Issue Description:
The summarization step takes an extremely long time to complete:

On my personal computer (16GB RAM, 8 cores): ~2 days
On HPC cluster (4 cores, 32GB RAM): Still very slow

Code Used:
R# Parallel setup attempted
library(parallel)
library(doParallel)
cores <- 4
cl <- makeCluster(cores)
registerDoParallel(cl)

Function call

Spectronaut_summary_PTM <- dataSummarizationPTM(
spectronaut_test,
logTrans = 2,
normalization = "quantile",
normalization.PTM = "quantile",
summaryMethod = "TMP",
MBimpute = TRUE,
MBimpute.PTM = TRUE,
use_log_file = TRUE,
verbose = TRUE

... other parameters as shown above

)

stopCluster(cl)

Log

INFO [2025-04-20 00:20:06] == Start the summarization per subplot...
INFO [2025-04-20 05:23:43] == Summarization is done.
INFO [2025-04-20 05:23:45] Starting Protein summarization..
INFO [2025-04-20 05:23:45] MSstats - dataProcess function
INFO [2025-04-20 05:23:45] Summary method: TMP
INFO [2025-04-20 05:23:45] censoredInt: NA
INFO [2025-04-20 05:24:11] ** Features with one or two measurements across runs are removed.
INFO [2025-04-20 05:24:11] ** Fractionation handled.
INFO [2025-04-20 05:24:13] ** Updated quantification data to make balanced design. Missing values are marked by NA
INFO [2025-04-20 05:24:13]

INFO [2025-04-20 05:24:16] Logarithm transformation with base 2 is done
INFO [2025-04-20 05:24:19] Factorize in columns(GROUP, SUBJECT, GROUP_ORIGINAL, SUBJECT_ORIGINAL, FEATURE, RUN)
INFO [2025-04-20 05:24:20] Normalization : Quantile normalization - okay
INFO [2025-04-20 05:24:20] ** Log2 intensities under cutoff = 4.128 were considered as censored missing values.
INFO [2025-04-20 05:24:20] ** Log2 intensities = NA were considered as censored missing values.
INFO [2025-04-20 05:24:20] ** Use all features that the dataset originally has.
INFO [2025-04-20 05:24:28]

proteins: 2333

peptides per protein: 1-3880

features per peptide: 1-6

INFO [2025-04-20 05:24:28] Some proteins have only one feature:
A0A1D5NY89,
A0A1D5P4S3,
A0A1D5P997,
A0A1D5PRB6,
A0A8V0X6F7 ...
INFO [2025-04-20 05:24:28]
NM WB
# runs 4 4
# bioreplicates 4 4

tech. replicates 1 1

INFO [2025-04-20 05:24:29] Some features are completely missing in at least one condition:
AALDVDER_2_y4_1,
EINDYTEK_2_y4_1,
EINDYTEK_2_y6_1,
GTEASATAATPK_2_y8_1,
IANPTTTSR_2_y4_1 ...
INFO [2025-04-20 05:24:29] == Start the summarization per subplot...

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions