-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Hi MSstatsPTM team,
I'm experiencing extremely slow performance when running the dataSummarizationPTM function, and I would like to understand why this is happening and if there are ways to optimize it.
Environment Details:
R version:4.3.1
Operating Systems tested:
Personal computer: 16GB RAM, 8 cores
High-Performance Computing (HPC) cluster: 32GB RAM (per node), 4 cores allocated
Issue Description:
The summarization step takes an extremely long time to complete:
On my personal computer (16GB RAM, 8 cores): ~2 days
On HPC cluster (4 cores, 32GB RAM): Still very slow
Code Used:
R# Parallel setup attempted
library(parallel)
library(doParallel)
cores <- 4
cl <- makeCluster(cores)
registerDoParallel(cl)
Function call
Spectronaut_summary_PTM <- dataSummarizationPTM(
spectronaut_test,
logTrans = 2,
normalization = "quantile",
normalization.PTM = "quantile",
summaryMethod = "TMP",
MBimpute = TRUE,
MBimpute.PTM = TRUE,
use_log_file = TRUE,
verbose = TRUE
... other parameters as shown above
)
stopCluster(cl)
Log
INFO [2025-04-20 00:20:06] == Start the summarization per subplot...
INFO [2025-04-20 05:23:43] == Summarization is done.
INFO [2025-04-20 05:23:45] Starting Protein summarization..
INFO [2025-04-20 05:23:45] MSstats - dataProcess function
INFO [2025-04-20 05:23:45] Summary method: TMP
INFO [2025-04-20 05:23:45] censoredInt: NA
INFO [2025-04-20 05:24:11] ** Features with one or two measurements across runs are removed.
INFO [2025-04-20 05:24:11] ** Fractionation handled.
INFO [2025-04-20 05:24:13] ** Updated quantification data to make balanced design. Missing values are marked by NA
INFO [2025-04-20 05:24:13]
INFO [2025-04-20 05:24:16] Logarithm transformation with base 2 is done
INFO [2025-04-20 05:24:19] Factorize in columns(GROUP, SUBJECT, GROUP_ORIGINAL, SUBJECT_ORIGINAL, FEATURE, RUN)
INFO [2025-04-20 05:24:20] Normalization : Quantile normalization - okay
INFO [2025-04-20 05:24:20] ** Log2 intensities under cutoff = 4.128 were considered as censored missing values.
INFO [2025-04-20 05:24:20] ** Log2 intensities = NA were considered as censored missing values.
INFO [2025-04-20 05:24:20] ** Use all features that the dataset originally has.
INFO [2025-04-20 05:24:28]
proteins: 2333
peptides per protein: 1-3880
features per peptide: 1-6
INFO [2025-04-20 05:24:28] Some proteins have only one feature:
A0A1D5NY89,
A0A1D5P4S3,
A0A1D5P997,
A0A1D5PRB6,
A0A8V0X6F7 ...
INFO [2025-04-20 05:24:28]
NM WB
# runs 4 4
# bioreplicates 4 4
tech. replicates 1 1
INFO [2025-04-20 05:24:29] Some features are completely missing in at least one condition:
AALDVDER_2_y4_1,
EINDYTEK_2_y4_1,
EINDYTEK_2_y6_1,
GTEASATAATPK_2_y8_1,
IANPTTTSR_2_y4_1 ...
INFO [2025-04-20 05:24:29] == Start the summarization per subplot...