Extremely Slow Performance of dataSummarizationPTM Function (2+ Days for Summarization)

Hi MSstatsPTM team,
I'm experiencing extremely slow performance when running the dataSummarizationPTM function, and I would like to understand why this is happening and if there are ways to optimize it.
Environment Details:
R version:4.3.1
Operating Systems tested:
Personal computer: 16GB RAM, 8 cores
High-Performance Computing (HPC) cluster: 32GB RAM (per node), 4 cores allocated

Issue Description:
The summarization step takes an extremely long time to complete:

On my personal computer (16GB RAM, 8 cores): ~2 days
On HPC cluster (4 cores, 32GB RAM): Still very slow

Code Used:
R# Parallel setup attempted
library(parallel)
library(doParallel)
cores <- 4
cl <- makeCluster(cores)
registerDoParallel(cl)

# Function call
Spectronaut_summary_PTM <- dataSummarizationPTM(
  spectronaut_test,
  logTrans = 2,
  normalization = "quantile",
  normalization.PTM = "quantile",
  summaryMethod = "TMP",
  MBimpute = TRUE,
  MBimpute.PTM = TRUE,
  use_log_file = TRUE,
  verbose = TRUE
  # ... other parameters as shown above
)

stopCluster(cl)
# Log
INFO  [2025-04-20 00:20:06] == Start the summarization per subplot...
INFO  [2025-04-20 05:23:43] == Summarization is done.
INFO  [2025-04-20 05:23:45] Starting Protein summarization..
INFO  [2025-04-20 05:23:45] MSstats - dataProcess function
INFO  [2025-04-20 05:23:45] Summary method: TMP
INFO  [2025-04-20 05:23:45] censoredInt: NA
INFO  [2025-04-20 05:24:11] ** Features with one or two measurements across runs are removed.
INFO  [2025-04-20 05:24:11] ** Fractionation handled.
INFO  [2025-04-20 05:24:13] ** Updated quantification data to make balanced design. Missing values are marked by NA
INFO  [2025-04-20 05:24:13] 

INFO  [2025-04-20 05:24:16] Logarithm transformation with base 2 is done
INFO  [2025-04-20 05:24:19] Factorize in columns(GROUP, SUBJECT, GROUP_ORIGINAL, SUBJECT_ORIGINAL, FEATURE, RUN)
INFO  [2025-04-20 05:24:20] Normalization : Quantile normalization - okay
INFO  [2025-04-20 05:24:20] ** Log2 intensities under cutoff = 4.128  were considered as censored missing values.
INFO  [2025-04-20 05:24:20] ** Log2 intensities = NA were considered as censored missing values.
INFO  [2025-04-20 05:24:20] ** Use all features that the dataset originally has.
INFO  [2025-04-20 05:24:28] 
 # proteins: 2333
 # peptides per protein: 1-3880
 # features per peptide: 1-6
INFO  [2025-04-20 05:24:28] Some proteins have only one feature: 
 A0A1D5NY89,
 A0A1D5P4S3,
 A0A1D5P997,
 A0A1D5PRB6,
 A0A8V0X6F7 ...
INFO  [2025-04-20 05:24:28] 
                    NM WB
             # runs  4  4
    # bioreplicates  4  4
 # tech. replicates  1  1
INFO  [2025-04-20 05:24:29] Some features are completely missing in at least one condition:  
 AALDVDER_2_y4_1,
 EINDYTEK_2_y4_1,
 EINDYTEK_2_y6_1,
 GTEASATAATPK_2_y8_1,
 IANPTTTSR_2_y4_1 ...
INFO  [2025-04-20 05:24:29] == Start the summarization per subplot...


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extremely Slow Performance of dataSummarizationPTM Function (2+ Days for Summarization) #113

Function call

... other parameters as shown above

Log

proteins: 2333

peptides per protein: 1-3880

features per peptide: 1-6

tech. replicates 1 1

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Extremely Slow Performance of dataSummarizationPTM Function (2+ Days for Summarization) #113

Description

Function call

... other parameters as shown above

Log

proteins: 2333

peptides per protein: 1-3880

features per peptide: 1-6

tech. replicates 1 1

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions