Skip to content

computproteomics/ProteoMaker

Repository files navigation

ProteoMaker: Simulating proteoforms in bottom-up proteomics

ProteoMaker is a platform for the generation of an in-silico bottom-up proteomics data set with a ground truth on the level of proteoforms.

All the parameters that are used to generate the data are described in man/Parameters.qmd (render to HTML if you prefer). The script that runs the entire pipeline is RunSims.R. Alternatively, you can use the Vignette. The simulations with multiple parameters can be set up and run. ProteoMaker also provides comparison of the results with the ground truth using benchmarking metrics and visual comparison between simulated data sets.

The pipeline is described in the figure ProteoMakerLayout.svg and can be described as follows:

  1. General functions to run the simulations: 00_BatchRunFuncs.R
  2. Generation of ground truth data at the proteoform level 01_GenerateGroundTruth.R.
  3. Digestion of the proteoforms from the ground truth: 02_Digestion.R.
  4. In silico MS run: 03_MSRun.R.
  5. Functions for data analysis from the peptide to proteins: 04_DataAnalysis.R.
  6. Statistical testing: 05_Statistics.R.
  7. Benchmarking: 06_Benchmarks.R.

Installation

Install the package from GitHub with devtools:

devtools::install_github("computproteomics/ProteoMaker")

Quick Start

  1. Load the package in R:
    library(ProteoMaker)
  2. Run the default simulation (adjust the output folder as needed):
    Param  <- def_param()
    Config <- set_proteomaker(resultFilePath = "results")
    run_sims(Param, Config)
    Intermediate outputDataAnalysis_<hash>.RData files and benchmark tables are written to results/.
  3. Explore the outputs, for example:
    visualize_benchmarks(Config$resultFilePath)
    or open the vignette for a full walkthrough.

Repository Overview

Path Description
R/ Core simulation, analysis, and benchmarking functions
inst/config/parameters.yaml Default parameter definitions consumed by def_param()
inst/cmd/RunSims.R Convenience script to run the full simulation pipeline
vignettes/ Walkthroughs and usage examples
inst/img/ Diagrams and figures (e.g. pipeline layout)
tests/ Automated tests for key functionality
inst/shiny/ Shiny interface for interactive configuration (if used)

Running full batches and benchmarking

Running the Vignette allows running full batches without having to re-run the data sets which have been built with the same set of parameters. In addition, the pipeline is run hierarchically to avoid repetitive execution of identical down-stream analysis. This is done via creating hashes of the parameter configurations and writing intermediate and final results into respective tables.

Re-running the full batch with different assessment of the benchmarking metrics will avoid re-running the data set generation and analysis, and thus should be superfast.

Important remarks:

  • You need to always define all parameters in the beginning of this script
  • Be aware that changing downstream parameters (ground truth, digestion) can immensely increase the number of possible parameter settings
  • Keep always the result files in the respective folder (resultFilePath) if you didn't change anything in the pipeline such as any of the methods in the sourced files. This will allow you to run the full batch without re-running the data set generation and analysis.
  • Benchmarking results are skipped for data sets with fewer than 100 quantified proteins.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 7