This repository contains the R code and results for a project investigating the impact of different database search strategies on False Discovery Rate (FDR) estimation in shotgun proteomics. The analysis compares an emulated concatenated search strategy against a separate, un-concatenated target-decoy search.
The primary finding is that the concatenated approach, which incorporates a competition model, identifies approximately 1.25 times more confident peptide-spectrum matches (PSMs) at a 1% FDR threshold than the more conservative un-concatenated method.
- SearchGUI: (Version 4.3.17 or similar) for running the Comet search engine. Download here.
- R: (Version 4.0 or later) for data analysis. Download here.
Run the following commands in your R console to install the necessary packages.
# For reading FASTA files
install.packages("microseq")
# For data manipulation and plotting
install.packages("dplyr")
install.packages("ggplot2")
install.packages("stringr")
# For reading mass spectrometry data files (from Bioconductor)
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c("mzR", "Spectra", "MsBackendMzR", "Biostrings"))
The raw mass spectrometry file (JD_06232014_sample1-A.raw) is too large to be included in this repository.
- Download the file from the ProteomeXchange repository: PXD015300.
- Convert the .raw file to .mzML format using a tool like ProteoWizard's msConvert.
- Place the resulting JD_06232014_sample1-A.mzML file in the root directory of this project.
Three separate searches must be performed using SearchGUI and the Comet search engine.
- Spectrum File(s): JD_06232014_sample1-A.mzML
- Database File: data/iPRG2015.fasta
- Output Folder: Create and select search_results/target_results/
- Search Parameters:
- Precursor Tolerance: 10.0 ppm
- Fragment Tolerance: 0.02 Da
- Enzyme: Trypsin (Specific), Max 2 Missed Cleavages
- Fixed Modifications: Carbamidomethylation of C
- Variable Modifications: Oxidation of M, Acetylation of protein N-term
- Comet Advanced Settings -> Number of Spectrum Matches: 1
- Spectrum File(s): JD_06232014_sample1-A.mzML
- Database File: decoy.fasta (This file is generated by the R script in the first chunk).
- Output Folder: Create and select search_results/decoy_results/
- Search Parameters: Use the exact same settings as the Target-Only search.
- Note: This search can only be run after generating the top100.mzML file using the R script.
- Spectrum File(s): top100.mzML
- Database File: data/iPRG2015.fasta
- Output Folder: Create and select search_results/top100_target_results/
- Search Parameters (with changes):
- Precursor Tolerance: 100.0 ppm (Wider tolerance)
- Fragment Tolerance: 0.02 Da
- Enzyme: Trypsin (Specific), Max 2 Missed Cleavages
- Fixed/Variable Modifications: Same as above.
- Comet Advanced Settings -> Number of Spectrum Matches: 100
- Open the APE_Project_Analysis.Rmd file in RStudio.
- Ensure all prerequisite packages are installed.
- Run the code chunks sequentially or click the "Knit" button to generate the HTML report and all figures. The script will automatically load the search results, perform the FDR calculations, and generate the plots used in the final report.