This repository contains the code for the paper:
Laurie Prélot,1,2 Jiayu Chen 1, Matthias Hüser 1,3, André Kahles,1,2∗ and Gunnar Rätsch 1,2,3,4,5∗
1 Department of Computer Science, ETH Zürich, Zürich, Switzerland
2 University Hospital Zürich, Biomedical Informatics Research, Zurich, Switzerland
3 SIB Swiss Institute of Bioinformatics, Zürich, Switzerland
4 Department of Biology, ETH Zürich, Zürich, Switzerland
5 ETH AI Center, Zürich, Switzerland
∗ Corresponding authors
They are listed in sequential order
projects2020_immunopepper_analysis/immunopepper/translate_TCGA-BRCA/run_file_germline.sh projects2020_immunopepper_analysis/immunopepper/translate_TCGA-BRCA/run_file_ref.sh projects2020_immunopepper_analysis/immunopepper/translate_TCGA-BRCA/run_file_somatic.sh projects2020_immunopepper_analysis/immunopepper/translate_TCGA-BRCA/run_file_somatic_and_germline.sh projects2020_immunopepper_analysis/immunopepper/translate_TCGA-BRCA/run_file_tgx_germline.sh projects2020_immunopepper_analysis/immunopepper/translate_TCGA-BRCA/run_file_tgx_ref.sh projects2020_immunopepper_analysis/immunopepper/translate_TCGA-BRCA/run_file_tgx_somatic.sh projects2020_immunopepper_analysis/immunopepper/translate_TCGA-BRCA/run_file_tgx_somatic_and_germline.sh projects2020_immunopepper_analysis/immunopepper/translate_TCGA-BRCA/tmp_run/*
projects2020_immunopepper_analysis/immunopepper/translate_GTEX/GTEX2017/run_all_no_count_GTEX2017.sh
Annotation removed (for requant pipeline)/ Annotation and GTEX translated in:all reading frames (for allframes pipeline) removed with:
projects2020_immunopepper_analysis/immunopepper/filter_cancerspecific/step_1_remove_GTEX/run_all_samples_BRCA-GTEX2017.sh projects2020_immunopepper_analysis/immunopepper/filter_cancerspecific/step_1_remove_GTEX/tmp_launch
projects2020_immunopepper_analysis/immunopepper/filter_cancerspecific/step_2_combine_remove_GTEX/filter-requant.py projects2020_immunopepper_analysis/immunopepper/filter_cancerspecific/step_2_combine_remove_GTEX/run_filter-requant.sh
projects2020_immunopepper_analysis/immunopepper/filter_cancerspecific/helpers/helpers_analyze_results.py projects2020_immunopepper_analysis/immunopepper/filter_cancerspecific/helpers/helpers_plotting.py
Cancer cohort recurrence filter applied (requant pipeline, allframes pipeline) ### projects2020_immunopepper_analysis/immunopepper/filter_cancerspecific/step_3_recurrence_cancer_remove_junctions_BAM/20230824_dev-recurrence-star-save.ipynb (Notebook!)
projects2020_immunopepper_analysis/immunopepper/filter_cancerspecific/helpers/helpers_analyze_results.py projects2020_immunopepper_analysis/immunopepper/filter_cancerspecific/helpers/helpers_plotting.py
projects2020_immunopepper_analysis/immunopepper/filter_cancerspecific/helpers/helpers_analyze_results.py projects2020_immunopepper_analysis/immunopepper/filter_cancerspecific/helpers/helpers_plotting.py These plots can be found in:the Main:and the Supplementary sections of the paper.
projects2020_immunopepper_analysis/mhcBinding/send_process_mhc_filter.sh
projects2020_immunopepper_analysis/mhcBinding/launch_files #Cleaning in: projects2020_immunopepper_analysis/mhcBinding/clean_intermediate.sh
projects2020_immunopepper_analysis/pepFasta/20231106_meta-matching_format-peptides-modular.py projects2020_immunopepper_analysis/pepFasta/helpers_format_peptides.py projects2020_immunopepper_analysis/pepFasta/helpers_metadata_matching.py projects2020_immunopepper_analysis/pepFasta/run_meta_matching.sh projects2020_immunopepper_analysis/pepFasta/send_meta-matching.sh
projects2020_immunopepper_analysis/pepDigest/send_trypsine.sh
projects2020_immunopepper_analysis/pepQuery/pep_search/multi_pepQ.sh
projects2020_immunopepper_analysis/pepQuery/pep_search/run_tmp/*
projects2020_immunopepper_analysis/pepNeighborsSearch
searched projects2020_immunopepper_analysis/pepNeighborsSearch/20240108-tide-index/. projects2020_immunopepper_analysis/pepNeighborsSearch/20240108-tide-index/./runall-createIndex-ipp.sh projects2020_immunopepper_analysis/pepNeighborsSearch/20240108-tide-index/./script_index.sh projects2020_immunopepper_analysis/pepNeighborsSearch/20240108-tide-index/./launch_multi_createIndex.sh
projects2020_immunopepper_analysis/pepNeighborsSearch/20240108-tide-search/. projects2020_immunopepper_analysis/pepNeighborsSearch/20240108-tide-search/./runall-search-ipp.sh projects2020_immunopepper_analysis/pepNeighborsSearch/20240108-tide-search/./launch_multi_search.sh projects2020_immunopepper_analysis/pepNeighborsSearch/20240108-tide-search/./script_search.sh
a) The search results are pooled across fractions projects2020_immunopepper_analysis/pepNeighborsSearch/20240109-FDR-correct/a_extract_concat/. projects2020_immunopepper_analysis/pepNeighborsSearch/20240109-FDR-correct/a_extract_concat/./runall-extract-ipp.sh projects2020_immunopepper_analysis/pepNeighborsSearch/20240109-FDR-correct/a_extract_concat/./script_extract.sh projects2020_immunopepper_analysis/pepNeighborsSearch/20240109-FDR-correct/a_extract_concat/./launch_multi_extract.sh
b) The FDR is performed either PSM-wise with Crux search engine, or peptide-wise with the crema software tool. projects2020_immunopepper_analysis/pepNeighborsSearch/20240109-FDR-correct/b_conf_FDR/. projects2020_immunopepper_analysis/pepNeighborsSearch/20240109-FDR-correct/b_conf_FDR/./script_confidence_crema.sh projects2020_immunopepper_analysis/pepNeighborsSearch/20240109-FDR-correct/b_conf_FDR/./launch_multi_FDR.sh projects2020_immunopepper_analysis/pepNeighborsSearch/20240109-FDR-correct/b_conf_FDR/./runall-FDR-ipp.sh projects2020_immunopepper_analysis/pepNeighborsSearch/20240109-FDR-correct/b_conf_FDR/./script_confidence.sh projects2020_immunopepper_analysis/pepNeighborsSearch/20240109-FDR-correct/b_conf_FDR/./script_crema.py
results between the two proteomics methods### projects2020_immunopepper_analysis/plotting/plot_proteomics_results/20240109_from_parsed_compare_with_CC.ipynb These plots can be found in:the Main:and the Supplementary sections of the paper.
projects2020_immunopepper_analysis/plotting/plot_proteomics_results/20240109_from_parsed_plot_raw_numbers.ipynb These plots can be found in:the Main:and the Supplementary sections of the paper.
projects2020_immunopepper_analysis/plotting/plot_proteomics_results/helpers_initialize.py projects2020_immunopepper_analysis/plotting/plot_proteomics_results/helpers_parse_results.py projects2020_immunopepper_analysis/plotting/plot_proteomics_results/helpers_plotting_bars.py projects2020_immunopepper_analysis/plotting/plot_proteomics_results/helpers_validated_kmers.py
Found in projects2020_immunopepper_analysis/separatePeptideOrigin/
First step aims at pooling all the generated k-mers by categories (database creation to simplify post-processing) projects2020_immunopepper_analysis/separatePeptideOrigin/step1_generated_kmers_extract/send_generated_kmers_extract.sh
Second step creates a file which maps each kmer to its class: junction_only (reference), germline, germline_and_somatic, somatic projects2020_immunopepper_analysis/separatePeptideOrigin/step2_assign_mutation_type_kmers/20241216_RUN_Isolate_mutation_type.ipynb It generates a map path_save = os.path.join(base_dir, f'filter_{sample}/result/FILTERED/part-kmers_CLASS_MAP.tsv.gz') Then several plots are generated.
A. A new plot (swarmplot) is created and applied to previous outputs. This looks good for some of the filtering results projects2020_immunopepper_analysis/plotting/plot_proteomics_results/20240109_from_parsed_plot_raw_numbers_swarmplot.ipynb
B. The filtered k-mers are plotted per mutation class on a swarmplot. projects2020_immunopepper_analysis/plotting/plot_proteomics_results/20240109_from_parsed_plot_raw_numbers_swarmplot-multiclass_version.ipynb This plot can be found in:the Supplementary section of the paper.
C. The MS-validated k-mers are plotted per mutation class on a swarmplot. projects2020_immunopepper_analysis/plotting/plot_proteomics_results/20240109_parse_proteomics_results-kmers-rates-1_ReviewPaper.ipynb This plot can be found in:the Supplementary section of the paper.
Review Task 2: Compute some statistics about the number of somatic mutations applied to each of the TCGA samples + cross-run comparisons
projects2020_immunopepper_analysis/posthocAnalyses/20250203_Compare_gene.ipynb
Review Task 3: Translate the peptides (junctions) in:the wrong frame and assess the proteomics validation rate
-
Step 1: Translate the peptides in:the wrong frame The code can be found in projects2020_immunopepper_analysis/immunopepper/translate_TCGA-BRCA (github project) Bash script to launch ImmunoPepper: immunopepper/translate_TCGA-BRCA/translate_wrong_frame/send_all_cross_sample_TCGA_frames.sh Example of command immunopepper/translate_TCGA-BRCA/translate_wrong_frame/run_file_ref.sh This plot can be found in:the Supplementary section of the paper.
-
Then remove the annotated frames from all frames The operation performed is kmers all frames \ kmer annotated frame (novel or not) \ kmers from annotation \ Uniprot (Performed at the ImmunoPepper stage) Code in: projects2020_immunopepper_analysis/posthocAnalyses/translate_wrong_frame/ The filtering is performed in:a notebook posthocAnalyses/translate_wrong_frame/20250217_wrongFrame_select_kmers.ipynb This plot can be found in:the Supplementary section of the paper.
-
Extract the 2 or 3 exon context peptides which contain:the filtered junction k-mers and create a fasta file Code is in projects2020_immunopepper_analysis/posthocAnalyses/translate_wrong_frame/ The fasta is generated in:a notebook (A bit slow) posthocAnalyses/translate_wrong_frame/20240217_wrongFrame_Fasta_matching.ipynb Therefore the code is also in: posthocAnalyses/translate_wrong_frame/20240217_wrongFrame_Fasta_matching.py The following helper code is used posthocAnalyses/translate_wrong_frame/helpers_format_peptides.py Updated to script in posthocAnalyses/translate_wrong_frame/helpers_metadata_matching.py NOTE that some "batches (of up to 10 genes), could not be run. There are total 2035 batches of 10 genes. Batch 220, 495-6, 535, 1000, 1051, 1106, 1945 but this does not matter too much because we are going to sample the peptides anyways.
-
Digest the peptides from the fasta file, filter the tryptic peptide for size and make unique Code is in: projects2020_immunopepper_analysis/posthocAnalyses/pepDigest_wrong_frame/send_trypsine.sh Helper code in: projects2020_immunopepper_analysis/posthocAnalyses/pepDigest_wrong_frame/andy_lin_scripts/ Non digested fasta has 1881861 peptides Digested fasta with some processing 1123529 After exclusion of peptides that are too short or too long 851857 After unicity operation the number of peptides is 476485
-
Sample the tryptic peptides and generate small fasta files. The sampling is performed to match the number of candidates that we as input for proteomics in the analysis of the paper. The motivation behind the sampling is that the validation rate is heavily influenced by the size of the peptide set. Code is in: projects2020_immunopepper_analysis/posthocAnalyses/pepSampleFasta/20240217_wrongFrame_Fasta_sampling.ipynb Performed sampling 10 times
-
Proteomics with Subset Neighbor Search: Compute the neighbor peptides and index the database Code in: projects2020_immunopepper_analysis/posthocAnalyses/pepNeighborsSearch_wrong_frame/20240108-tide-index
-
Proteomics with Subset Neighbor Search: Perform comparison of spectra with the peptide database (crux search engine) Code in: projects2020_immunopepper_analysis/posthocAnalyses/pepNeighborsSearch_wrong_frame/20240108-tide-search (Each of the fraction for the sample is matched)
-
Proteomics with Subset Neighbor Search: Perform FDR calculation Code in: projects2020_immunopepper_analysis/posthocAnalyses/pepNeighborsSearch_wrong_frame/20240109-FDR-correct
-
Proteomics with PepQuery: Code in: projects2020_immunopepper_analysis/posthocAnalyses/pepQuery_wrong_frame
-
Extraction of validation rates Code in: projects2020_immunopepper_analysis/posthocAnalyses/pepValidationRate_wrong_frame/20250219_parse_proteomics_rates.ipynb
The plots related to this experiement can be found in:the Supplementary section of the paper.