Add section for running pVACseq

susannasiebert · susannasiebert · commit dfd1554880ac · 2023-07-05T15:58:01.000-05:00
diff --git a/02-prerequisites.Rmd b/02-prerequisites.Rmd
@@ -90,7 +90,7 @@ General:
 - `Homo_sapiens.GRCh38.pep.all.fa.gz`: A reference proteome peptide FASTA to use
   for determining whether there are any reference matches of neoantigen candidates
 
-To download this data, please run the following command:
+To download this data, please run the following commands:
 
 ```{r, engine = 'bash', eval = FALSE}
 wget https://raw.githubusercontent.com/griffithlab/pVACtools_Intro_Course/main/HCC1395_inputs.zip
@@ -99,4 +99,4 @@ unzip HCC1395_inputs.zip
 
 This course will not cover the required pre-processing steps for the pVACtools
 input data but extensive instructions on how to prepare your own data for use
-with pVACtools can be found at[pvactools.org](https://www.pvactools.org)
+with pVACtools can be found at [pvactools.org](https://www.pvactools.org)
diff --git a/03-running_pvactools.Rmd b/03-running_pvactools.Rmd
@@ -9,12 +9,183 @@ ottrpal::set_knitr_image_path()
 
 This chapter will cover:
 
-- Running pVACtools
+- Starting an interactive Docker session
+- Running pVACseq
+- Running pVACfuse
 - Understanding pVACtools outputs
 
-## Running pVACtools
+## Starting Docker
 
-This section will explain how to run pVACtools either using Docker.
+In your Terminal execute the following command:
+
+```{r, engine = 'bash', eval = FALSE}
+mkdir pVACtools_outputs
+
+docker run \
+-v HCC1395_inputs:/HCC1395_inputs \
+-v pVACtools_outputs:/pVACtools_outputs \
+-it griffithlab/pvactools:4.0.0 \
+/bin/bash
+```
+
+This will pull the 4.0.0 version of the griffithlab/pvactools Docker image and
+start an interactive session of that Docker image. The `-v
+HCC1395_inputs:/HCC1395_inputs` part of the command will mount the
+`HCC1395_inputs` folder at `/HCC1395_inputs` inside of the Docker container
+so that you will have access to the input data from inside the Docker
+container. The `-v pVACtools_outputs:/pVACtools_outputs` part of the command
+will mount the `pVACtools_outputs` folder you just created. We will write the
+outputs from pVACseq and pVACfuse to that folder so that you will have access
+to it once you exit the Docker image.
+
+## Running pVACseq
+
+The pVACseq pipeline is run using the `pvacseq run` command.
+
+
+### Required Parameters
+
+The `pvacseq run` command takes a number of required parameters in the
+following order:
+
+- `vcf_file`: A VEP-annotated single- or multi-sample VCF containing genotype,
+  transcript, Wildtype protein sequence, and Frameshift protein sequence
+  information.
+- `sample_name`: The name of the tumor sample being processed. When processing
+  a multi-sample VCF the sample name must be a sample ID in the input VCF #CHROM
+  header line. Only variants that are called (genotype/GT 0/1 or 1/1) in that
+  sample will be processed.
+- `allele(s)`: The name of the HLA allele to use for epitope prediction. Multiple
+  alleles can be specified using a comma-separated list. These should be the
+  HLA alleles of your patient. You might have clinical typing information for
+  your patient. If not, you will need to computational predict the patient's
+  HLA type using software such as OptiType.
+- `prediction_algorithms`: The epitope prediction algorithms to use. Multiple
+  prediction algorithms can be specified, separated by spaces. Use `all` to
+  run all available prediction algorithms.
+- `output_dir`: The directory for writing all result files.
+
+### Optional Parameters
+
+The `pvacseq run` command offers quite a few optional arguments to fine-tune
+your run. Here are a list of parameters we generally recommend:
+
+- `--phased-proximal-variants-vcf`: This is an additional VCF file that
+  includes both somatic and germline variants with phasing information. This
+  file is used to identify variants near a somatic variant of interest and
+  in-phase that would, as a result, change the predicted protein sequence
+  around the somatic variant of interest and, thus, change the predicted
+  neoantigens. Please note that pVACseq is currently only able to incorporate
+  proximal missense variants so users should still manually investigate their
+  candidates for other types of nearby variants (e.g. inframe and frameshift
+  indels)
+- `--normal-sample-name`: When using a tumor-normal input VCF, this parameter
+  is used to identify the normal sample in the VCF in order to parse
+  coverage metrics for the normal sample.
+- `--iedb-install-directory`: For speed and reliability, we generally recommend
+  that users use a standalone installation of the IEDB software. The pVACtools
+  Docker containers already come with this software pre-installed in the
+  `/opt/iedb` directory.
+- `--allele-specific-binding-thresholds`: When filtering and tiering
+  neoantigen candidates, one main criteria is the predicted peptide-MHC
+  binding affinity. By default, pVACseq uses a cutoff of <500 nmol IC50.
+  However, for some HLA alleles, other cutoffs are more appropriate depending
+  on the distribution of binding affinities across peptides. Setting
+  this flag enables allele-specific binding cutoffs as recommended by
+  [IEDB](https://help.iedb.org/hc/en-us/articles/114094152371-What-thresholds-cut-offs-should-I-use-for-MHC-class-I-and-II-binding-predictions).
+- `--allele-specific-anchors`: When considering a neoantigen candidate, only a
+  subset of peptide positions are presented to the T cell receptor
+  for recognition, while others are responsible for anchoring to the MHC, making
+  these positional considerations critical for predicting T cell responses.
+  Conventionally, the 1st, 2nd, n-1 and n position in a neoantigen candidates
+  were considered anchors while recent studies [@Xia2023] have shown that
+  these positions will depend on the HLA allele. Setting this flag will use
+  allele-specific anchor locations.
+- `--run-reference-proteome-similarity`: One consideration when selecting
+  neoantigen candidates, is that the neoantigen should not occur natively in
+  the patient's proteome. When this flag is set, pVACseq will search for each
+  neoantigen candidate in the reference proteome and report any hits found.
+  By default this is done using BLASTp but we recommend using a proteome FASTA
+  file via the `--peptide-fasta` parameter to speed up this step.
+- `--pass-only`: By default, all variants that were called in the tumor sample
+  are considered by pVACseq. This flag will lead pVACseq to skip variants that
+  have a FILTER applied in the VCF to, e.g., exclude variants that were marked
+  as low quality by the variant caller.
+- `--percentile-threshold`: When considering the peptide-MHC binding affinity
+  for filtering and prioritizing neoantigen candidates, by default only the
+  IC50 value is being used. Setting this parameter will additional also filter
+  on the predicted percentile. We recommend a value of 0.01 (1%) for this
+  threshold.
+
+Additionally there are a number of parameters that might be useful depending
+on your specific analysis needs:
+
+- `--class-i-epitope-length` and `--class-ii-epitope-length`: By default 8,
+  9, 10, 11 and 12, 13, 14, 15, 16, 17, 18 are set for these parameters,
+  respecitively but different lengths might be desired.
+- `--tumor-purity`: This parameter is used to bin variants into clonal and
+  sub-clonal. This parameter might need to be adjusted based on the tumor
+  purity of your data.
+- `--problematic-amino-acids`: Some vaccine manufacturers will consider certain amino
+  acids in the neoantigen candidates difficult to manufacture. For example, a
+  Cysteine is commonly considered problematic as it makes the peptide
+  unstable. This parameter allows users to set their own rules as to which
+  peptides are considered problematic and peptides meeting those rules will be marked in the
+  pVACseq results and deprioritized.
+- `--threads`: This argument will allow pVACseq to run in multi-processing
+  mode.
+- `--keep-tmp-files`: Setting this flag will save intermediate files created by pVACseq.
+- `--downstream-sequence-length`: For frameshift variants, the downstream
+  sequence can potentially be very long, which can be computationally
+  expensive. This parameter limits how many amino acids of the downstream
+  sequence are included in the prediction.
+
+### pVACseq Command
+
+Given the considerations outlined above, let's run pVACseq on our sample data.
+
+From the
+`optitype_normal_result.tsv` we know that the patient's class I alleles are HLA-A\*29:02, HLA-B\*45:01,
+HLA-B\*82:02, and HLA-C\*06:02. We also have clinical typing information that confirms
+these class I alleles as well as identified DQA1\*03:03, DQB1\*03:02, and DRB1\*04:05 as the
+patient's class II alleles.
+
+To identify the tumor and normal sample names we will grep the VCF file for
+the CHROM header:
+
+```{r, engine = 'bash', eval = FALSE}
+zgrep CHROM /HCC1395_inputs/annotated.expression.vcf.gz
+```
+
+This shows that the tumor sample is named `HCC1395_TUMOR_DNA` and the normal sample is named `HCC1395_NORMAL_DNA`.
+
+For our test run, please execute the `pvacseq run` command below. The
+prediction run might take a while but pVACseq will output progress messages as
+it processeses through the pipeline.
+
+```{r, engine = 'bash', eval = FALSE}
+pvacseq run \
+/HCC1395_inputs/annotated.expression.vcf.gz \
+HCC1395_TUMOR_DNA \
+HLA-A*29:02,HLA-B*45:01,HLA-B*82:02,HLA-C*06:02,DQA1*03:03,DQB1*03:02,DRB1*04:05 \
+all \
+/pVACtools_outputs/pvacseq_predictions \
+--normal-sample-name HCC1395_NORMAL_DNA \
+--phased-proximal-variants-vcf /HCC1395_inputs/phased.vcf.gz \
+--iedb-install-directory /opt/iedb \
+--pass-only \
+--allele-specific-binding-thresholds \
+--percentile-threshold 0.01 \
+--allele-specific-anchors \
+--run-reference-proteome-similarity \
+--peptide-fasta /HCC1395_inputs/Homo_sapiens.GRCh38.pep.all.fa.gz \
+--problematic-amino-acids C \
+--downstream-sequence-length 100 \
+--n-threads 8 \
+--keep-tmp-files
+```
+
+## Running pVACfuse
 
 ## Understanding pVACtools outputs
 
diff --git a/book.bib b/book.bib
@@ -53,6 +53,19 @@ @article{Keskin2018
   journal = {Nature}
 }
 
+@article{Xia2023,
+  doi = {10.1126/sciimmunol.abg2200},
+  url = {https://doi.org/10.1126/sciimmunol.abg2200},
+  year = {2023},
+  month = apr,
+  publisher = {American Association for the Advancement of Science ({AAAS})},
+  volume = {8},
+  number = {82},
+  author = {Huiming Xia and Joshua McMichael and Michelle Becker-Hapak and Onyinyechi C. Onyeador and Rico Buchli and Ethan McClain and Patrick Pence and Suangson Supabphol and Megan M. Richters and Anamika Basu and Cody A. Ramirez and Cristina Puig-Saus and Kelsy C. Cotto and Sharon L. Freshour and Jasreet Hundal and Susanna Kiwala and S. Peter Goedegebuure and Tanner M. Johanns and Gavin P. Dunn and Antoni Ribas and Christopher A. Miller and William E. Gillanders and Todd A. Fehniger and Obi L. Griffith and Malachi Griffith},
+  title = {Computational prediction of {MHC} anchor locations guides neoantigen identification and prioritization},
+  journal = {Science Immunology}
+}
+
 @article{Ott2017,
   doi = {10.1038/nature22991},
   url = {https://doi.org/10.1038/nature22991},
diff --git a/resources/dictionary.txt b/resources/dictionary.txt
@@ -1,16 +1,31 @@
+AGFusion
+Arriba
 AnVIL
 BIPOC
+BLASTp
 Bloomberg
 Bookdown
+bioinformatics
+CHROM
+CLI
 ClinVar
 Coursera
+Cysteine
+clonality
 css
+cytotoxic
+DQA
+DQB
+DRB
 Datatrail
 DataTrail
 Dockerfile
 Dockerhub
 dropdown
+epitope
 epitopes
+Ensembl
+FASTA
 favicon
 frameshift
 fyi
@@ -19,29 +34,55 @@ GenBank
 GH
 GitHub
 Github
+germline
 gnomAD
+griffithlab
+HCC
 HLA
+histocompatibility
 https
+IC
+IEDB
+ITCR
+ITN
+immunotherapies
+immunotherapy
+isoform
 immunogenomics
 impactful
-ITCR
+indels
+inframe
 itcrtraining
-ITN
 json
 junctional
 Leanpub
+MHCnuggets
 Markua
+mRNA
+manufacturability
 mentorship
 mers
+missense
 MHC
 MHCflurry
 NCI
+NHGRI
+NetChop
+NetMHCpan
+NetMHCstabpan
+natively
+nd
 neoantigen
 Neoantigen
 neoantigens
-NHGRI
+nmol
+OptiType
 ottrpal
+PHLAT
+pVACbind
+proteome
 Pandoc
+pre
 proteomics
 pVAC
 pVACfuse
@@ -53,32 +94,17 @@ pVACview
 pVACviz
 RefSeq
 reproducibility
+somatically
 subclonal
+STARFusion
+tbi
+tiering
 tsv
 UE
 UE5
 underserved
-www
-AGfusion
-Arriba
-clonality
-cytotoxic
-Ensembl
-histocompatibility
-IEDB
-immunotherapies
-immunotherapy
-isoform
-manufacturability
-MHCnuggets
-mRNA
-NetChop
-NetMHCpan
-NetMHCstabpan
-OptiType
-PHLAT
-proteome
-pVACbind
-somatically
+VCF
 vaxrank
 VEP
+www
+Wildtype