|
| 1 | +## Test solubility modifications to peptides |
| 2 | + |
| 3 | +### Preamble |
| 4 | +Starting with a list of peptides and proposed amino acid modifications on the N- or C-terminal ends (or automatically selected modifications), aimed at improving the solubility of synthesize long peptides, test for creation of strong binding peptides containing modified amino acids. Summarize these findings and avoid use of modified peptides that lead to predicted strong binding peptide containing these synthetic modifications. |
| 5 | + |
| 6 | +### Local dependencies |
| 7 | +The following assumes you have gcloud installed and have authenticated to use the google cloud project below |
| 8 | + |
| 9 | +Set up Google cloud configurations and make sure the right one is activated: |
| 10 | +```bash |
| 11 | +export GCS_PROJECT=jlf-rcrf |
| 12 | +export GCS_VM_NAME=mg-test-peptide-mods |
| 13 | + |
| 14 | +#list possible configs that are set up |
| 15 | +gcloud config configurations list |
| 16 | + |
| 17 | +#activate the rcrf config |
| 18 | +gcloud config configurations activate rcrf |
| 19 | + |
| 20 | +#login if needed (only needs to be done once) |
| 21 | +gcloud auth login |
| 22 | + |
| 23 | +#view active config/login (should show the correct project "jlf-rcrf", zone, and email address) |
| 24 | +gcloud config list |
| 25 | + |
| 26 | +``` |
| 27 | + |
| 28 | +Configure these configurations to use a specific zone. Once the config is setup and you have logged into at least once the config files should look like this: |
| 29 | + |
| 30 | +`cat ~/.config/gcloud/configurations/config_rcrf` |
| 31 | + |
| 32 | +``` |
| 33 | +[compute] |
| 34 | +region = us-central1 |
| 35 | +zone = us-central1-c |
| 36 | +[core] |
| 37 | +account = <email address associated with rcrf account> |
| 38 | +disable_usage_reporting = True |
| 39 | +project = jlf-rcrf |
| 40 | +``` |
| 41 | + |
| 42 | +### Launching a Google VM to perform the predictions |
| 43 | +Launch GCP instance set up for ad hoc analyses (including docker) |
| 44 | + |
| 45 | +```bash |
| 46 | +gcloud compute instances create $GCS_VM_NAME --service-account=cromwell-server@$GCS_PROJECT.iam.gserviceaccount.com --source-machine-image=jlf-adhoc-v1 --network=cloud-workflows --subnet=cloud-workflows-default --boot-disk-size=250GB --boot-disk-type=pd-ssd --machine-type=e2-standard-8 |
| 47 | +``` |
| 48 | + |
| 49 | +### Log into the GCP instance and check status |
| 50 | + |
| 51 | +```bash |
| 52 | +gcloud compute ssh $GCS_VM_NAME |
| 53 | + |
| 54 | +#confirm start up scripts have completed. use <ctrl> <c> to exit |
| 55 | +journalctl -u google-startup-scripts -f |
| 56 | + |
| 57 | +#check for expected disk space |
| 58 | +df -h |
| 59 | + |
| 60 | +``` |
| 61 | + |
| 62 | +### Configure Docker to work for current user |
| 63 | + |
| 64 | +```bash |
| 65 | +sudo usermod -a -G docker $USER |
| 66 | +sudo reboot |
| 67 | + |
| 68 | +``` |
| 69 | + |
| 70 | +Logout and login to get this change to take effect and test the docker install |
| 71 | +```bash |
| 72 | +exit |
| 73 | + |
| 74 | +gcloud compute ssh $GCS_VM_NAME |
| 75 | + |
| 76 | +docker run hello-world |
| 77 | + |
| 78 | +``` |
| 79 | + |
| 80 | + |
| 81 | +## Generating the peptide_table.tsv file |
| 82 | + |
| 83 | +The input is a csv file which has the names and base sequences. The name does not have to be unique. |
| 84 | +For example, the gene name to indentify each sequence. The sequences should not contian any modifications. |
| 85 | + |
| 86 | +```bash |
| 87 | +Base Sequence name,Base sequence |
| 88 | +CUL9,RMLDYYEEISAGDEGEFRQS |
| 89 | +CUL9,RVRMLDYYEEISAGDEGEFRQSN |
| 90 | +CUL9,RVRMLDYYEEISAGDEGEFR |
| 91 | +EXOC4,SVIRTLSTIDDVEDRENEKGR |
| 92 | +EXOC4,ISVIRTLSTIDDVEDRENEKGR |
| 93 | +EXOC4,LISVIRTLSTIDDVEDRENEKGR |
| 94 | +VPS13B,GLRQGLFRLGISLLGAIAGIVD |
| 95 | +VPS13B,GEGLRQGLFRLGISLLGAIAG |
| 96 | +VPS13B,SLGEGLRQGLFRLGISLLGAI |
| 97 | +DYNC1H1,KRFHATISFDTDTGLKQALET |
| 98 | +DYNC1H1,GKRFHATISFDTDTGLKQALET |
| 99 | +DYNC1H1,KRFHATISFDTDTGLKQAL |
| 100 | +MYO9A,FDWIVFRINHALLNSKVLEHNTK |
| 101 | +MYO9A,FDWIVFRINHALLNSKVL |
| 102 | +MYO9A,SALFDWIVFRINHALLNSKVLEHN |
| 103 | +EPG5,KELPLYLWQPSTSEIAVIRDW |
| 104 | +``` |
| 105 | + |
| 106 | +``` |
| 107 | +export HOME=/Users/evelynschmidt/jlf/JLF-100-048/ModifiedPeptides |
| 108 | +export HLA_ALLELES=HLA-A*24:02,HLA-A*29:02,HLA-B*14:02,HLA-B*14:02,HLA-C*02:02,HLA-C*08:02 |
| 109 | +export SAMPLE_NAME=jlf-100-048 |
| 110 | +``` |
| 111 | + |
| 112 | + |
| 113 | + |
| 114 | +Using the Docker and executing this command will produce the peptide_table.tsv used by pVAC bind. |
| 115 | +The -n argument is the maximum number of modified peptides and the -m argument is a path to the csv file. |
| 116 | + |
| 117 | +``` |
| 118 | +docker pull griffithlab/neoang_scripts:latest |
| 119 | +
|
| 120 | +docker run -it -v $HOME/:$HOME/ -v $HOME/.config/gcloud:/root/.config/gcloud --env HOME --env SAMPLE_NAME --env HLA_ALLELES griffithlab/neoang_scripts:latest /bin/bash |
| 121 | +
|
| 122 | +cd $HOME |
| 123 | +
|
| 124 | +python3 /opt/scripts/modify_peptides.py -n 3 -m *.csv -samp $SAMPLE_NAME -HLA $HLA_ALLELES -WD $HOME |
| 125 | +``` |
| 126 | + |
| 127 | +For example, if you speficify -n 1 then the modified sequences produced will me: |
| 128 | + |
| 129 | +```bash |
| 130 | +CUL9.1.n-term-K KRMLDYYEEISAGDEGEFRQS K|RMLDYYEEISAGDEGEFRQS |
| 131 | +CUL9.1.c-term-K RMLDYYEEISAGDEGEFRQSK RMLDYYEEISAGDEGEFRQS|K |
| 132 | +CUL9.1.n-term-R RRMLDYYEEISAGDEGEFRQS R|RMLDYYEEISAGDEGEFRQS |
| 133 | +CUL9.1.c-term-R RMLDYYEEISAGDEGEFRQSR RMLDYYEEISAGDEGEFRQS|R |
| 134 | +. |
| 135 | +. |
| 136 | +. |
| 137 | +EPG5.n-term-K KKELPLYLWQPSTSEIAVIRDW K|KELPLYLWQPSTSEIAVIRDW |
| 138 | +EPG5.c-term-K KELPLYLWQPSTSEIAVIRDWK KELPLYLWQPSTSEIAVIRDW|K |
| 139 | +EPG5.n-term-R RKELPLYLWQPSTSEIAVIRDW R|KELPLYLWQPSTSEIAVIRDW |
| 140 | +EPG5.c-term-R KELPLYLWQPSTSEIAVIRDWR KELPLYLWQPSTSEIAVIRDW|R |
| 141 | +``` |
| 142 | + |
| 143 | +### Enter a pVACtools docker environment to run pVACbind on the sub-peptide sequences containing modified AAs |
| 144 | + |
| 145 | +```bash |
| 146 | +docker pull griffithlab/pvactools:4.0.5 |
| 147 | +docker run -it -v $HOME/:$HOME/ --env HOME --env SAMPLE_NAME --env HLA_ALLELES griffithlab/pvactools:4.0.5 /bin/bash |
| 148 | +cd $HOME |
| 149 | + |
| 150 | +for LENGTH in 8 9 10 11 |
| 151 | +do |
| 152 | + #process n-term fasta for this length |
| 153 | + echo "Running pVACbind for length: $LENGTH (n-term sequences)" |
| 154 | + export LENGTH_FASTA=$HOME/n-term/pvacbind_inputs/${LENGTH}-mer-test.fa |
| 155 | + export LENGTH_RESULT_DIR=$HOME/n-term/pvacbind_results/${LENGTH}-mer-test |
| 156 | + pvacbind run $LENGTH_FASTA $SAMPLE_NAME $HLA_ALLELES all_class_i $LENGTH_RESULT_DIR -e1 $LENGTH --n-threads 8 --iedb-install-directory /opt/iedb/ 1>$LENGTH_RESULT_DIR/stdout.txt 2>$LENGTH_RESULT_DIR/stderr.txt |
| 157 | + |
| 158 | + #process c-term fasta for this length |
| 159 | + echo "Running pVACbind for length: $LENGTH (c-term sequences)" |
| 160 | + export LENGTH_FASTA=$HOME/c-term/pvacbind_inputs/${LENGTH}-mer-test.fa |
| 161 | + export LENGTH_RESULT_DIR=$HOME/c-term/pvacbind_results/${LENGTH}-mer-test |
| 162 | + pvacbind run $LENGTH_FASTA $SAMPLE_NAME $HLA_ALLELES all_class_i $LENGTH_RESULT_DIR -e1 $LENGTH --n-threads 8 --iedb-install-directory /opt/iedb/ 1>$LENGTH_RESULT_DIR/stdout.txt 2>$LENGTH_RESULT_DIR/stderr.txt |
| 163 | +done |
| 164 | + |
| 165 | +``` |
| 166 | + |
| 167 | + |
| 168 | +To check for successful completion of all jobs you can check the stdout logs that have been saved. There should be 8 successful jobs total, 4 lengths for n-term modified peptides and 4 lengths for c-term. |
| 169 | + |
| 170 | +```bash |
| 171 | +grep "Pipeline finished" */pvacbind_results/*/stdout.txt | wc -l |
| 172 | + |
| 173 | +# leave docker |
| 174 | +exit |
| 175 | +``` |
| 176 | + |
| 177 | +### Combine all the pVACbind results into a single file |
| 178 | +Create a combined TSV file by concatenating all the individual "all_epitopes.tsv" files and avoiding redundant headers. Store this file locally (or in a cloud bucket) so that it can be accessed after the VM is destroyed. |
| 179 | + |
| 180 | +```bash |
| 181 | +#get the header line |
| 182 | +grep -h "^Mutation" --color=never */pvacbind_results/*/MHC_Class_I/${SAMPLE_NAME}.all_epitopes.tsv | sort | uniq > header.tsv |
| 183 | + |
| 184 | +#combine the results from all prediction runs and add the header on |
| 185 | +cat */pvacbind_results/*/MHC_Class_I/${SAMPLE_NAME}.all_epitopes.tsv | grep -v "^Mutation" | cat header.tsv - > ${SAMPLE_NAME}.all_epitopes.all_modifications.tsv |
| 186 | + |
| 187 | +``` |
| 188 | + |
| 189 | +### Evaluate the proposed modified peptide sequences |
| 190 | +The goal of this analysis is to test whether any strong binding peptides are created that include the modified amino acid sequences included to improve solubility. For example, one could require that no such peptides exist where the median binding affinity is < 500nm OR median binding score percentile is < 1%. |
| 191 | + |
| 192 | +For each candidate modified peptide sequence, summarize the number of such potentially problematic peptides. |
| 193 | + |
| 194 | +```bash |
| 195 | + |
| 196 | +#pull out all the rows that correspond to strong binders according to default criteria (<500nm affinity OR <1 percentile score) |
| 197 | +cut -f 1,2,4,5,8 ${SAMPLE_NAME}.all_epitopes.all_modifications.tsv | perl -ne 'chomp; @l=split("\t",$_); $median_affinity=$l[3]; $median_percentile=$l[4]; if ($median_affinity < 500 || $median_percentile < 1){print "$_\n"}' > ${SAMPLE_NAME}.all_epitopes.all_modifications.problematic.tsv |
| 198 | + |
| 199 | +#summarize number of problematic results of each unique candidate proposed peptide |
| 200 | +cat ${SAMPLE_NAME}.all_epitopes.all_modifications.problematic.tsv | grep -v "^Mutation" | cut -f 1 | sort | uniq -c | sed 's/^[ ]*//' | tr " " "\t" | awk 'BEGIN {FS="\t"; OFS="\t"} {print $2, $1}' > ${SAMPLE_NAME}.problematic.summary.tsv |
| 201 | + |
| 202 | +#create a list of all unique peptide names for modified peptides to be summarized |
| 203 | +cut -f 1 ${SAMPLE_NAME}.all_epitopes.all_modifications.tsv | grep -v "^Mutation" | sort | uniq > peptide_name_list.tsv |
| 204 | + |
| 205 | +#create an output table with a count of problematic binders for all peptides (include 0 if that is the case) |
| 206 | +join -t $'\t' -a 1 -a 2 -e'0' -o '0,2.2' peptide_name_list.tsv ${SAMPLE_NAME}.problematic.summary.tsv > ${SAMPLE_NAME}.problematic.summary.complete.tsv |
| 207 | + |
| 208 | +``` |
| 209 | + |
| 210 | +### Retrieve final result files to local system |
| 211 | + |
| 212 | +Files to be kept: |
| 213 | + |
| 214 | +- ${SAMPLE_NAME}.all_epitopes.all_modifications.tsv |
| 215 | +- ${SAMPLE_NAME}.all_epitopes.all_modifications.problematic.tsv |
| 216 | +- ${SAMPLE_NAME}.problematic.summary.complete.tsv |
| 217 | + |
| 218 | +```bash |
| 219 | +#leave the GCP VM |
| 220 | +exit |
| 221 | + |
| 222 | +export SAMPLE_NAME="jlf-100-026" |
| 223 | + |
| 224 | +mkdir ${SAMPLE_NAME}_modified_peptide_results |
| 225 | +cd ${SAMPLE_NAME}_modified_peptide_results |
| 226 | + |
| 227 | +gcloud compute scp $USER@$GCS_VM_NAME:${SAMPLE_NAME}.all_epitopes.all_modifications.tsv ${SAMPLE_NAME}.all_epitopes.all_modifications.tsv |
| 228 | + |
| 229 | +gcloud compute scp $USER@$GCS_VM_NAME:${SAMPLE_NAME}.all_epitopes.all_modifications.problematic.tsv ${SAMPLE_NAME}.all_epitopes.all_modifications.problematic.tsv |
| 230 | + |
| 231 | +gcloud compute scp $USER@$GCS_VM_NAME:${SAMPLE_NAME}.problematic.summary.complete.tsv ${SAMPLE_NAME}.problematic.summary.complete.tsv |
| 232 | + |
| 233 | + |
| 234 | +``` |
| 235 | + |
| 236 | +### Once the analysis is done and results retrieved, destroy the Google VM on GCP to avoid wasting resources |
| 237 | + |
| 238 | +```bash |
| 239 | + |
| 240 | +gcloud compute instances delete $GCS_VM_NAME |
| 241 | + |
| 242 | +``` |
| 243 | + |
| 244 | +### Final report generation and interpretation |
| 245 | +Use the information in `${SAMPLE_NAME}.all_epitopes.all_modifications.tsv` and `${SAMPLE_NAME}.problematic.summary.complete.tsv` to produce summary spreadsheets. |
0 commit comments