Skip to content

Commit 1e574de

Browse files
Merge pull request #12 from griffithlab/ModifiedPeptides
Modified peptides
2 parents c5fae6c + 9d6f6e1 commit 1e574de

File tree

2 files changed

+482
-56
lines changed

2 files changed

+482
-56
lines changed

ModifiedPeptides.md

Lines changed: 245 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,245 @@
1+
## Test solubility modifications to peptides
2+
3+
### Preamble
4+
Starting with a list of peptides and proposed amino acid modifications on the N- or C-terminal ends (or automatically selected modifications), aimed at improving the solubility of synthesize long peptides, test for creation of strong binding peptides containing modified amino acids. Summarize these findings and avoid use of modified peptides that lead to predicted strong binding peptide containing these synthetic modifications.
5+
6+
### Local dependencies
7+
The following assumes you have gcloud installed and have authenticated to use the google cloud project below
8+
9+
Set up Google cloud configurations and make sure the right one is activated:
10+
```bash
11+
export GCS_PROJECT=jlf-rcrf
12+
export GCS_VM_NAME=mg-test-peptide-mods
13+
14+
#list possible configs that are set up
15+
gcloud config configurations list
16+
17+
#activate the rcrf config
18+
gcloud config configurations activate rcrf
19+
20+
#login if needed (only needs to be done once)
21+
gcloud auth login
22+
23+
#view active config/login (should show the correct project "jlf-rcrf", zone, and email address)
24+
gcloud config list
25+
26+
```
27+
28+
Configure these configurations to use a specific zone. Once the config is setup and you have logged into at least once the config files should look like this:
29+
30+
`cat ~/.config/gcloud/configurations/config_rcrf`
31+
32+
```
33+
[compute]
34+
region = us-central1
35+
zone = us-central1-c
36+
[core]
37+
account = <email address associated with rcrf account>
38+
disable_usage_reporting = True
39+
project = jlf-rcrf
40+
```
41+
42+
### Launching a Google VM to perform the predictions
43+
Launch GCP instance set up for ad hoc analyses (including docker)
44+
45+
```bash
46+
gcloud compute instances create $GCS_VM_NAME --service-account=cromwell-server@$GCS_PROJECT.iam.gserviceaccount.com --source-machine-image=jlf-adhoc-v1 --network=cloud-workflows --subnet=cloud-workflows-default --boot-disk-size=250GB --boot-disk-type=pd-ssd --machine-type=e2-standard-8
47+
```
48+
49+
### Log into the GCP instance and check status
50+
51+
```bash
52+
gcloud compute ssh $GCS_VM_NAME
53+
54+
#confirm start up scripts have completed. use <ctrl> <c> to exit
55+
journalctl -u google-startup-scripts -f
56+
57+
#check for expected disk space
58+
df -h
59+
60+
```
61+
62+
### Configure Docker to work for current user
63+
64+
```bash
65+
sudo usermod -a -G docker $USER
66+
sudo reboot
67+
68+
```
69+
70+
Logout and login to get this change to take effect and test the docker install
71+
```bash
72+
exit
73+
74+
gcloud compute ssh $GCS_VM_NAME
75+
76+
docker run hello-world
77+
78+
```
79+
80+
81+
## Generating the peptide_table.tsv file
82+
83+
The input is a csv file which has the names and base sequences. The name does not have to be unique.
84+
For example, the gene name to indentify each sequence. The sequences should not contian any modifications.
85+
86+
```bash
87+
Base Sequence name,Base sequence
88+
CUL9,RMLDYYEEISAGDEGEFRQS
89+
CUL9,RVRMLDYYEEISAGDEGEFRQSN
90+
CUL9,RVRMLDYYEEISAGDEGEFR
91+
EXOC4,SVIRTLSTIDDVEDRENEKGR
92+
EXOC4,ISVIRTLSTIDDVEDRENEKGR
93+
EXOC4,LISVIRTLSTIDDVEDRENEKGR
94+
VPS13B,GLRQGLFRLGISLLGAIAGIVD
95+
VPS13B,GEGLRQGLFRLGISLLGAIAG
96+
VPS13B,SLGEGLRQGLFRLGISLLGAI
97+
DYNC1H1,KRFHATISFDTDTGLKQALET
98+
DYNC1H1,GKRFHATISFDTDTGLKQALET
99+
DYNC1H1,KRFHATISFDTDTGLKQAL
100+
MYO9A,FDWIVFRINHALLNSKVLEHNTK
101+
MYO9A,FDWIVFRINHALLNSKVL
102+
MYO9A,SALFDWIVFRINHALLNSKVLEHN
103+
EPG5,KELPLYLWQPSTSEIAVIRDW
104+
```
105+
106+
```
107+
export HOME=/Users/evelynschmidt/jlf/JLF-100-048/ModifiedPeptides
108+
export HLA_ALLELES=HLA-A*24:02,HLA-A*29:02,HLA-B*14:02,HLA-B*14:02,HLA-C*02:02,HLA-C*08:02
109+
export SAMPLE_NAME=jlf-100-048
110+
```
111+
112+
113+
114+
Using the Docker and executing this command will produce the peptide_table.tsv used by pVAC bind.
115+
The -n argument is the maximum number of modified peptides and the -m argument is a path to the csv file.
116+
117+
```
118+
docker pull griffithlab/neoang_scripts:latest
119+
120+
docker run -it -v $HOME/:$HOME/ -v $HOME/.config/gcloud:/root/.config/gcloud --env HOME --env SAMPLE_NAME --env HLA_ALLELES griffithlab/neoang_scripts:latest /bin/bash
121+
122+
cd $HOME
123+
124+
python3 /opt/scripts/modify_peptides.py -n 3 -m *.csv -samp $SAMPLE_NAME -HLA $HLA_ALLELES -WD $HOME
125+
```
126+
127+
For example, if you speficify -n 1 then the modified sequences produced will me:
128+
129+
```bash
130+
CUL9.1.n-term-K KRMLDYYEEISAGDEGEFRQS K|RMLDYYEEISAGDEGEFRQS
131+
CUL9.1.c-term-K RMLDYYEEISAGDEGEFRQSK RMLDYYEEISAGDEGEFRQS|K
132+
CUL9.1.n-term-R RRMLDYYEEISAGDEGEFRQS R|RMLDYYEEISAGDEGEFRQS
133+
CUL9.1.c-term-R RMLDYYEEISAGDEGEFRQSR RMLDYYEEISAGDEGEFRQS|R
134+
.
135+
.
136+
.
137+
EPG5.n-term-K KKELPLYLWQPSTSEIAVIRDW K|KELPLYLWQPSTSEIAVIRDW
138+
EPG5.c-term-K KELPLYLWQPSTSEIAVIRDWK KELPLYLWQPSTSEIAVIRDW|K
139+
EPG5.n-term-R RKELPLYLWQPSTSEIAVIRDW R|KELPLYLWQPSTSEIAVIRDW
140+
EPG5.c-term-R KELPLYLWQPSTSEIAVIRDWR KELPLYLWQPSTSEIAVIRDW|R
141+
```
142+
143+
### Enter a pVACtools docker environment to run pVACbind on the sub-peptide sequences containing modified AAs
144+
145+
```bash
146+
docker pull griffithlab/pvactools:4.0.5
147+
docker run -it -v $HOME/:$HOME/ --env HOME --env SAMPLE_NAME --env HLA_ALLELES griffithlab/pvactools:4.0.5 /bin/bash
148+
cd $HOME
149+
150+
for LENGTH in 8 9 10 11
151+
do
152+
#process n-term fasta for this length
153+
echo "Running pVACbind for length: $LENGTH (n-term sequences)"
154+
export LENGTH_FASTA=$HOME/n-term/pvacbind_inputs/${LENGTH}-mer-test.fa
155+
export LENGTH_RESULT_DIR=$HOME/n-term/pvacbind_results/${LENGTH}-mer-test
156+
pvacbind run $LENGTH_FASTA $SAMPLE_NAME $HLA_ALLELES all_class_i $LENGTH_RESULT_DIR -e1 $LENGTH --n-threads 8 --iedb-install-directory /opt/iedb/ 1>$LENGTH_RESULT_DIR/stdout.txt 2>$LENGTH_RESULT_DIR/stderr.txt
157+
158+
#process c-term fasta for this length
159+
echo "Running pVACbind for length: $LENGTH (c-term sequences)"
160+
export LENGTH_FASTA=$HOME/c-term/pvacbind_inputs/${LENGTH}-mer-test.fa
161+
export LENGTH_RESULT_DIR=$HOME/c-term/pvacbind_results/${LENGTH}-mer-test
162+
pvacbind run $LENGTH_FASTA $SAMPLE_NAME $HLA_ALLELES all_class_i $LENGTH_RESULT_DIR -e1 $LENGTH --n-threads 8 --iedb-install-directory /opt/iedb/ 1>$LENGTH_RESULT_DIR/stdout.txt 2>$LENGTH_RESULT_DIR/stderr.txt
163+
done
164+
165+
```
166+
167+
168+
To check for successful completion of all jobs you can check the stdout logs that have been saved. There should be 8 successful jobs total, 4 lengths for n-term modified peptides and 4 lengths for c-term.
169+
170+
```bash
171+
grep "Pipeline finished" */pvacbind_results/*/stdout.txt | wc -l
172+
173+
# leave docker
174+
exit
175+
```
176+
177+
### Combine all the pVACbind results into a single file
178+
Create a combined TSV file by concatenating all the individual "all_epitopes.tsv" files and avoiding redundant headers. Store this file locally (or in a cloud bucket) so that it can be accessed after the VM is destroyed.
179+
180+
```bash
181+
#get the header line
182+
grep -h "^Mutation" --color=never */pvacbind_results/*/MHC_Class_I/${SAMPLE_NAME}.all_epitopes.tsv | sort | uniq > header.tsv
183+
184+
#combine the results from all prediction runs and add the header on
185+
cat */pvacbind_results/*/MHC_Class_I/${SAMPLE_NAME}.all_epitopes.tsv | grep -v "^Mutation" | cat header.tsv - > ${SAMPLE_NAME}.all_epitopes.all_modifications.tsv
186+
187+
```
188+
189+
### Evaluate the proposed modified peptide sequences
190+
The goal of this analysis is to test whether any strong binding peptides are created that include the modified amino acid sequences included to improve solubility. For example, one could require that no such peptides exist where the median binding affinity is < 500nm OR median binding score percentile is < 1%.
191+
192+
For each candidate modified peptide sequence, summarize the number of such potentially problematic peptides.
193+
194+
```bash
195+
196+
#pull out all the rows that correspond to strong binders according to default criteria (<500nm affinity OR <1 percentile score)
197+
cut -f 1,2,4,5,8 ${SAMPLE_NAME}.all_epitopes.all_modifications.tsv | perl -ne 'chomp; @l=split("\t",$_); $median_affinity=$l[3]; $median_percentile=$l[4]; if ($median_affinity < 500 || $median_percentile < 1){print "$_\n"}' > ${SAMPLE_NAME}.all_epitopes.all_modifications.problematic.tsv
198+
199+
#summarize number of problematic results of each unique candidate proposed peptide
200+
cat ${SAMPLE_NAME}.all_epitopes.all_modifications.problematic.tsv | grep -v "^Mutation" | cut -f 1 | sort | uniq -c | sed 's/^[ ]*//' | tr " " "\t" | awk 'BEGIN {FS="\t"; OFS="\t"} {print $2, $1}' > ${SAMPLE_NAME}.problematic.summary.tsv
201+
202+
#create a list of all unique peptide names for modified peptides to be summarized
203+
cut -f 1 ${SAMPLE_NAME}.all_epitopes.all_modifications.tsv | grep -v "^Mutation" | sort | uniq > peptide_name_list.tsv
204+
205+
#create an output table with a count of problematic binders for all peptides (include 0 if that is the case)
206+
join -t $'\t' -a 1 -a 2 -e'0' -o '0,2.2' peptide_name_list.tsv ${SAMPLE_NAME}.problematic.summary.tsv > ${SAMPLE_NAME}.problematic.summary.complete.tsv
207+
208+
```
209+
210+
### Retrieve final result files to local system
211+
212+
Files to be kept:
213+
214+
- ${SAMPLE_NAME}.all_epitopes.all_modifications.tsv
215+
- ${SAMPLE_NAME}.all_epitopes.all_modifications.problematic.tsv
216+
- ${SAMPLE_NAME}.problematic.summary.complete.tsv
217+
218+
```bash
219+
#leave the GCP VM
220+
exit
221+
222+
export SAMPLE_NAME="jlf-100-026"
223+
224+
mkdir ${SAMPLE_NAME}_modified_peptide_results
225+
cd ${SAMPLE_NAME}_modified_peptide_results
226+
227+
gcloud compute scp $USER@$GCS_VM_NAME:${SAMPLE_NAME}.all_epitopes.all_modifications.tsv ${SAMPLE_NAME}.all_epitopes.all_modifications.tsv
228+
229+
gcloud compute scp $USER@$GCS_VM_NAME:${SAMPLE_NAME}.all_epitopes.all_modifications.problematic.tsv ${SAMPLE_NAME}.all_epitopes.all_modifications.problematic.tsv
230+
231+
gcloud compute scp $USER@$GCS_VM_NAME:${SAMPLE_NAME}.problematic.summary.complete.tsv ${SAMPLE_NAME}.problematic.summary.complete.tsv
232+
233+
234+
```
235+
236+
### Once the analysis is done and results retrieved, destroy the Google VM on GCP to avoid wasting resources
237+
238+
```bash
239+
240+
gcloud compute instances delete $GCS_VM_NAME
241+
242+
```
243+
244+
### Final report generation and interpretation
245+
Use the information in `${SAMPLE_NAME}.all_epitopes.all_modifications.tsv` and `${SAMPLE_NAME}.problematic.summary.complete.tsv` to produce summary spreadsheets.

0 commit comments

Comments
 (0)