-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathREADME
More file actions
175 lines (140 loc) · 6.47 KB
/
README
File metadata and controls
175 lines (140 loc) · 6.47 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
Proteins are associated with other proteins, somehow.
STRING-DB (https://string-db.org) provides
a free, public database of functional associations
between proteins (for many species including human).
The data is of the form:
A B 123
where A and B are proteins and 123 is the confidence
that they are functionally associated. Some of these
associations are transitive and some are not.
For humans (the 9606 files in string-db.org) there are
11353057 associations (not unique proteins but pairs of
associated proteins).
This program reads a STRING-DB file, with the format
A B 123
...
and then takes the names of two proteins (the 'from' and
the 'to') and a range of the number of hops.
It then proceeds to search the database as a Graph
and report all associations which start with 'from',
end with 'to' and have as many intermediates as the
number of hops specified.
For example, running the command
graphos_enquiry_for_STRING --input /data/DATASETS/STRING_DB/9606.protein.links.v10.5.txt --ifs ' ' --skip_header_lines 1 --max_num_hops 2 --search '9606.ENSP00000000233,9606.ENSP00000301744' --output out
will report all associations between 9606.ENSP00000000233 and
9606.ENSP00000301744 with a maximum of two hops between them,
i.e.
9606.ENSP00000000233 9606.ENSP00000301744 (the direct one if it exists)
9606.ENSP00000000233 A 9606.ENSP00000301744 (1st degree, 1 intermediary protein 'A')
9606.ENSP00000000233 A B 9606.ENSP00000301744 (2nd degree, 2 intermediaries 'A' and 'B')
This is version 4.2 and it is a preliminary version
provided as a proof-of-concept for my idea.
Version 11.0 does many many more things.
It does search. Very well.
But it also builds a network of
interactions with the protein intermediaries too.
It clusters this network either wrt to function or
with say expression if specified.
It then assesses the statistical
significance of the clustering using monte carlo
simulation and t-test.
V11 can be provided upon request and subsequent interviewing.
If successful, the candidate user will be provided with V11
and supported.
# download:
I suggest you download the tarball from git repository at:
https://github.com/hadjiprocopis/protein-hops/blob/master/grapher-4.2.tar.gz
# compile and install:
At a unix (OSX or Linux, please do not ask me question about windows)
command prompt do:
```
tar xvzf grapher-4.2.tar.gz
cd grapher-4.2
./configure --disable-debug
```
If configure likes your system, finds all
dependancies and is in a good day,
you may proceed to compilation:
```
make clean && make all
```
Then probably you have the executable at
```
src/C/graphos_enquiry_for_STRING
```
# running the program:
Firstly you must download a STRING-DB protein
associations file, for example the one for
human species proteins at:
```
http://stringdb-static.org/download/protein.links.v10.5/9606.protein.links.v10.5.txt.gz
```
Note that I have changed 'https://...' to 'http://...'
Using https for public protein data is one way to destroy the world.
Then you need to uncompress it and store it somewhere,
let's say at the current dir
```
gunzip 9606.protein.links.v10.5.txt.gz
```
which will yield the file
```
9606.protein.links.v10.5.txt
```
in the current directory.
Now you are ready to run the grapher.
```
graphos_enquiry_for_STRING --input 9606.protein.links.v10.5.txt --ifs ' ' --skip_header_lines 1 --max_num_hops 1 --search '9606.ENSP00000000233,9606.ENSP00000301744' --output out.hops1
```
This will
produce the following output:
```
graphos_enquiry_for_STRING : reading edges from file '/data/DATASETS/STRING_DB/9606.protein.links.v10.5.txt' ...
read_file_in_memory : read 522240609 bytes from file '/data/DATASETS/STRING_DB/9606.protein.links.v10.5.txt' in memory.
done, read 11353057 lines (including header & empty), user: 5.170934, system: 0.276579 (5.170934, 0.278411)
graphos_enquiry_for_STRING : 1 of 1) recursive search 9606.ENSP00000000233 -> 9606.ENSP00000301744, min num hops: 1, max num hops: 1
graphos_enquiry_for_STRING : 1 of 1) found 818 paths with criteria specified, user: 0.509191, system: 0.000000 (5.680152, 0.278413)
graphos_enquiry_for_STRING : doing calculations on the paths ... user: 0.000075, system: 0.000000 (5.680241, 0.278413)
graphos_enquiry_for_STRING : 1 of 1) saving paths to files (out.hops1.9606.ENSP00000000233.9606.ENSP00000301744.*) ...
1 hops yielded 818 paths, output file 'out.hops1.9606.ENSP00000000233.9606.ENSP00000301744.1hops.txt'.
1 hops yielded 1636 edges, output file 'out.hops1.9606.ENSP00000000233.9606.ENSP00000301744.edgelist.1hops.txt'.
all hops yielded 1636 edges, output file 'out.hops1.9606.ENSP00000000233.9606.ENSP00000301744.edgelist.allhops.txt'.
1 hops yielded 820 nodes, output file 'out.hops1.9606.ENSP00000000233.9606.ENSP00000301744.nodelist.1hops.txt'.
all hops yielded 820 nodes, output file 'out.hops1.9606.ENSP00000000233.9606.ENSP00000301744.nodelist.allhops.txt'.
user: 0.003015, system: 0.000000 (5.683262, 0.278413)
graphos_enquiry_for_STRING : done, 0 (0.0%) searches failed, 1 (100.0%) searches succeeded, of a total of 1.
user: 6.039501, system: 0.276642 (6.039501, 0.278413)
```
Which says that it finished in about 6 seconds and the output is in the
current directory prefixed with 'out.hops1.'
The file
```
out.hops1.9606.ENSP00000000233.9606.ENSP00000301744.1hops.txt
```
contains entries like:
```
9606.ENSP00000000233() () 9606.ENSP00000358867() () 9606.ENSP00000301744() (1)
```
which, as you can imagine is the association of the 'from' protein
(at the left) with the 'to' protein (at the right) and all
the intermediate proteins in the middle.
If you have a fast computer and a lot of memory then you can try
the 2nd degree associations, i.e. with 2 intermediaries (plus 1 'from'
and 1 'to' makes a chain of 4 proteins in total).
It will take exponentially more time to complete.
In my computer it took 16 minutes instead of
6 seconds.
Sometimes the program may run out of memory. So try more hops than
1 if you have confidence in your hardware. Otherwise, if all memory is
exhausted, the computer will become very slow and it may not be
easy to save any work you have been doing.
So, before running the 2-hops, save all your work and be ready
to reboot your computer if it becomes too slow (usually having access
to a terminal will be sufficient to kill the program using:
```
killall graphos_enquiry_for_STRING
```
author (idea and implementation): Andreas Hadjiprocopis
contact: andreashad2@gmail.com
when: at the time I was working at the Institute of Cancer Research.
Although the idea and implementation are mine, I used
ICR's resources.