Skip to content

Commit d83e766

Browse files
authored
Merge pull request #23 from 3D-e-Chem/similarity
Renamed distance to similarity
2 parents b8666b3 + 8cb43e3 commit d83e766

22 files changed

+372
-371
lines changed

CHANGELOG.md

Lines changed: 31 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,82 +1,88 @@
11
# Change log
2+
All notable changes to this project will be documented in this file.
3+
This project adheres to [Semantic Versioning](http://semver.org/).
4+
Formatted as described on http://keepachangelog.com/.
25

36
## Unreleased
47

8+
## [2.0.0] - 2016-07-14
9+
510
### Changed
611

7-
* Flag to ignore upper triangle when calculating distances, instead of always ignore (#20)
12+
- Renamed distance to similarity (#21)
13+
- Flag to ignore upper triangle when calculating distances, instead of always ignore (#20)
814

9-
## 1.4.2 - 3 June 2016
15+
## [1.4.2] - 2016-06-03
1016

1117
### Changed
1218

13-
* Lower webservice cutoff to 0.45 (#18)
19+
- Lower webservice cutoff to 0.45 (#18)
1420

15-
## 1.4.1 - 31 May 2016
21+
## [1.4.1] - 2016-05-31
1622

1723
### Added
1824

19-
* Webservice online at http://3d-e-chem.vu-compmedchem.nl/kripodb/ui/
20-
* Ignore_upper triangle option in distance import sub command
25+
- Webservice online at http://3d-e-chem.vu-compmedchem.nl/kripodb/ui/
26+
- Ignore_upper triangle option in distance import sub command
2127

22-
## 1.4.0 - 3 May 2016
28+
## [1.4.0] - 2016-05-03
2329

2430
### Changed
2531

26-
* Using nested sub-commands instead of long sub-command. For example `kripodb distmatrix_import` now is `kripodb distances import`
32+
- Using nested sub-commands instead of long sub-command. For example `kripodb distmatrix_import` now is `kripodb distances import`
2733

2834
### Added
2935

30-
* Faster distance matrix storage format
31-
* Python3 support (#12)
32-
* Automated build to docker hub.
36+
- Faster distance matrix storage format
37+
- Python3 support (#12)
38+
- Automated build to docker hub.
3339

3440
### Removed
3541

36-
* CLI argument `--precision`
42+
- CLI argument `--precision`
3743

38-
## 1.3.0 - 23 Apr 2016
44+
## [1.3.0] - 2016-04-23
3945

4046
### Added
4147

42-
* webservice server/client for distance matrix (#16). The CLI and canned commands can now take a local file or a url.
48+
- webservice server/client for distance matrix (#16). The CLI and canned commands can now take a local file or a url.
4349

4450
### Fixed
4551

46-
* het_seq_nr contains non-numbers (#15)
52+
- het_seq_nr contains non-numbers (#15)
4753

48-
## 1.2.5 - 24 Mar 2016
54+
## [1.2.5] - 2016-03-24
4955

5056
### Fixed
5157

52-
* fpneigh2tsv not available as sub command
58+
- fpneigh2tsv not available as sub command
5359

54-
## 1.2.4 - 24 Mar 2016
60+
## [1.2.4] - 2016-03-24
5561

5662
### Added
5763

58-
* Sub command to convert fpneight distance file to tsv.
64+
- Sub command to convert fpneight distance file to tsv.
5965

60-
## 1.2.3 - 1 Mar 2016
66+
## [1.2.3] - 2016-03-01
6167

6268
### Changed
6369

64-
* Converting distances matrix will load id2label lookup into memory to speed up conversion
70+
- Converting distances matrix will load id2label lookup into memory to speed up conversion
6571

66-
## 1.2.2 - 22 Feb 2016
72+
## [1.2.2] - 2016-02-22
6773

6874
### Added
6975

7076
- Added sub command to read fpneigh formatted distance matrix file (#14)
7177

72-
## 1.2.1 - 12 Feb 2016
78+
## [1.2.1] - 2016-02-12
7379

7480
### Added
7581

7682
- Added sub commands to read/write distance matrix in tab delimited format (#13)
7783
- Created repo for Knime example and plugin at https://github.com/3D-e-Chem/knime-kripodb (#8)
7884

79-
## 1.2.0 - 11 Feb 2016
85+
## [1.2.0] - 2016-02-11
8086

8187
### Added
8288

@@ -89,7 +95,7 @@
8995
- Merging of distance matrix files more robust (#10)
9096
- Tanimoto coefficient is rounded up (#7)
9197

92-
## 1.0.0 - 5 Feb 2016
98+
## [1.0.0] - 2016-02-05
9399

94100
### Added
95101

README.md

Lines changed: 24 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -17,14 +17,14 @@ KRIPO stands for Key Representation of Interaction in POckets, see [reference](h
1717
* Subpocket, part of the protein pocket which binds with the fragment
1818
* Fingerprint, fingerprint of structure-based pharmacophore of subpocket
1919
* Similarity matrix, similarities between all fingerprint pairs calculated using the modified tanimoto similarity index
20-
* Kripo identifier, used as identifier for fragment, subpocket and fingerprint
20+
* Kripo fragment identifier, used as identifier for fragment, subpocket and fingerprint
2121

2222
# Install
2323

2424
Requirements:
2525

2626
* rdkit, http://rdkit.org, to read SDF files and generate smile strings from molecules
27-
* libhdf5 headers, to read/write distance matrix in hdf5 format
27+
* libhdf5 headers, to read/write similarity matrix in hdf5 format
2828

2929
```
3030
pip install -U setuptools
@@ -48,42 +48,42 @@ kripodb fragments sdf fragment??.sdf fragments.sqlite
4848
kripodb fragments pdb fragments.sqlite
4949
kripodb fingerprints import 01.fp 01.fp.db
5050
kripodb fingerprints import 02.fp 02.fp.db
51-
kripodb fingerprints distances --fragmentsdbfn fragments.sqlite --ignore_upper_triangle 01.fp.db 01.fp.db dist_01_01.h5
52-
kripodb fingerprints distances --fragmentsdbfn fragments.sqlite --ignore_upper_triangle 02.fp.db 02.fp.db dist_02_02.h5
53-
kripodb fingerprints distances --fragmentsdbfn fragments.sqlite 01.fp.db 02.fp.db dist_01_02.h5
54-
kripodb distances merge dist_*_*.h5 dist_all.h5
55-
kripodb distances freeze dist_all.h5 dist_all.frozen.h5
56-
# Make froze distance matrix smaller, by using slower compression
57-
ptrepack --complevel 6 --complib blosc:zlib dist_all.frozen.h5 dist_all.packedfrozen.h5
58-
rm dist_all.frozen.h5
59-
kripodb distances serve dist_all.packedfrozen.h5
51+
kripodb fingerprints similarities --fragmentsdbfn fragments.sqlite --ignore_upper_triangle 01.fp.db 01.fp.db sim_01_01.h5
52+
kripodb fingerprints similarities --fragmentsdbfn fragments.sqlite --ignore_upper_triangle 02.fp.db 02.fp.db sim_02_02.h5
53+
kripodb fingerprints similarities --fragmentsdbfn fragments.sqlite 01.fp.db 02.fp.db sim_01_02.h5
54+
kripodb similarities merge sim_*_*.h5 sim_all.h5
55+
kripodb similarities freeze sim_all.h5 sim_all.frozen.h5
56+
# Make froze similarity matrix smaller, by using slower compression
57+
ptrepack --complevel 6 --complib blosc:zlib sim_all.frozen.h5 sim_all.packedfrozen.h5
58+
rm sim_all.frozen.h5
59+
kripodb similarities serve sim_all.packedfrozen.h5
6060
```
6161

6262
## Search for most similar fragments
6363

6464
Command to find fragments most similar to `3kxm_K74_frag1` fragment.
6565
```
66-
kripodb similar dist_all.h5 3kxm_K74_frag1 --cutoff 0.45
66+
kripodb similar sim_all.h5 3kxm_K74_frag1 --cutoff 0.45
6767
```
6868

69-
## Create distance matrix from text files
69+
## Create similarity matrix from text files
7070

71-
Input files `dist_??_??.txt.gz` looks like:
71+
Input files `sim_??_??.txt.gz` looks like:
7272
```
7373
Compounds similar to 2xry_FAD_frag4:
7474
2xry_FAD_frag4 1.0000
7575
3cvv_FAD_frag3 0.5600
7676
```
7777

78-
To create a single distance matrix from multiple text files:
78+
To create a single similarity matrix from multiple text files:
7979
```
80-
gunzip -c dist_01_01.txt.gz | kripodb distances import --ignore_upper_triangle - fragments.sqlite dist_01_01.h5
81-
gunzip -c dist_01_02.txt.gz | kripodb distances import - fragments.sqlite dist_01_02.h5
82-
gunzip -c dist_02_02.txt.gz | kripodb distances import --ignore_upper_triangle - fragments.sqlite dist_02_02.h5
83-
kripodb distances merge dist_??_??.h5 dist_all.h5
80+
gunzip -c sim_01_01.txt.gz | kripodb similarities import --ignore_upper_triangle - fragments.sqlite sim_01_01.h5
81+
gunzip -c sim_01_02.txt.gz | kripodb similarities import - fragments.sqlite sim_01_02.h5
82+
gunzip -c sim_02_02.txt.gz | kripodb similarities import --ignore_upper_triangle - fragments.sqlite sim_02_02.h5
83+
kripodb similarities merge sim_??_??.h5 sim_all.h5
8484
```
8585

86-
The `--ignore_upper_triangle` flag is used to prevent scores corruption when freezing distance matrix.
86+
The `--ignore_upper_triangle` flag is used to prevent scores corruption when freezing similarity matrix.
8787

8888
# Data sets
8989

@@ -96,7 +96,7 @@ An example data set included in the [data/](data/) directory of this repo. See [
9696
All fragments based on GPCR proteins compared with all proteins in PDB.
9797

9898
* kripo.gpcrandhits.sqlite - Fragments sqlite database
99-
* kripo.gpcr.h5 - HDF5 file with distance matrix
99+
* kripo.gpcr.h5 - HDF5 file with similarity matrix
100100

101101
The data set has been published at [![DOI](https://zenodo.org/badge/doi/10.5281/zenodo.50835.svg)](http://dx.doi.org/10.5281/zenodo.50835)
102102

@@ -106,8 +106,8 @@ All fragments form all proteins-ligand complexes in PDB compared with all.
106106
Data set contains PDB entries that where available at 23 December 2015.
107107

108108
* kripo.sqlite - Fragments sqlite database
109-
* Distance matrix is too big to ship with VM so use http://3d-e-chem.vu-compmedchem.nl/kripodb webservice url to query.
110-
* kripo_fingerprint_2015_*.fp.gz - Fragment fingerprints, see [here](#create-distance-matrix-from-text-files) for instructions how to convert to a distance matrix.
109+
* Similarity matrix is too big to ship with VM so use http://3d-e-chem.vu-compmedchem.nl/kripodb webservice url to query.
110+
* kripo_fingerprint_2015_*.fp.gz - Fragment fingerprints, see [here](#create-similarity-matrix-from-text-files) for instructions how to convert to a similarity matrix.
111111

112112
The data set has been published at [![DOI](https://zenodo.org/badge/doi/10.5281/zenodo.55254.svg)](http://dx.doi.org/10.5281/zenodo.55254)
113113

@@ -152,7 +152,7 @@ The Kripo data files can be queried using a web service.
152152

153153
Start webservice with:
154154
```
155-
kripodb serve --port 8084 data/distances.h5
155+
kripodb serve --port 8084 data/similarities.h5
156156
```
157157
It will print the urls for the swagger spec and UI.
158158

data/README.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
* fragments.sqlite - Fragments sqlite database containing a small number of fragments with their smiles string and molblock.
44
* fingerprints.sqlite - Fingerprints sqlite database with fingerprint stored as [fastdumped intbitset](http://intbitset.readthedocs.org/en/latest/index.html#intbitset.intbitset.fastdump)
5-
* distances.h5 - HDF5 file with distance matrix of fingerprints using modified tanimoto coefficient
5+
* similarities.h5 - HDF5 file with similarities matrix of fingerprints using modified tanimoto similarity index
66

77
## Creating tiny data set
88

@@ -23,8 +23,9 @@ EOF
2323
2424
```
2525

26-
3. Create distance matrix
26+
3. Create similarity matrix
2727

2828
```
29-
kripodb fingerprints distances --fragmentsdbfn fragments.sqlite fingerprints.sqlite fingerprints.sqlite distances.h5
30-
```
29+
kripodb fingerprints similarities --fragmentsdbfn fragments.sqlite fingerprints.sqlite fingerprints.sqlite similarities.h5
30+
```
31+
File renamed without changes.

kripodb/canned.py

Lines changed: 16 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -13,28 +13,28 @@
1313
# limitations under the License.
1414
"""Module with functions which use pandas DataFrame as input and output.
1515
16-
For using Kripo data files inside Knime (http://www.knime.org)
16+
For using Kripo data files inside KNIME (http://www.knime.org)
1717
"""
1818

1919
from __future__ import absolute_import
2020

2121
import tables
2222

2323
import pandas as pd
24-
from kripodb.frozen import FrozenDistanceMatrix
24+
from kripodb.frozen import FrozenSimilarityMatrix
2525
from .db import FragmentsDb
26-
from .hdf5 import DistanceMatrix
27-
from .pairs import similar
26+
from .hdf5 import SimilarityMatrix
27+
from .pairs import similar, open_similarity_matrix
2828
from .webservice.client import WebserviceClient
2929

3030

31-
def similarities(queries, distance_matrix_filename_or_url, cutoff, limit=1000):
32-
"""Find similar fragments to queries based on distance matrix.
31+
def similarities(queries, similarity_matrix_filename_or_url, cutoff, limit=1000):
32+
"""Find similar fragments to queries based on similarity matrix.
3333
3434
Args:
3535
queries (List[str]): Query fragment identifiers
36-
distance_matrix_filename_or_url (str): Filename of distance matrix file or base url of kripodb webservice
37-
cutoff (float): Cutoff, distance scores below cutoff are discarded.
36+
similarity_matrix_filename_or_url (str): Filename of similarity matrix file or base url of kripodb webservice
37+
cutoff (float): Cutoff, similarity scores below cutoff are discarded.
3838
limit (int): Maximum number of hits for each query.
3939
Default is 1000. Use is None for no limit.
4040
@@ -44,12 +44,12 @@ def similarities(queries, distance_matrix_filename_or_url, cutoff, limit=1000):
4444
>>> import pandas as pd
4545
>>> from kripodb.canned import similarities
4646
>>> queries = pd.Series(['3j7u_NDP_frag24'])
47-
>>> hits = similarities(queries, 'data/distances.h5', 0.55)
47+
>>> hits = similarities(queries, 'data/similaritys.h5', 0.55)
4848
>>> len(hits)
4949
11
5050
51-
Retrieved from web service instead of local distance matrix file.
52-
Make sure the web service is running, for example by `kripodb serve data/distances.h5`.
51+
Retrieved from web service instead of local similarity matrix file.
52+
Make sure the web service is running, for example by `kripodb serve data/similaritys.h5`.
5353
5454
>>> hits = similarities(queries, 'http://localhost:8084/kripo', 0.55)
5555
>>> len(hits)
@@ -59,28 +59,22 @@ def similarities(queries, distance_matrix_filename_or_url, cutoff, limit=1000):
5959
pandas.DataFrame: Data frame with query_fragment_id, hit_frag_id and score columns
6060
"""
6161
hits = []
62-
if distance_matrix_filename_or_url.startswith('http'):
63-
client = WebserviceClient(distance_matrix_filename_or_url)
62+
if similarity_matrix_filename_or_url.startswith('http'):
63+
client = WebserviceClient(similarity_matrix_filename_or_url)
6464
for query in queries:
6565
qhits = client.similar_fragments(query, cutoff, limit)
6666
hits.extend(qhits)
6767
else:
68-
f = tables.open_file(distance_matrix_filename_or_url, 'r')
69-
is_frozen = 'scores' in f.root
70-
f.close()
71-
if is_frozen:
72-
distance_matrix = FrozenDistanceMatrix(distance_matrix_filename_or_url)
73-
else:
74-
distance_matrix = DistanceMatrix(distance_matrix_filename_or_url)
68+
similarity_matrix = open_similarity_matrix(similarity_matrix_filename_or_url)
7569
for query in queries:
76-
for query_id, hit_id, score in similar(query, distance_matrix, cutoff, limit):
70+
for query_id, hit_id, score in similar(query, similarity_matrix, cutoff, limit):
7771
hit = {'query_frag_id': query_id,
7872
'hit_frag_id': hit_id,
7973
'score': score,
8074
}
8175
hits.append(hit)
8276

83-
distance_matrix.close()
77+
similarity_matrix.close()
8478

8579
return pd.DataFrame(hits)
8680

0 commit comments

Comments
 (0)