Skip to content

Commit fe006f6

Browse files
authored
Fix docs links (#473)
1 parent e412750 commit fe006f6

9 files changed

+45
-49
lines changed

examples/01_skfp_introduction.ipynb

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -49,10 +49,10 @@
4949
"\n",
5050
"We compute the popular [Extended Connectivity Fingerprint (ECFP)](https://docs.chemaxon.com/display/docs/fingerprints_extended-connectivity-fingerprint-ecfp.md), also known as Morgan fingerprint. By default, it uses radius 2 (diameter 4, we call this ECFP4 fingerprints) and 2048 bits (dimensions). Then, we train Random Forest classifier on those features, and evaluate it using AUROC (Area Under Receiver Operating Characteristic curve).\n",
5151
"\n",
52-
"All those elements are described in [scikit-fingerprints documentation](https://scikit-fingerprints.github.io/scikit-fingerprints/index.html):\n",
53-
"- [BACE dataset](https://scikit-fingerprints.github.io/scikit-fingerprints/modules/datasets/generated/skfp.datasets.moleculenet.load_bace.html)\n",
54-
"- [scaffold split](https://scikit-fingerprints.github.io/scikit-fingerprints/modules/generated/skfp.model_selection.scaffold_train_test_split.html)\n",
55-
"- [ECFP fingerprint](https://scikit-fingerprints.github.io/scikit-fingerprints/modules/generated/skfp.fingerprints.ECFPFingerprint.html)"
52+
"All those elements are described in [scikit-fingerprints documentation](https://scikit-fingerprints.readthedocs.io/latest/index.html):\n",
53+
"- [BACE dataset](https://scikit-fingerprints.readthedocs.io/latest/modules/datasets/generated/skfp.datasets.moleculenet.load_bace.html)\n",
54+
"- [scaffold split](https://scikit-fingerprints.readthedocs.io/latest/modules/generated/skfp.model_selection.scaffold_train_test_split.html)\n",
55+
"- [ECFP fingerprint](https://scikit-fingerprints.readthedocs.io/latest/modules/generated/skfp.fingerprints.ECFPFingerprint.html)"
5656
]
5757
},
5858
{

examples/02_fingerprint_types.ipynb

Lines changed: 8 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,7 @@
6868
"source": [
6969
"### Descriptors\n",
7070
"\n",
71-
"**Descriptors** are sets of physicochemical properties of molecule, e.g. number of heavy atoms (non-hydrogens), number of rings, estimated solubility, distributions of inter-atomic distances, and more. Those are typically floating point numbers or counts of simple topology features (graph structure). They are often very interpretable, as each feature has a certain chemical meaning. Those are e.g. [Mordred](https://scikit-fingerprints.github.io/scikit-fingerprints/modules/generated/skfp.fingerprints.MordredFingerprint.html) and [VSA](https://scikit-fingerprints.github.io/scikit-fingerprints/modules/generated/skfp.fingerprints.VSAFingerprint.html).\n",
71+
"**Descriptors** are sets of physicochemical properties of molecule, e.g. number of heavy atoms (non-hydrogens), number of rings, estimated solubility, distributions of inter-atomic distances, and more. Those are typically floating point numbers or counts of simple topology features (graph structure). They are often very interpretable, as each feature has a certain chemical meaning. Those are e.g. [Mordred](https://scikit-fingerprints.readthedocs.io/latest/modules/generated/skfp.fingerprints.MordredFingerprint.html) and [VSA](https://scikit-fingerprints.readthedocs.io/latest/modules/generated/skfp.fingerprints.VSAFingerprint.html).\n",
7272
"\n",
7373
"**Pros:**\n",
7474
"- well-correlated with many global properties of molecule\n",
@@ -81,7 +81,7 @@
8181
"- typically don't benefit from sparse arrays\n",
8282
"- some are very slow to compute\n",
8383
"\n",
84-
"Let's compute two descriptor fingerprints: [Mordred](https://scikit-fingerprints.github.io/scikit-fingerprints/modules/generated/skfp.fingerprints.MordredFingerprint.html) and [RDKit2DDescriptorsFingerprint](https://scikit-fingerprints.github.io/scikit-fingerprints/modules/generated/skfp.fingerprints.RDKit2DDescriptorsFingerprint.html). Mordred is a set of descriptors proposed in the [Mordred software publication](https://doi.org/10.1186/s13321-018-0258-y), and the RDKit2DDescriptorsFingerprint is simply the collection of all topological descriptors available in RDKit.\n",
84+
"Let's compute two descriptor fingerprints: [Mordred](https://scikit-fingerprints.readthedocs.io/latest/modules/generated/skfp.fingerprints.MordredFingerprint.html) and [RDKit2DDescriptorsFingerprint](https://scikit-fingerprints.readthedocs.io/latest/modules/generated/skfp.fingerprints.RDKit2DDescriptorsFingerprint.html). Mordred is a set of descriptors proposed in the [Mordred software publication](https://doi.org/10.1186/s13321-018-0258-y), and the RDKit2DDescriptorsFingerprint is simply the collection of all topological descriptors available in RDKit.\n",
8585
"\n",
8686
"We will set a few options: `n_jobs=-1, batch_size=1, verbose=1`. Fingerprints can be computed for all molecules independently, so parallelism with multiple cores is very efficient. Setting `n_jobs=-1` by default uses all available N cores, dividing the dataset into N equal-sized batches. `batch_size` gives us more fine-grained control, and combined with `verbose=True`, it will show a nice progress bar, allowing us to check the progress molecule by molecule.\n",
8787
"\n",
@@ -569,7 +569,7 @@
569569
"source": [
570570
"### Substructure fingerprints\n",
571571
"\n",
572-
"**Substructure fingerprints** check for existence of selected substructures (subgraphs, patterns) in a molecule, such as functional groups, rings of given size, or counts of atoms of particular element. They are often hand-crafted and selected by domain experts, reflecting which parts of molecule is typically interesting to e.g. medicinal chemists. Substructures are typically described using [SMARTS patterns](https://www.daylight.com/dayhtml/doc/theory/theory.smarts.html), which can be though of as a kind of \"regular expressions\" for molecule structures. Examples include [MACCS fingerprint](https://scikit-fingerprints.github.io/scikit-fingerprints/modules/generated/skfp.fingerprints.MACCSFingerprint.html) and [PubChem fingerprint](https://scikit-fingerprints.github.io/scikit-fingerprints/modules/generated/skfp.fingerprints.PubChemFingerprint.html).\n",
572+
"**Substructure fingerprints** check for existence of selected substructures (subgraphs, patterns) in a molecule, such as functional groups, rings of given size, or counts of atoms of particular element. They are often hand-crafted and selected by domain experts, reflecting which parts of molecule is typically interesting to e.g. medicinal chemists. Substructures are typically described using [SMARTS patterns](https://www.daylight.com/dayhtml/doc/theory/theory.smarts.html), which can be though of as a kind of \"regular expressions\" for molecule structures. Examples include [MACCS fingerprint](https://scikit-fingerprints.readthedocs.io/latest/modules/generated/skfp.fingerprints.MACCSFingerprint.html) and [PubChem fingerprint](https://scikit-fingerprints.readthedocs.io/latest/modules/generated/skfp.fingerprints.PubChemFingerprint.html).\n",
573573
"\n",
574574
"Like descriptors, they have a constant length, but they result in integer-valued vectors. We can distinguish **binary** and **count** variants of those fingerprints. Binary ones only check for existence of a given pattern, and often are created as if/else conditions, e.g. \"number of oxygens >= 4\" or \"is there a ring of size 6?\". Count variants instead use the number of occurrences of a substructure, e.g. \"number of oxygens\" or \"number of rings of size 6\". Due to this difference, the count version may have less features. Most of those are scikit-fingerprints novel propositions.\n",
575575
"\n",
@@ -582,7 +582,7 @@
582582
"- don't generalize well outside the chemical space they've been designed for\n",
583583
"- longer ones are quite slow\n",
584584
"\n",
585-
"We'll compute [MACCS fingerprint](https://scikit-fingerprints.github.io/scikit-fingerprints/modules/generated/skfp.fingerprints.MACCSFingerprint.html) and [PubChem fingerprint](https://scikit-fingerprints.github.io/scikit-fingerprints/modules/generated/skfp.fingerprints.PubChemFingerprint.html), in both binary and count versions. SMARTS pattern matching can be quite slow, so those fingerprints often benefit from parallelism, similarly to descriptors."
585+
"We'll compute [MACCS fingerprint](https://scikit-fingerprints.readthedocs.io/latest/modules/generated/skfp.fingerprints.MACCSFingerprint.html) and [PubChem fingerprint](https://scikit-fingerprints.readthedocs.io/latest/modules/generated/skfp.fingerprints.PubChemFingerprint.html), in both binary and count versions. SMARTS pattern matching can be quite slow, so those fingerprints often benefit from parallelism, similarly to descriptors."
586586
]
587587
},
588588
{
@@ -678,11 +678,11 @@
678678
"source": [
679679
"### Hashed fingerprints\n",
680680
"\n",
681-
"**Hashed fingerprints** extract all subgraphs of general shape, such as shortest paths between pairs of atoms (linear subgraphs) or neighborhoods of bonded atoms (circular fingerprints). From each substructure, an integer identifier is then computed. Then we use the hashing function (hence the name), which translates the identifier into an index in the final vector, where we put information that subgraph was detected. Those are e.g. [ECFP fingerprint](https://scikit-fingerprints.github.io/scikit-fingerprints/modules/generated/skfp.fingerprints.ECFPFingerprint.html) and [Atom Pair fingerprint](https://scikit-fingerprints.github.io/scikit-fingerprints/modules/generated/skfp.fingerprints.AtomPairFingerprint.html).\n",
681+
"**Hashed fingerprints** extract all subgraphs of general shape, such as shortest paths between pairs of atoms (linear subgraphs) or neighborhoods of bonded atoms (circular fingerprints). From each substructure, an integer identifier is then computed. Then we use the hashing function (hence the name), which translates the identifier into an index in the final vector, where we put information that subgraph was detected. Those are e.g. [ECFP fingerprint](https://scikit-fingerprints.readthedocs.io/latest/modules/generated/skfp.fingerprints.ECFPFingerprint.html) and [Atom Pair fingerprint](https://scikit-fingerprints.readthedocs.io/latest/modules/generated/skfp.fingerprints.AtomPairFingerprint.html).\n",
682682
"\n",
683683
"Initially, each atom gets assigned a numerical identifier, called atom invariant. It combines a few basic properties like e.g. atomic number, charge, and atomic mass, into a single integer value (typically with XOR function). This allows us to distinguish e.g. carbon in different contexts.\n",
684684
"\n",
685-
"Then, subgraphs are computed, with their shape depending on a fingerprint. For example, [ECFP](https://scikit-fingerprints.github.io/scikit-fingerprints/modules/generated/skfp.fingerprints.ECFPFingerprint.html) uses circular neighborhood, e.g. atom with neighbors (bonded atoms), then with their neighbors (radius 2 neighborhood), up to a given radius (by default 2). The identifiers of atoms and bonds in the subgraph are combined into a single identifier of a whole substructure.\n",
685+
"Then, subgraphs are computed, with their shape depending on a fingerprint. For example, [ECFP](https://scikit-fingerprints.readthedocs.io/latest/modules/generated/skfp.fingerprints.ECFPFingerprint.html) uses circular neighborhood, e.g. atom with neighbors (bonded atoms), then with their neighbors (radius 2 neighborhood), up to a given radius (by default 2). The identifiers of atoms and bonds in the subgraph are combined into a single identifier of a whole substructure.\n",
686686
"\n",
687687
"Output vector starts with only zeros. It has a given length (also called \"number of bits\"), which is a common hyperparameter of all hashed fingerprints. Subgraph identifiers are hashed into it, translating subgraph identifier into index of a vector, e.g. with a [modulo function](https://en.wikipedia.org/wiki/Modulo). Hashing collisions may occur, when two distinct substructures get the same index, but this is typically not a big problem. Binary variant ignores such collisions (just marks 1 at a given position), and count version sums up all occurrences at each index.\n",
688688
"\n",
@@ -695,7 +695,7 @@
695695
"**Cons:**\n",
696696
"- not interpretable\n",
697697
"\n",
698-
"Let's check out [ECFP fingerprint](https://scikit-fingerprints.github.io/scikit-fingerprints/modules/generated/skfp.fingerprints.ECFPFingerprint.html) and [Atom Pair fingerprint](https://scikit-fingerprints.github.io/scikit-fingerprints/modules/generated/skfp.fingerprints.AtomPairFingerprint.html). We will create default versions with length 2048, and short ones with 1024. Those fingerprints are often so fast to compute they don't benefit from parallelism at all."
698+
"Let's check out [ECFP fingerprint](https://scikit-fingerprints.readthedocs.io/latest/modules/generated/skfp.fingerprints.ECFPFingerprint.html) and [Atom Pair fingerprint](https://scikit-fingerprints.readthedocs.io/latest/modules/generated/skfp.fingerprints.AtomPairFingerprint.html). We will create default versions with length 2048, and short ones with 1024. Those fingerprints are often so fast to compute they don't benefit from parallelism at all."
699699
]
700700
},
701701
{
@@ -919,9 +919,7 @@
919919
"cell_type": "markdown",
920920
"id": "a2fd6d70-1479-46e0-81bd-5b6f146f7db2",
921921
"metadata": {},
922-
"source": [
923-
"For evaluating the efficiency of downstream ML models, we'll train a Random Forest classifier and check its AUROC. Both this model type and metric are commonly used in chemoinformatics. For train-test splitting, we use [scaffold split](https://scikit-fingerprints.github.io/scikit-fingerprints/modules/generated/skfp.model_selection.scaffold_train_test_split.html), which typically gives better estimation than overly optimistic random split."
924-
]
922+
"source": "For evaluating the efficiency of downstream ML models, we'll train a Random Forest classifier and check its AUROC. Both this model type and metric are commonly used in chemoinformatics. For train-test splitting, we use [scaffold split](https://scikit-fingerprints.readthedocs.io/latest/modules/generated/skfp.model_selection.scaffold_train_test_split.html), which typically gives better estimation than overly optimistic random split."
925923
},
926924
{
927925
"cell_type": "code",

examples/03_pipelines.ipynb

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@
3232
"\n",
3333
"We can easily chain all relevant operations in a single pipeline.\n",
3434
"\n",
35-
"Also, note that for convenience `scaffold_train_test_split` and all fingerprints, e.g. `ECFPFingerprint` ([docs](https://scikit-fingerprints.github.io/scikit-fingerprints/modules/generated/skfp.fingerprints.ECFPFingerprint.html#skfp.fingerprints.ECFPFingerprint)) used below, can also take SMILES input. Those strings will be converted to molecules inside as necessary, since this is a very cheap operation. Whole pipeline then also takes lists of SMILES as inputs, instead of NumPy arrays, e.g. `smiles_train` instead of `X_train`."
35+
"Also, note that for convenience `scaffold_train_test_split` and all fingerprints, e.g. `ECFPFingerprint` ([docs](https://scikit-fingerprints.readthedocs.io/latest/modules/generated/skfp.fingerprints.ECFPFingerprint.html#skfp.fingerprints.ECFPFingerprint)) used below, can also take SMILES input. Those strings will be converted to molecules inside as necessary, since this is a very cheap operation. Whole pipeline then also takes lists of SMILES as inputs, instead of NumPy arrays, e.g. `smiles_train` instead of `X_train`."
3636
]
3737
},
3838
{
@@ -91,7 +91,7 @@
9191
"source": [
9292
"### More complex molecular property prediction pipeline\n",
9393
"\n",
94-
"We will reuse the pipeline from above, but also add a second [MACCS fingerprint](https://scikit-fingerprints.github.io/scikit-fingerprints/modules/generated/skfp.fingerprints.MACCSFingerprint.html#skfp.fingerprints.MACCSFingerprint) and concatenate it as additional features. Scikit-learn has a built-in `FeatureUnion` class for this, with `make_union` function for easy usage. As it often benefits from parallelism, we will add `n_jobs=-1` to it.\n",
94+
"We will reuse the pipeline from above, but also add a second [MACCS fingerprint](https://scikit-fingerprints.readthedocs.io/latest/modules/generated/skfp.fingerprints.MACCSFingerprint.html#skfp.fingerprints.MACCSFingerprint) and concatenate it as additional features. Scikit-learn has a built-in `FeatureUnion` class for this, with `make_union` function for easy usage. As it often benefits from parallelism, we will add `n_jobs=-1` to it.\n",
9595
"\n",
9696
"Since we may have all-zero features, we will also filter them out as another step, using `VarianceThreshold` ([docs](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html))."
9797
]
@@ -146,7 +146,7 @@
146146
"\n",
147147
"Here, we will build a visualization pipeline that uses dimensionality reduction for 2D visualization.\n",
148148
"\n",
149-
"[UMAP](https://umap-learn.readthedocs.io/en/latest/) is a particularly powerful, yet easy to use nonlinear dimensionality reduction. It requires using some distance metric, which is used for pairwise distance calculation. While scikit-fingerprints implements [distance and similarity measures](https://scikit-fingerprints.github.io/scikit-fingerprints/modules/distances.html) commonly used in chemoinformatics, like [Tanimoto distance](https://scikit-fingerprints.github.io/scikit-fingerprints/modules/generated/skfp.distances.tanimoto_binary_distance.html), UMAP also supports it directly under the name `\"jaccard\"`. Lastly, we'll plot the training and testing data with Matplotlib.\n",
149+
"[UMAP](https://umap-learn.readthedocs.io/en/latest/) is a particularly powerful, yet easy to use nonlinear dimensionality reduction. It requires using some distance metric, which is used for pairwise distance calculation. While scikit-fingerprints implements [distance and similarity measures](https://scikit-fingerprints.readthedocs.io/latest/modules/distances.html) commonly used in chemoinformatics, like [Tanimoto distance](https://scikit-fingerprints.readthedocs.io/latest/modules/generated/skfp.distances.tanimoto_binary_distance.html), UMAP also supports it directly under the name `\"jaccard\"`. Lastly, we'll plot the training and testing data with Matplotlib.\n",
150150
"\n",
151151
"Since neither UMAP, nor Matplotlib are required by `scikit-fingerprints`, we will install them separately."
152152
]
@@ -236,7 +236,7 @@
236236
"source": [
237237
"### Pipelines with conformational fingerprints\n",
238238
"\n",
239-
"Some fingerprints, like [RDF fingerprint](https://scikit-fingerprints.github.io/scikit-fingerprints/modules/generated/skfp.fingerprints.RDFFingerprint.html), are based on the 3D (spatial) structure of a molecule conformer. They can be easily generated in scikit-fingerprints with [ConformerGenerator](https://scikit-fingerprints.github.io/scikit-fingerprints/modules/generated/skfp.preprocessing.ConformerGenerator.html), which uses ETKDGv3 algorithm underneath. This information is saved as a molecule attribute, and such molecules can be vectorized with both conformational fingerprints and regular ones. Note that conformer generation is hard, and can sometimes fail. It is also quite computationally expensive, and using parallelization is very useful.\n",
239+
"Some fingerprints, like [RDF fingerprint](https://scikit-fingerprints.readthedocs.io/latest/modules/generated/skfp.fingerprints.RDFFingerprint.html), are based on the 3D (spatial) structure of a molecule conformer. They can be easily generated in scikit-fingerprints with [ConformerGenerator](https://scikit-fingerprints.readthedocs.io/latest/modules/generated/skfp.preprocessing.ConformerGenerator.html), which uses ETKDGv3 algorithm underneath. This information is saved as a molecule attribute, and such molecules can be vectorized with both conformational fingerprints and regular ones. Note that conformer generation is hard, and can sometimes fail. It is also quite computationally expensive, and using parallelization is very useful.\n",
240240
"\n",
241241
"We will use both conformational RDF fingerprint and topological (\"flat\") ECFP fingerprint here, concatenated to create a more rich feature space for classification. Since they create features with really different value ranges, we'll add min-max scaling. This is particularly important for linear classifiers and anything not based on decision trees. To make things more interesting, we will use logistic regression, the most popular linear model."
242242
]

0 commit comments

Comments
 (0)