|
7 | 7 | "source": [ |
8 | 8 | "# GP Regression on Molecules #\n", |
9 | 9 | "\n", |
10 | | - "An example notebook for basic GP regression on a molecular dataset. We showcase two different GP models on the Photoswitch Dataset --- one using a Tanimoto kernel applied to fingerprint representations of the molecules and another also using a Tanimoto kernel but applied to a bag-of-characters representations of molecular SMILES strings (a.k.a the bag-of-SMILES model). Towards the end of the tutorial it is shown that the GP's uncertainty estimates are correlated with prediction error and can thus act as a criteria for prioritising molecules for laboratory synthesis.\n", |
| 10 | + "An example notebook for basic GP regression on a molecular dataset. We showcase two different GP models on the Photoswitch Dataset --- one using a Tanimoto kernel applied to fingerprint representations of the molecules and another also using a Tanimoto kernel but applied to a bag-of-characters representations of molecular SMILES strings (a.k.a the bag-of-SMILES model). It should be noted that the bag-of-SMILES is equivalent to the SMILES string kernel in [1]. Towards the end of the tutorial it is shown that the GP's uncertainty estimates are correlated with prediction error and can thus act as a criteria for prioritising molecules for laboratory synthesis.\n", |
11 | 11 | "\n", |
12 | 12 | "Paper: https://pubs.rsc.org/en/content/articlelanding/2022/sc/d2sc04306h\n", |
13 | 13 | "\n", |
|
62 | 62 | "k_{\\text{Tanimoto}}(\\mathbf{x}, \\mathbf{x}') = \\sigma^2_{f}\\frac{<\\mathbf{x}, \\mathbf{x}'>}{||\\mathbf{x}||^2 + ||\\mathbf{x}'||^2 \\: - <\\mathbf{x}, \\mathbf{x}'>},\n", |
63 | 63 | "\\end{equation}\n", |
64 | 64 | "\n", |
65 | | - "where $\\mathbf{x} \\in \\mathbb{R}^D$ is a D-dimensional binary fingerprint vector i.e. components $\\mathbf{x}_i \\in \\{0, 1\\}$, $<\\cdot, \\cdot>$ is the Euclidean inner product, $||\\cdot||$ is the Euclidean norm and $\\sigma_{f}$ is a scalar kernel signal amplitude (vertical lengthscale) hyperparameter. One of the first instances of the Tanimoto kernel being used in conjunction with GP regression was in [1]. While common GP kernels that operate on continuous spaces can be applied to molecules, there is evidence to suggest that using an appropriate similarity metric for bit vectors yields improved performance [2]. \n", |
| 65 | + "where $\\mathbf{x} \\in \\mathbb{R}^D$ is a D-dimensional binary fingerprint vector i.e. components $\\mathbf{x}_i \\in \\{0, 1\\}$, $<\\cdot, \\cdot>$ is the Euclidean inner product, $||\\cdot||$ is the Euclidean norm and $\\sigma_{f}$ is a scalar kernel signal amplitude (vertical lengthscale) hyperparameter. One of the first instances of the Tanimoto kernel being used in conjunction with GP regression was in [2]. While common GP kernels that operate on continuous spaces can be applied to molecules, there is evidence to suggest that using an appropriate similarity metric for bit vectors yields improved performance [4]. \n", |
66 | 66 | "\n" |
67 | 67 | ] |
68 | 68 | }, |
|
654 | 654 | "metadata": {}, |
655 | 655 | "source": [ |
656 | 656 | "## References \n", |
| 657 | + "[1] D-S Cao, J-C Zhao, Y-N Yang, C-X Zhao, J Yan, S Liu, Q-N Hu, Q-S Xu, and Y-Z Liang. [In silico toxicity prediction by support vector machine and SMILES representation-based string kernel](https://pubmed.ncbi.nlm.nih.gov/22224501/). SAR and QSAR in Environmental Research, 2012.\n", |
657 | 658 | "\n", |
658 | | - "[1] Griffiths, R.R., Greenfield, J.L., Thawani, AR, Jamasb, A., Moss, H.B, Bourached, A., Jones, P., McCorkindale, W., Aldrick, A.A. Fuchter, M.J. and Lee, A.A., [Data-driven discovery of molecular photoswitches with multioutput Gaussian processes](https://pubs.rsc.org/en/content/articlehtml/2022/sc/d2sc04306h). Chemical Science 2022.\n", |
659 | 659 | "\n", |
660 | | - "[2] Bajusz, D., Rácz, A. and Héberger, K., 2015. [Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations?](https://jcheminf.biomedcentral.com/articles/10.1186/s13321-015-0069-3). Journal of Cheminformatics, 7(1), pp.1-13.\n", |
| 660 | + "[2] Griffiths, R.R., Greenfield, J.L., Thawani, AR, Jamasb, A., Moss, H.B, Bourached, A., Jones, P., McCorkindale, W., Aldrick, A.A. Fuchter, M.J. and Lee, A.A., [Data-driven discovery of molecular photoswitches with multioutput Gaussian processes](https://pubs.rsc.org/en/content/articlehtml/2022/sc/d2sc04306h). Chemical Science 2022.\n", |
661 | 661 | "\n", |
662 | | - "[3] Moriwaki, H., Tian, Y.S., Kawashita, N. and Takagi, T., 2018. [Mordred: a molecular descriptor calculator](https://jcheminf.biomedcentral.com/articles/10.1186/s13321-018-0258-y?ref=https://githubhelp.com). Journal of Cheminformatics, 10(1), pp.1-14." |
| 662 | + "[3] Bajusz, D., Rácz, A. and Héberger, K., 2015. [Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations?](https://jcheminf.biomedcentral.com/articles/10.1186/s13321-015-0069-3). Journal of Cheminformatics, 7(1), pp.1-13.\n", |
| 663 | + "\n", |
| 664 | + "[4] Moriwaki, H., Tian, Y.S., Kawashita, N. and Takagi, T., 2018. [Mordred: a molecular descriptor calculator](https://jcheminf.biomedcentral.com/articles/10.1186/s13321-018-0258-y?ref=https://githubhelp.com). Journal of Cheminformatics, 10(1), pp.1-14." |
663 | 665 | ] |
664 | 666 | } |
665 | 667 | ], |
|
0 commit comments