|
| 1 | +{ |
| 2 | + "cells": [ |
| 3 | + { |
| 4 | + "cell_type": "markdown", |
| 5 | + "id": "5d7a1aa6", |
| 6 | + "metadata": {}, |
| 7 | + "source": [ |
| 8 | + "# Molecular Clustering Algorithms " |
| 9 | + ] |
| 10 | + }, |
| 11 | + { |
| 12 | + "cell_type": "markdown", |
| 13 | + "id": "6d326fb9", |
| 14 | + "metadata": {}, |
| 15 | + "source": [ |
| 16 | + "**scikit-fingerprints** provides tools to partition chemical space using fingerprint-based methods. These clustering algorithms make it possible to divide large molecular collections into smaller, representative groups, which is useful in a variety of settings such as:\n", |
| 17 | + "\n", |
| 18 | + "- organizing compounds in virtual screening campaigns\n", |
| 19 | + "- exploring chemical diversity\n", |
| 20 | + "- constructing stratified splits for machine-learning model validation\n", |
| 21 | + "\n", |
| 22 | + "In the following tutorial, we demonstrate how to work with built-in datasets, preprocess molecular data, and apply clustering algorithms to partition chemical space in practice." |
| 23 | + ] |
| 24 | + }, |
| 25 | + { |
| 26 | + "cell_type": "code", |
| 27 | + "execution_count": 1, |
| 28 | + "id": "df2e8529", |
| 29 | + "metadata": {}, |
| 30 | + "outputs": [], |
| 31 | + "source": [ |
| 32 | + "import pandas as pd\n", |
| 33 | + "\n", |
| 34 | + "from skfp.clustering import MaxMinClustering\n", |
| 35 | + "from skfp.datasets.moleculenet import load_bace\n", |
| 36 | + "from skfp.fingerprints import ECFPFingerprint" |
| 37 | + ] |
| 38 | + }, |
| 39 | + { |
| 40 | + "cell_type": "code", |
| 41 | + "execution_count": 2, |
| 42 | + "id": "b65ec205", |
| 43 | + "metadata": {}, |
| 44 | + "outputs": [], |
| 45 | + "source": [ |
| 46 | + "# Get and preprocess data - here, we use a built-in MoleculeNet datset and compute binary fingerprints\n", |
| 47 | + "\n", |
| 48 | + "smiles, _ = load_bace()\n", |
| 49 | + "fps = ECFPFingerprint().fit_transform(smiles)" |
| 50 | + ] |
| 51 | + }, |
| 52 | + { |
| 53 | + "cell_type": "markdown", |
| 54 | + "id": "3e80daf8", |
| 55 | + "metadata": {}, |
| 56 | + "source": [ |
| 57 | + "## Available clustering algorithms\n", |
| 58 | + "\n", |
| 59 | + "The table below summarizes the clustering algorithms currently implemented in\n", |
| 60 | + "**scikit-fingerprints**. Additional methods may be added in future releases.\n", |
| 61 | + "\n", |
| 62 | + "| Algorithm | Distance / Similarity | Centroid-based | Produces labels | Predict on new data | Typical use case |\n", |
| 63 | + "|---------|----------------------|----------------|------------------|---------------------|------------------|\n", |
| 64 | + "| MaxMin clustering | Tanimoto (binary fingerprints) | Yes (diverse representatives) | Yes | Yes | Partitioning chemical space into representative clusters |\n" |
| 65 | + ] |
| 66 | + }, |
| 67 | + { |
| 68 | + "cell_type": "markdown", |
| 69 | + "id": "4b229e44", |
| 70 | + "metadata": {}, |
| 71 | + "source": [ |
| 72 | + "## MaxMin clustering\n", |
| 73 | + "\n", |
| 74 | + "MaxMin clustering is a centroid-based clustering method designed for binary molecular fingerprints. It combines diversity-based centroid selection with similarity-based cluster assignment, making it particularly suitable for partitioning chemical space into representative regions.\n", |
| 75 | + "\n", |
| 76 | + "Unlike many classical clustering algorithms, MaxMin clustering explicitly selects cluster centers to maximize diversity before assigning molecules to clusters.\n", |
| 77 | + "\n", |
| 78 | + "**Algorithm overview**\n", |
| 79 | + "\n", |
| 80 | + "MaxMin clustering proceeds in two stages:\n", |
| 81 | + "\n", |
| 82 | + "- *Centroid selection (MaxMin diversity picking)*\n", |
| 83 | + "A set of representative molecules is selected using RDKit’s MaxMinPicker. Centroids are chosen iteratively such that each newly selected centroid maximizes the minimum Tanimoto distance to all previously selected centroids. This encourages broad coverage of chemical space.\n", |
| 84 | + "\n", |
| 85 | + "- *Cluster assignment*\n", |
| 86 | + "Once centroids are fixed, all molecules (including the centroids themselves) are assigned to the nearest centroid using Tanimoto similarity. Each molecule is assigned to the cluster corresponding to the centroid with the highest similarity.\n", |
| 87 | + "\n", |
| 88 | + "**Distance and similarity**\n", |
| 89 | + "\n", |
| 90 | + "MaxMin clustering operates on binary fingerprints and uses:\n", |
| 91 | + "\n", |
| 92 | + "- *Tanimoto similarity* to measure molecular similarity\n", |
| 93 | + "- *Tanimoto distance* (1 − similarity) during centroid selection\n", |
| 94 | + "\n", |
| 95 | + "This choice is standard in cheminformatics and well suited for sparse binary representations such as ECFP (Morgan) fingerprints.\n", |
| 96 | + "\n", |
| 97 | + "**Controlling cluster granularity**\n", |
| 98 | + "\n", |
| 99 | + "The number and spread of clusters are controlled by the distance threshold used during centroid selection:\n", |
| 100 | + "- *Lower thresholds* produce fewer, broader clusters\n", |
| 101 | + "- *Higher thresholds* produce more, finer-grained clusters\n", |
| 102 | + "\n", |
| 103 | + "The exact number of clusters emerges from the data and the chosen threshold.\n", |
| 104 | + "\n", |
| 105 | + "The algorithm does not balance cluster sizes or select fixed-size subsets; such operations are left to downstream processing." |
| 106 | + ] |
| 107 | + }, |
| 108 | + { |
| 109 | + "cell_type": "code", |
| 110 | + "execution_count": 3, |
| 111 | + "id": "0ef9b136", |
| 112 | + "metadata": {}, |
| 113 | + "outputs": [ |
| 114 | + { |
| 115 | + "data": { |
| 116 | + "text/plain": [ |
| 117 | + "array([124, 307, 194, ..., 263, 263, 216], shape=(1513,))" |
| 118 | + ] |
| 119 | + }, |
| 120 | + "execution_count": 3, |
| 121 | + "metadata": {}, |
| 122 | + "output_type": "execute_result" |
| 123 | + } |
| 124 | + ], |
| 125 | + "source": [ |
| 126 | + "# run MaxMin clustering (distance threshold 0.4) - this outputs a label for each molecule\n", |
| 127 | + "clusterer = MaxMinClustering(distance_threshold=0.4, random_state=0)\n", |
| 128 | + "labels = clusterer.fit_predict(fps)\n", |
| 129 | + "labels" |
| 130 | + ] |
| 131 | + }, |
| 132 | + { |
| 133 | + "cell_type": "markdown", |
| 134 | + "id": "28dcb24b", |
| 135 | + "metadata": {}, |
| 136 | + "source": [ |
| 137 | + "**Inspecting Clustering Results**\n", |
| 138 | + "\n", |
| 139 | + "To better understand the outcome, we first examine how many clusters were created and how compressed the chemical space is on average." |
| 140 | + ] |
| 141 | + }, |
| 142 | + { |
| 143 | + "cell_type": "code", |
| 144 | + "execution_count": 4, |
| 145 | + "id": "32da15cb", |
| 146 | + "metadata": {}, |
| 147 | + "outputs": [ |
| 148 | + { |
| 149 | + "name": "stdout", |
| 150 | + "output_type": "stream", |
| 151 | + "text": [ |
| 152 | + "Number of molecules: 1513\n", |
| 153 | + "Number of clusters (distance_threshold=0.4): 347\n", |
| 154 | + "Average molecules per cluster: 4.4\n" |
| 155 | + ] |
| 156 | + } |
| 157 | + ], |
| 158 | + "source": [ |
| 159 | + "n_molecules = len(smiles)\n", |
| 160 | + "n_clusters = len(clusterer.centroid_indices_)\n", |
| 161 | + "\n", |
| 162 | + "print(f\"Number of molecules: {n_molecules}\")\n", |
| 163 | + "print(f\"Number of clusters (distance_threshold=0.4): {n_clusters}\")\n", |
| 164 | + "print(f\"Average molecules per cluster: {n_molecules / n_clusters:.1f}\")" |
| 165 | + ] |
| 166 | + }, |
| 167 | + { |
| 168 | + "cell_type": "markdown", |
| 169 | + "id": "aa74a3d7", |
| 170 | + "metadata": {}, |
| 171 | + "source": [ |
| 172 | + "**Cluster size distribution**\n", |
| 173 | + "\n", |
| 174 | + "Next, we attach the cluster labels to a table and *rank clusters by size*. \n", |
| 175 | + "\n", |
| 176 | + "Larger clusters correspond to densely populated regions of chemical space, while smaller clusters often represent more unique or isolated chemotypes." |
| 177 | + ] |
| 178 | + }, |
| 179 | + { |
| 180 | + "cell_type": "code", |
| 181 | + "execution_count": 5, |
| 182 | + "id": "44b94576", |
| 183 | + "metadata": {}, |
| 184 | + "outputs": [ |
| 185 | + { |
| 186 | + "name": "stdout", |
| 187 | + "output_type": "stream", |
| 188 | + "text": [ |
| 189 | + "Top 10 largest clusters:\n" |
| 190 | + ] |
| 191 | + }, |
| 192 | + { |
| 193 | + "data": { |
| 194 | + "text/plain": [ |
| 195 | + "cluster\n", |
| 196 | + "0 33\n", |
| 197 | + "17 32\n", |
| 198 | + "261 29\n", |
| 199 | + "281 29\n", |
| 200 | + "187 24\n", |
| 201 | + "15 21\n", |
| 202 | + "324 18\n", |
| 203 | + "218 18\n", |
| 204 | + "190 18\n", |
| 205 | + "145 17\n", |
| 206 | + "dtype: int64" |
| 207 | + ] |
| 208 | + }, |
| 209 | + "execution_count": 5, |
| 210 | + "metadata": {}, |
| 211 | + "output_type": "execute_result" |
| 212 | + } |
| 213 | + ], |
| 214 | + "source": [ |
| 215 | + "df = pd.DataFrame(\n", |
| 216 | + " {\n", |
| 217 | + " \"smiles\": smiles,\n", |
| 218 | + " \"cluster\": labels,\n", |
| 219 | + " }\n", |
| 220 | + ")\n", |
| 221 | + "\n", |
| 222 | + "cluster_sizes = df.groupby(\"cluster\").size().sort_values(ascending=False)\n", |
| 223 | + "\n", |
| 224 | + "print(\"Top 10 largest clusters:\")\n", |
| 225 | + "cluster_sizes.head(10)" |
| 226 | + ] |
| 227 | + }, |
| 228 | + { |
| 229 | + "cell_type": "markdown", |
| 230 | + "id": "01d91dc5", |
| 231 | + "metadata": {}, |
| 232 | + "source": [ |
| 233 | + "**Effect of the distance threshold**\n", |
| 234 | + "\n", |
| 235 | + "Finally, we explore how changing the distance threshold affects the number of clusters." |
| 236 | + ] |
| 237 | + }, |
| 238 | + { |
| 239 | + "cell_type": "code", |
| 240 | + "execution_count": 6, |
| 241 | + "id": "4feedf52", |
| 242 | + "metadata": {}, |
| 243 | + "outputs": [ |
| 244 | + { |
| 245 | + "name": "stdout", |
| 246 | + "output_type": "stream", |
| 247 | + "text": [ |
| 248 | + "threshold=0.30 -> n_clusters=600\n", |
| 249 | + "threshold=0.40 -> n_clusters=347\n", |
| 250 | + "threshold=0.50 -> n_clusters=211\n", |
| 251 | + "threshold=0.70 -> n_clusters=83\n" |
| 252 | + ] |
| 253 | + } |
| 254 | + ], |
| 255 | + "source": [ |
| 256 | + "for t in [0.3, 0.4, 0.5, 0.7]:\n", |
| 257 | + " c = MaxMinClustering(distance_threshold=t, random_state=0)\n", |
| 258 | + " c.fit(fps)\n", |
| 259 | + " print(f\"threshold={t:0.2f} -> n_clusters={len(c.centroid_indices_)}\")" |
| 260 | + ] |
| 261 | + }, |
| 262 | + { |
| 263 | + "cell_type": "markdown", |
| 264 | + "id": "e6f1faf3", |
| 265 | + "metadata": {}, |
| 266 | + "source": [ |
| 267 | + "Increasing the distance threshold forces centroids to be more dissimilar, resulting in a larger number of finer-grained clusters." |
| 268 | + ] |
| 269 | + }, |
| 270 | + { |
| 271 | + "cell_type": "markdown", |
| 272 | + "id": "ebdd6d5f", |
| 273 | + "metadata": {}, |
| 274 | + "source": [] |
| 275 | + } |
| 276 | + ], |
| 277 | + "metadata": { |
| 278 | + "kernelspec": { |
| 279 | + "display_name": "scikit-fingerprints", |
| 280 | + "language": "python", |
| 281 | + "name": "python3" |
| 282 | + }, |
| 283 | + "language_info": { |
| 284 | + "codemirror_mode": { |
| 285 | + "name": "ipython", |
| 286 | + "version": 3 |
| 287 | + }, |
| 288 | + "file_extension": ".py", |
| 289 | + "mimetype": "text/x-python", |
| 290 | + "name": "python", |
| 291 | + "nbconvert_exporter": "python", |
| 292 | + "pygments_lexer": "ipython3", |
| 293 | + "version": "3.11.9" |
| 294 | + } |
| 295 | + }, |
| 296 | + "nbformat": 4, |
| 297 | + "nbformat_minor": 5 |
| 298 | +} |
0 commit comments