|
5 | 5 | def infer_nct_year(nct_id): |
6 | 6 | """ |
7 | 7 | Infer the approximate registration year from a ClinicalTrials.gov NCT identifier. |
8 | | - NCT IDs follow the format ``NCT########``, where the numeric portion generally increases |
9 | | - over time. This function uses approximate year ranges based on observed NCT ID |
10 | | - allocation patterns to estimate when a trial was registered. |
| 8 | + |
| 9 | + NCT IDs follow the format ``NCT########``, where the 8-digit numeric portion is assigned |
| 10 | + sequentially and increases over time. This function uses empirically-observed NCT ID |
| 11 | + allocation patterns to estimate when a trial was registered, which is useful for temporal |
| 12 | + filtering and analysis when exact registration dates are not available. |
11 | 13 |
|
12 | | - NCT IDs are sequential and follow approximate ranges: |
13 | | - - ``NCT00000000``-``NCT00999999``: ~1999-2004 |
14 | | - - ``NCT01000000``-``NCT01999999``: ~2005-2011 |
15 | | - - ``NCT02000000``-``NCT02999999``: ~2012-2015 |
16 | | - - ``NCT03000000``-``NCT03999999``: ~2016-2018 |
17 | | - - ``NCT04000000``-``NCT04999999``: ~2019-2021 |
18 | | - - ``NCT05000000``-``NCT05999999``: ~2022-2023 |
| 14 | + **NCT ID Allocation Ranges:** |
| 15 | + |
| 16 | + - ``NCT00000000`` - ``NCT00999999``: ~1999-2004 |
| 17 | + - ``NCT01000000`` - ``NCT01999999``: ~2005-2011 |
| 18 | + - ``NCT02000000`` - ``NCT02999999``: ~2012-2015 |
| 19 | + - ``NCT03000000`` - ``NCT03999999``: ~2016-2018 |
| 20 | + - ``NCT04000000`` - ``NCT04999999``: ~2019-2021 |
| 21 | + - ``NCT05000000`` - ``NCT05999999``: ~2022-2023 |
19 | 22 | - ``NCT06000000``+: ~2024+ |
20 | 23 |
|
21 | | - :param nct_id: A ClinicalTrials.gov identifier (e.g., ``NCT00000001``) |
| 24 | + :param nct_id: A ClinicalTrials.gov identifier (e.g., ``NCT00500000``) |
22 | 25 | :type nct_id: str |
23 | | - :return: Estimated year of trial registration, or None if the NCT ID is invalid |
| 26 | + :return: Estimated year of trial registration as an integer, or ``None`` if the NCT ID is invalid/malformed |
24 | 27 | :rtype: int or None |
25 | 28 |
|
26 | | - **Example** |
27 | | -
|
| 29 | + **Examples** |
| 30 | + |
28 | 31 | >>> infer_nct_year("NCT00500000") |
29 | 32 | 2002 |
30 | 33 | >>> infer_nct_year("NCT03000000") |
31 | 34 | 2016 |
| 35 | + >>> infer_nct_year("NCT06123456") |
| 36 | + 2024 |
32 | 37 | >>> infer_nct_year("invalid") |
33 | 38 | None |
| 39 | + >>> infer_nct_year("NCT123") # Too short |
| 40 | + None |
| 41 | + |
| 42 | + **Use Cases:** |
| 43 | + |
| 44 | + - Filtering clinical trials data by approximate time period |
| 45 | + - Temporal analysis of drug development trends |
| 46 | + - Quick year estimation when full trial metadata is unavailable |
| 47 | + |
34 | 48 | .. note:: |
35 | | - This function provides an approximation based on historical NCT ID allocation |
36 | | - patterns and may not be accurate for all trials. The actual registration date |
37 | | - should be obtained from the official ClinicalTrials.gov database when precision |
38 | | - is required. |
| 49 | + This function provides an **approximation** based on historical NCT ID allocation |
| 50 | + patterns. Individual trials may vary by ±1-2 years from the estimated value. |
| 51 | + For precise temporal analysis, obtain the actual registration date from the |
| 52 | + ClinicalTrials.gov API or database. |
| 53 | + |
| 54 | + .. warning:: |
| 55 | + Returns ``None`` for invalid inputs including non-string types, IDs without the |
| 56 | + "NCT" prefix, or IDs that don't contain exactly 8 digits after "NCT". |
39 | 57 | """ |
40 | 58 | if not isinstance(nct_id, str) or not nct_id.startswith('NCT'): |
41 | 59 | return None |
@@ -65,51 +83,143 @@ def infer_nct_year(nct_id): |
65 | 83 |
|
66 | 84 | def molecules(version: str = '36', top_n_activities: int = 1): |
67 | 85 | """ |
68 | | - Query ChEMBL database for parent molecules with clinical trial data and drug indications. |
| 86 | + Query ChEMBL database for bioactive drug molecules with clinical trial data and therapeutic indications. |
| 87 | + |
| 88 | + This function retrieves comprehensive drug-target-indication relationships from ChEMBL, automatically |
| 89 | + normalizing all molecular forms (salts, formulations, etc.) to their parent compounds. It integrates |
| 90 | + clinical trial phases, MeSH disease classifications, drug mechanisms, and target information to create |
| 91 | + a unified dataset for drug target prioritization and discovery. |
69 | 92 | |
70 | | - This function normalizes all molecules to their parent forms and aggregates indications from |
71 | | - both parent and child molecules (e.g., salt forms). Mechanism assignment follows this hierarchy: |
72 | | - 1. Use parent's ``DRUG_MECHANISM`` if available |
73 | | - 2. Inherit from any child's ``DRUG_MECHANISM`` if parent lacks mechanisms |
74 | | - 3. Use top N activities from ``ACTIVITIES`` table if no mechanisms exist |
| 93 | + **Data Processing Workflow:** |
75 | 94 | |
76 | | - All mechanisms are independent of indication - a molecule has one set of targets that apply |
77 | | - to all its indications. |
| 95 | + 1. **Parent Normalization**: All child molecules (salts, prodrugs, formulations) are mapped to their |
| 96 | + parent compound using ChEMBL's molecule hierarchy |
| 97 | + 2. **Indication Aggregation**: Drug indications from both parent and all child molecules are combined |
| 98 | + 3. **Target Assignment**: Molecular targets are identified using a three-tier hierarchy: |
| 99 | + |
| 100 | + - Primary: Parent molecule's ``DRUG_MECHANISM`` table entries (known mechanisms) |
| 101 | + - Secondary: Child molecule mechanisms (inherited when parent lacks mechanisms) |
| 102 | + - Tertiary: Top N most-studied targets from ``ACTIVITIES`` table (bioassay data) |
78 | 103 | |
79 | | - :param version: ChEMBL database version to query, defaults to ``36`` |
| 104 | + 4. **Clinical Trial Mapping**: Links molecules to ClinicalTrials.gov identifiers with phase information |
| 105 | + 5. **Year Inference**: Estimates trial registration year from NCT identifiers |
| 106 | + |
| 107 | + **Key Features:** |
| 108 | + |
| 109 | + - Only includes molecules with clinical trial references (ClinicalTrials.gov) |
| 110 | + - Filters to human targets only (``Homo sapiens``) |
| 111 | + - One molecule-indication-target per row (exploded format for multi-trial drugs) |
| 112 | + - Mechanisms apply to all indications of a molecule (not indication-specific) |
| 113 | + |
| 114 | + :param version: ChEMBL database version to query. Version 36 covers data through 2024. |
| 115 | + See https://www.ebi.ac.uk/chembl/ for available versions. |
80 | 116 | :type version: str, optional |
81 | | - :param top_n_activities: For molecules without ``DRUG_MECHANISM``, use top N targets from ``ACTIVITIES`` table, defaults to 1 |
| 117 | + :param top_n_activities: For molecules without documented mechanisms (no ``DRUG_MECHANISM`` entries), |
| 118 | + include the top N most-studied targets from bioassay data. Set to 0 to |
| 119 | + exclude activity-based targets entirely. Defaults to 1 (most-studied target only). |
82 | 120 | :type top_n_activities: int, optional |
83 | | - :return: DataFrame containing parent molecule information with the following key columns: |
84 | | - - ``chembl_id``: ChEMBL identifier for the parent molecule |
85 | | - - ``pref_name``: Preferred name of the parent molecule |
86 | | - - ``mesh_heading``: MeSH term for the indication (aggregated from parent and children) |
87 | | - - ``mesh_id``: MeSH identifier |
88 | | - - ``phase``: Clinical trial phase for this indication |
89 | | - - ``reference_type``: Type of reference (filtered to 'ClinicalTrials') |
90 | | - - ``clinical_trial_id``: ClinicalTrials.gov identifier(s), exploded if multiple |
91 | | - - ``target_chembl_id``: ChEMBL identifier for the target |
92 | | - - ``target_organism``: Target organism (filtered to 'Homo sapiens') |
93 | | - - ``target_type``: Type of target |
94 | | - - ``target_uniprot_id``: UniProt accession for the target |
95 | | - - ``target_gene_name``: Gene symbol for the target |
96 | | - - ``mechanism_of_action``: Description of the mechanism of action (NULL for activity-derived targets) |
97 | | - - ``action_type``: Type of action on the target (NULL for activity-derived targets) |
98 | | - - ``parent_molregno``: Internal molecule registry number of parent |
99 | | - - ``trial_year``: Inferred year from clinical trial ID (nullable integer) |
100 | | - - ``target_source``: ``DRUG_MECHANISM``, ``DRUG_MECHANISM_CHILD``, or ``ACTIVITIES`` |
| 121 | + |
| 122 | + :return: DataFrame with one row per parent-molecule-indication-target combination. |
| 123 | + |
| 124 | + **Columns:** |
| 125 | + |
| 126 | + **Molecule Information:** |
| 127 | + |
| 128 | + - ``chembl_id`` (str): ChEMBL identifier for the parent molecule (e.g., 'CHEMBL25') |
| 129 | + - ``pref_name`` (str): Preferred drug name (e.g., 'ASPIRIN') |
| 130 | + - ``parent_molregno`` (int): Internal ChEMBL registry number for parent molecule |
| 131 | + |
| 132 | + **Indication Information:** |
| 133 | + |
| 134 | + - ``mesh_heading`` (str): MeSH disease term (e.g., 'Lung Neoplasms') |
| 135 | + - ``mesh_id`` (str): MeSH unique identifier (e.g., 'D008175') |
| 136 | + - ``phase`` (int): Maximum clinical trial phase for this indication (0-4, where 4=approved) |
| 137 | + - ``reference_type`` (str): Always 'ClinicalTrials' (pre-filtered) |
| 138 | + - ``clinical_trial_id`` (str): ClinicalTrials.gov NCT identifier (e.g., 'NCT00123456') |
| 139 | + - ``trial_year`` (int, nullable): Inferred trial registration year via :func:`infer_nct_year` |
| 140 | + |
| 141 | + **Target Information:** |
| 142 | + |
| 143 | + - ``target_chembl_id`` (str): ChEMBL target identifier (e.g., 'CHEMBL240') |
| 144 | + - ``target_organism`` (str): Always 'Homo sapiens' (pre-filtered) |
| 145 | + - ``target_type`` (str): Target classification (e.g., 'SINGLE PROTEIN', 'PROTEIN COMPLEX') |
| 146 | + - ``target_uniprot_id`` (str, nullable): UniProt accession (e.g., 'P35354') |
| 147 | + - ``target_gene_name`` (str, nullable): HGNC gene symbol (e.g., 'EGFR') |
| 148 | + - ``mechanism_of_action`` (str, nullable): Mechanism description (NULL for activity-derived targets) |
| 149 | + - ``action_type`` (str, nullable): Drug action type (e.g., 'INHIBITOR', 'AGONIST'; NULL for activities) |
| 150 | + - ``target_source`` (str): Data provenance - one of: |
| 151 | + |
| 152 | + - ``DRUG_MECHANISM``: From parent's mechanism table (highest confidence) |
| 153 | + - ``DRUG_MECHANISM_CHILD``: Inherited from child molecule's mechanism |
| 154 | + - ``ACTIVITIES``: Derived from bioassay activity data (lower confidence) |
| 155 | + |
101 | 156 | :rtype: pandas.DataFrame |
102 | 157 | |
| 158 | + **Examples** |
| 159 | + |
| 160 | + Basic usage - retrieve all molecules from ChEMBL v36:: |
| 161 | + |
| 162 | + >>> from alethiotx.artemis.chembl import molecules |
| 163 | + >>> df = molecules(version='36', top_n_activities=1) |
| 164 | + >>> print(f"{len(df)} records, {df['chembl_id'].nunique()} unique molecules") |
| 165 | + >>> print(df[['chembl_id', 'pref_name', 'mesh_heading', 'target_gene_name']].head()) |
| 166 | + |
| 167 | + Filter to specific disease and approved drugs only:: |
| 168 | + |
| 169 | + >>> df = molecules(version='36') |
| 170 | + >>> lung_cancer = df[df['mesh_heading'] == 'Lung Neoplasms'] |
| 171 | + >>> approved = lung_cancer[lung_cancer['phase'] == 4] |
| 172 | + >>> print(approved[['pref_name', 'target_gene_name']].drop_duplicates()) |
| 173 | + |
| 174 | + Exclude activity-based targets (mechanism data only):: |
| 175 | + |
| 176 | + >>> df = molecules(version='36', top_n_activities=0) |
| 177 | + >>> print(f"Mechanisms only: {df['target_source'].value_counts()}") |
| 178 | + |
| 179 | + Analyze recent trials (last 6 years):: |
| 180 | + |
| 181 | + >>> from datetime import datetime |
| 182 | + >>> df = molecules(version='36') |
| 183 | + >>> current_year = datetime.now().year |
| 184 | + >>> recent = df[df['trial_year'] >= current_year - 6] |
| 185 | + >>> print(f"Recent trials: {len(recent)} records") |
| 186 | + |
103 | 187 | .. note:: |
104 | | - All child molecules (salts, formulations) are converted to their parent compound. |
105 | | - Indications are aggregated from both parent and all children. |
| 188 | + **Parent-Child Normalization**: All molecular forms (salts like 'aspirin sodium', |
| 189 | + formulations like 'aspirin tablet') are normalized to their parent compound ('aspirin'). |
| 190 | + This ensures consistent target mapping and prevents double-counting. |
106 | 191 | |
107 | 192 | .. note:: |
108 | | - Mechanisms are assigned at the parent level and apply to all indications. |
109 | | - If a parent has no mechanism but children do, the child's mechanism is inherited. |
| 193 | + **Mechanism-Indication Independence**: A molecule's targets are the same across all its |
| 194 | + indications. For example, if aspirin targets COX1/COX2, these targets apply whether the |
| 195 | + indication is 'Pain' or 'Cardiovascular Disease'. This reflects biological reality - a |
| 196 | + drug's mechanism doesn't change based on what it's prescribed for. |
110 | 197 | |
111 | 198 | .. note:: |
112 | | - Clinical trial IDs containing multiple comma-separated values are exploded into separate rows. |
| 199 | + **Clinical Trial ID Explosion**: When a molecule has multiple comma-separated trial IDs |
| 200 | + (e.g., 'NCT001,NCT002'), they are exploded into separate rows. This enables per-trial |
| 201 | + analysis and proper trial counting. |
| 202 | + |
| 203 | + .. warning:: |
| 204 | + **Data Volume**: ChEMBL v36 contains hundreds of thousands of molecule-target relationships. |
| 205 | + Full queries may take several minutes and return large DataFrames (>100K rows). Consider |
| 206 | + filtering by version, phase, or disease after loading to reduce memory usage. |
| 207 | + |
| 208 | + .. warning:: |
| 209 | + **Activity-Based Targets**: Targets from the ``ACTIVITIES`` table (``target_source='ACTIVITIES'``) |
| 210 | + have lower confidence than mechanism-based targets. They represent bioassay activity but may |
| 211 | + not reflect clinical mechanisms. Set ``top_n_activities=0`` to exclude these if you need |
| 212 | + high-confidence mechanisms only. |
| 213 | + |
| 214 | + .. warning:: |
| 215 | + **Requires chembl-downloader**: This function requires the ``chembl-downloader`` package |
| 216 | + to be installed. Install via: ``pip install chembl-downloader`` |
| 217 | + |
| 218 | + .. seealso:: |
| 219 | + - :func:`infer_nct_year`: Used internally to estimate trial registration years |
| 220 | + - ChEMBL Documentation: https://chembl.gitbook.io/chembl-interface-documentation/ |
| 221 | + - ClinicalTrials.gov: https://clinicaltrials.gov/ |
| 222 | +
|
113 | 223 | """ |
114 | 224 |
|
115 | 225 | print("Step 1: Getting all parent molecules with their children's indications...") |
|
0 commit comments