Skip to content

Commit 891da8b

Browse files
author
wikiselev
committed
Enhance docstrings for infer_nct_year and molecules functions with detailed descriptions, examples, and warnings
1 parent 6bb096e commit 891da8b

File tree

1 file changed

+161
-51
lines changed

1 file changed

+161
-51
lines changed

src/alethiotx/artemis/chembl/query.py

Lines changed: 161 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -5,37 +5,55 @@
55
def infer_nct_year(nct_id):
66
"""
77
Infer the approximate registration year from a ClinicalTrials.gov NCT identifier.
8-
NCT IDs follow the format ``NCT########``, where the numeric portion generally increases
9-
over time. This function uses approximate year ranges based on observed NCT ID
10-
allocation patterns to estimate when a trial was registered.
8+
9+
NCT IDs follow the format ``NCT########``, where the 8-digit numeric portion is assigned
10+
sequentially and increases over time. This function uses empirically-observed NCT ID
11+
allocation patterns to estimate when a trial was registered, which is useful for temporal
12+
filtering and analysis when exact registration dates are not available.
1113
12-
NCT IDs are sequential and follow approximate ranges:
13-
- ``NCT00000000``-``NCT00999999``: ~1999-2004
14-
- ``NCT01000000``-``NCT01999999``: ~2005-2011
15-
- ``NCT02000000``-``NCT02999999``: ~2012-2015
16-
- ``NCT03000000``-``NCT03999999``: ~2016-2018
17-
- ``NCT04000000``-``NCT04999999``: ~2019-2021
18-
- ``NCT05000000``-``NCT05999999``: ~2022-2023
14+
**NCT ID Allocation Ranges:**
15+
16+
- ``NCT00000000`` - ``NCT00999999``: ~1999-2004
17+
- ``NCT01000000`` - ``NCT01999999``: ~2005-2011
18+
- ``NCT02000000`` - ``NCT02999999``: ~2012-2015
19+
- ``NCT03000000`` - ``NCT03999999``: ~2016-2018
20+
- ``NCT04000000`` - ``NCT04999999``: ~2019-2021
21+
- ``NCT05000000`` - ``NCT05999999``: ~2022-2023
1922
- ``NCT06000000``+: ~2024+
2023
21-
:param nct_id: A ClinicalTrials.gov identifier (e.g., ``NCT00000001``)
24+
:param nct_id: A ClinicalTrials.gov identifier (e.g., ``NCT00500000``)
2225
:type nct_id: str
23-
:return: Estimated year of trial registration, or None if the NCT ID is invalid
26+
:return: Estimated year of trial registration as an integer, or ``None`` if the NCT ID is invalid/malformed
2427
:rtype: int or None
2528
26-
**Example**
27-
29+
**Examples**
30+
2831
>>> infer_nct_year("NCT00500000")
2932
2002
3033
>>> infer_nct_year("NCT03000000")
3134
2016
35+
>>> infer_nct_year("NCT06123456")
36+
2024
3237
>>> infer_nct_year("invalid")
3338
None
39+
>>> infer_nct_year("NCT123") # Too short
40+
None
41+
42+
**Use Cases:**
43+
44+
- Filtering clinical trials data by approximate time period
45+
- Temporal analysis of drug development trends
46+
- Quick year estimation when full trial metadata is unavailable
47+
3448
.. note::
35-
This function provides an approximation based on historical NCT ID allocation
36-
patterns and may not be accurate for all trials. The actual registration date
37-
should be obtained from the official ClinicalTrials.gov database when precision
38-
is required.
49+
This function provides an **approximation** based on historical NCT ID allocation
50+
patterns. Individual trials may vary by ±1-2 years from the estimated value.
51+
For precise temporal analysis, obtain the actual registration date from the
52+
ClinicalTrials.gov API or database.
53+
54+
.. warning::
55+
Returns ``None`` for invalid inputs including non-string types, IDs without the
56+
"NCT" prefix, or IDs that don't contain exactly 8 digits after "NCT".
3957
"""
4058
if not isinstance(nct_id, str) or not nct_id.startswith('NCT'):
4159
return None
@@ -65,51 +83,143 @@ def infer_nct_year(nct_id):
6583

6684
def molecules(version: str = '36', top_n_activities: int = 1):
6785
"""
68-
Query ChEMBL database for parent molecules with clinical trial data and drug indications.
86+
Query ChEMBL database for bioactive drug molecules with clinical trial data and therapeutic indications.
87+
88+
This function retrieves comprehensive drug-target-indication relationships from ChEMBL, automatically
89+
normalizing all molecular forms (salts, formulations, etc.) to their parent compounds. It integrates
90+
clinical trial phases, MeSH disease classifications, drug mechanisms, and target information to create
91+
a unified dataset for drug target prioritization and discovery.
6992
70-
This function normalizes all molecules to their parent forms and aggregates indications from
71-
both parent and child molecules (e.g., salt forms). Mechanism assignment follows this hierarchy:
72-
1. Use parent's ``DRUG_MECHANISM`` if available
73-
2. Inherit from any child's ``DRUG_MECHANISM`` if parent lacks mechanisms
74-
3. Use top N activities from ``ACTIVITIES`` table if no mechanisms exist
93+
**Data Processing Workflow:**
7594
76-
All mechanisms are independent of indication - a molecule has one set of targets that apply
77-
to all its indications.
95+
1. **Parent Normalization**: All child molecules (salts, prodrugs, formulations) are mapped to their
96+
parent compound using ChEMBL's molecule hierarchy
97+
2. **Indication Aggregation**: Drug indications from both parent and all child molecules are combined
98+
3. **Target Assignment**: Molecular targets are identified using a three-tier hierarchy:
99+
100+
- Primary: Parent molecule's ``DRUG_MECHANISM`` table entries (known mechanisms)
101+
- Secondary: Child molecule mechanisms (inherited when parent lacks mechanisms)
102+
- Tertiary: Top N most-studied targets from ``ACTIVITIES`` table (bioassay data)
78103
79-
:param version: ChEMBL database version to query, defaults to ``36``
104+
4. **Clinical Trial Mapping**: Links molecules to ClinicalTrials.gov identifiers with phase information
105+
5. **Year Inference**: Estimates trial registration year from NCT identifiers
106+
107+
**Key Features:**
108+
109+
- Only includes molecules with clinical trial references (ClinicalTrials.gov)
110+
- Filters to human targets only (``Homo sapiens``)
111+
- One molecule-indication-target per row (exploded format for multi-trial drugs)
112+
- Mechanisms apply to all indications of a molecule (not indication-specific)
113+
114+
:param version: ChEMBL database version to query. Version 36 covers data through 2024.
115+
See https://www.ebi.ac.uk/chembl/ for available versions.
80116
:type version: str, optional
81-
:param top_n_activities: For molecules without ``DRUG_MECHANISM``, use top N targets from ``ACTIVITIES`` table, defaults to 1
117+
:param top_n_activities: For molecules without documented mechanisms (no ``DRUG_MECHANISM`` entries),
118+
include the top N most-studied targets from bioassay data. Set to 0 to
119+
exclude activity-based targets entirely. Defaults to 1 (most-studied target only).
82120
:type top_n_activities: int, optional
83-
:return: DataFrame containing parent molecule information with the following key columns:
84-
- ``chembl_id``: ChEMBL identifier for the parent molecule
85-
- ``pref_name``: Preferred name of the parent molecule
86-
- ``mesh_heading``: MeSH term for the indication (aggregated from parent and children)
87-
- ``mesh_id``: MeSH identifier
88-
- ``phase``: Clinical trial phase for this indication
89-
- ``reference_type``: Type of reference (filtered to 'ClinicalTrials')
90-
- ``clinical_trial_id``: ClinicalTrials.gov identifier(s), exploded if multiple
91-
- ``target_chembl_id``: ChEMBL identifier for the target
92-
- ``target_organism``: Target organism (filtered to 'Homo sapiens')
93-
- ``target_type``: Type of target
94-
- ``target_uniprot_id``: UniProt accession for the target
95-
- ``target_gene_name``: Gene symbol for the target
96-
- ``mechanism_of_action``: Description of the mechanism of action (NULL for activity-derived targets)
97-
- ``action_type``: Type of action on the target (NULL for activity-derived targets)
98-
- ``parent_molregno``: Internal molecule registry number of parent
99-
- ``trial_year``: Inferred year from clinical trial ID (nullable integer)
100-
- ``target_source``: ``DRUG_MECHANISM``, ``DRUG_MECHANISM_CHILD``, or ``ACTIVITIES``
121+
122+
:return: DataFrame with one row per parent-molecule-indication-target combination.
123+
124+
**Columns:**
125+
126+
**Molecule Information:**
127+
128+
- ``chembl_id`` (str): ChEMBL identifier for the parent molecule (e.g., 'CHEMBL25')
129+
- ``pref_name`` (str): Preferred drug name (e.g., 'ASPIRIN')
130+
- ``parent_molregno`` (int): Internal ChEMBL registry number for parent molecule
131+
132+
**Indication Information:**
133+
134+
- ``mesh_heading`` (str): MeSH disease term (e.g., 'Lung Neoplasms')
135+
- ``mesh_id`` (str): MeSH unique identifier (e.g., 'D008175')
136+
- ``phase`` (int): Maximum clinical trial phase for this indication (0-4, where 4=approved)
137+
- ``reference_type`` (str): Always 'ClinicalTrials' (pre-filtered)
138+
- ``clinical_trial_id`` (str): ClinicalTrials.gov NCT identifier (e.g., 'NCT00123456')
139+
- ``trial_year`` (int, nullable): Inferred trial registration year via :func:`infer_nct_year`
140+
141+
**Target Information:**
142+
143+
- ``target_chembl_id`` (str): ChEMBL target identifier (e.g., 'CHEMBL240')
144+
- ``target_organism`` (str): Always 'Homo sapiens' (pre-filtered)
145+
- ``target_type`` (str): Target classification (e.g., 'SINGLE PROTEIN', 'PROTEIN COMPLEX')
146+
- ``target_uniprot_id`` (str, nullable): UniProt accession (e.g., 'P35354')
147+
- ``target_gene_name`` (str, nullable): HGNC gene symbol (e.g., 'EGFR')
148+
- ``mechanism_of_action`` (str, nullable): Mechanism description (NULL for activity-derived targets)
149+
- ``action_type`` (str, nullable): Drug action type (e.g., 'INHIBITOR', 'AGONIST'; NULL for activities)
150+
- ``target_source`` (str): Data provenance - one of:
151+
152+
- ``DRUG_MECHANISM``: From parent's mechanism table (highest confidence)
153+
- ``DRUG_MECHANISM_CHILD``: Inherited from child molecule's mechanism
154+
- ``ACTIVITIES``: Derived from bioassay activity data (lower confidence)
155+
101156
:rtype: pandas.DataFrame
102157
158+
**Examples**
159+
160+
Basic usage - retrieve all molecules from ChEMBL v36::
161+
162+
>>> from alethiotx.artemis.chembl import molecules
163+
>>> df = molecules(version='36', top_n_activities=1)
164+
>>> print(f"{len(df)} records, {df['chembl_id'].nunique()} unique molecules")
165+
>>> print(df[['chembl_id', 'pref_name', 'mesh_heading', 'target_gene_name']].head())
166+
167+
Filter to specific disease and approved drugs only::
168+
169+
>>> df = molecules(version='36')
170+
>>> lung_cancer = df[df['mesh_heading'] == 'Lung Neoplasms']
171+
>>> approved = lung_cancer[lung_cancer['phase'] == 4]
172+
>>> print(approved[['pref_name', 'target_gene_name']].drop_duplicates())
173+
174+
Exclude activity-based targets (mechanism data only)::
175+
176+
>>> df = molecules(version='36', top_n_activities=0)
177+
>>> print(f"Mechanisms only: {df['target_source'].value_counts()}")
178+
179+
Analyze recent trials (last 6 years)::
180+
181+
>>> from datetime import datetime
182+
>>> df = molecules(version='36')
183+
>>> current_year = datetime.now().year
184+
>>> recent = df[df['trial_year'] >= current_year - 6]
185+
>>> print(f"Recent trials: {len(recent)} records")
186+
103187
.. note::
104-
All child molecules (salts, formulations) are converted to their parent compound.
105-
Indications are aggregated from both parent and all children.
188+
**Parent-Child Normalization**: All molecular forms (salts like 'aspirin sodium',
189+
formulations like 'aspirin tablet') are normalized to their parent compound ('aspirin').
190+
This ensures consistent target mapping and prevents double-counting.
106191
107192
.. note::
108-
Mechanisms are assigned at the parent level and apply to all indications.
109-
If a parent has no mechanism but children do, the child's mechanism is inherited.
193+
**Mechanism-Indication Independence**: A molecule's targets are the same across all its
194+
indications. For example, if aspirin targets COX1/COX2, these targets apply whether the
195+
indication is 'Pain' or 'Cardiovascular Disease'. This reflects biological reality - a
196+
drug's mechanism doesn't change based on what it's prescribed for.
110197
111198
.. note::
112-
Clinical trial IDs containing multiple comma-separated values are exploded into separate rows.
199+
**Clinical Trial ID Explosion**: When a molecule has multiple comma-separated trial IDs
200+
(e.g., 'NCT001,NCT002'), they are exploded into separate rows. This enables per-trial
201+
analysis and proper trial counting.
202+
203+
.. warning::
204+
**Data Volume**: ChEMBL v36 contains hundreds of thousands of molecule-target relationships.
205+
Full queries may take several minutes and return large DataFrames (>100K rows). Consider
206+
filtering by version, phase, or disease after loading to reduce memory usage.
207+
208+
.. warning::
209+
**Activity-Based Targets**: Targets from the ``ACTIVITIES`` table (``target_source='ACTIVITIES'``)
210+
have lower confidence than mechanism-based targets. They represent bioassay activity but may
211+
not reflect clinical mechanisms. Set ``top_n_activities=0`` to exclude these if you need
212+
high-confidence mechanisms only.
213+
214+
.. warning::
215+
**Requires chembl-downloader**: This function requires the ``chembl-downloader`` package
216+
to be installed. Install via: ``pip install chembl-downloader``
217+
218+
.. seealso::
219+
- :func:`infer_nct_year`: Used internally to estimate trial registration years
220+
- ChEMBL Documentation: https://chembl.gitbook.io/chembl-interface-documentation/
221+
- ClinicalTrials.gov: https://clinicaltrials.gov/
222+
113223
"""
114224

115225
print("Step 1: Getting all parent molecules with their children's indications...")

0 commit comments

Comments
 (0)