Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
130 changes: 130 additions & 0 deletions docs/how-to/validate-clinical-trials.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
# Validating ClinicalTrials.gov References

This guide shows how to validate supporting text against clinical trial records from ClinicalTrials.gov.

## Overview

The ClinicalTrials.gov source fetches trial data from the [ClinicalTrials.gov API v2](https://clinicaltrials.gov/data-api/api). It extracts:

- **Title**: Official title (falls back to brief title)
- **Content**: Brief summary (falls back to detailed description)
- **Metadata**: Trial status and lead sponsor

The source uses the [bioregistry standard prefix](https://bioregistry.io/registry/clinicaltrials) `clinicaltrials` with identifiers following the pattern `NCT` followed by 8 digits (e.g., `NCT00000001`).

## Basic Usage

Validate text against a clinical trial using its NCT identifier:

```bash
linkml-reference-validator validate text \
"A randomized controlled trial investigating..." \
clinicaltrials:NCT00000001
```

## Accepted Identifier Formats

You can use the bioregistry standard prefix or bare NCT identifiers:

```
clinicaltrials:NCT00000001
clinicaltrials:NCT12345678
NCT00000001
NCT12345678
```

The prefix is case-insensitive:

```
clinicaltrials:NCT00000001
CLINICALTRIALS:NCT00000001
```

## Prefix Aliases and Normalization

If your data uses alternate prefix styles (e.g., the legacy `NCT:` prefix), configure normalization in `.linkml-reference-validator.yaml`:

```yaml
validation:
reference_prefix_map:
NCT: clinicaltrials
nct: clinicaltrials
ct: clinicaltrials
ClinicalTrials: clinicaltrials
```

Or programmatically:

```python
from linkml_reference_validator.models import ReferenceValidationConfig

config = ReferenceValidationConfig(
reference_prefix_map={
"NCT": "clinicaltrials",
"ct": "clinicaltrials",
}
)
```

## Pre-caching Clinical Trial Records

To cache trial data for offline validation or faster repeated access:

```bash
linkml-reference-validator cache reference clinicaltrials:NCT00000001
```

Cached references are stored in `references_cache/` as markdown files with YAML frontmatter containing metadata like trial status and sponsor.

## Rate Limiting

The ClinicalTrials.gov API has rate limits. The default `rate_limit_delay` of 0.5 seconds between requests should be sufficient for most use cases:

```python
from linkml_reference_validator.models import ReferenceValidationConfig

config = ReferenceValidationConfig(
rate_limit_delay=0.5, # default
)
```

## Content Availability

Not all trials have detailed descriptions. If only a brief summary is available, that will be used for validation. Trials without any description will return `content_type: unavailable`.

## Example: Validating Trial Descriptions

```python
from linkml_reference_validator.etl.sources import ClinicalTrialsSource
from linkml_reference_validator.models import ReferenceValidationConfig

config = ReferenceValidationConfig()
source = ClinicalTrialsSource()

# Fetch trial content
content = source.fetch("NCT00000001", config)

if content:
print(f"Reference ID: {content.reference_id}") # clinicaltrials:NCT00000001
print(f"Title: {content.title}")
print(f"Summary: {content.content}")
print(f"Status: {content.metadata.get('status')}")
print(f"Sponsor: {content.metadata.get('sponsor')}")
```

## Bioregistry Standard

This source follows the [bioregistry standard](https://bioregistry.io/registry/clinicaltrials) for ClinicalTrials.gov identifiers:

- **Prefix**: `clinicaltrials`
- **Pattern**: `^NCT\d{8}$`
- **Example CURIE**: `clinicaltrials:NCT00222573`

Alternative prefixes recognized by bioregistry include `clinicaltrial`, `NCT`, and `ctgov`. Use the `reference_prefix_map` configuration to normalize these to the standard prefix.

## See Also

- [Validating Entrez Accessions](validate-entrez.md) - Similar pattern for NCBI databases
- [Adding a New Reference Source](add-reference-source.md) - How the plugin system works
- [Quickstart](../quickstart.md) - Getting started guide
- [CLI Reference](../reference/cli.md) - Complete command documentation
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ nav:
- How-To Guides:
- Validating OBO Files: how-to/validate-obo-files.md
- Validating Entrez Accessions: how-to/validate-entrez.md
- Validating Clinical Trials: how-to/validate-clinical-trials.md
- Validating DOIs: how-to/validate-dois.md
- Validating URLs: how-to/validate-urls.md
- Using Local Files and URLs: how-to/use-local-files-and-urls.md
Expand Down
6 changes: 4 additions & 2 deletions src/linkml_reference_validator/etl/sources/__init__.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
"""Reference source plugins.

This package provides pluggable reference sources for fetching content
from various origins (PubMed, Crossref, local files, URLs).
from various origins (PubMed, Crossref, local files, URLs, ClinicalTrials.gov).

Examples:
>>> from linkml_reference_validator.etl.sources import ReferenceSourceRegistry
>>> sources = ReferenceSourceRegistry.list_sources()
>>> len(sources) >= 7
>>> len(sources) >= 8
True
"""

Expand All @@ -25,6 +25,7 @@
BioProjectSource,
BioSampleSource,
)
from linkml_reference_validator.etl.sources.clinicaltrials import ClinicalTrialsSource

__all__ = [
"ReferenceSource",
Expand All @@ -36,4 +37,5 @@
"GEOSource",
"BioProjectSource",
"BioSampleSource",
"ClinicalTrialsSource",
]
180 changes: 180 additions & 0 deletions src/linkml_reference_validator/etl/sources/clinicaltrials.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,180 @@
"""ClinicalTrials.gov reference source.

Provides access to clinical trial data via the ClinicalTrials.gov API.

Uses the bioregistry standard prefix 'clinicaltrials' with pattern NCT followed by 8 digits.
See: https://bioregistry.io/registry/clinicaltrials

Examples:
>>> from linkml_reference_validator.etl.sources.clinicaltrials import ClinicalTrialsSource
>>> ClinicalTrialsSource.prefix()
'clinicaltrials'
>>> ClinicalTrialsSource.can_handle("clinicaltrials:NCT00000001")
True
>>> ClinicalTrialsSource.can_handle("NCT00000001")
True
"""

import logging
import re
import time
from typing import Optional

import requests # type: ignore

from linkml_reference_validator.models import ReferenceContent, ReferenceValidationConfig
from linkml_reference_validator.etl.sources.base import ReferenceSource, ReferenceSourceRegistry

logger = logging.getLogger(__name__)

# ClinicalTrials.gov API v2 endpoint
CLINICALTRIALS_API_URL = "https://clinicaltrials.gov/api/v2/studies/{nct_id}"

# NCT ID pattern: NCT followed by 8 digits (bioregistry standard)
NCT_ID_PATTERN = re.compile(r"^NCT\d{8}$", re.IGNORECASE)


@ReferenceSourceRegistry.register
class ClinicalTrialsSource(ReferenceSource):
"""Fetch clinical trial data from ClinicalTrials.gov.

Uses the bioregistry standard prefix 'clinicaltrials'.
Supports NCT identifiers (e.g., NCT00000001) with or without prefix.

Examples:
>>> ClinicalTrialsSource.prefix()
'clinicaltrials'
>>> ClinicalTrialsSource.can_handle("clinicaltrials:NCT00000001")
True
>>> ClinicalTrialsSource.can_handle("NCT00000001")
True
>>> ClinicalTrialsSource.can_handle("PMID:12345")
False
"""

@classmethod
def prefix(cls) -> str:
"""Return the prefix this source handles.

Uses bioregistry standard prefix 'clinicaltrials'.

Examples:
>>> ClinicalTrialsSource.prefix()
'clinicaltrials'
"""
return "clinicaltrials"

@classmethod
def can_handle(cls, reference_id: str) -> bool:
"""Check if this source can handle the given reference ID.

Supports:
- clinicaltrials:NCT00000001 (bioregistry standard)
- NCT00000001 (bare NCT ID)

Examples:
>>> ClinicalTrialsSource.can_handle("clinicaltrials:NCT00000001")
True
>>> ClinicalTrialsSource.can_handle("clinicaltrials:NCT12345678")
True
>>> ClinicalTrialsSource.can_handle("NCT00000001")
True
>>> ClinicalTrialsSource.can_handle("PMID:12345")
False
"""
# Check for prefix (clinicaltrials:...)
if super().can_handle(reference_id):
return True
# Check for bare NCT ID (NCT followed by 8 digits)
return bool(NCT_ID_PATTERN.match(reference_id))

def fetch(
self, identifier: str, config: ReferenceValidationConfig
) -> Optional[ReferenceContent]:
"""Fetch clinical trial data from ClinicalTrials.gov API.

Args:
identifier: NCT identifier (e.g., NCT00000001)
config: Configuration including rate limiting

Returns:
ReferenceContent if successful, None otherwise

Examples:
>>> source = ClinicalTrialsSource()
>>> # This would require network access in real usage
>>> source.prefix()
'clinicaltrials'
"""
time.sleep(config.rate_limit_delay)

# Normalize identifier - ensure it starts with NCT
nct_id = identifier.upper()
if not nct_id.startswith("NCT"):
nct_id = f"NCT{nct_id}"

url = CLINICALTRIALS_API_URL.format(nct_id=nct_id)

try:
response = requests.get(url, timeout=30)
except requests.RequestException as exc:
logger.warning(f"Failed to fetch clinical trial {nct_id}: {exc}")
return None

if response.status_code != 200:
logger.warning(
f"ClinicalTrials.gov API returned status {response.status_code} for {nct_id}"
)
return None

try:
data = response.json()
except ValueError as exc:
logger.warning(f"Failed to parse JSON response for {nct_id}: {exc}")
return None

return self._parse_response(nct_id, data)

def _parse_response(self, nct_id: str, data: dict) -> Optional[ReferenceContent]:
"""Parse the ClinicalTrials.gov API response into ReferenceContent.

Args:
nct_id: The NCT identifier
data: The JSON response from the API

Returns:
ReferenceContent with trial information
"""
protocol_section = data.get("protocolSection", {})
identification = protocol_section.get("identificationModule", {})
description = protocol_section.get("descriptionModule", {})
status_module = protocol_section.get("statusModule", {})
sponsor_module = protocol_section.get("sponsorCollaboratorsModule", {})

# Extract title (prefer officialTitle, fall back to briefTitle)
title = identification.get("officialTitle") or identification.get("briefTitle")

# Extract content (prefer briefSummary, fall back to detailedDescription)
content = description.get("briefSummary") or description.get("detailedDescription")

# Build metadata
metadata: dict = {}

status = status_module.get("overallStatus")
if status:
metadata["status"] = status

lead_sponsor = sponsor_module.get("leadSponsor", {})
sponsor_name = lead_sponsor.get("name")
if sponsor_name:
metadata["sponsor"] = sponsor_name

content_type = "summary" if content else "unavailable"

return ReferenceContent(
reference_id=f"{self.prefix()}:{nct_id}",
title=title,
content=content,
content_type=content_type,
metadata=metadata,
)
4 changes: 3 additions & 1 deletion tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,9 +25,11 @@ def test_config(tmp_path, fixtures_dir):
cache_dir = tmp_path / "cache"
cache_dir.mkdir()

# Copy test fixtures to cache
# Copy test fixtures to cache (both .txt and .md formats)
for fixture_file in fixtures_dir.glob("*.txt"):
(cache_dir / fixture_file.name).write_text(fixture_file.read_text())
for fixture_file in fixtures_dir.glob("*.md"):
(cache_dir / fixture_file.name).write_text(fixture_file.read_text())

return ReferenceValidationConfig(
cache_dir=cache_dir,
Expand Down
11 changes: 11 additions & 0 deletions tests/fixtures/CLINICALTRIALS_NCT00000001.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
---
reference_id: clinicaltrials:NCT00000001
title: A Phase III Study of Drug X for Treatment of Disease Y
content_type: summary
---

# A Phase III Study of Drug X for Treatment of Disease Y

## Content

This study evaluates the efficacy and safety of Drug X in patients with Disease Y. The primary endpoint is overall survival at 12 months. Secondary endpoints include progression-free survival, quality of life measures, and adverse event profiles. Participants will be randomized 1:1 to receive either Drug X or placebo in addition to standard of care treatment.
Loading