A machine learning-based system for validating whether PubMed abstracts support given research triples using Natural Language Inference (NLI).
This project uses natural language inference (NLI) models to validate research triples against PubMed abstracts.
- CURIE-Based Input: Uses CURIE IDs instead of Entity Names as input
- Automatic Name Resolution: Converts CURIEs to equivalent names via node normalization
- NLI-Based Validation: Uses pretrained or fine-tuned transformer models for natural language inference
- Multiple Model Support: Pre-trained (BART, DeBERTa, RoBERTa) and fine-tuned biomedical models (BioBERT, PubMedBERT, BioLinkBERT, SciBERT, SapBERT)
- Two-Stage Training Pipeline: MNLI pre-training followed by custom biomedical data fine-tuning
- Binary Classification: Trained models use binary (supported/not_supported) classification for better domain alignment
- Local Caching: SQLite-based caching for PubMed abstracts
- Evidence Extraction: Returns supporting sentences from abstracts
# Activate conda environment
conda activate llm_pmid_lm_version_env
# Install dependencies
pip install -r requirements.txtCreate a .env file:
# Edit .env with your credentials
nano .envRequired configuration:
[email protected]
NCBI_API_KEY=your_ncbi_api_key_here
AVAILABLE_NLI_MODELS=bart,deberta,roberta,biobert-nli,pubmedbert-nli,biolinkbert-nli,scibert-nli,sapbert-nli# Basic usage with CURIEs (CPU)
python main.py \
--triple_curie 'NCBIGene:6495' 'affects' 'UMLS:C0596290' \
--pmids 34513929 16488997
# Use GPU
python main.py \
--triple_curie 'NCBIGene:6495' 'affects' 'UMLS:C0596290' \
--pmids 34513929 \
--use_gpu --gpu 0
# Use a fine-tuned biomedical model
python main.py \
--triple_curie 'NCBIGene:6495' 'affects' 'UMLS:C0596290' \
--pmids 34513929 \
--nli_model biobert-nli \
--use_gpu
# Use pre-trained general NLI model
python main.py \
--triple_curie 'NCBIGene:6495' 'affects' 'UMLS:C0596290' \
--pmids 34513929 \
--nli_model deberta \
--use_gpu
# Multiple PMIDs
python main.py \
--triple_curie 'NCBIGene:7157' 'regulates' 'GO:0006915' \
--pmids 12345678 87654321 11223344
# Custom threshold (stricter validation)
python main.py \
--triple_curie 'NCBIGene:6495' 'affects' 'UMLS:C0596290' \
--pmids 34513929 \
--threshold 0.8
# Performance tuning: more names for better coverage
python main.py \
--triple_curie 'NCBIGene:6495' 'affects' 'UMLS:C0596290' \
--pmids 34513929 \
--max_names 10 \
--use_gpu
# Performance tuning: larger batch size for speed (requires more GPU memory)
python main.py \
--triple_curie 'NCBIGene:6495' 'affects' 'UMLS:C0596290' \
--pmids 34513929 \
--batch_size 64 \
--use_gpufrom src.nli_checker import NLIChecker
from src.pmid_extractor import PMIDExtractor
from src.node_normalization import NodeNormalizationClient
# Initialize with custom parameters
extractor = PMIDExtractor(email="[email protected]", api_key="your_key")
checker = NLIChecker(
model_name="biobert-nli", # Use fine-tuned biomedical model
use_gpu=True,
gpu_id="0",
threshold=0.7, # Stricter validation
batch_size=64 # Larger batches for speed
)
normalizer = NodeNormalizationClient()
# Resolve CURIEs to names
subject_names = normalizer.get_equivalent_names(curie="NCBIGene:6495")
object_names = normalizer.get_equivalent_names(curie="UMLS:C0596290")
# Extract abstracts
abstracts = extractor.extract_abstracts(["34513929"])
# Check support
triple = (subject_names[0], "affects", object_names[0])
for pmid, data in abstracts.items():
result = checker.check_support(triple, data.abstract)
print(f"PMID {pmid}: {result['supported']} (confidence: {result['confidence']:.4f}, threshold: {result['threshold']:.2f})")| Model | Description |
|---|---|
bart |
BART-large-mnli (default) |
deberta |
DeBERTa-v3-base-mnli |
roberta |
RoBERTa-large-mnli |
These models have been fine-tuned using a two-stage training pipeline on custom biomedical NLI data:
| Model | Base Model | Training | Classification |
|---|---|---|---|
biobert-nli |
BioBERT v1.2 | MNLI → Custom | Binary (supported/not_supported) |
pubmedbert-nli |
PubMedBERT | MNLI → Custom | Binary (supported/not_supported) |
biolinkbert-nli |
BioLinkBERT | MNLI → Custom | Binary (supported/not_supported) |
scibert-nli |
SciBERT | MNLI → Custom | Binary (supported/not_supported) |
sapbert-nli |
SapBERT | MNLI → Custom | Binary (supported/not_supported) |
=== Result for PMID 34513929 ===
Triple (CURIEs): (NCBIGene:6495, affects, UMLS:C0596290)
Best Match: six1 affects cell proliferation.
Status: SUPPORTED (Confidence: 0.9889, Threshold: 0.50)
Evidence: We demonstrated that SIX1 promoted cell proliferation by upregulating cyclin...
============================
-
Resolve CURIEs to Multiple Names
- Input:
(NCBIGene:6495, affects, UMLS:C0596290) - Resolved:
(['six1', 'six1 gene', 'six1 (hsap)', ...], 'affects', ['cell proliferation']) - Uses ARAX node normalization to get all equivalent names
- Names sorted by length (shortest first) for efficient matching
- Top 5 names used for each entity to balance accuracy and performance
- Input:
-
Test All Name Combinations
- For each combination of subject and object names:
- Generate hypothesis: e.g.,
"six1 affects cell proliferation.","six1 gene affects cell proliferation." - Test against all sentences in the abstract
- Generate hypothesis: e.g.,
- Best matching hypothesis is selected based on highest support score
- This handles cases where abstracts use different name variations
- For each combination of subject and object names:
-
Extract Abstract Sentences
- Split abstract into individual sentences using NLTK
-
Parallel Batch Processing
- Create all (sentence, hypothesis) pairs from combinations
- Process pairs in batches (batch size: 32) for efficiency
- For each batch:
- Feed batch to NLI model in parallel
- Model classifies as: supported vs not_supported (binary) or entailment/neutral/contradiction (3-way)
- Extract support confidence scores
- Track the hypothesis with highest support score across all batches
-
Binary Classification
- SUPPORTED: If best support confidence > threshold (default: 0.5)
- NOT SUPPORTED: If no hypothesis meets the threshold
- Threshold is user-configurable (0.0-1.0)
- Return best matching hypothesis and supporting sentence