Skip to content

RTXteam/LLM_PMID_Checker_NLI_VERSION

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM PMID Support - NLI Version

A machine learning-based system for validating whether PubMed abstracts support given research triples using Natural Language Inference (NLI).

Python 3.8+ License

Overview

This project uses natural language inference (NLI) models to validate research triples against PubMed abstracts.

Key Features

  • CURIE-Based Input: Uses CURIE IDs instead of Entity Names as input
  • Automatic Name Resolution: Converts CURIEs to equivalent names via node normalization
  • NLI-Based Validation: Uses pretrained or fine-tuned transformer models for natural language inference
  • Multiple Model Support: Pre-trained (BART, DeBERTa, RoBERTa) and fine-tuned biomedical models (BioBERT, PubMedBERT, BioLinkBERT, SciBERT, SapBERT)
  • Two-Stage Training Pipeline: MNLI pre-training followed by custom biomedical data fine-tuning
  • Binary Classification: Trained models use binary (supported/not_supported) classification for better domain alignment
  • Local Caching: SQLite-based caching for PubMed abstracts
  • Evidence Extraction: Returns supporting sentences from abstracts

Quick Start

1. Installation

# Activate conda environment
conda activate llm_pmid_lm_version_env

# Install dependencies
pip install -r requirements.txt

2. Configuration

Create a .env file:

# Edit .env with your credentials
nano .env

Required configuration:

[email protected]
NCBI_API_KEY=your_ncbi_api_key_here
AVAILABLE_NLI_MODELS=bart,deberta,roberta,biobert-nli,pubmedbert-nli,biolinkbert-nli,scibert-nli,sapbert-nli

3. Usage

Command Line

# Basic usage with CURIEs (CPU)
python main.py \
  --triple_curie 'NCBIGene:6495' 'affects' 'UMLS:C0596290' \
  --pmids 34513929 16488997

# Use GPU
python main.py \
  --triple_curie 'NCBIGene:6495' 'affects' 'UMLS:C0596290' \
  --pmids 34513929 \
  --use_gpu --gpu 0

# Use a fine-tuned biomedical model
python main.py \
  --triple_curie 'NCBIGene:6495' 'affects' 'UMLS:C0596290' \
  --pmids 34513929 \
  --nli_model biobert-nli \
  --use_gpu

# Use pre-trained general NLI model
python main.py \
  --triple_curie 'NCBIGene:6495' 'affects' 'UMLS:C0596290' \
  --pmids 34513929 \
  --nli_model deberta \
  --use_gpu

# Multiple PMIDs
python main.py \
  --triple_curie 'NCBIGene:7157' 'regulates' 'GO:0006915' \
  --pmids 12345678 87654321 11223344

# Custom threshold (stricter validation)
python main.py \
  --triple_curie 'NCBIGene:6495' 'affects' 'UMLS:C0596290' \
  --pmids 34513929 \
  --threshold 0.8

# Performance tuning: more names for better coverage
python main.py \
  --triple_curie 'NCBIGene:6495' 'affects' 'UMLS:C0596290' \
  --pmids 34513929 \
  --max_names 10 \
  --use_gpu

# Performance tuning: larger batch size for speed (requires more GPU memory)
python main.py \
  --triple_curie 'NCBIGene:6495' 'affects' 'UMLS:C0596290' \
  --pmids 34513929 \
  --batch_size 64 \
  --use_gpu

Python API

from src.nli_checker import NLIChecker
from src.pmid_extractor import PMIDExtractor
from src.node_normalization import NodeNormalizationClient

# Initialize with custom parameters
extractor = PMIDExtractor(email="[email protected]", api_key="your_key")
checker = NLIChecker(
    model_name="biobert-nli",  # Use fine-tuned biomedical model
    use_gpu=True, 
    gpu_id="0", 
    threshold=0.7,      # Stricter validation
    batch_size=64       # Larger batches for speed
)
normalizer = NodeNormalizationClient()

# Resolve CURIEs to names
subject_names = normalizer.get_equivalent_names(curie="NCBIGene:6495")
object_names = normalizer.get_equivalent_names(curie="UMLS:C0596290")

# Extract abstracts
abstracts = extractor.extract_abstracts(["34513929"])

# Check support
triple = (subject_names[0], "affects", object_names[0])
for pmid, data in abstracts.items():
    result = checker.check_support(triple, data.abstract)
    print(f"PMID {pmid}: {result['supported']} (confidence: {result['confidence']:.4f}, threshold: {result['threshold']:.2f})")

Available Models

Pre-trained General NLI Models

Model Description
bart BART-large-mnli (default)
deberta DeBERTa-v3-base-mnli
roberta RoBERTa-large-mnli

Fine-tuned Biomedical Models

These models have been fine-tuned using a two-stage training pipeline on custom biomedical NLI data:

Model Base Model Training Classification
biobert-nli BioBERT v1.2 MNLI → Custom Binary (supported/not_supported)
pubmedbert-nli PubMedBERT MNLI → Custom Binary (supported/not_supported)
biolinkbert-nli BioLinkBERT MNLI → Custom Binary (supported/not_supported)
scibert-nli SciBERT MNLI → Custom Binary (supported/not_supported)
sapbert-nli SapBERT MNLI → Custom Binary (supported/not_supported)

Example Output

=== Result for PMID 34513929 ===
Triple (CURIEs): (NCBIGene:6495, affects, UMLS:C0596290)
Best Match: six1 affects cell proliferation.
Status: SUPPORTED (Confidence: 0.9889, Threshold: 0.50)
Evidence: We demonstrated that SIX1 promoted cell proliferation by upregulating cyclin...
============================

How It Works

  1. Resolve CURIEs to Multiple Names

    • Input: (NCBIGene:6495, affects, UMLS:C0596290)
    • Resolved: (['six1', 'six1 gene', 'six1 (hsap)', ...], 'affects', ['cell proliferation'])
    • Uses ARAX node normalization to get all equivalent names
    • Names sorted by length (shortest first) for efficient matching
    • Top 5 names used for each entity to balance accuracy and performance
  2. Test All Name Combinations

    • For each combination of subject and object names:
      • Generate hypothesis: e.g., "six1 affects cell proliferation.", "six1 gene affects cell proliferation."
      • Test against all sentences in the abstract
    • Best matching hypothesis is selected based on highest support score
    • This handles cases where abstracts use different name variations
  3. Extract Abstract Sentences

    • Split abstract into individual sentences using NLTK
  4. Parallel Batch Processing

    • Create all (sentence, hypothesis) pairs from combinations
    • Process pairs in batches (batch size: 32) for efficiency
    • For each batch:
      • Feed batch to NLI model in parallel
      • Model classifies as: supported vs not_supported (binary) or entailment/neutral/contradiction (3-way)
      • Extract support confidence scores
    • Track the hypothesis with highest support score across all batches
  5. Binary Classification

    • SUPPORTED: If best support confidence > threshold (default: 0.5)
    • NOT SUPPORTED: If no hypothesis meets the threshold
    • Threshold is user-configurable (0.0-1.0)
    • Return best matching hypothesis and supporting sentence

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published