LLM PMID Support - NLI Version

A machine learning-based system for validating whether PubMed abstracts support given research triples using Natural Language Inference (NLI).

Overview

This project uses natural language inference (NLI) models to validate research triples against PubMed abstracts.

Key Features

CURIE-Based Input: Uses CURIE IDs instead of Entity Names as input
Automatic Name Resolution: Converts CURIEs to equivalent names via node normalization
NLI-Based Validation: Uses pretrained or fine-tuned transformer models for natural language inference
Multiple Model Support: Pre-trained (BART, DeBERTa, RoBERTa) and fine-tuned biomedical models (BioBERT, PubMedBERT, BioLinkBERT, SciBERT, SapBERT)
Two-Stage Training Pipeline: MNLI pre-training followed by custom biomedical data fine-tuning
Binary Classification: Trained models use binary (supported/not_supported) classification for better domain alignment
Local Caching: SQLite-based caching for PubMed abstracts
Evidence Extraction: Returns supporting sentences from abstracts

Quick Start

1. Installation

# Activate conda environment
conda activate llm_pmid_lm_version_env

# Install dependencies
pip install -r requirements.txt

2. Configuration

Create a .env file:

# Edit .env with your credentials
nano .env

Required configuration:

[email protected]
NCBI_API_KEY=your_ncbi_api_key_here
AVAILABLE_NLI_MODELS=bart,deberta,roberta,biobert-nli,pubmedbert-nli,biolinkbert-nli,scibert-nli,sapbert-nli

3. Usage

Command Line

# Basic usage with CURIEs (CPU)
python main.py \
  --triple_curie 'NCBIGene:6495' 'affects' 'UMLS:C0596290' \
  --pmids 34513929 16488997

# Use GPU
python main.py \
  --triple_curie 'NCBIGene:6495' 'affects' 'UMLS:C0596290' \
  --pmids 34513929 \
  --use_gpu --gpu 0

# Use a fine-tuned biomedical model
python main.py \
  --triple_curie 'NCBIGene:6495' 'affects' 'UMLS:C0596290' \
  --pmids 34513929 \
  --nli_model biobert-nli \
  --use_gpu

# Use pre-trained general NLI model
python main.py \
  --triple_curie 'NCBIGene:6495' 'affects' 'UMLS:C0596290' \
  --pmids 34513929 \
  --nli_model deberta \
  --use_gpu

# Multiple PMIDs
python main.py \
  --triple_curie 'NCBIGene:7157' 'regulates' 'GO:0006915' \
  --pmids 12345678 87654321 11223344

# Custom threshold (stricter validation)
python main.py \
  --triple_curie 'NCBIGene:6495' 'affects' 'UMLS:C0596290' \
  --pmids 34513929 \
  --threshold 0.8

# Performance tuning: more names for better coverage
python main.py \
  --triple_curie 'NCBIGene:6495' 'affects' 'UMLS:C0596290' \
  --pmids 34513929 \
  --max_names 10 \
  --use_gpu

# Performance tuning: larger batch size for speed (requires more GPU memory)
python main.py \
  --triple_curie 'NCBIGene:6495' 'affects' 'UMLS:C0596290' \
  --pmids 34513929 \
  --batch_size 64 \
  --use_gpu

Python API

from src.nli_checker import NLIChecker
from src.pmid_extractor import PMIDExtractor
from src.node_normalization import NodeNormalizationClient

# Initialize with custom parameters
extractor = PMIDExtractor(email="[email protected]", api_key="your_key")
checker = NLIChecker(
    model_name="biobert-nli",  # Use fine-tuned biomedical model
    use_gpu=True, 
    gpu_id="0", 
    threshold=0.7,      # Stricter validation
    batch_size=64       # Larger batches for speed
)
normalizer = NodeNormalizationClient()

# Resolve CURIEs to names
subject_names = normalizer.get_equivalent_names(curie="NCBIGene:6495")
object_names = normalizer.get_equivalent_names(curie="UMLS:C0596290")

# Extract abstracts
abstracts = extractor.extract_abstracts(["34513929"])

# Check support
triple = (subject_names[0], "affects", object_names[0])
for pmid, data in abstracts.items():
    result = checker.check_support(triple, data.abstract)
    print(f"PMID {pmid}: {result['supported']} (confidence: {result['confidence']:.4f}, threshold: {result['threshold']:.2f})")

Available Models

Pre-trained General NLI Models

Model	Description
`bart`	BART-large-mnli (default)
`deberta`	DeBERTa-v3-base-mnli
`roberta`	RoBERTa-large-mnli

Fine-tuned Biomedical Models

These models have been fine-tuned using a two-stage training pipeline on custom biomedical NLI data:

Model	Base Model	Training	Classification
`biobert-nli`	BioBERT v1.2	MNLI → Custom	Binary (supported/not_supported)
`pubmedbert-nli`	PubMedBERT	MNLI → Custom	Binary (supported/not_supported)
`biolinkbert-nli`	BioLinkBERT	MNLI → Custom	Binary (supported/not_supported)
`scibert-nli`	SciBERT	MNLI → Custom	Binary (supported/not_supported)
`sapbert-nli`	SapBERT	MNLI → Custom	Binary (supported/not_supported)

Example Output

=== Result for PMID 34513929 ===
Triple (CURIEs): (NCBIGene:6495, affects, UMLS:C0596290)
Best Match: six1 affects cell proliferation.
Status: SUPPORTED (Confidence: 0.9889, Threshold: 0.50)
Evidence: We demonstrated that SIX1 promoted cell proliferation by upregulating cyclin...
============================

How It Works

Resolve CURIEs to Multiple Names
- Input: (NCBIGene:6495, affects, UMLS:C0596290)
- Resolved: (['six1', 'six1 gene', 'six1 (hsap)', ...], 'affects', ['cell proliferation'])
- Uses ARAX node normalization to get all equivalent names
- Names sorted by length (shortest first) for efficient matching
- Top 5 names used for each entity to balance accuracy and performance
Test All Name Combinations
- For each combination of subject and object names:
  - Generate hypothesis: e.g., "six1 affects cell proliferation.", "six1 gene affects cell proliferation."
  - Test against all sentences in the abstract
- Best matching hypothesis is selected based on highest support score
- This handles cases where abstracts use different name variations
Extract Abstract Sentences
- Split abstract into individual sentences using NLTK
Parallel Batch Processing
- Create all (sentence, hypothesis) pairs from combinations
- Process pairs in batches (batch size: 32) for efficiency
- For each batch:
  - Feed batch to NLI model in parallel
  - Model classifies as: supported vs not_supported (binary) or entailment/neutral/contradiction (3-way)
  - Extract support confidence scores
- Track the hypothesis with highest support score across all batches
Binary Classification
- SUPPORTED: If best support confidence > threshold (default: 0.5)
- NOT SUPPORTED: If no hypothesis meets the threshold
- Threshold is user-configurable (0.0-1.0)
- Return best matching hypothesis and supporting sentence

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM PMID Support - NLI Version

Overview

Key Features

Quick Start

1. Installation

2. Configuration

3. Usage

Command Line

Python API

Available Models

Pre-trained General NLI Models

Fine-tuned Biomedical Models

Example Output

How It Works

About

Uh oh!

Releases

Packages

Languages

RTXteam/LLM_PMID_Checker_NLI_VERSION

Folders and files

Latest commit

History

Repository files navigation

LLM PMID Support - NLI Version

Overview

Key Features

Quick Start

1. Installation

2. Configuration

3. Usage

Command Line

Python API

Available Models

Pre-trained General NLI Models

Fine-tuned Biomedical Models

Example Output

How It Works

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages