Automated Citation Verification

An ML Engineering Capstone Project

This project develops an NLP framework for automated validation of citations and claims, ensuring references accurately support stated information. In an era where scientific misinformation can have serious consequences, verifying that citations properly support the claims they reference is crucial for maintaining the integrity of scientific literature and preventing the spread of false information.

The Problem: Studies show that 10-20% or more of citations in scientific literature are inaccurate, failing to support the claims they reference. This undermines scientific credibility and can perpetuate misinformation. Our solution uses state-of-the-art transformer models to automatically classify citation accuracy as SUPPORT, REFUTE, or NEI (Not Enough Information).

Key Achievements

Superior Model Performance

Our fine-tuned DeBERTa model achieves a 7 percentage point increase in average F1 over the best baseline model:

	Macro F1 on test split
Model	SciFact	Citation-Integrity	Average
SciFact baseline [1]	0.81	0.15	0.48
Citation-Integrity baseline [2]	0.74	0.44	0.59
Fine-tuned DeBERTa [3]	0.84	0.47	*0.66*

Production-Ready Deliverables

🏋 PyVers Python Package: Comprehensive framework for model training with multi-dataset ingestion, 🤗 Hugging Face integration, and ⚡ PyTorch Lightning for scalable training
🔀 Fine-tuned Model: Publicly available model ready for inference
🌐 AI4Citations Web Application: Live application on Hugging Face Spaces where users can input claims and evidence to get verification results and provide feedback for model improvement
</> Application Repository: Complete source code for Gradio frontend and evidence retrieval modules

Technical Innovation

Improvement on state-of-the-art baselines that use the MultiVerS model based on Longformer
Multi-dataset training approach combining SciFact and Citation-Integrity datasets
Evidence retrieval from PDFs using text similarity (BM25-based), semantic search (BERT-based), or LLMs (OpenAI API).
Comprehensive evaluation framework with detailed performance metrics
Feedback functionality implemented with Hugging Face Datasets
API access to inference through the Gradio app

Project Development Timeline

This project follows a systematic approach covering all aspects of the machine learning engineering lifecycle:

Phase	Component	Description
🎯	Problem Definition	Initial project ideas identifying citation accuracy as a critical scientific integrity issue
📊	Data Collection	Data directory with curated datasets from SciFact and Citation-Integrity
📝	Project Proposal	Detailed proposal outlining approach and deliverables
🔍	Literature Review	Research survey reproducing Citation-Integrity methodology
🧹	Data Wrangling	Notebooks for Citation-Integrity and SciFact preprocessing
🔬	Data Exploration	Analysis notebooks for Citation-Integrity and SciFact
📍	Baseline Models	MultiVerS baseline and checkpoint analysis with custom evaluation metrics
🧪	Model Experimentation	Blog post on fine-tuning DeBERTa across multiple datasets
🪜	Scaling Prototype	Scaling implementation for production readiness
📐	Deployment Planning	Engineering plan and architecture design
💫	Production Deployment	Blog post documenting deployment process
❇	Project Sharing	Public repositories (this one, pyvers, AI4citations) and live application

Data Sources

The project combines two high-quality datasets for biomedical citations with consistent labeling and preprocessing:

SciFact Dataset

Scope: 1,409 scientific claims verified against 5,183 abstracts
Source: GitHub Repository | Research Paper
Main Topics: gene expression, cancer, treatment, infection
Data Quality: Enhanced test fold with labels and abstract IDs from scifact_10

Citation-Integrity Dataset

Scope: 3,063 citation instances from biomedical publications
Source: GitHub Repository | Research Paper
Main Topics: cells, cancer, COVID-19, patients

Technical Approach: Both datasets follow the MultiVerS data format, enabling consistent model training and evaluation across different domains.

Novel Insights and Lessons Learned

This project uncovered several findings that weren't documented in existing literature or course materials:

🔄 Sentence Pair Ordering Matters

Discovery: The order of sentence pairs in transformer tokenization significantly impacts performance
Investigation: Model documentation and papers lack clarity on proper ordering for natural language inference
Solution: Experiments revealed DeBERTa was trained with evidence-before-claim ordering
Impact: Maintaining consistent ordering between fine-tuning and inference improved classification accuracy
Implementation: The pyvers package enforces this ordering, enhancing model reliability

📈 Rethinking Overfitting in Deep Learning

Conventional Wisdom: Classical ML teaches that overfitting hurts generalization
Surprising Finding: Fine-tuning pretrained transformers on small datasets shows apparent overfitting after 1-2 epochs, yet continued training improves test accuracy
Insight: The bias-variance tradeoff behaves differently for large parameter models
Documentation: Detailed analysis in blog post with connections to "benign overfitting" research
Practical Impact: Optimized training schedules based on test performance rather than traditional early stopping

Future Development Opportunities

Class Imbalance Handling: Implement loss function reweighting similar to MultiVerS approach
Data Augmentation: Integrate libraries like TextAttack, TextAugment, or nlpaug for synthetic data generation
Efficient Fine-tuning: Explore Low-rank Adaptation (LoRA) for faster training and overfitting mitigation
Expanded Domains: Extend beyond biomedical literature to other scientific disciplines
Real-time Processing: Optimize inference speed for large-scale document processing

Acknowledgments

Special thanks to Divya Vellanki, my mentor, for invaluable guidance and encouragement throughout this project.

The Springboard MLE bootcamp provided the foundational knowledge and structured approach that made this project possible.

This work builds upon significant contributions from the research community:

Citation-Integrity dataset by Sarol et al. (2024)
DeBERTa model by He et al. (2021)
MultiVerS model by Wadden et al. (2021)
SciFact dataset by Wadden et al. (2020)
Longformer model by Beltagy et al. (2020)

References:

[1] MultiVerS pretrained on FeverSci and fine-tuned on SciFact by Wadden et al. (2021)
[2] MultiVerS pretrained on HealthVer and fine-tuned on Citation-Integrity by Sarol et al. (2024)
[3] DeBERTa v3 pretrained on multiple NLI datasets and fine-tuned on shuffled data from SciFact and Citation-Integrity in this project

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
baselines		baselines
data		data
images		images
notebooks		notebooks
predictions		predictions
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Automated Citation Verification

Key Achievements

Superior Model Performance

Production-Ready Deliverables

Technical Innovation

Project Development Timeline

Data Sources

SciFact Dataset

Citation-Integrity Dataset

Novel Insights and Lessons Learned

🔄 Sentence Pair Ordering Matters

📈 Rethinking Overfitting in Deep Learning

Future Development Opportunities

Acknowledgments

About

Uh oh!

Uh oh!

Languages

License

jedick/MLE-capstone-project

Folders and files

Latest commit

History

Repository files navigation

Automated Citation Verification

Key Achievements

Superior Model Performance

Production-Ready Deliverables

Technical Innovation

Project Development Timeline

Data Sources

SciFact Dataset

Citation-Integrity Dataset

Novel Insights and Lessons Learned

🔄 Sentence Pair Ordering Matters

📈 Rethinking Overfitting in Deep Learning

Future Development Opportunities

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages