Skip to content

jedick/MLE-capstone-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

70 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Automated Citation Verification

An ML Engineering Capstone Project

This project develops an NLP framework for automated validation of citations and claims, ensuring references accurately support stated information. In an era where scientific misinformation can have serious consequences, verifying that citations properly support the claims they reference is crucial for maintaining the integrity of scientific literature and preventing the spread of false information.

The Problem: Studies show that 10-20% or more of citations in scientific literature are inaccurate, failing to support the claims they reference. This undermines scientific credibility and can perpetuate misinformation. Our solution uses state-of-the-art transformer models to automatically classify citation accuracy as SUPPORT, REFUTE, or NEI (Not Enough Information).

MLE Capstone Project Diagram

Key Achievements

Superior Model Performance

Our fine-tuned DeBERTa model achieves a 7 percentage point increase in average F1 over the best baseline model:

Macro F1 on test split
Model SciFact Citation-Integrity Average
SciFact baseline [1] 0.81 0.15 0.48
Citation-Integrity baseline [2] 0.74 0.44 0.59
Fine-tuned DeBERTa [3] 0.84 0.47 0.66

Production-Ready Deliverables

  • 🏋 PyVers Python Package: Comprehensive framework for model training with multi-dataset ingestion, 🤗 Hugging Face integration, and ⚡ PyTorch Lightning for scalable training
  • 🔀 Fine-tuned Model: Publicly available model ready for inference
  • 🌐 AI4Citations Web Application: Live application on Hugging Face Spaces where users can input claims and evidence to get verification results and provide feedback for model improvement
  • </> Application Repository: Complete source code for Gradio frontend and evidence retrieval modules

Technical Innovation

  • Improvement on state-of-the-art baselines that use the MultiVerS model based on Longformer
  • Multi-dataset training approach combining SciFact and Citation-Integrity datasets
  • Evidence retrieval from PDFs using text similarity (BM25-based), semantic search (BERT-based), or LLMs (OpenAI API).
  • Comprehensive evaluation framework with detailed performance metrics
  • Feedback functionality implemented with Hugging Face Datasets
  • API access to inference through the Gradio app

Project Development Timeline

This project follows a systematic approach covering all aspects of the machine learning engineering lifecycle:

Phase Component Description
🎯 Problem Definition Initial project ideas identifying citation accuracy as a critical scientific integrity issue
📊 Data Collection Data directory with curated datasets from SciFact and Citation-Integrity
📝 Project Proposal Detailed proposal outlining approach and deliverables
🔍 Literature Review Research survey reproducing Citation-Integrity methodology
🧹 Data Wrangling Notebooks for Citation-Integrity and SciFact preprocessing
🔬 Data Exploration Analysis notebooks for Citation-Integrity and SciFact
📍 Baseline Models MultiVerS baseline and checkpoint analysis with custom evaluation metrics
🧪 Model Experimentation Blog post on fine-tuning DeBERTa across multiple datasets
🪜 Scaling Prototype Scaling implementation for production readiness
📐 Deployment Planning Engineering plan and architecture design
💫 Production Deployment Blog post documenting deployment process
Project Sharing Public repositories (this one, pyvers, AI4citations) and live application

Data Sources

The project combines two high-quality datasets for biomedical citations with consistent labeling and preprocessing:

SciFact Dataset

  • Scope: 1,409 scientific claims verified against 5,183 abstracts
  • Source: GitHub Repository | Research Paper
  • Main Topics: gene expression, cancer, treatment, infection
  • Data Quality: Enhanced test fold with labels and abstract IDs from scifact_10

Citation-Integrity Dataset

Technical Approach: Both datasets follow the MultiVerS data format, enabling consistent model training and evaluation across different domains.

Novel Insights and Lessons Learned

This project uncovered several findings that weren't documented in existing literature or course materials:

🔄 Sentence Pair Ordering Matters

  • Discovery: The order of sentence pairs in transformer tokenization significantly impacts performance
  • Investigation: Model documentation and papers lack clarity on proper ordering for natural language inference
  • Solution: Experiments revealed DeBERTa was trained with evidence-before-claim ordering
  • Impact: Maintaining consistent ordering between fine-tuning and inference improved classification accuracy
  • Implementation: The pyvers package enforces this ordering, enhancing model reliability

📈 Rethinking Overfitting in Deep Learning

  • Conventional Wisdom: Classical ML teaches that overfitting hurts generalization
  • Surprising Finding: Fine-tuning pretrained transformers on small datasets shows apparent overfitting after 1-2 epochs, yet continued training improves test accuracy
  • Insight: The bias-variance tradeoff behaves differently for large parameter models
  • Documentation: Detailed analysis in blog post with connections to "benign overfitting" research
  • Practical Impact: Optimized training schedules based on test performance rather than traditional early stopping

Future Development Opportunities

  • Class Imbalance Handling: Implement loss function reweighting similar to MultiVerS approach
  • Data Augmentation: Integrate libraries like TextAttack, TextAugment, or nlpaug for synthetic data generation
  • Efficient Fine-tuning: Explore Low-rank Adaptation (LoRA) for faster training and overfitting mitigation
  • Expanded Domains: Extend beyond biomedical literature to other scientific disciplines
  • Real-time Processing: Optimize inference speed for large-scale document processing

Acknowledgments

Special thanks to Divya Vellanki, my mentor, for invaluable guidance and encouragement throughout this project.

The Springboard MLE bootcamp provided the foundational knowledge and structured approach that made this project possible.

This work builds upon significant contributions from the research community:


References:

About

ML enginering capstone project for automated citation verification

Topics

Resources

License

Stars

Watchers

Forks