An ML Engineering Capstone Project
This project develops an NLP framework for automated validation of citations and claims, ensuring references accurately support stated information. In an era where scientific misinformation can have serious consequences, verifying that citations properly support the claims they reference is crucial for maintaining the integrity of scientific literature and preventing the spread of false information.
The Problem: Studies show that 10-20% or more of citations in scientific literature are inaccurate, failing to support the claims they reference. This undermines scientific credibility and can perpetuate misinformation. Our solution uses state-of-the-art transformer models to automatically classify citation accuracy as SUPPORT, REFUTE, or NEI (Not Enough Information).
Our fine-tuned DeBERTa model achieves a 7 percentage point increase in average F1 over the best baseline model:
Macro F1 on test split | |||
Model | SciFact | Citation-Integrity | Average |
SciFact baseline [1] | 0.81 | 0.15 | 0.48 |
Citation-Integrity baseline [2] | 0.74 | 0.44 | 0.59 |
Fine-tuned DeBERTa [3] | 0.84 | 0.47 | 0.66 |
- 🏋 PyVers Python Package: Comprehensive framework for model training with multi-dataset ingestion, 🤗 Hugging Face integration, and ⚡ PyTorch Lightning for scalable training
- 🔀 Fine-tuned Model: Publicly available model ready for inference
- 🌐 AI4Citations Web Application: Live application on Hugging Face Spaces where users can input claims and evidence to get verification results and provide feedback for model improvement
- </> Application Repository: Complete source code for Gradio frontend and evidence retrieval modules
- Improvement on state-of-the-art baselines that use the MultiVerS model based on Longformer
- Multi-dataset training approach combining SciFact and Citation-Integrity datasets
- Evidence retrieval from PDFs using text similarity (BM25-based), semantic search (BERT-based), or LLMs (OpenAI API).
- Comprehensive evaluation framework with detailed performance metrics
- Feedback functionality implemented with Hugging Face Datasets
- API access to inference through the Gradio app
This project follows a systematic approach covering all aspects of the machine learning engineering lifecycle:
Phase | Component | Description |
---|---|---|
🎯 | Problem Definition | Initial project ideas identifying citation accuracy as a critical scientific integrity issue |
📊 | Data Collection | Data directory with curated datasets from SciFact and Citation-Integrity |
📝 | Project Proposal | Detailed proposal outlining approach and deliverables |
🔍 | Literature Review | Research survey reproducing Citation-Integrity methodology |
🧹 | Data Wrangling | Notebooks for Citation-Integrity and SciFact preprocessing |
🔬 | Data Exploration | Analysis notebooks for Citation-Integrity and SciFact |
📍 | Baseline Models | MultiVerS baseline and checkpoint analysis with custom evaluation metrics |
🧪 | Model Experimentation | Blog post on fine-tuning DeBERTa across multiple datasets |
🪜 | Scaling Prototype | Scaling implementation for production readiness |
📐 | Deployment Planning | Engineering plan and architecture design |
💫 | Production Deployment | Blog post documenting deployment process |
❇ | Project Sharing | Public repositories (this one, pyvers, AI4citations) and live application |
The project combines two high-quality datasets for biomedical citations with consistent labeling and preprocessing:
- Scope: 1,409 scientific claims verified against 5,183 abstracts
- Source: GitHub Repository | Research Paper
- Main Topics: gene expression, cancer, treatment, infection
- Data Quality: Enhanced test fold with labels and abstract IDs from
scifact_10
- Scope: 3,063 citation instances from biomedical publications
- Source: GitHub Repository | Research Paper
- Main Topics: cells, cancer, COVID-19, patients
Technical Approach: Both datasets follow the MultiVerS data format, enabling consistent model training and evaluation across different domains.
This project uncovered several findings that weren't documented in existing literature or course materials:
- Discovery: The order of sentence pairs in transformer tokenization significantly impacts performance
- Investigation: Model documentation and papers lack clarity on proper ordering for natural language inference
- Solution: Experiments revealed DeBERTa was trained with evidence-before-claim ordering
- Impact: Maintaining consistent ordering between fine-tuning and inference improved classification accuracy
- Implementation: The pyvers package enforces this ordering, enhancing model reliability
- Conventional Wisdom: Classical ML teaches that overfitting hurts generalization
- Surprising Finding: Fine-tuning pretrained transformers on small datasets shows apparent overfitting after 1-2 epochs, yet continued training improves test accuracy
- Insight: The bias-variance tradeoff behaves differently for large parameter models
- Documentation: Detailed analysis in blog post with connections to "benign overfitting" research
- Practical Impact: Optimized training schedules based on test performance rather than traditional early stopping
- Class Imbalance Handling: Implement loss function reweighting similar to MultiVerS approach
- Data Augmentation: Integrate libraries like TextAttack, TextAugment, or nlpaug for synthetic data generation
- Efficient Fine-tuning: Explore Low-rank Adaptation (LoRA) for faster training and overfitting mitigation
- Expanded Domains: Extend beyond biomedical literature to other scientific disciplines
- Real-time Processing: Optimize inference speed for large-scale document processing
Special thanks to Divya Vellanki, my mentor, for invaluable guidance and encouragement throughout this project.
The Springboard MLE bootcamp provided the foundational knowledge and structured approach that made this project possible.
This work builds upon significant contributions from the research community:
- Citation-Integrity dataset by Sarol et al. (2024)
- DeBERTa model by He et al. (2021)
- MultiVerS model by Wadden et al. (2021)
- SciFact dataset by Wadden et al. (2020)
- Longformer model by Beltagy et al. (2020)
References:
- [1] MultiVerS pretrained on FeverSci and fine-tuned on SciFact by Wadden et al. (2021)
- [2] MultiVerS pretrained on HealthVer and fine-tuned on Citation-Integrity by Sarol et al. (2024)
- [3] DeBERTa v3 pretrained on multiple NLI datasets and fine-tuned on shuffled data from SciFact and Citation-Integrity in this project