AI-Powered Citation Validation for Academic Papers
Features • Installation • Usage • How It Works • Benchmark
ValiRef is an intelligent tool designed to detect hallucinated citations in academic papers. With the rise of AI-generated content, Large Language Models (LLMs) sometimes generate plausible-sounding but non-existent references. ValiRef helps researchers, reviewers, and publishers verify the authenticity of citations in PDF documents.
| Hallucination Type | Description | Example |
|---|---|---|
| 🔮 Fabrication | Completely fake paper that doesn't exist | A paper with a convincing title but no actual publication |
| 👤 Attribution Error | Real paper, wrong authors | Citing "Attention is All You Need" by someone other than Vaswani et al. |
| 📄 Irrelevance | Real paper, but claim doesn't match content | Citing a paper about NLP for a claim about computer vision |
| 🔄 Counterfactual | Real paper, opposite conclusion | Claiming a paper supports X when it actually argues against X |
- 🔍 Multi-Source Verification - Cross-references citations against ArXiv, Google Scholar, Semantic Scholar, OpenReview, OpenAlex, and DuckDuckGo
- 🤖 AI-Powered Detection - Uses DeepSeek LLM with ReAct reasoning to analyze search results
- ⚡ Async-First Architecture - Concurrent validation of multiple references for optimal performance
- 📊 Rich CLI Output - Beautiful terminal interface with progress bars, real-time metrics, and detailed reports
- 📈 Benchmark Suite - Built-in dataset generation and evaluation framework
- 🛡️ Resilient API Handling - Token bucket rate limiting + circuit breaker pattern for reliable external API calls
- 🎯 High Accuracy - 72%+ accuracy on 100-sample benchmark with confidence scoring and detailed reasoning
- Python 3.12 or higher
- uv package manager (recommended) or pip
pip install valiref# Clone the repository
git clone https://github.com/Gianthard-cyh/ValiRef.git
cd ValiRef
# Install dependencies
uv sync
# Set up environment variables
cp .env.example .env
# Edit .env and add your DeepSeek API keyCreate a .env file with your API keys:
DEEPSEEK_API_KEY=your_deepseek_api_key_here
# Optional: for enhanced search capabilities
SERPAPI_API_KEY=your_serpapi_key
SEMANTIC_SCHOLAR_API_KEY=your_semantic_scholar_key
# Optional: LangSmith tracing
LANGCHAIN_TRACING_V2=false
LANGCHAIN_API_KEY=your_langchain_key
LANGCHAIN_PROJECT=ValiRef# Basic usage
uv run python -m src.cli validate paper.pdf
# With concurrent workers (default: 5)
uv run python -m src.cli validate paper.pdf --workers 10
# Output as JSON
uv run python -m src.cli validate paper.pdf --json
# Enable verbose logging
uv run python -m src.cli validate paper.pdf --verboseValidation Summary for paper.pdf
Total References: 12
Validated: 12
Duration: 15.34s
┌─────────────────────────────────────────────────────────────────────┐
│ ✅ Reference #1 - REAL REFERENCE │
├─────────────────────────────────────────────────────────────────────┤
│ Title: Attention Is All You Need │
│ Authors: Ashish Vaswani, Noam Shazeer, Niki Parmar, et al. │
│ Confidence: 0.98 │
│ │
│ Reasoning: │
│ Found exact match on ArXiv (arxiv.org/abs/1706.03762). Title, │
│ authors, and venue (NIPS 2017) all match the citation. │
│ │
│ Evidence / Sources: │
│ - https://arxiv.org/abs/1706.03762 │
└─────────────────────────────────────────────────────────────────────┘
ValiRef employs a sophisticated multi-step validation pipeline:
┌─────────────┐ ┌──────────────┐ ┌──────────────┐ ┌─────────────┐
│ PDF Input │ → │ Extract │ → │ Search │ → │ Validate │
│ │ │ References │ │ Multi-Source│ │ with LLM │
└─────────────┘ └──────────────┘ └──────────────┘ └─────────────┘
│
▼
┌─────────────┐
│ Report │
│ Results │
└─────────────┘
- Parses PDF documents using PyMuPDF
- Uses LLM to intelligently extract structured reference data from bibliography sections
- Handles various citation formats (APA, MLA, Chicago, etc.)
Simultaneously queries multiple academic databases:
- ArXiv - Preprint server with full-text access
- Google Scholar - Broad academic search
- Semantic Scholar - AI-powered academic search
- OpenReview - Peer-reviewed conference papers
- OpenAlex - Open academic graph
- DuckDuckGo - Web search fallback
The HallucinationDetector uses a ReAct (Reasoning + Acting) agent powered by DeepSeek LLM:
- Analyzes search results from all sources
- Compares paper metadata (title, authors, abstract, venue)
- Evaluates claims against actual paper content
- Provides confidence scores with detailed reasoning
ValiRef implements a production-grade resilience layer for external API calls:
┌─────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ SearchTool │────▶│ ToolRequestQueue│────▶│ Token Bucket │
│ (per source)│ │ (rate limiter) │ │ (smooth flow) │
└─────────────┘ └─────────────────┘ └─────────────────┘
│
▼
┌─────────────────┐
│ Circuit Breaker │
│ (fail-fast for │
│ unhealthy APIs)│
└─────────────────┘
Features:
- Token Bucket Rate Limiting - Smooth request flow with configurable burst capacity per source
- Circuit Breaker Pattern - Automatically stops requests to failing services (3 failures → OPEN, 15s recovery timeout)
- Real-time Metrics - Live display of API call statistics, active requests, and circuit states
- Graceful Degradation - Failed sources are marked unavailable but don't block other sources
ValiRef includes a comprehensive benchmark suite for evaluating hallucination detection performance.
On a 100-sample mixed dataset:
| Metric | Value |
|---|---|
| Accuracy | 72.0% |
| Precision | 1.0000 |
| Recall | 0.2800 (Counterfactual) / 1.0000 (Fabrication) |
| F1 Score | 0.4375 (Counterfactual) / 1.0000 (Fabrication) |
| Throughput | ~0.09 samples/sec |
| Duration | ~18 min (100 samples) |
| Hallucination Type | Accuracy | Precision | Recall | F1 Score | Samples |
|---|---|---|---|---|---|
| Fabrication | 100% | 1.0000 | 1.0000 | 1.0000 | 19 |
| AttributionError | 100% | 1.0000 | 1.0000 | 1.0000 | 19 |
| Irrelevance | 74% | 1.0000 | 0.7368 | 0.8485 | 19 |
| Counterfactual | 28% | 1.0000 | 0.2800 | 0.4375 | 25 |
| Real Papers | 72% | 0.0000 | 0.0000 | 0.0000 | 18 |
uv run python scripts/generate_dataset.py \
--topic cs.CL \
--count 1000 \
--output data/dataset.csvThe benchmark dataset combines real ArXiv papers with synthetic hallucinations:
| Category | Description | Percentage |
|---|---|---|
| Real | Genuine papers from ArXiv | 50% |
| Fabrication | AI-generated fake papers | 12.5% |
| Attribution Error | Real papers with wrong authors | 12.5% |
| Irrelevance | Real papers with mismatched claims | 12.5% |
| Counterfactual | Real papers with inverted claims | 12.5% |
# Run unit tests (fast, no external APIs)
uv run pytest
# Run integration tests (slow, requires API keys)
uv run pytest -m integration
# Run specific test
uv run pytest tests/core/test_tools.py -vvaliref/
├── src/
│ ├── cli.py # Typer-based CLI interface
│ ├── cli_callbacks.py # Progress callbacks and Live display
│ ├── core/ # Core validation engine
│ │ ├── pipeline.py # Async validation orchestration
│ │ ├── detector.py # LLM-based hallucination detection
│ │ ├── extract.py # PDF/text extraction
│ │ ├── tools.py # Academic search tools with rate limiting
│ │ ├── search_queue.py # Token bucket + circuit breaker
│ │ ├── tool_monitor.py # Real-time metrics via blinker signals
│ │ ├── config.py # Configuration management
│ │ └── logger.py # Rich-based logging
│ ├── bench/ # Benchmark framework
│ │ ├── crawler.py # ArXiv paper crawler
│ │ ├── dataset.py # Hallucination injection
│ │ ├── bench.py # Benchmark runner with live metrics
│ │ └── schema.py # Pydantic data models
│ └── api/ # API interface (future)
├── scripts/
│ └── generate_dataset.py # Dataset generation script
├── tests/ # Test suite
└── data/ # Benchmark datasets
Key settings in src/core/config.py:
| Setting | Default | Description |
|---|---|---|
LLM_MODEL |
deepseek-chat | LLM for validation |
LLM_TEMPERATURE |
0.7 | Creativity vs determinism |
DETECTOR_TEMPERATURE |
0.1 | Lower for consistent reasoning |
EXTRACTION_CHAR_LIMIT |
20000 | Max chars from PDF references |
MAX_WORKERS |
5 | Concurrent validation threads |
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
# Install dev dependencies
uv sync --dev
# Run linting
uv run ruff check .
uv run ruff format .
# Run tests
uv run pytestThis project is licensed under the MIT License - see the LICENSE file for details.
- Built with LangChain and LangGraph
- Powered by DeepSeek LLM
- Academic search via ArXiv, Semantic Scholar, OpenReview, and OpenAlex
- CLI powered by Typer and Rich
Built with ❤️ for the research community