ValiRef

AI-Powered Citation Validation for Academic Papers

Features • Installation • Usage • How It Works • Benchmark

Overview

ValiRef is an intelligent tool designed to detect hallucinated citations in academic papers. With the rise of AI-generated content, Large Language Models (LLMs) sometimes generate plausible-sounding but non-existent references. ValiRef helps researchers, reviewers, and publishers verify the authenticity of citations in PDF documents.

What ValiRef Detects

Hallucination Type	Description	Example
🔮 Fabrication	Completely fake paper that doesn't exist	A paper with a convincing title but no actual publication
👤 Attribution Error	Real paper, wrong authors	Citing "Attention is All You Need" by someone other than Vaswani et al.
📄 Irrelevance	Real paper, but claim doesn't match content	Citing a paper about NLP for a claim about computer vision
🔄 Counterfactual	Real paper, opposite conclusion	Claiming a paper supports X when it actually argues against X

Features

🔍 Multi-Source Verification - Cross-references citations against ArXiv, Google Scholar, Semantic Scholar, OpenReview, OpenAlex, and DuckDuckGo
🤖 AI-Powered Detection - Uses DeepSeek LLM with ReAct reasoning to analyze search results
⚡ Async-First Architecture - Concurrent validation of multiple references for optimal performance
📊 Rich CLI Output - Beautiful terminal interface with progress bars, real-time metrics, and detailed reports
📈 Benchmark Suite - Built-in dataset generation and evaluation framework
🛡️ Resilient API Handling - Token bucket rate limiting + circuit breaker pattern for reliable external API calls
🎯 High Accuracy - 72%+ accuracy on 100-sample benchmark with confidence scoring and detailed reasoning

Installation

Prerequisites

Python 3.12 or higher
uv package manager (recommended) or pip

Install from PyPI (Recommended)

pip install valiref

Install from Source

# Clone the repository
git clone https://github.com/Gianthard-cyh/ValiRef.git
cd ValiRef

# Install dependencies
uv sync

# Set up environment variables
cp .env.example .env
# Edit .env and add your DeepSeek API key

Environment Configuration

Create a .env file with your API keys:

DEEPSEEK_API_KEY=your_deepseek_api_key_here

# Optional: for enhanced search capabilities
SERPAPI_API_KEY=your_serpapi_key
SEMANTIC_SCHOLAR_API_KEY=your_semantic_scholar_key

# Optional: LangSmith tracing
LANGCHAIN_TRACING_V2=false
LANGCHAIN_API_KEY=your_langchain_key
LANGCHAIN_PROJECT=ValiRef

Usage

Validate References in a PDF

# Basic usage
uv run python -m src.cli validate paper.pdf

# With concurrent workers (default: 5)
uv run python -m src.cli validate paper.pdf --workers 10

# Output as JSON
uv run python -m src.cli validate paper.pdf --json

# Enable verbose logging
uv run python -m src.cli validate paper.pdf --verbose

Example Output

Validation Summary for paper.pdf
Total References: 12
Validated: 12
Duration: 15.34s

┌─────────────────────────────────────────────────────────────────────┐
│ ✅ Reference #1 - REAL REFERENCE                                    │
├─────────────────────────────────────────────────────────────────────┤
│ Title: Attention Is All You Need                                    │
│ Authors: Ashish Vaswani, Noam Shazeer, Niki Parmar, et al.          │
│ Confidence: 0.98                                                    │
│                                                                     │
│ Reasoning:                                                          │
│ Found exact match on ArXiv (arxiv.org/abs/1706.03762). Title,       │
│ authors, and venue (NIPS 2017) all match the citation.              │
│                                                                     │
│ Evidence / Sources:                                                 │
│ - https://arxiv.org/abs/1706.03762                                  │
└─────────────────────────────────────────────────────────────────────┘

How It Works

ValiRef employs a sophisticated multi-step validation pipeline:

┌─────────────┐    ┌──────────────┐    ┌──────────────┐    ┌─────────────┐
│  PDF Input  │ →  │   Extract    │ →  │    Search    │ →  │   Validate  │
│             │    │  References  │    │  Multi-Source│    │  with LLM   │
└─────────────┘    └──────────────┘    └──────────────┘    └─────────────┘
                                                              │
                                                              ▼
                                                        ┌─────────────┐
                                                        │   Report    │
                                                        │  Results    │
                                                        └─────────────┘

1. Reference Extraction

Parses PDF documents using PyMuPDF
Uses LLM to intelligently extract structured reference data from bibliography sections
Handles various citation formats (APA, MLA, Chicago, etc.)

2. Multi-Source Search

Simultaneously queries multiple academic databases:

ArXiv - Preprint server with full-text access
Google Scholar - Broad academic search
Semantic Scholar - AI-powered academic search
OpenReview - Peer-reviewed conference papers
OpenAlex - Open academic graph
DuckDuckGo - Web search fallback

3. AI Validation

The HallucinationDetector uses a ReAct (Reasoning + Acting) agent powered by DeepSeek LLM:

Analyzes search results from all sources
Compares paper metadata (title, authors, abstract, venue)
Evaluates claims against actual paper content
Provides confidence scores with detailed reasoning

Resilient API Architecture

ValiRef implements a production-grade resilience layer for external API calls:

┌─────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  SearchTool │────▶│ ToolRequestQueue│────▶│  Token Bucket   │
│  (per source)│     │  (rate limiter) │     │ (smooth flow)   │
└─────────────┘     └─────────────────┘     └─────────────────┘
                              │
                              ▼
                     ┌─────────────────┐
                     │ Circuit Breaker │
                     │ (fail-fast for  │
                     │  unhealthy APIs)│
                     └─────────────────┘

Features:

Token Bucket Rate Limiting - Smooth request flow with configurable burst capacity per source
Circuit Breaker Pattern - Automatically stops requests to failing services (3 failures → OPEN, 15s recovery timeout)
Real-time Metrics - Live display of API call statistics, active requests, and circuit states
Graceful Degradation - Failed sources are marked unavailable but don't block other sources

Benchmark

ValiRef includes a comprehensive benchmark suite for evaluating hallucination detection performance.

Performance Results

On a 100-sample mixed dataset:

Metric	Value
Accuracy	72.0%
Precision	1.0000
Recall	0.2800 (Counterfactual) / 1.0000 (Fabrication)
F1 Score	0.4375 (Counterfactual) / 1.0000 (Fabrication)
Throughput	~0.09 samples/sec
Duration	~18 min (100 samples)

Per-Type Performance

Hallucination Type	Accuracy	Precision	Recall	F1 Score	Samples
Fabrication	100%	1.0000	1.0000	1.0000	19
AttributionError	100%	1.0000	1.0000	1.0000	19
Irrelevance	74%	1.0000	0.7368	0.8485	19
Counterfactual	28%	1.0000	0.2800	0.4375	25
Real Papers	72%	0.0000	0.0000	0.0000	18

Generate Benchmark Dataset

uv run python scripts/generate_dataset.py \
  --topic cs.CL \
  --count 1000 \
  --output data/dataset.csv

Dataset Composition

The benchmark dataset combines real ArXiv papers with synthetic hallucinations:

Category	Description	Percentage
Real	Genuine papers from ArXiv	50%
Fabrication	AI-generated fake papers	12.5%
Attribution Error	Real papers with wrong authors	12.5%
Irrelevance	Real papers with mismatched claims	12.5%
Counterfactual	Real papers with inverted claims	12.5%

Running Tests

# Run unit tests (fast, no external APIs)
uv run pytest

# Run integration tests (slow, requires API keys)
uv run pytest -m integration

# Run specific test
uv run pytest tests/core/test_tools.py -v

Architecture

valiref/
├── src/
│   ├── cli.py                 # Typer-based CLI interface
│   ├── cli_callbacks.py       # Progress callbacks and Live display
│   ├── core/                  # Core validation engine
│   │   ├── pipeline.py        # Async validation orchestration
│   │   ├── detector.py        # LLM-based hallucination detection
│   │   ├── extract.py         # PDF/text extraction
│   │   ├── tools.py           # Academic search tools with rate limiting
│   │   ├── search_queue.py    # Token bucket + circuit breaker
│   │   ├── tool_monitor.py    # Real-time metrics via blinker signals
│   │   ├── config.py          # Configuration management
│   │   └── logger.py          # Rich-based logging
│   ├── bench/                 # Benchmark framework
│   │   ├── crawler.py         # ArXiv paper crawler
│   │   ├── dataset.py         # Hallucination injection
│   │   ├── bench.py           # Benchmark runner with live metrics
│   │   └── schema.py          # Pydantic data models
│   └── api/                   # API interface (future)
├── scripts/
│   └── generate_dataset.py    # Dataset generation script
├── tests/                     # Test suite
└── data/                      # Benchmark datasets

Configuration

Key settings in src/core/config.py:

Setting	Default	Description
`LLM_MODEL`	deepseek-chat	LLM for validation
`LLM_TEMPERATURE`	0.7	Creativity vs determinism
`DETECTOR_TEMPERATURE`	0.1	Lower for consistent reasoning
`EXTRACTION_CHAR_LIMIT`	20000	Max chars from PDF references
`MAX_WORKERS`	5	Concurrent validation threads

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Development Setup

# Install dev dependencies
uv sync --dev

# Run linting
uv run ruff check .
uv run ruff format .

# Run tests
uv run pytest

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Built with LangChain and LangGraph
Powered by DeepSeek LLM
Academic search via ArXiv, Semantic Scholar, OpenReview, and OpenAlex
CLI powered by Typer and Rich

_{Built with ❤️ for the research community}

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
.claude/skills/valiref		.claude/skills/valiref
.trae/rules		.trae/rules
.vscode		.vscode
assets		assets
data		data
scripts		scripts
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
CLAUDE.md		CLAUDE.md
README.md		README.md
README.zh-CN.md		README.zh-CN.md
docker-compose.yml		docker-compose.yml
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

ValiRef

Overview

What ValiRef Detects

Features

Installation

Prerequisites

Install from PyPI (Recommended)

Install from Source

Environment Configuration

Usage

Validate References in a PDF

Example Output

How It Works

1. Reference Extraction

2. Multi-Source Search

3. AI Validation

Resilient API Architecture

Benchmark

Performance Results

Per-Type Performance

Generate Benchmark Dataset

Dataset Composition

Running Tests

Architecture

Configuration

Contributing

Development Setup

License

Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages