A powerful document deduplication tool for paperless-ngx that identifies duplicate documents using advanced fuzzy matching and MinHash/LSH algorithms, designed to handle large document collections efficiently.
- π Modern Web UI: React TypeScript frontend with real-time updates
- β‘ Scalable Architecture: Handles 13,000+ documents efficiently using MinHash/LSH algorithms
- π§ Smart Deduplication: Multi-factor similarity scoring with OCR-aware fuzzy matching
- π High Performance: Efficient SQLite storage with optimized indexing
- βοΈ Flexible Configuration: Web-based configuration with connection testing
- π Detailed Analytics: Confidence scores and space-saving calculations
- π Real-time Updates: WebSocket integration for live progress tracking
- π³ Container Ready: Full Docker support with docker-compose
If you're using paperless-ngx to manage your documents, you might have:
- Duplicate scans from re-scanning documents
- Multiple versions of the same document with slight OCR differences
- Similar documents that are hard to identify manually
- Large collections where manual duplicate checking is impractical
This tool helps you:
- Save storage space by identifying redundant documents
- Clean up your archive with confidence scores for each duplicate
- Process large collections efficiently (tested with 13,000+ documents)
- Maintain data integrity - only identifies duplicates, doesn't delete automatically
- Download docker-compose.yml:
curl -O https://raw.githubusercontent.com/rknightion/paperless-ngx-dedupe/main/docker-compose.yml
- Start the services:
docker compose up -d
- Access the application:
- Web UI: http://localhost:30002
- API Documentation: http://localhost:30001/docs
- Configure paperless-ngx connection:
- Navigate to Settings in the web UI
- Enter your paperless-ngx URL and API token
- Click "Test Connection" to verify
That's it! The application will automatically pull the latest images from GitHub Container Registry.
To use a specific version instead of latest:
# Edit docker-compose.yml and replace :latest with :v1.0.0
sed -i 's/:latest/:v1.0.0/g' docker-compose.yml
docker compose up -d
For detailed development setup and contribution guidelines, see CONTRIBUTING.md.
# Clone the repository
git clone https://github.com/rknightion/paperless-ngx-dedupe.git
cd paperless-ngx-dedupe
# Option 1: Start both frontend and backend with hot-reloading (Recommended)
uv run python dev.py
# Option 2: Use Docker for development
docker compose -f docker-compose.dev.yml up -d
# Option 3: Manual setup
uv sync --dev
cd frontend && npm install
# Then run: uv run uvicorn paperless_dedupe.main:app --reload --port 30001
# And in another terminal: cd frontend && npm run dev
The uv run python dev.py
script:
- Starts backend API on http://localhost:30001 (with hot-reloading)
- Starts frontend UI on http://localhost:3000 (with hot-reloading)
- Shows full backend logs with proper INFO/DEBUG output
- Handles all dependencies automatically via uv
- Shows color-coded logs for easy debugging
- Uses uv for proper Python environment isolation
- Automatically restarts on code changes for rapid development
The application now includes a modern React TypeScript frontend with:
- π Dashboard: Overview with statistics and system status
- π Documents: Virtual scrolling list for large document collections
- π Duplicates: Visual duplicate group management with confidence scores
- βοΈ Processing: Real-time analysis control with progress tracking
- π οΈ Settings: Connection configuration and system preferences
- Access the Web Interface: Navigate to http://localhost:3000
- Configure Connection: Go to Settings β Connection to configure your paperless-ngx API
- Test Connection: Use the "Test Connection" button to verify settings
- Sync Documents: Navigate to Documents and click "Sync from Paperless"
- Run Analysis: Go to Processing and start the deduplication analysis
- Review Duplicates: Check the Duplicates page for results
- Configure Paperless Connection:
curl -X PUT http://localhost:8000/api/v1/config/ \
-H "Content-Type: application/json" \
-d '{
"paperless_url": "http://your-paperless:8000",
"paperless_api_token": "your-api-token"
}'
- Test Connection:
curl -X POST http://localhost:8000/api/v1/config/test-connection
- Sync Documents:
curl -X POST http://localhost:8000/api/v1/documents/sync
- Run Deduplication Analysis:
curl -X POST http://localhost:8000/api/v1/processing/analyze
Variable | Description | Default |
---|---|---|
PAPERLESS_DEDUPE_DATABASE_URL |
SQLite database file path | sqlite:///data/paperless_dedupe.db |
PAPERLESS_DEDUPE_PAPERLESS_URL |
Paperless-ngx API URL | http://localhost:8000 |
PAPERLESS_DEDUPE_PAPERLESS_API_TOKEN |
API token for authentication | None |
PAPERLESS_DEDUPE_FUZZY_MATCH_THRESHOLD |
Similarity threshold (0-100) | 80 |
PAPERLESS_DEDUPE_MAX_OCR_LENGTH |
Max OCR text to store | 10000 |
Interactive API documentation is available at http://localhost:8000/docs
-
Documents
GET /api/v1/documents/
- List all documentsPOST /api/v1/documents/sync
- Sync from paperless-ngxGET /api/v1/documents/{id}/duplicates
- Get document duplicates
-
Duplicates
GET /api/v1/duplicates/groups
- List duplicate groupsGET /api/v1/duplicates/statistics
- Get deduplication statisticsPOST /api/v1/duplicates/groups/{id}/review
- Mark group as reviewed
-
Processing
POST /api/v1/processing/analyze
- Start deduplication analysisGET /api/v1/processing/status
- Get processing status
- Document Sync: Fetches documents and OCR content from paperless-ngx
- MinHash Generation: Creates compact signatures for each document
- LSH Indexing: Builds locality-sensitive hash tables for fast similarity search
- Fuzzy Matching: Applies text similarity algorithms for refined scoring
- Confidence Scoring: Calculates weighted scores based on multiple factors:
- Jaccard similarity (40%)
- Fuzzy text ratio (30%)
- Metadata matching (20%)
- Filename similarity (10%)
- Scalability: O(n log n) complexity using LSH instead of O(nΒ²)
- Memory Efficient: ~50MB for 13K document metadata
- Storage Strategy: File-based SQLite database for simplicity and portability
- Processing Speed: ~1000 documents/minute on modern hardware
paperless-ngx-dedupe/
βββ frontend/ # React TypeScript frontend
β βββ src/
β β βββ components/ # React components
β β βββ pages/ # Application pages
β β βββ services/ # API client and utilities
β β βββ store/ # Redux state management
β β βββ hooks/ # Custom React hooks
β βββ package.json # Frontend dependencies
β βββ dist/ # Built frontend (served by backend)
βββ src/paperless_dedupe/
β βββ api/v1/ # REST API endpoints + WebSocket
β βββ core/ # Configuration and settings
β βββ models/ # Database models
β βββ services/ # Business logic
β βββ main.py # FastAPI application with frontend serving
βββ docker-compose.yml # Container orchestration
βββ Dockerfile # Container definition
βββ pyproject.toml # Python dependencies and build config
βββ CLAUDE.md # LLM development context
uv run pytest
uv run pytest --cov=paperless_dedupe
- Web UI with React - β Complete (Phase 1)
- Enhanced Deduplication Features (Phase 2)
- Image-based similarity with perceptual hashing
- Custom field matching and extraction
- ML-based detection with sentence transformers
- Performance Optimizations (Phase 3)
- Parallel processing implementation
- Database query optimization
- Incremental processing with checkpoints
- Paperless Integration (Phase 4)
- Webhook support for real-time sync
- Automated document deletion
- Batch resolution operations
- Document preview and merge functionality
- Infrastructure & DevOps (Phase 5)
- CI/CD pipeline with GitHub Actions
- Monitoring and observability
- Authentication and multi-tenancy
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Security: See SECURITY.md for reporting vulnerabilities
We welcome contributions! Please see CONTRIBUTING.md for:
- Development setup instructions
- Code style guidelines
- How to submit pull requests
- Testing requirements
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
- paperless-ngx team for the excellent document management system
- datasketch for MinHash implementation
- rapidfuzz for fast fuzzy string matching
If you find this project useful, please consider giving it a β on GitHub!