Skip to content

rknightion/paperless-ngx-dedupe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

76 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Paperless-NGX Deduplication Tool

CI Docker Pulls License GitHub Release Python

A powerful document deduplication tool for paperless-ngx that identifies duplicate documents using advanced fuzzy matching and MinHash/LSH algorithms, designed to handle large document collections efficiently.

Features

  • 🌐 Modern Web UI: React TypeScript frontend with real-time updates
  • ⚑ Scalable Architecture: Handles 13,000+ documents efficiently using MinHash/LSH algorithms
  • 🧠 Smart Deduplication: Multi-factor similarity scoring with OCR-aware fuzzy matching
  • πŸš€ High Performance: Efficient SQLite storage with optimized indexing
  • βš™οΈ Flexible Configuration: Web-based configuration with connection testing
  • πŸ“Š Detailed Analytics: Confidence scores and space-saving calculations
  • πŸ”„ Real-time Updates: WebSocket integration for live progress tracking
  • 🐳 Container Ready: Full Docker support with docker-compose

Why Use This?

If you're using paperless-ngx to manage your documents, you might have:

  • Duplicate scans from re-scanning documents
  • Multiple versions of the same document with slight OCR differences
  • Similar documents that are hard to identify manually
  • Large collections where manual duplicate checking is impractical

This tool helps you:

  • Save storage space by identifying redundant documents
  • Clean up your archive with confidence scores for each duplicate
  • Process large collections efficiently (tested with 13,000+ documents)
  • Maintain data integrity - only identifies duplicates, doesn't delete automatically

Quick Start

Using Docker (Recommended)

  1. Download docker-compose.yml:
curl -O https://raw.githubusercontent.com/rknightion/paperless-ngx-dedupe/main/docker-compose.yml
  1. Start the services:
docker compose up -d
  1. Access the application:
  2. Configure paperless-ngx connection:
    • Navigate to Settings in the web UI
    • Enter your paperless-ngx URL and API token
    • Click "Test Connection" to verify

That's it! The application will automatically pull the latest images from GitHub Container Registry.

Alternative: Using Specific Version

To use a specific version instead of latest:

# Edit docker-compose.yml and replace :latest with :v1.0.0
sed -i 's/:latest/:v1.0.0/g' docker-compose.yml
docker compose up -d

Development

For detailed development setup and contribution guidelines, see CONTRIBUTING.md.

Quick Local Development Setup

# Clone the repository
git clone https://github.com/rknightion/paperless-ngx-dedupe.git
cd paperless-ngx-dedupe

# Option 1: Start both frontend and backend with hot-reloading (Recommended)
uv run python dev.py

# Option 2: Use Docker for development
docker compose -f docker-compose.dev.yml up -d

# Option 3: Manual setup
uv sync --dev
cd frontend && npm install
# Then run: uv run uvicorn paperless_dedupe.main:app --reload --port 30001
# And in another terminal: cd frontend && npm run dev

The uv run python dev.py script:

  • Starts backend API on http://localhost:30001 (with hot-reloading)
  • Starts frontend UI on http://localhost:3000 (with hot-reloading)
  • Shows full backend logs with proper INFO/DEBUG output
  • Handles all dependencies automatically via uv
  • Shows color-coded logs for easy debugging
  • Uses uv for proper Python environment isolation
  • Automatically restarts on code changes for rapid development

Web Interface

The application now includes a modern React TypeScript frontend with:

  • πŸ“Š Dashboard: Overview with statistics and system status
  • πŸ“„ Documents: Virtual scrolling list for large document collections
  • πŸ” Duplicates: Visual duplicate group management with confidence scores
  • βš™οΈ Processing: Real-time analysis control with progress tracking
  • πŸ› οΈ Settings: Connection configuration and system preferences

Initial Setup via Web UI

  1. Access the Web Interface: Navigate to http://localhost:3000
  2. Configure Connection: Go to Settings β†’ Connection to configure your paperless-ngx API
  3. Test Connection: Use the "Test Connection" button to verify settings
  4. Sync Documents: Navigate to Documents and click "Sync from Paperless"
  5. Run Analysis: Go to Processing and start the deduplication analysis
  6. Review Duplicates: Check the Duplicates page for results

Configuration via API

Manual API Setup (Alternative)

  1. Configure Paperless Connection:
curl -X PUT http://localhost:8000/api/v1/config/ \
  -H "Content-Type: application/json" \
  -d '{
    "paperless_url": "http://your-paperless:8000",
    "paperless_api_token": "your-api-token"
  }'
  1. Test Connection:
curl -X POST http://localhost:8000/api/v1/config/test-connection
  1. Sync Documents:
curl -X POST http://localhost:8000/api/v1/documents/sync
  1. Run Deduplication Analysis:
curl -X POST http://localhost:8000/api/v1/processing/analyze

Environment Variables

Variable Description Default
PAPERLESS_DEDUPE_DATABASE_URL SQLite database file path sqlite:///data/paperless_dedupe.db
PAPERLESS_DEDUPE_PAPERLESS_URL Paperless-ngx API URL http://localhost:8000
PAPERLESS_DEDUPE_PAPERLESS_API_TOKEN API token for authentication None
PAPERLESS_DEDUPE_FUZZY_MATCH_THRESHOLD Similarity threshold (0-100) 80
PAPERLESS_DEDUPE_MAX_OCR_LENGTH Max OCR text to store 10000

API Documentation

Interactive API documentation is available at http://localhost:8000/docs

Key Endpoints

  • Documents

    • GET /api/v1/documents/ - List all documents
    • POST /api/v1/documents/sync - Sync from paperless-ngx
    • GET /api/v1/documents/{id}/duplicates - Get document duplicates
  • Duplicates

    • GET /api/v1/duplicates/groups - List duplicate groups
    • GET /api/v1/duplicates/statistics - Get deduplication statistics
    • POST /api/v1/duplicates/groups/{id}/review - Mark group as reviewed
  • Processing

    • POST /api/v1/processing/analyze - Start deduplication analysis
    • GET /api/v1/processing/status - Get processing status

How It Works

  1. Document Sync: Fetches documents and OCR content from paperless-ngx
  2. MinHash Generation: Creates compact signatures for each document
  3. LSH Indexing: Builds locality-sensitive hash tables for fast similarity search
  4. Fuzzy Matching: Applies text similarity algorithms for refined scoring
  5. Confidence Scoring: Calculates weighted scores based on multiple factors:
    • Jaccard similarity (40%)
    • Fuzzy text ratio (30%)
    • Metadata matching (20%)
    • Filename similarity (10%)

Performance

  • Scalability: O(n log n) complexity using LSH instead of O(nΒ²)
  • Memory Efficient: ~50MB for 13K document metadata
  • Storage Strategy: File-based SQLite database for simplicity and portability
  • Processing Speed: ~1000 documents/minute on modern hardware

Development

Project Structure

paperless-ngx-dedupe/
β”œβ”€β”€ frontend/            # React TypeScript frontend
β”‚   β”œβ”€β”€ src/
β”‚   β”‚   β”œβ”€β”€ components/  # React components
β”‚   β”‚   β”œβ”€β”€ pages/       # Application pages
β”‚   β”‚   β”œβ”€β”€ services/    # API client and utilities
β”‚   β”‚   β”œβ”€β”€ store/       # Redux state management
β”‚   β”‚   └── hooks/       # Custom React hooks
β”‚   β”œβ”€β”€ package.json     # Frontend dependencies
β”‚   └── dist/           # Built frontend (served by backend)
β”œβ”€β”€ src/paperless_dedupe/
β”‚   β”œβ”€β”€ api/v1/          # REST API endpoints + WebSocket
β”‚   β”œβ”€β”€ core/            # Configuration and settings
β”‚   β”œβ”€β”€ models/          # Database models
β”‚   β”œβ”€β”€ services/        # Business logic
β”‚   └── main.py          # FastAPI application with frontend serving
β”œβ”€β”€ docker-compose.yml   # Container orchestration
β”œβ”€β”€ Dockerfile          # Container definition
β”œβ”€β”€ pyproject.toml      # Python dependencies and build config
└── CLAUDE.md          # LLM development context

Running Tests

uv run pytest
uv run pytest --cov=paperless_dedupe

Roadmap

  • Web UI with React - βœ… Complete (Phase 1)
  • Enhanced Deduplication Features (Phase 2)
    • Image-based similarity with perceptual hashing
    • Custom field matching and extraction
    • ML-based detection with sentence transformers
  • Performance Optimizations (Phase 3)
    • Parallel processing implementation
    • Database query optimization
    • Incremental processing with checkpoints
  • Paperless Integration (Phase 4)
    • Webhook support for real-time sync
    • Automated document deletion
    • Batch resolution operations
    • Document preview and merge functionality
  • Infrastructure & DevOps (Phase 5)
    • CI/CD pipeline with GitHub Actions
    • Monitoring and observability
    • Authentication and multi-tenancy

Support

Contributing

We welcome contributions! Please see CONTRIBUTING.md for:

  • Development setup instructions
  • Code style guidelines
  • How to submit pull requests
  • Testing requirements

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Acknowledgments

Star History

If you find this project useful, please consider giving it a ⭐ on GitHub!

Star History Chart

About

Helps dedupe your paperless-ngx instance

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors 3

  •  
  •  
  •