Skip to content

s-revanth/duplicate-hunter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

dupl - Secure Duplicate File Hunter

Python License Platform Architecture

A professional-grade duplicate file detection tool with advanced security features and high-performance optimizations. Designed to focus on large files for maximum storage space recovery.

Key Features

  • Advanced Security: Never touches system files, comprehensive protection
  • Large File Focus: Defaults to 50MB+ files for maximum impact
  • High Performance: Multi-core parallel processing with smart hashing
  • Automatic Backups: Safe deletion with automatic backup creation
  • Interactive Mode: Confirmations for recent files
  • Detailed Reports: Comprehensive analytics and reporting
  • Clean Architecture: Modular, maintainable, secure code

Quick Start

Installation

# Clone the repository
git clone https://github.com/revanthsuddala/dupl.git
cd dupl

# Run installation script
./install.sh

Basic Usage

# Find duplicates in Downloads (dry run)
dupl ~/Downloads --min-size 50MB --dry-run

# Interactive cleanup
dupl ~/Downloads --min-size 50MB --interactive

# Generate detailed report
dupl ~/Documents --min-size 100MB --report-json "report.json"

Documentation

Use Cases

Personal File Management

# Clean up duplicate downloads
dupl ~/Downloads --min-size 100MB --interactive

# Find duplicate photos
dupl ~/Pictures --file-types jpg,png,raw --min-size 10MB

# Clean documents folder
dupl ~/Documents --min-size 50MB --auto-delete

System Administration

# Scan entire user directory
dupl ~ --min-size 100MB --dry-run --limit 20

# Generate system-wide report
dupl ~ --min-size 500MB --report-json "system_scan.json"

Development Workflows

# Clean up project dependencies
dupl ./node_modules --min-size 1MB --dry-run

# Find duplicate build artifacts
dupl ./build --min-size 10MB --interactive

Architecture

Built with secure software engineering principles:

  • Single Responsibility: Each module has one clear purpose
  • Open/Closed: Easy to extend with new features
  • Dependency Inversion: High-level modules depend on abstractions
  • Interface Segregation: Focused, minimal interfaces

Module Structure

src/
├── models.py      # Data structures and types
├── scanner.py     # File discovery and collection
├── hasher.py      # Hash calculation strategies
├── cleaner.py     # Safe file deletion and backup
├── reporter.py    # Report generation and display
├── safety.py      # System protection and risk analysis
└── utils.py       # Common utilities and helpers

Performance

  • Scanning Speed: 14,000+ files/second on multi-core systems
  • Memory Usage: < 100MB for 100,000+ file scans
  • Large File Optimization: 90%+ faster than traditional full-hash approaches
  • Parallel Processing: Utilizes all available CPU cores

Safety Features

  1. System file protection - tool will never touch OS files
  2. Permission validation - checks file access rights before operations
  3. Automatic backups - creates timestamped backups before deletion
  4. Risk assessment - analyzes files for potential deletion risks
  5. Recent files require confirmation - prevents accidental deletion

Advanced Options

# Performance tuning
dupl ~/Downloads --parallel 8 --min-size 100MB

# File type filtering
dupl ~/Pictures --file-types jpg,png,raw,tiff --min-size 10MB

# Limit results
dupl ~/Documents --min-size 50MB --limit 5 --dry-run

# Custom backup location
dupl ~/Downloads --min-size 100MB --backup-dir ~/dupl_backups

Reporting

Text Reports

Detailed human-readable reports with:

  • File type distribution
  • Space waste analysis
  • Risk assessment
  • Performance metrics

JSON Reports

Machine-readable reports for:

  • Integration with other tools
  • Data analysis
  • Automation workflows

Testing

# Run tests
python -m pytest tests/

# Run with coverage
python -m pytest --cov=src tests/

# Performance testing
python -m pytest tests/test_performance.py

Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

Development Setup

# Clone and setup
git clone https://github.com/revanthsuddala/dupl.git
cd dupl

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Run tests
python -m pytest

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • Built with dedication and lots of coffee
  • Inspired by the need to clean up duplicate files safely
  • Thanks to the Python community for amazing libraries and tools

Support


Star this repository if it helped you!

Built with security and performance in mind.

About

Python tool for finding and managing duplicate files

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors