A professional-grade duplicate file detection tool with advanced security features and high-performance optimizations. Designed to focus on large files for maximum storage space recovery.
- Advanced Security: Never touches system files, comprehensive protection
- Large File Focus: Defaults to 50MB+ files for maximum impact
- High Performance: Multi-core parallel processing with smart hashing
- Automatic Backups: Safe deletion with automatic backup creation
- Interactive Mode: Confirmations for recent files
- Detailed Reports: Comprehensive analytics and reporting
- Clean Architecture: Modular, maintainable, secure code
# Clone the repository
git clone https://github.com/revanthsuddala/dupl.git
cd dupl
# Run installation script
./install.sh# Find duplicates in Downloads (dry run)
dupl ~/Downloads --min-size 50MB --dry-run
# Interactive cleanup
dupl ~/Downloads --min-size 50MB --interactive
# Generate detailed report
dupl ~/Documents --min-size 100MB --report-json "report.json"- Storyline - Complete development journey and technical deep-dive
- Installation Guide - Step-by-step setup instructions
- Usage Examples - Common use cases and commands
# Clean up duplicate downloads
dupl ~/Downloads --min-size 100MB --interactive
# Find duplicate photos
dupl ~/Pictures --file-types jpg,png,raw --min-size 10MB
# Clean documents folder
dupl ~/Documents --min-size 50MB --auto-delete# Scan entire user directory
dupl ~ --min-size 100MB --dry-run --limit 20
# Generate system-wide report
dupl ~ --min-size 500MB --report-json "system_scan.json"# Clean up project dependencies
dupl ./node_modules --min-size 1MB --dry-run
# Find duplicate build artifacts
dupl ./build --min-size 10MB --interactiveBuilt with secure software engineering principles:
- Single Responsibility: Each module has one clear purpose
- Open/Closed: Easy to extend with new features
- Dependency Inversion: High-level modules depend on abstractions
- Interface Segregation: Focused, minimal interfaces
src/
├── models.py # Data structures and types
├── scanner.py # File discovery and collection
├── hasher.py # Hash calculation strategies
├── cleaner.py # Safe file deletion and backup
├── reporter.py # Report generation and display
├── safety.py # System protection and risk analysis
└── utils.py # Common utilities and helpers
- Scanning Speed: 14,000+ files/second on multi-core systems
- Memory Usage: < 100MB for 100,000+ file scans
- Large File Optimization: 90%+ faster than traditional full-hash approaches
- Parallel Processing: Utilizes all available CPU cores
- System file protection - tool will never touch OS files
- Permission validation - checks file access rights before operations
- Automatic backups - creates timestamped backups before deletion
- Risk assessment - analyzes files for potential deletion risks
- Recent files require confirmation - prevents accidental deletion
# Performance tuning
dupl ~/Downloads --parallel 8 --min-size 100MB
# File type filtering
dupl ~/Pictures --file-types jpg,png,raw,tiff --min-size 10MB
# Limit results
dupl ~/Documents --min-size 50MB --limit 5 --dry-run
# Custom backup location
dupl ~/Downloads --min-size 100MB --backup-dir ~/dupl_backupsDetailed human-readable reports with:
- File type distribution
- Space waste analysis
- Risk assessment
- Performance metrics
Machine-readable reports for:
- Integration with other tools
- Data analysis
- Automation workflows
# Run tests
python -m pytest tests/
# Run with coverage
python -m pytest --cov=src tests/
# Performance testing
python -m pytest tests/test_performance.pyWe welcome contributions! Please see our Contributing Guidelines for details.
# Clone and setup
git clone https://github.com/revanthsuddala/dupl.git
cd dupl
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Run tests
python -m pytestThis project is licensed under the MIT License - see the LICENSE file for details.
- Built with dedication and lots of coffee
- Inspired by the need to clean up duplicate files safely
- Thanks to the Python community for amazing libraries and tools
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Wiki: Project Wiki
Star this repository if it helped you!
Built with security and performance in mind.