v2.0.0 — Deduplication-based archiver built for cold storage.
Tesseract scans a directory, detects duplicate files using multi-stage content-aware matching, and compresses only unique data into a single .tesseract archive. Designed for archiving large drives (12TB+) where duplicate files waste significant space.
- Content-aware deduplication — 3-stage pipeline: metadata grouping → partial hash (first + last 64KB using BLAKE3) → full BLAKE3
- Zstandard compression — modern zstd compression with adaptive levels (fast for small files, full level for large files)
- Failsafe staged encoding — files are compressed into verified multi-file shards (~500MB each) before atomic assembly; source files are never modified
- BLAKE3 hashing — used for all partial and full hashing; ~6x faster than SHA-256 with strong cryptographic guarantees
- Persistent hash cache — SQLite-backed cache for scan results, partial hashes, and full hashes; survives interruptions
- AES-256-GCM encryption — password-based with PBKDF2-HMAC-SHA256 (600K iterations)
- Solid compression — optional single continuous compressed stream for better ratios
- Recovery records — XOR parity-based self-repair (1-30% redundancy)
- Multi-volume splitting — split archives for FAT32 or media size limits
- Archive comments — embed text metadata in archives
- File permission storage — optional preservation of file permissions
- Archive locking — mark archives as finalized
- Polished terminal UI — rich-powered progress, cleaner summaries, and automatic fallback for plain terminals
- Fully parallelized — multi-threaded hashing, deduplication, compression, verification, and preflight checks
- Python ≥ 3.9
tqdm≥ 4.60.0rich≥ 13.7.0cryptography≥ 41.0.0blake3≥ 0.3.0zstandard≥ 0.19.0
pip install -e .For development (includes pytest):
pip install -e "[dev]"# Compress a directory
tesseract encode "D:\MyFiles" "D:\Backups\myfiles.tesseract"
# Restore it
tesseract decode "D:\Backups\myfiles.tesseract" "D:\Restored"Create a .tesseract archive from a directory.
The CLI now prefers a single live progress region with structured summaries instead of stacking multiple tqdm bars. Set the environment variable TESSERACT_PLAIN=1 if you want plain terminal output.
tesseract encode <source> <output> [options]
| Flag | Short | Type | Default | Description |
|---|---|---|---|---|
--workers |
-w |
int | CPU count - 1 | Number of CPU cores to use |
--compression-level |
-c |
1-22 | 9 | zstd compression level |
--exclude |
-e |
str | — | Glob pattern to exclude (repeatable) |
--solid |
-s |
flag | off | Solid compression mode (better ratio, slower random access) |
--password |
-p |
str | — | Encrypt with this password |
--encrypt |
flag | off | Encrypt (prompts for password securely) | |
--recovery |
-r |
1-30 | 0 | Add recovery records (% of archive size) |
--comment |
-m |
str | — | Embed a text comment in the archive |
--permissions |
flag | off | Store file permissions | |
--lock |
flag | off | Mark archive as finalized | |
--verbose |
-v |
flag | off | Verbose logging |
Examples:
# Basic encode
tesseract encode "H:\" "X:\Backup\h_drive.tesseract"
# 30 threads, verbose
tesseract encode "H:\" "X:\Backup\h_drive.tesseract" -w 30 -v
# Encrypted with 5% recovery and a comment
tesseract encode "D:\Photos" "E:\archive.tesseract" --encrypt -r 5 -m "Photos backup 2026"
# Solid mode, max compression, exclude temp files
tesseract encode "D:\Projects" "E:\projects.tesseract" -s -c 22 -e "*.tmp" -e "node_modules" -e "__pycache__"Extract a .tesseract archive to a directory.
tesseract decode <archive> <output> [options]
| Flag | Short | Type | Default | Description |
|---|---|---|---|---|
--workers |
-w |
int | CPU count - 1 | Number of CPU cores |
--password |
-p |
str | — | Password for encrypted archives |
--extract |
-x |
str | — | Extract only files matching this glob (repeatable) |
--no-verify |
flag | off | Skip post-extraction hash verification | |
--overwrite |
flag | off | Overwrite existing output files | |
--verbose |
-v |
flag | off | Verbose logging |
Examples:
# Basic decode
tesseract decode "E:\archive.tesseract" "D:\Restored"
# Encrypted archive
tesseract decode "E:\archive.tesseract" "D:\Restored" -p mypassword
# Extract only photos
tesseract decode "E:\archive.tesseract" "D:\Restored" -x "photos/*" -x "*.jpg"
# Overwrite existing files
tesseract decode "E:\archive.tesseract" "D:\Restored" --overwriteDisplay archive metadata without extracting.
tesseract info <archive> [options]
| Flag | Short | Type | Description |
|---|---|---|---|
--password |
-p |
str | Password for encrypted archives |
--list-files |
-l |
flag | List all files in the archive |
--list-groups |
-g |
flag | List duplicate groups |
Verify archive integrity without extracting.
tesseract verify <archive> [options]
| Flag | Short | Type | Description |
|---|---|---|---|
--password |
-p |
str | Password for encrypted archives |
--verbose |
-v |
flag | Verbose logging |
Split an archive into multi-volume parts.
tesseract split <archive> [options]
| Flag | Short | Type | Default | Description |
|---|---|---|---|---|
--size |
-s |
int (MB) | 100 | Size of each volume in megabytes |
Output files are named archive.001, archive.002, etc.
Reassemble multi-volume archive parts.
tesseract join <first_volume> [options]
| Flag | Short | Type | Description |
|---|---|---|---|
--output |
-o |
str | Output path (default: auto-named alongside volumes) |
Attempt to repair a damaged archive using embedded recovery records.
tesseract repair <archive> [options]
Requires recovery records (
-rflag during encode). Without them, repair is not possible.
Display the comment embedded in an archive.
tesseract comment <archive>
- Scan — recursively finds all files (cached in SQLite for subsequent runs)
- Deduplicate — 3-stage content matching: metadata → partial BLAKE3 (first + last 64KB) → full BLAKE3
- Hash — computes full BLAKE3 for all unique files (parallel, cached)
- Preflight — verifies every source file is readable and snapshots hashes
- Stage shards — compresses files into ~500MB multi-file shards (parallel with zstd)
- Verify shards — re-reads every shard from disk and verifies CRC32 integrity
- Verify source — re-checks all source files haven't changed since step 4
- Assemble — streams verified shards into a
.tmparchive file - Verify archive — validates the assembled archive structure and manifest
- Finalize — atomic rename from
.tmpto.tesseract - Cleanup — removes staging directory and any temp files
If any step fails, source files remain completely untouched. The staging directory and .tmp file are cleaned up automatically.
Tesseract maintains a .hashcache SQLite database alongside the output archive with three tables:
- Full hashes — keyed on (filepath, size, mtime_ns), auto-invalidated when files change
- Partial hashes — 64KB BLAKE3 for deduplication, also auto-invalidated
- Scan cache — directory scan results stored as JSON, validated against file metadata
On subsequent runs, cached hashes are loaded instantly — only new or modified files need to be re-hashed.
┌─────────────────────────────┐
│ Header (128 bytes) │ Magic, flags, offsets, salt, password check
├─────────────────────────────┤
│ Comment (variable) │ UTF-8 text, up to 64KB
├─────────────────────────────┤
│ Data Blocks │ Compressed file data (normal or solid mode)
│ Normal: per-file blocks │ 80-byte block header + zstd stream [+ AES-GCM]
│ Solid: single stream │ 16-byte solid header + continuous zstd [+ AES-GCM]
├─────────────────────────────┤
│ Manifest (gzip JSON) │ File metadata, dedup groups, offsets [+ AES-GCM]
├─────────────────────────────┤
│ Footer (8 bytes) │ Magic bytes
├─────────────────────────────┤
│ Recovery Records (optional) │ XOR parity blocks for self-repair
└─────────────────────────────┘
python -m pytest tests/ -q145 tests covering the full pipeline including safety, encryption, recovery, deduplication, encoding/decoding roundtrips, and archive format validation.
MIT License — see LICENSE.
Files are grouped by (size, extension) → partial hash (first + last 64KB using BLAKE3) → full BLAKE3.
Only one copy of each unique file is stored. Duplicates reference the master copy's data offset in the manifest.
- AES-256-GCM authenticated encryption
- Key derived via PBKDF2-HMAC-SHA256 with 600,000 iterations
- Random 16-byte salt per archive
- Each encryption operation uses a unique nonce
- Manifest is also encrypted
- XOR parity computed over 512KB slices of the data region
- Configurable redundancy (1-30% of archive size)
- Can repair single-slice corruption per parity group
tesseract/
├── __init__.py # Package exports, version
├── __main__.py # Entry point (python -m tesseract)
├── cli.py # CLI argument parsing and command handlers
├── encoder.py # Staged encoding pipeline
├── decoder.py # Archive extraction
├── safeguard.py # Failsafe staging, CRC verification, preflight checks
├── scanner.py # Recursive file discovery with exclusion patterns
├── deduplicator.py # 3-stage duplicate detection
├── hasher.py # BLAKE3 partial and full hashing
├── manifest.py # Archive manifest (gzip JSON)
├── archive_format.py # Binary format v2 pack/unpack
├── encryption.py # AES-256-GCM + PBKDF2 key derivation
├── recovery.py # XOR parity recovery records
└── volume.py # Multi-volume split/join
pip install -e "[dev]"
python -m pytest tests/ -v145 tests covering all modules, pipeline roundtrips, encryption, recovery, staging safety, and edge cases.
MIT
