|
| 1 | +# 🔁 Deduplication in Sietch Vault |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +Sietch Vault uses **content-defined chunking** and **cryptographic deduplication** to minimize redundant data storage. |
| 6 | +Instead of storing an entire file whenever it changes, Sietch breaks each file into small, fixed- or variable-sized chunks (default: 4 MB), computes unique hashes for each chunk, and only stores chunks that haven't been stored before. |
| 7 | + |
| 8 | +This makes syncing and storage highly efficient i.e identical data across files, folders, or even vaults are stored only once, reducing disk space and sync time. |
| 9 | + |
| 10 | +--- |
| 11 | + |
| 12 | +## How Deduplication Works |
| 13 | + |
| 14 | +Deduplication in Sietch happens at **chunk level**, not file level. |
| 15 | + |
| 16 | +1. **Chunking** |
| 17 | + - Every file added to a vault is split into smaller data chunks. |
| 18 | + - Chunk size is configurable (`--chunk-size` flag, default 4 MB). |
| 19 | + - Chunk boundaries can be static or computed using a **rolling hash**, which helps detect identical regions even when a file shifts slightly. |
| 20 | + |
| 21 | +2. **Hashing** |
| 22 | + - Each chunk is assigned two hashes: |
| 23 | + - **Content Hash:** A cryptographic hash (e.g., SHA-256) of the unencrypted chunk. |
| 24 | + - **Storage Hash:** A hash of the encrypted chunk (post-encryption). |
| 25 | + |
| 26 | +3. **Deduplication Check** |
| 27 | + - When a new chunk is processed, Sietch checks if the **content hash** already exists in the vault index. |
| 28 | + - If found, it reuses the existing chunk reference instead of storing a duplicate. |
| 29 | + |
| 30 | +4. **Encryption** |
| 31 | + - Chunks are encrypted **after** deduplication check. |
| 32 | + - This ensures that identical plaintext chunks yield identical content hashes (so dedup works), while the **storage hash** maintains uniqueness in encrypted storage. |
| 33 | + - Supported encryption modes: |
| 34 | + - **AES-256-GCM** |
| 35 | + - **ChaCha20-Poly1305** |
| 36 | + |
| 37 | +5. **Manifest Tracking** |
| 38 | + - Each file's metadata (its list of chunks, sizes, hashes) is stored in a **manifest**. |
| 39 | + - The manifest maps logical files to their deduplicated chunks in the storage backend. |
| 40 | + - Manifests are versioned, so rolling back or verifying integrity is easy. |
| 41 | + |
| 42 | +--- |
| 43 | + |
| 44 | +## Content Hash vs Storage Hash |
| 45 | + |
| 46 | +| Type | Definition | Purpose | Scope | |
| 47 | +|------|-------------|----------|--------| |
| 48 | +| **Content Hash** | Hash of the raw (unencrypted) chunk data. | Used to identify identical content for deduplication before encryption. | Computed once during ingestion. | |
| 49 | +| **Storage Hash** | Hash of the encrypted chunk data. | Used to verify integrity and locate stored encrypted blobs. | Used internally during sync and retrieval. | |
| 50 | + |
| 51 | +**In short:** |
| 52 | +- **Content Hash = Dedup identity** |
| 53 | +- **Storage Hash = Integrity + storage mapping** |
| 54 | + |
| 55 | +By separating these two, Sietch maintains **efficient deduplication** while ensuring **secure encryption** and **data integrity**. |
| 56 | + |
| 57 | +--- |
| 58 | + |
| 59 | +## Migration Guide — Enabling Dedup on an Existing Vault |
| 60 | + |
| 61 | +If you created a vault before deduplication was enabled, follow this step-by-step process to migrate safely. |
| 62 | + |
| 63 | +### 🧩 Step 1: Backup Existing Vault |
| 64 | +Before any operation: |
| 65 | +```bash |
| 66 | +sietch backup --output ./vault-backup |
| 67 | +``` |
| 68 | +This ensures you can roll back if migration fails. |
| 69 | + |
| 70 | +### ⚙️ Step 2: Enable Deduplication |
| 71 | +Enable dedup in the configuration: |
| 72 | + |
| 73 | +```bash |
| 74 | +sietch config set dedup.enabled true |
| 75 | +sietch config set dedup.chunk-size 4MB |
| 76 | +``` |
| 77 | +Or manually in the config file (~/.sietch/config.json): |
| 78 | + |
| 79 | +```json |
| 80 | +{ |
| 81 | + "dedup": { |
| 82 | + "enabled": true, |
| 83 | + "chunk_size": "4MB" |
| 84 | + } |
| 85 | +} |
| 86 | +``` |
| 87 | + |
| 88 | +### 🔍 Step 3: Re-index Existing Files |
| 89 | +Run the re-indexing tool to compute chunk hashes for existing data: |
| 90 | + |
| 91 | +```bash |
| 92 | +sietch dedup reindex |
| 93 | +``` |
| 94 | +This step scans all files, computes content hashes, and builds a deduplication index. |
| 95 | + |
| 96 | +### 🧼 Step 4: Garbage Collect Old Chunks |
| 97 | +Once the dedup index is ready: |
| 98 | + |
| 99 | +```bash |
| 100 | +sietch dedup gc |
| 101 | +``` |
| 102 | +Removes orphaned or redundant chunks not referenced in any manifest. |
| 103 | + |
| 104 | +### 🧠 Step 5: Optimize Storage Layout |
| 105 | +To finalize: |
| 106 | + |
| 107 | +```bash |
| 108 | +sietch dedup optimize |
| 109 | +``` |
| 110 | +Reorganizes chunks and manifests for better read/write performance. |
| 111 | + |
| 112 | +**Note:** For very large vaults, perform these steps on a local copy or use the `--dry-run` flag first to estimate changes. |
| 113 | + |
| 114 | +--- |
| 115 | + |
| 116 | +## Performance Tuning & Chunk Size Recommendations |
| 117 | + |
| 118 | +Chunk size directly affects both storage efficiency and CPU performance: |
| 119 | + |
| 120 | +| Chunk Size | Use Case | Storage Efficiency | CPU Cost | |
| 121 | +|------------|----------|-------------------|----------| |
| 122 | +| 1 MB | Rapidly changing files, e.g., source code, logs | High | High | |
| 123 | +| 4 MB (default) | Balanced general purpose | Medium | Medium | |
| 124 | +| 8–16 MB | Large static files (media, backups) | Lower dedup gain | Low | |
| 125 | + |
| 126 | +**Tips:** |
| 127 | +- Smaller chunks → better deduplication but slower processing. |
| 128 | +- Larger chunks → faster sync and less overhead but fewer dedup hits. |
| 129 | +- For mixed workloads, keep 4 MB or use `--adaptive-chunking` (if available). |
| 130 | + |
| 131 | +--- |
| 132 | + |
| 133 | +## Best Practices |
| 134 | + |
| 135 | +- Run `sietch dedup stats` regularly to monitor chunk reuse and storage savings. |
| 136 | +- Avoid changing chunk size after initial vault creation as this can break dedup references. |
| 137 | +- Use `sietch dedup optimize` monthly to defragment storage. |
| 138 | +- Keep your manifests backed up; they're critical for mapping files to chunks. |
| 139 | +- Use `--dry-run` with dedup operations before running them in production. |
| 140 | + |
| 141 | +--- |
| 142 | + |
| 143 | +## Example Workflow |
| 144 | + |
| 145 | +```bash |
| 146 | +# Initialize vault with ChaCha20 encryption |
| 147 | +sietch init --name research --key-type chacha20 |
| 148 | + |
| 149 | +# Add files |
| 150 | +sietch add ./datasets ./vault/data |
| 151 | + |
| 152 | +# Check dedup stats |
| 153 | +sietch dedup stats |
| 154 | + |
| 155 | +# Clean up unreferenced chunks |
| 156 | +sietch dedup gc |
| 157 | + |
| 158 | +# Optimize layout |
| 159 | +sietch dedup optimize |
| 160 | +``` |
| 161 | + |
| 162 | +Output might look like: |
| 163 | + |
| 164 | +```yaml |
| 165 | +Deduplication Statistics |
| 166 | +------------------------ |
| 167 | +Total Chunks: 12,843 |
| 168 | +Unique Chunks: 9,557 |
| 169 | +Space Saved: 4.21 GB (32%) |
| 170 | +Garbage Collected: 152 chunks |
| 171 | +Optimization Complete: OK |
| 172 | +``` |
| 173 | +
|
| 174 | +--- |
| 175 | +
|
| 176 | +## Future Improvements (Planned) |
| 177 | +
|
| 178 | +- **Adaptive Chunking** --> variable chunk sizes based on content entropy. |
| 179 | +- **Cross-Vault Dedup** --> share dedup indices securely across multiple vaults. |
| 180 | +- **Dedup Metrics API** --> expose storage savings via REST/CLI metrics. |
0 commit comments