Skip to content

Commit 4fff24c

Browse files
authored
Added DEDUPLICATION.md and updated README link (#109)
* Added DEDUPLICATION.md and updated README link * Updated Deduplication.md
1 parent e0c0806 commit 4fff24c

2 files changed

Lines changed: 181 additions & 0 deletions

File tree

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,7 @@ sietch get thumper-plans.pdf ./retrieved/
7272
### Chunking & Deduplication
7373
* Files are split into configurable chunks (default: 4MB)
7474
* Identical chunks across files are deduplicated to save space
75+
* Please Refer [this](internal/deduplication/README.md) documentation to understand how Deduplication works.
7576

7677
### Encryption
7778
Each chunk is encrypted before storage using:

internal/deduplication/README.md

Lines changed: 180 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,180 @@
1+
# 🔁 Deduplication in Sietch Vault
2+
3+
## Overview
4+
5+
Sietch Vault uses **content-defined chunking** and **cryptographic deduplication** to minimize redundant data storage.
6+
Instead of storing an entire file whenever it changes, Sietch breaks each file into small, fixed- or variable-sized chunks (default: 4 MB), computes unique hashes for each chunk, and only stores chunks that haven't been stored before.
7+
8+
This makes syncing and storage highly efficient i.e identical data across files, folders, or even vaults are stored only once, reducing disk space and sync time.
9+
10+
---
11+
12+
## How Deduplication Works
13+
14+
Deduplication in Sietch happens at **chunk level**, not file level.
15+
16+
1. **Chunking**
17+
- Every file added to a vault is split into smaller data chunks.
18+
- Chunk size is configurable (`--chunk-size` flag, default 4 MB).
19+
- Chunk boundaries can be static or computed using a **rolling hash**, which helps detect identical regions even when a file shifts slightly.
20+
21+
2. **Hashing**
22+
- Each chunk is assigned two hashes:
23+
- **Content Hash:** A cryptographic hash (e.g., SHA-256) of the unencrypted chunk.
24+
- **Storage Hash:** A hash of the encrypted chunk (post-encryption).
25+
26+
3. **Deduplication Check**
27+
- When a new chunk is processed, Sietch checks if the **content hash** already exists in the vault index.
28+
- If found, it reuses the existing chunk reference instead of storing a duplicate.
29+
30+
4. **Encryption**
31+
- Chunks are encrypted **after** deduplication check.
32+
- This ensures that identical plaintext chunks yield identical content hashes (so dedup works), while the **storage hash** maintains uniqueness in encrypted storage.
33+
- Supported encryption modes:
34+
- **AES-256-GCM**
35+
- **ChaCha20-Poly1305**
36+
37+
5. **Manifest Tracking**
38+
- Each file's metadata (its list of chunks, sizes, hashes) is stored in a **manifest**.
39+
- The manifest maps logical files to their deduplicated chunks in the storage backend.
40+
- Manifests are versioned, so rolling back or verifying integrity is easy.
41+
42+
---
43+
44+
## Content Hash vs Storage Hash
45+
46+
| Type | Definition | Purpose | Scope |
47+
|------|-------------|----------|--------|
48+
| **Content Hash** | Hash of the raw (unencrypted) chunk data. | Used to identify identical content for deduplication before encryption. | Computed once during ingestion. |
49+
| **Storage Hash** | Hash of the encrypted chunk data. | Used to verify integrity and locate stored encrypted blobs. | Used internally during sync and retrieval. |
50+
51+
**In short:**
52+
- **Content Hash = Dedup identity**
53+
- **Storage Hash = Integrity + storage mapping**
54+
55+
By separating these two, Sietch maintains **efficient deduplication** while ensuring **secure encryption** and **data integrity**.
56+
57+
---
58+
59+
## Migration Guide — Enabling Dedup on an Existing Vault
60+
61+
If you created a vault before deduplication was enabled, follow this step-by-step process to migrate safely.
62+
63+
### 🧩 Step 1: Backup Existing Vault
64+
Before any operation:
65+
```bash
66+
sietch backup --output ./vault-backup
67+
```
68+
This ensures you can roll back if migration fails.
69+
70+
### ⚙️ Step 2: Enable Deduplication
71+
Enable dedup in the configuration:
72+
73+
```bash
74+
sietch config set dedup.enabled true
75+
sietch config set dedup.chunk-size 4MB
76+
```
77+
Or manually in the config file (~/.sietch/config.json):
78+
79+
```json
80+
{
81+
"dedup": {
82+
"enabled": true,
83+
"chunk_size": "4MB"
84+
}
85+
}
86+
```
87+
88+
### 🔍 Step 3: Re-index Existing Files
89+
Run the re-indexing tool to compute chunk hashes for existing data:
90+
91+
```bash
92+
sietch dedup reindex
93+
```
94+
This step scans all files, computes content hashes, and builds a deduplication index.
95+
96+
### 🧼 Step 4: Garbage Collect Old Chunks
97+
Once the dedup index is ready:
98+
99+
```bash
100+
sietch dedup gc
101+
```
102+
Removes orphaned or redundant chunks not referenced in any manifest.
103+
104+
### 🧠 Step 5: Optimize Storage Layout
105+
To finalize:
106+
107+
```bash
108+
sietch dedup optimize
109+
```
110+
Reorganizes chunks and manifests for better read/write performance.
111+
112+
**Note:** For very large vaults, perform these steps on a local copy or use the `--dry-run` flag first to estimate changes.
113+
114+
---
115+
116+
## Performance Tuning & Chunk Size Recommendations
117+
118+
Chunk size directly affects both storage efficiency and CPU performance:
119+
120+
| Chunk Size | Use Case | Storage Efficiency | CPU Cost |
121+
|------------|----------|-------------------|----------|
122+
| 1 MB | Rapidly changing files, e.g., source code, logs | High | High |
123+
| 4 MB (default) | Balanced general purpose | Medium | Medium |
124+
| 8–16 MB | Large static files (media, backups) | Lower dedup gain | Low |
125+
126+
**Tips:**
127+
- Smaller chunks → better deduplication but slower processing.
128+
- Larger chunks → faster sync and less overhead but fewer dedup hits.
129+
- For mixed workloads, keep 4 MB or use `--adaptive-chunking` (if available).
130+
131+
---
132+
133+
## Best Practices
134+
135+
- Run `sietch dedup stats` regularly to monitor chunk reuse and storage savings.
136+
- Avoid changing chunk size after initial vault creation as this can break dedup references.
137+
- Use `sietch dedup optimize` monthly to defragment storage.
138+
- Keep your manifests backed up; they're critical for mapping files to chunks.
139+
- Use `--dry-run` with dedup operations before running them in production.
140+
141+
---
142+
143+
## Example Workflow
144+
145+
```bash
146+
# Initialize vault with ChaCha20 encryption
147+
sietch init --name research --key-type chacha20
148+
149+
# Add files
150+
sietch add ./datasets ./vault/data
151+
152+
# Check dedup stats
153+
sietch dedup stats
154+
155+
# Clean up unreferenced chunks
156+
sietch dedup gc
157+
158+
# Optimize layout
159+
sietch dedup optimize
160+
```
161+
162+
Output might look like:
163+
164+
```yaml
165+
Deduplication Statistics
166+
------------------------
167+
Total Chunks: 12,843
168+
Unique Chunks: 9,557
169+
Space Saved: 4.21 GB (32%)
170+
Garbage Collected: 152 chunks
171+
Optimization Complete: OK
172+
```
173+
174+
---
175+
176+
## Future Improvements (Planned)
177+
178+
- **Adaptive Chunking** --> variable chunk sizes based on content entropy.
179+
- **Cross-Vault Dedup** --> share dedup indices securely across multiple vaults.
180+
- **Dedup Metrics API** --> expose storage savings via REST/CLI metrics.

0 commit comments

Comments
 (0)