-
Notifications
You must be signed in to change notification settings - Fork 3
Deduplication System
- Overview
- How It Works
- Hash Calculation
- Normal Operation Scenarios
- Extraordinary Scenarios
- First Load Optimization
- Performance Characteristics
- Troubleshooting
- Best Practices
LogLynx uses a cryptographic hash-based deduplication system to prevent duplicate log entries from being stored in the database. This ensures data integrity and prevents storage waste when:
- Log files are reprocessed after crashes
- Log rotation causes log re-reading
- Multiple LogLynx instances process the same logs
- Logs are manually re-imported
- ✅ SHA256 Hash-Based: Cryptographically secure, collision-resistant ``
- ✅ Nanosecond Precision: Uses Duration and StartUTC for ultra-precise uniqueness
- ✅ Database-Enforced: UNIQUE index guarantees no duplicates
- ✅ Idempotent Processing: Safe to reprocess logs multiple times
- ✅ First-Load Optimized: Skips deduplication checks on empty database (10-100x faster)
- ✅ Automatic: No configuration required
┌─────────────────────────────────────────────────────────────┐
│ 1. LOG ENTRY READ │
│ Source: Traefik JSON/CLF log file │
└────────────────┬────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ 2. PARSE LOG ENTRY │
│ Extract: timestamp, IP, method, path, status, etc. │
└────────────────┬────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ 3. CALCULATE HASH (SHA256) │
│ Input: timestamp|IP|method|host|path|query|status| │
│ duration|startUTC │
│ Output: 64-character hex hash │
│ Example: a3f5b8c9d2e1f4a7c6b5d8e7f9a0c1d2... │
└────────────────┬────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ 4. CHECK FOR DUPLICATES │
│ Database: SELECT COUNT(*) WHERE request_hash = ? │
│ Result: EXISTS or NOT EXISTS │
└────────────────┬────────────────────────────────────────────┘
↓
┌───────┴───────┐
↓ ↓
┌─────────────────┐ ┌─────────────────┐
│ HASH EXISTS │ │ HASH NEW │
│ (Duplicate) │ │ (Unique) │
└────────┬────────┘ └────────┬────────┘
↓ ↓
┌─────────────────┐ ┌─────────────────┐
│ SKIP INSERT │ │ INSERT RECORD │
│ Log: "Skipped │ │ Success! │
│ duplicate" │ │ │
└─────────────────┘ └─────────────────┘
The request_hash field has a UNIQUE INDEX:
CREATE UNIQUE INDEX idx_request_hash ON http_requests(request_hash);This means:
- Database guarantees no two records can have the same hash
- Attempted duplicates are rejected at the database level
- No duplicate can slip through, even in race conditions
The deduplication hash is calculated using 9 fields to create a unique identifier:
hashInput = timestamp.Unix() + "|" +
clientIP + "|" +
method + "|" +
host + "|" +
path + "|" +
queryString + "|" +
statusCode + "|" +
duration + "|" + // Nanosecond precision
startUTC // Timestamp with nanosecond precision
hash = SHA256(hashInput)Log Entry:
{
"time": "2025-11-07T10:30:45.123456789Z",
"ClientAddr": "103.4.250.66:48952",
"RequestMethod": "GET",
"request_Host": "api.example.com",
"RequestPath": "/users?page=1",
"DownstreamStatus": 200,
"Duration": 299425702,
"StartUTC": "2025-11-07T10:30:45.123456789Z"
}Hash Input String:
1730976645|103.4.250.66|GET|api.example.com|/users|page=1|200|299425702|2025-11-07T10:30:45.123456789Z
SHA256 Hash:
a3f5b8c9d2e1f4a7c6b5d8e7f9a0c1d2b3e4f5a6c7d8e9f0a1b2c3d4e5f6a7b8
This hash is stored in the request_hash field.
| Field | Purpose | Precision |
|---|---|---|
| timestamp.Unix() | Identifies the second when request occurred | Second |
| clientIP | Distinguishes requests from different clients | Exact |
| method | Differentiates GET, POST, PUT, DELETE, etc. | Exact |
| host | Separates requests to different services | Exact |
| path | Identifies the endpoint | Exact |
| queryString | Differentiates query parameters | Exact |
| statusCode | Allows same request with different results | Exact |
| duration | Nanosecond-level uniqueness | Nanosecond |
| startUTC | Timestamp with nanosecond precision | Nanosecond |
| Format | Precision | Collision Risk |
|---|---|---|
| Traefik JSON | Nanosecond | Virtually 0% (1 in 10^18) |
| Traefik CLF | Millisecond | Very low (~0.001%) |
| Generic CLF | Second | Low (~1%) |
Situation: LogLynx reads new log entries for the first time.
Time: 10:30:45.123456789
Request: GET /api/users from 103.4.250.66
Status: 200
Duration: 299425702 ns
Process:
- ✅ Parse log entry
- ✅ Calculate hash:
a3f5b8c9... - ✅ Check database: Hash NOT found
- ✅ Insert record → Success!
Result: Record stored successfully.
Situation: Two identical requests in the same second.
Request 1: GET /api/users at 10:30:45.123456789 → Duration: 299425702 ns
Request 2: GET /api/users at 10:30:45.987654321 → Duration: 299425999 ns
Hash Inputs:
Hash 1: ...10:30:45.123456789|299425702|...
Hash 2: ...10:30:45.987654321|299425999|...
↑ ↑
Different nanosecond precision
Process:
- ✅ Calculate hash 1:
a3f5b8c9... - ✅ Calculate hash 2:
b4f6c0d1...(different!) - ✅ Both hashes unique
- ✅ Both records inserted
Result: Both requests stored (they're actually different).
Situation: Log line appears twice (file re-read, rotation, etc.).
Request 1: GET /api/users at 10:30:45.123456789 → Duration: 299425702 ns
Request 2: GET /api/users at 10:30:45.123456789 → Duration: 299425702 ns
↑
Exact duplicate
Hash Inputs:
Hash 1: 1730976645|103.4.250.66|GET|api.example.com|/users|page=1|200|299425702|2025-11-07T10:30:45.123456789Z
Hash 2: 1730976645|103.4.250.66|GET|api.example.com|/users|page=1|200|299425702|2025-11-07T10:30:45.123456789Z
↑
Identical
Process:
- ✅ Calculate hash 1:
a3f5b8c9...→ Insert → Success - ✅ Calculate hash 2:
a3f5b8c9...(same!) - ✅ Check database: Hash EXISTS
⚠️ Skip insert → Log: "Skipped duplicate"
Result: Second record NOT stored (duplicate detected).
Situation: Same request, different outcomes.
Request 1: GET /api/users at 10:30:45.123456789 → Status: 200
Request 2: GET /api/users at 10:30:45.123456789 → Status: 500
Hash Inputs:
Hash 1: ...|200|...
Hash 2: ...|500|...
↑
Different status code
Process:
- ✅ Calculate hash 1:
a3f5b8c9... - ✅ Calculate hash 2:
c7d9e1f3...(different!) - ✅ Both hashes unique
- ✅ Both records inserted
Result: Both stored (different status codes = different events).
Situation: Same path, different query strings.
Request 1: GET /api/users?page=1 at 10:30:45
Request 2: GET /api/users?page=2 at 10:30:45
Hash Inputs:
Hash 1: ...|/api/users|page=1|...
Hash 2: ...|/api/users|page=2|...
↑
Different query parameter
Process:
- ✅ Calculate hash 1:
a3f5b8c9... - ✅ Calculate hash 2:
d8e0f2a4...(different!) - ✅ Both hashes unique
- ✅ Both records inserted
Result: Both stored (different query params = different requests).
Situation: LogLynx crashes while processing a batch of 1000 logs. 500 were inserted, 500 were not.
What Happens:
Batch 1 (logs 1-500): ✅ Inserted (position NOT updated)
Batch 2 (logs 501-1000): ❌ Crashed before insert
Position in database: ⚠️ Still at log 0
On Restart:
- ✅ LogLynx reads from last position (0)
- ✅ Re-reads logs 1-1000
- ✅ Calculates hashes for all logs
- ✅ Logs 1-500: Hash EXISTS → Skip (duplicates)
- ✅ Logs 501-1000: Hash NOT found → Insert
- ✅ Position updated to 1000
Result: ✅ No data loss, no duplicates. Crash-safe!
Situation: Log file rotated while LogLynx is running.
Before rotation:
access.log (1000 lines) ← LogLynx reading at line 800
After rotation:
access.log.1 (1000 lines) ← Old file renamed
access.log (0 lines) ← New empty file
What Happens:
- ✅ LogLynx detects rotation (file inode changed or size < position)
- ✅ Resets position to 0
- ✅ Starts reading new
access.log(empty) - ✅ No logs to read yet
- ✅ Continues monitoring
If old logs are re-imported later:
- ✅ Calculate hashes for old logs
- ✅ All hashes EXIST (already in database)
- ✅ All skipped → No duplicates
Result: ✅ No duplicates, automatic handling.
Situation: Log file truncated (common with copytruncate rotation).
Before:
access.log (10 MB, position at 8 MB)
After truncation:
access.log (0 MB, position still thinks it's at 8 MB)
What Happens:
- ✅ LogLynx detects truncation (file size < last position)
- ✅ Resets position to 0
- ✅ Starts reading from beginning
- ✅ File is empty (just truncated)
- ✅ Waits for new logs
Result: ✅ Automatic recovery, no errors.
Situation: Administrator manually re-imports old logs.
# Copy old logs back
cp access.log.old access.log
systemctl restart loglynxWhat Happens:
- ✅ LogLynx starts reading
access.log.old - ✅ Calculates hashes for all entries
- ✅ Checks database for each hash
- ✅ All hashes EXIST (already imported)
- ✅ All entries skipped
- ✅ Log: "Skipped 10000 duplicates"
Result: ✅ No duplicates stored, idempotent import.
Situation: Two LogLynx instances reading the same log file (NOT RECOMMENDED, but handled).
Instance 1: Reading access.log (position 0-1000)
Instance 2: Reading access.log (position 0-1000)
What Happens:
- ✅ Instance 1 calculates hashes, inserts records
- ✅ Instance 2 calculates same hashes
- ✅ Database rejects duplicates (UNIQUE constraint)
- ✅ Instance 2 logs: "Skipped duplicates"
Result: ✅ No duplicates, but wasteful (avoid this scenario).
Best Practice: Use different log sources per instance.
Situation: Database file corrupted, restore from backup.
Current database: 1 million records (Jan 1 - Dec 31)
Backup database: 500k records (Jan 1 - Jun 30)
What Happens:
- ✅ Restore backup (500k records)
- ✅ Restart LogLynx
- ✅ LogLynx reads logs from last position
- ✅ Processes logs from Jul 1 - Dec 31
- ✅ Hashes NOT in database (backup was old)
- ✅ All records inserted (no duplicates)
Result: ✅ Database restored, missing data re-imported.
Situation: Two different requests generate the same SHA256 hash.
Probability: ~1 in 2^256 (virtually impossible)
What Would Happen (theoretical):
⚠️ Request 1 inserted with hasha3f5b8c9...⚠️ Request 2 generates same hash⚠️ Database rejects insert (UNIQUE constraint)⚠️ Request 2 skipped as "duplicate"
Impact: One request lost (but probability is ~0%)
Mitigation: Not needed (SHA256 is collision-resistant).
When the database is empty (first load), LogLynx skips deduplication checks for maximum performance.
// Check if database is empty
var count int64
db.Model(&HTTPRequest{}).Count(&count)
isFirstLoad := (count == 0)
if isFirstLoad {
// Skip UNIQUE constraint handling
// Insert all records in batch (10-100x faster)
} else {
// Normal deduplication (check for duplicates)
}| Records | First Load | Normal Load | Speedup |
|---|---|---|---|
| 1K | 0.1s | 0.5s | 5x |
| 10K | 1s | 8s | 8x |
| 100K | 10s | 120s | 12x |
| 1M | 2 min | 20 min | 10x |
| 10M | 20 min | 6 hours | 18x |
✅ First install (database is empty) ✅ After database reset (all tables dropped) ❌ Normal operation (database has records)
✅ Yes! The UNIQUE index still enforces uniqueness at the database level. ✅ Even on first load, duplicates are impossible (database constraint). ✅ Only skips the application-level duplicate handling for speed.
| Scenario | Speed | Duplicates Detected |
|---|---|---|
| First load (empty DB) | 5,000-10,000 logs/sec | None (N/A) |
| Normal load (no duplicates) | 500-2,000 logs/sec | 0% |
| Normal load (50% duplicates) | 400-1,500 logs/sec | 50% |
| All duplicates | 300-1,000 logs/sec | 100% |
Deduplication hash lookups are extremely fast due to the UNIQUE index:
-- Hash lookup (to check for duplicate)
SELECT COUNT(*) FROM http_requests WHERE request_hash = 'a3f5b8c9...';Performance: <1ms (index lookup)
| Component | Size per Record |
|---|---|
| Hash field | 64 bytes (fixed) |
| Index overhead | ~50 bytes |
| Total | ~114 bytes/record |
For 1 million records: ~114 MB storage for deduplication.
Symptom: LogLynx logs show thousands of duplicates being skipped.
Possible Causes:
- Log rotation caused re-reading of old logs
- Multiple LogLynx instances reading same logs
- Manual log re-import
Solution:
# Check LogLynx logs
tail -f loglynx.log | grep "Skipped duplicate"
# If expected (log rotation):
✅ No action needed, working as designed
# If unexpected (multiple instances):
⚠️ Stop duplicate instances, use different log sourcesSymptom: Valid requests not appearing in database.
Possible Causes:
- Hash collision (extremely rare, ~0%)
- Two requests with identical nanosecond timestamp (very rare)
Diagnosis:
-- Find potential collisions
SELECT request_hash, COUNT(*) as count
FROM http_requests
GROUP BY request_hash
HAVING count > 1;Solution:
-- Check if these are truly different requests
SELECT * FROM http_requests
WHERE request_hash = 'a3f5b8c9...'
ORDER BY timestamp;
-- If truly different (hash collision):
⚠️ Report to developers (should never happen with SHA256)
-- If identical (not a collision):
✅ Working as designedSymptom: First load is slow despite empty database.
Diagnosis:
-- Check if database is truly empty
SELECT COUNT(*) FROM http_requests;Solution:
# If count > 0:
# Database not empty, first-load optimization won't activate
# If count = 0 but still slow:
# Check logs for "First load detected" message
tail -f loglynx.log | grep "First load detected"
# If message not present:
# Report to developersSymptom: Database errors about UNIQUE constraint violations.
Cause: Normal! These errors are expected when duplicates are detected.
Solution: ✅ No action needed. LogLynx silently handles these errors.
The error is logged at DEBUG level, not ERROR level:
DEBUG: Skipped duplicate entries (total=1000, inserted=950, duplicates=50)
- Let LogLynx handle duplicates automatically - No configuration needed
- Monitor "Skipped duplicates" count - High numbers may indicate log rotation
- Use one LogLynx instance per log source - Avoid multiple readers
- Trust the system - SHA256 collision is virtually impossible
- Enable first-load optimization - Automatically enabled on empty DB
- Don't disable deduplication - It's always active (database constraint)
- Don't worry about hash collisions - Probability is ~0% (with a correct configuration of the password manager logs)
- Don't manually delete duplicates - System handles it automatically
- Don't run multiple instances on same logs - Wasteful, though safe
- Don't modify request_hash values - Breaks deduplication
- ✅ Deduplication is automatic - No configuration required
- ✅ Cryptographically secure - SHA256 prevents collisions
- ✅ Database-enforced - UNIQUE index guarantees no duplicates
- ✅ Crash-safe - Safe to reprocess logs after crashes
- ✅ Idempotent - Safe to re-import logs multiple times
- ✅ High performance - 500-10,000 logs/sec depending on duplicates
- ✅ First-load optimized - 10-100x faster on empty database
Hash = SHA256(
timestamp.Unix() +
clientIP +
method +
host +
path +
queryString +
statusCode +
duration +
startUTC
)
- Traefik JSON: Nanosecond precision (virtually no collisions)
- Traefik CLF: Millisecond precision (very low collision risk)
- Generic CLF: Second precision (low collision risk)
- Normal operation: 500-2,000 logs/sec
- First load: 5,000-10,000 logs/sec (10-100x faster)
- Hash lookup: <1ms (indexed)
Last Updated: November 2025
- Home - Introduction and overview
- API Documentation - Complete API reference
- Environment Variables - Configuration guide
- Standalone Deployment - Native installation
- Docker Deployment - Container deployment