Skip to content

Deduplication System

Kolin edited this page Nov 6, 2025 · 2 revisions

LogLynx Deduplication System - Complete Guide


Table of Contents


Overview

LogLynx uses a cryptographic hash-based deduplication system to prevent duplicate log entries from being stored in the database. This ensures data integrity and prevents storage waste when:

  • Log files are reprocessed after crashes
  • Log rotation causes log re-reading
  • Multiple LogLynx instances process the same logs
  • Logs are manually re-imported

Key Features

  • SHA256 Hash-Based: Cryptographically secure, collision-resistant ``
  • Nanosecond Precision: Uses Duration and StartUTC for ultra-precise uniqueness
  • Database-Enforced: UNIQUE index guarantees no duplicates
  • Idempotent Processing: Safe to reprocess logs multiple times
  • First-Load Optimized: Skips deduplication checks on empty database (10-100x faster)
  • Automatic: No configuration required

How It Works

Architecture Overview

┌─────────────────────────────────────────────────────────────┐
│  1. LOG ENTRY READ                                          │
│     Source: Traefik JSON/CLF log file                       │
└────────────────┬────────────────────────────────────────────┘
                 ↓
┌─────────────────────────────────────────────────────────────┐
│  2. PARSE LOG ENTRY                                         │
│     Extract: timestamp, IP, method, path, status, etc.      │
└────────────────┬────────────────────────────────────────────┘
                 ↓
┌─────────────────────────────────────────────────────────────┐
│  3. CALCULATE HASH (SHA256)                                 │
│     Input: timestamp|IP|method|host|path|query|status|      │
│            duration|startUTC                                 │
│     Output: 64-character hex hash                           │
│     Example: a3f5b8c9d2e1f4a7c6b5d8e7f9a0c1d2...           │
└────────────────┬────────────────────────────────────────────┘
                 ↓
┌─────────────────────────────────────────────────────────────┐
│  4. CHECK FOR DUPLICATES                                    │
│     Database: SELECT COUNT(*) WHERE request_hash = ?        │
│     Result: EXISTS or NOT EXISTS                            │
└────────────────┬────────────────────────────────────────────┘
                 ↓
         ┌───────┴───────┐
         ↓               ↓
┌─────────────────┐  ┌─────────────────┐
│  HASH EXISTS    │  │  HASH NEW       │
│  (Duplicate)    │  │  (Unique)       │
└────────┬────────┘  └────────┬────────┘
         ↓                    ↓
┌─────────────────┐  ┌─────────────────┐
│  SKIP INSERT    │  │  INSERT RECORD  │
│  Log: "Skipped  │  │  Success!       │
│   duplicate"    │  │                 │
└─────────────────┘  └─────────────────┘

Database Enforcement

The request_hash field has a UNIQUE INDEX:

CREATE UNIQUE INDEX idx_request_hash ON http_requests(request_hash);

This means:

  • Database guarantees no two records can have the same hash
  • Attempted duplicates are rejected at the database level
  • No duplicate can slip through, even in race conditions

Hash Calculation

Formula

The deduplication hash is calculated using 9 fields to create a unique identifier:

hashInput = timestamp.Unix() + "|" +
            clientIP + "|" +
            method + "|" +
            host + "|" +
            path + "|" +
            queryString + "|" +
            statusCode + "|" +
            duration + "|" +        // Nanosecond precision
            startUTC                // Timestamp with nanosecond precision

hash = SHA256(hashInput)

Example Calculation

Log Entry:

{
  "time": "2025-11-07T10:30:45.123456789Z",
  "ClientAddr": "103.4.250.66:48952",
  "RequestMethod": "GET",
  "request_Host": "api.example.com",
  "RequestPath": "/users?page=1",
  "DownstreamStatus": 200,
  "Duration": 299425702,
  "StartUTC": "2025-11-07T10:30:45.123456789Z"
}

Hash Input String:

1730976645|103.4.250.66|GET|api.example.com|/users|page=1|200|299425702|2025-11-07T10:30:45.123456789Z

SHA256 Hash:

a3f5b8c9d2e1f4a7c6b5d8e7f9a0c1d2b3e4f5a6c7d8e9f0a1b2c3d4e5f6a7b8

This hash is stored in the request_hash field.

Why These 9 Fields?

Field Purpose Precision
timestamp.Unix() Identifies the second when request occurred Second
clientIP Distinguishes requests from different clients Exact
method Differentiates GET, POST, PUT, DELETE, etc. Exact
host Separates requests to different services Exact
path Identifies the endpoint Exact
queryString Differentiates query parameters Exact
statusCode Allows same request with different results Exact
duration Nanosecond-level uniqueness Nanosecond
startUTC Timestamp with nanosecond precision Nanosecond

Precision Levels by Log Format

Format Precision Collision Risk
Traefik JSON Nanosecond Virtually 0% (1 in 10^18)
Traefik CLF Millisecond Very low (~0.001%)
Generic CLF Second Low (~1%)

Normal Operation Scenarios

Scenario 1: Normal Log Processing

Situation: LogLynx reads new log entries for the first time.

Time: 10:30:45.123456789
Request: GET /api/users from 103.4.250.66
Status: 200
Duration: 299425702 ns

Process:

  1. ✅ Parse log entry
  2. ✅ Calculate hash: a3f5b8c9...
  3. ✅ Check database: Hash NOT found
  4. Insert record → Success!

Result: Record stored successfully.


Scenario 2: Exact Duplicate (Same Second)

Situation: Two identical requests in the same second.

Request 1: GET /api/users at 10:30:45.123456789 → Duration: 299425702 ns
Request 2: GET /api/users at 10:30:45.987654321 → Duration: 299425999 ns

Hash Inputs:

Hash 1: ...10:30:45.123456789|299425702|...
Hash 2: ...10:30:45.987654321|299425999|...
         ↑                     ↑
         Different nanosecond precision

Process:

  1. ✅ Calculate hash 1: a3f5b8c9...
  2. ✅ Calculate hash 2: b4f6c0d1... (different!)
  3. ✅ Both hashes unique
  4. Both records inserted

Result: Both requests stored (they're actually different).


Scenario 3: True Duplicate (Same Nanosecond)

Situation: Log line appears twice (file re-read, rotation, etc.).

Request 1: GET /api/users at 10:30:45.123456789 → Duration: 299425702 ns
Request 2: GET /api/users at 10:30:45.123456789 → Duration: 299425702 ns
                                                    ↑
                                              Exact duplicate

Hash Inputs:

Hash 1: 1730976645|103.4.250.66|GET|api.example.com|/users|page=1|200|299425702|2025-11-07T10:30:45.123456789Z
Hash 2: 1730976645|103.4.250.66|GET|api.example.com|/users|page=1|200|299425702|2025-11-07T10:30:45.123456789Z
        ↑
        Identical

Process:

  1. ✅ Calculate hash 1: a3f5b8c9... → Insert → Success
  2. ✅ Calculate hash 2: a3f5b8c9... (same!)
  3. ✅ Check database: Hash EXISTS
  4. ⚠️ Skip insert → Log: "Skipped duplicate"

Result: Second record NOT stored (duplicate detected).


Scenario 4: Different Status Codes (Same Request)

Situation: Same request, different outcomes.

Request 1: GET /api/users at 10:30:45.123456789 → Status: 200
Request 2: GET /api/users at 10:30:45.123456789 → Status: 500

Hash Inputs:

Hash 1: ...|200|...
Hash 2: ...|500|...
            ↑
            Different status code

Process:

  1. ✅ Calculate hash 1: a3f5b8c9...
  2. ✅ Calculate hash 2: c7d9e1f3... (different!)
  3. ✅ Both hashes unique
  4. Both records inserted

Result: Both stored (different status codes = different events).


Scenario 5: Different Query Parameters

Situation: Same path, different query strings.

Request 1: GET /api/users?page=1 at 10:30:45
Request 2: GET /api/users?page=2 at 10:30:45

Hash Inputs:

Hash 1: ...|/api/users|page=1|...
Hash 2: ...|/api/users|page=2|...
                        ↑
                        Different query parameter

Process:

  1. ✅ Calculate hash 1: a3f5b8c9...
  2. ✅ Calculate hash 2: d8e0f2a4... (different!)
  3. ✅ Both hashes unique
  4. Both records inserted

Result: Both stored (different query params = different requests).


Extraordinary Scenarios

Scenario 1: Application Crash During Processing

Situation: LogLynx crashes while processing a batch of 1000 logs. 500 were inserted, 500 were not.

What Happens:

Batch 1 (logs 1-500):   ✅ Inserted (position NOT updated)
Batch 2 (logs 501-1000): ❌ Crashed before insert
Position in database:    ⚠️ Still at log 0

On Restart:

  1. ✅ LogLynx reads from last position (0)
  2. ✅ Re-reads logs 1-1000
  3. ✅ Calculates hashes for all logs
  4. ✅ Logs 1-500: Hash EXISTS → Skip (duplicates)
  5. ✅ Logs 501-1000: Hash NOT found → Insert
  6. ✅ Position updated to 1000

Result: ✅ No data loss, no duplicates. Crash-safe!


Scenario 2: Log Rotation (File Renamed)

Situation: Log file rotated while LogLynx is running.

Before rotation:
  access.log (1000 lines) ← LogLynx reading at line 800

After rotation:
  access.log.1 (1000 lines) ← Old file renamed
  access.log (0 lines) ← New empty file

What Happens:

  1. ✅ LogLynx detects rotation (file inode changed or size < position)
  2. ✅ Resets position to 0
  3. ✅ Starts reading new access.log (empty)
  4. ✅ No logs to read yet
  5. ✅ Continues monitoring

If old logs are re-imported later:

  1. ✅ Calculate hashes for old logs
  2. ✅ All hashes EXIST (already in database)
  3. All skipped → No duplicates

Result: ✅ No duplicates, automatic handling.


Scenario 3: Log Rotation (File Truncated)

Situation: Log file truncated (common with copytruncate rotation).

Before:
  access.log (10 MB, position at 8 MB)

After truncation:
  access.log (0 MB, position still thinks it's at 8 MB)

What Happens:

  1. ✅ LogLynx detects truncation (file size < last position)
  2. ✅ Resets position to 0
  3. ✅ Starts reading from beginning
  4. ✅ File is empty (just truncated)
  5. ✅ Waits for new logs

Result: ✅ Automatic recovery, no errors.


Scenario 4: Manual Log Re-Import

Situation: Administrator manually re-imports old logs.

# Copy old logs back
cp access.log.old access.log
systemctl restart loglynx

What Happens:

  1. ✅ LogLynx starts reading access.log.old
  2. ✅ Calculates hashes for all entries
  3. ✅ Checks database for each hash
  4. All hashes EXIST (already imported)
  5. All entries skipped
  6. ✅ Log: "Skipped 10000 duplicates"

Result: ✅ No duplicates stored, idempotent import.


Scenario 5: Multiple LogLynx Instances (Same Logs)

Situation: Two LogLynx instances reading the same log file (NOT RECOMMENDED, but handled).

Instance 1: Reading access.log (position 0-1000)
Instance 2: Reading access.log (position 0-1000)

What Happens:

  1. ✅ Instance 1 calculates hashes, inserts records
  2. ✅ Instance 2 calculates same hashes
  3. ✅ Database rejects duplicates (UNIQUE constraint)
  4. ✅ Instance 2 logs: "Skipped duplicates"

Result: ✅ No duplicates, but wasteful (avoid this scenario).

Best Practice: Use different log sources per instance.


Scenario 6: Database Corruption Recovery

Situation: Database file corrupted, restore from backup.

Current database: 1 million records (Jan 1 - Dec 31)
Backup database: 500k records (Jan 1 - Jun 30)

What Happens:

  1. ✅ Restore backup (500k records)
  2. ✅ Restart LogLynx
  3. ✅ LogLynx reads logs from last position
  4. ✅ Processes logs from Jul 1 - Dec 31
  5. ✅ Hashes NOT in database (backup was old)
  6. All records inserted (no duplicates)

Result: ✅ Database restored, missing data re-imported.


Scenario 7: Hash Collision (Extremely Rare)

Situation: Two different requests generate the same SHA256 hash.

Probability: ~1 in 2^256 (virtually impossible)

What Would Happen (theoretical):

  1. ⚠️ Request 1 inserted with hash a3f5b8c9...
  2. ⚠️ Request 2 generates same hash
  3. ⚠️ Database rejects insert (UNIQUE constraint)
  4. ⚠️ Request 2 skipped as "duplicate"

Impact: One request lost (but probability is ~0%)

Mitigation: Not needed (SHA256 is collision-resistant).


First Load Optimization

What Is It?

When the database is empty (first load), LogLynx skips deduplication checks for maximum performance.

How It Works

// Check if database is empty
var count int64
db.Model(&HTTPRequest{}).Count(&count)
isFirstLoad := (count == 0)

if isFirstLoad {
    // Skip UNIQUE constraint handling
    // Insert all records in batch (10-100x faster)
} else {
    // Normal deduplication (check for duplicates)
}

Performance Comparison

Records First Load Normal Load Speedup
1K 0.1s 0.5s 5x
10K 1s 8s 8x
100K 10s 120s 12x
1M 2 min 20 min 10x
10M 20 min 6 hours 18x

When Does It Activate?

First install (database is empty) ✅ After database reset (all tables dropped) ❌ Normal operation (database has records)

Is It Safe?

Yes! The UNIQUE index still enforces uniqueness at the database level. ✅ Even on first load, duplicates are impossible (database constraint). ✅ Only skips the application-level duplicate handling for speed.


Performance Characteristics

Insert Performance

Scenario Speed Duplicates Detected
First load (empty DB) 5,000-10,000 logs/sec None (N/A)
Normal load (no duplicates) 500-2,000 logs/sec 0%
Normal load (50% duplicates) 400-1,500 logs/sec 50%
All duplicates 300-1,000 logs/sec 100%

Query Performance

Deduplication hash lookups are extremely fast due to the UNIQUE index:

-- Hash lookup (to check for duplicate)
SELECT COUNT(*) FROM http_requests WHERE request_hash = 'a3f5b8c9...';

Performance: <1ms (index lookup)

Storage Overhead

Component Size per Record
Hash field 64 bytes (fixed)
Index overhead ~50 bytes
Total ~114 bytes/record

For 1 million records: ~114 MB storage for deduplication.


Troubleshooting

Issue 1: "Too Many Duplicates Being Skipped"

Symptom: LogLynx logs show thousands of duplicates being skipped.

Possible Causes:

  1. Log rotation caused re-reading of old logs
  2. Multiple LogLynx instances reading same logs
  3. Manual log re-import

Solution:

# Check LogLynx logs
tail -f loglynx.log | grep "Skipped duplicate"

# If expected (log rotation):
✅ No action needed, working as designed

# If unexpected (multiple instances):
⚠️ Stop duplicate instances, use different log sources

Issue 2: "Legitimate Requests Being Skipped"

Symptom: Valid requests not appearing in database.

Possible Causes:

  1. Hash collision (extremely rare, ~0%)
  2. Two requests with identical nanosecond timestamp (very rare)

Diagnosis:

-- Find potential collisions
SELECT request_hash, COUNT(*) as count
FROM http_requests
GROUP BY request_hash
HAVING count > 1;

Solution:

-- Check if these are truly different requests
SELECT * FROM http_requests
WHERE request_hash = 'a3f5b8c9...'
ORDER BY timestamp;

-- If truly different (hash collision):
⚠️ Report to developers (should never happen with SHA256)

-- If identical (not a collision):
✅ Working as designed

Issue 3: "First Load Not Optimizing"

Symptom: First load is slow despite empty database.

Diagnosis:

-- Check if database is truly empty
SELECT COUNT(*) FROM http_requests;

Solution:

# If count > 0:
# Database not empty, first-load optimization won't activate

# If count = 0 but still slow:
# Check logs for "First load detected" message
tail -f loglynx.log | grep "First load detected"

# If message not present:
# Report to developers

Issue 4: "UNIQUE Constraint Errors in Logs"

Symptom: Database errors about UNIQUE constraint violations.

Cause: Normal! These errors are expected when duplicates are detected.

Solution: ✅ No action needed. LogLynx silently handles these errors.

The error is logged at DEBUG level, not ERROR level:

DEBUG: Skipped duplicate entries (total=1000, inserted=950, duplicates=50)

Best Practices

✅ DO

  1. Let LogLynx handle duplicates automatically - No configuration needed
  2. Monitor "Skipped duplicates" count - High numbers may indicate log rotation
  3. Use one LogLynx instance per log source - Avoid multiple readers
  4. Trust the system - SHA256 collision is virtually impossible
  5. Enable first-load optimization - Automatically enabled on empty DB

❌ DON'T

  1. Don't disable deduplication - It's always active (database constraint)
  2. Don't worry about hash collisions - Probability is ~0% (with a correct configuration of the password manager logs)
  3. Don't manually delete duplicates - System handles it automatically
  4. Don't run multiple instances on same logs - Wasteful, though safe
  5. Don't modify request_hash values - Breaks deduplication

Summary

Key Takeaways

  • Deduplication is automatic - No configuration required
  • Cryptographically secure - SHA256 prevents collisions
  • Database-enforced - UNIQUE index guarantees no duplicates
  • Crash-safe - Safe to reprocess logs after crashes
  • Idempotent - Safe to re-import logs multiple times
  • High performance - 500-10,000 logs/sec depending on duplicates
  • First-load optimized - 10-100x faster on empty database

Deduplication Formula

Hash = SHA256(
    timestamp.Unix() +
    clientIP +
    method +
    host +
    path +
    queryString +
    statusCode +
    duration +
    startUTC
)

Hash Precision

  • Traefik JSON: Nanosecond precision (virtually no collisions)
  • Traefik CLF: Millisecond precision (very low collision risk)
  • Generic CLF: Second precision (low collision risk)

Performance

  • Normal operation: 500-2,000 logs/sec
  • First load: 5,000-10,000 logs/sec (10-100x faster)
  • Hash lookup: <1ms (indexed)

Last Updated: November 2025

LogLynx Wiki

Getting Started

Deployment

Reverse proxy recommended configuration

Key features

Quick Links

API Endpoints

Configuration

Deployment

Support

Clone this wiki locally