Deduplication System

LogLynx Deduplication System - Complete Guide

Overview

LogLynx uses a cryptographic hash-based deduplication system to prevent duplicate log entries from being stored in the database. This ensures data integrity and prevents storage waste when:

Log files are reprocessed after crashes
Log rotation causes log re-reading
Multiple LogLynx instances process the same logs
Logs are manually re-imported

Key Features

✅ SHA256 Hash-Based: Cryptographically secure, collision-resistant ``
✅ Nanosecond Precision: Uses Duration and StartUTC for ultra-precise uniqueness
✅ Database-Enforced: UNIQUE index guarantees no duplicates
✅ Idempotent Processing: Safe to reprocess logs multiple times
✅ First-Load Optimized: Skips deduplication checks on empty database (10-100x faster)
✅ Automatic: No configuration required

How It Works

Architecture Overview

┌─────────────────────────────────────────────────────────────┐
│  1. LOG ENTRY READ                                          │
│     Source: Traefik JSON/CLF log file                       │
└────────────────┬────────────────────────────────────────────┘
                 ↓
┌─────────────────────────────────────────────────────────────┐
│  2. PARSE LOG ENTRY                                         │
│     Extract: timestamp, IP, method, path, status, etc.      │
└────────────────┬────────────────────────────────────────────┘
                 ↓
┌─────────────────────────────────────────────────────────────┐
│  3. CALCULATE HASH (SHA256)                                 │
│     Input: timestamp|IP|method|host|path|query|status|      │
│            duration|startUTC                                 │
│     Output: 64-character hex hash                           │
│     Example: a3f5b8c9d2e1f4a7c6b5d8e7f9a0c1d2...           │
└────────────────┬────────────────────────────────────────────┘
                 ↓
┌─────────────────────────────────────────────────────────────┐
│  4. CHECK FOR DUPLICATES                                    │
│     Database: SELECT COUNT(*) WHERE request_hash = ?        │
│     Result: EXISTS or NOT EXISTS                            │
└────────────────┬────────────────────────────────────────────┘
                 ↓
         ┌───────┴───────┐
         ↓               ↓
┌─────────────────┐  ┌─────────────────┐
│  HASH EXISTS    │  │  HASH NEW       │
│  (Duplicate)    │  │  (Unique)       │
└────────┬────────┘  └────────┬────────┘
         ↓                    ↓
┌─────────────────┐  ┌─────────────────┐
│  SKIP INSERT    │  │  INSERT RECORD  │
│  Log: "Skipped  │  │  Success!       │
│   duplicate"    │  │                 │
└─────────────────┘  └─────────────────┘

Database Enforcement

The request_hash field has a UNIQUE INDEX:

CREATE UNIQUE INDEX idx_request_hash ON http_requests(request_hash);

This means:

Database guarantees no two records can have the same hash
Attempted duplicates are rejected at the database level
No duplicate can slip through, even in race conditions

Hash Calculation

Formula

The deduplication hash is calculated using 9 fields to create a unique identifier:

hashInput = timestamp.Unix() + "|" +
            clientIP + "|" +
            method + "|" +
            host + "|" +
            path + "|" +
            queryString + "|" +
            statusCode + "|" +
            duration + "|" +        // Nanosecond precision
            startUTC                // Timestamp with nanosecond precision

hash = SHA256(hashInput)

Example Calculation

Log Entry:

{
  "time": "2025-11-07T10:30:45.123456789Z",
  "ClientAddr": "103.4.250.66:48952",
  "RequestMethod": "GET",
  "request_Host": "api.example.com",
  "RequestPath": "/users?page=1",
  "DownstreamStatus": 200,
  "Duration": 299425702,
  "StartUTC": "2025-11-07T10:30:45.123456789Z"
}

Hash Input String:

1730976645|103.4.250.66|GET|api.example.com|/users|page=1|200|299425702|2025-11-07T10:30:45.123456789Z

SHA256 Hash:

a3f5b8c9d2e1f4a7c6b5d8e7f9a0c1d2b3e4f5a6c7d8e9f0a1b2c3d4e5f6a7b8

This hash is stored in the request_hash field.

Why These 9 Fields?

Field	Purpose	Precision
timestamp.Unix()	Identifies the second when request occurred	Second
clientIP	Distinguishes requests from different clients	Exact
method	Differentiates GET, POST, PUT, DELETE, etc.	Exact
host	Separates requests to different services	Exact
path	Identifies the endpoint	Exact
queryString	Differentiates query parameters	Exact
statusCode	Allows same request with different results	Exact
duration	Nanosecond-level uniqueness	Nanosecond
startUTC	Timestamp with nanosecond precision	Nanosecond

Precision Levels by Log Format

Format	Precision	Collision Risk
Traefik JSON	Nanosecond	Virtually 0% (1 in 10^18)
Traefik CLF	Millisecond	Very low (~0.001%)
Generic CLF	Second	Low (~1%)

Normal Operation Scenarios

Scenario 1: Normal Log Processing

Situation: LogLynx reads new log entries for the first time.

Time: 10:30:45.123456789
Request: GET /api/users from 103.4.250.66
Status: 200
Duration: 299425702 ns

Process:

✅ Parse log entry
✅ Calculate hash: a3f5b8c9...
✅ Check database: Hash NOT found
✅ Insert record → Success!

Result: Record stored successfully.

Scenario 2: Exact Duplicate (Same Second)

Situation: Two identical requests in the same second.

Request 1: GET /api/users at 10:30:45.123456789 → Duration: 299425702 ns
Request 2: GET /api/users at 10:30:45.987654321 → Duration: 299425999 ns

Hash Inputs:

Hash 1: ...10:30:45.123456789|299425702|...
Hash 2: ...10:30:45.987654321|299425999|...
         ↑                     ↑
         Different nanosecond precision

Process:

✅ Calculate hash 1: a3f5b8c9...
✅ Calculate hash 2: b4f6c0d1... (different!)
✅ Both hashes unique
✅ Both records inserted

Result: Both requests stored (they're actually different).

Scenario 3: True Duplicate (Same Nanosecond)

Situation: Log line appears twice (file re-read, rotation, etc.).

Request 1: GET /api/users at 10:30:45.123456789 → Duration: 299425702 ns
Request 2: GET /api/users at 10:30:45.123456789 → Duration: 299425702 ns
                                                    ↑
                                              Exact duplicate

Hash Inputs:

Hash 1: 1730976645|103.4.250.66|GET|api.example.com|/users|page=1|200|299425702|2025-11-07T10:30:45.123456789Z
Hash 2: 1730976645|103.4.250.66|GET|api.example.com|/users|page=1|200|299425702|2025-11-07T10:30:45.123456789Z
        ↑
        Identical

Process:

✅ Calculate hash 1: a3f5b8c9... → Insert → Success
✅ Calculate hash 2: a3f5b8c9... (same!)
✅ Check database: Hash EXISTS
⚠️ Skip insert → Log: "Skipped duplicate"

Result: Second record NOT stored (duplicate detected).

Scenario 4: Different Status Codes (Same Request)

Situation: Same request, different outcomes.

Request 1: GET /api/users at 10:30:45.123456789 → Status: 200
Request 2: GET /api/users at 10:30:45.123456789 → Status: 500

Hash Inputs:

Hash 1: ...|200|...
Hash 2: ...|500|...
            ↑
            Different status code

Process:

✅ Calculate hash 1: a3f5b8c9...
✅ Calculate hash 2: c7d9e1f3... (different!)
✅ Both hashes unique
✅ Both records inserted

Result: Both stored (different status codes = different events).

Scenario 5: Different Query Parameters

Situation: Same path, different query strings.

Request 1: GET /api/users?page=1 at 10:30:45
Request 2: GET /api/users?page=2 at 10:30:45

Hash Inputs:

Hash 1: ...|/api/users|page=1|...
Hash 2: ...|/api/users|page=2|...
                        ↑
                        Different query parameter

Process:

✅ Calculate hash 1: a3f5b8c9...
✅ Calculate hash 2: d8e0f2a4... (different!)
✅ Both hashes unique
✅ Both records inserted

Result: Both stored (different query params = different requests).

Extraordinary Scenarios

Scenario 1: Application Crash During Processing

Situation: LogLynx crashes while processing a batch of 1000 logs. 500 were inserted, 500 were not.

What Happens:

Batch 1 (logs 1-500):   ✅ Inserted (position NOT updated)
Batch 2 (logs 501-1000): ❌ Crashed before insert
Position in database:    ⚠️ Still at log 0

On Restart:

✅ LogLynx reads from last position (0)
✅ Re-reads logs 1-1000
✅ Calculates hashes for all logs
✅ Logs 1-500: Hash EXISTS → Skip (duplicates)
✅ Logs 501-1000: Hash NOT found → Insert
✅ Position updated to 1000

Result: ✅ No data loss, no duplicates. Crash-safe!

Scenario 2: Log Rotation (File Renamed)

Situation: Log file rotated while LogLynx is running.

Before rotation:
  access.log (1000 lines) ← LogLynx reading at line 800

After rotation:
  access.log.1 (1000 lines) ← Old file renamed
  access.log (0 lines) ← New empty file

What Happens:

✅ LogLynx detects rotation (file inode changed or size < position)
✅ Resets position to 0
✅ Starts reading new access.log (empty)
✅ No logs to read yet
✅ Continues monitoring

If old logs are re-imported later:

✅ Calculate hashes for old logs
✅ All hashes EXIST (already in database)
✅ All skipped → No duplicates

Result: ✅ No duplicates, automatic handling.

Scenario 3: Log Rotation (File Truncated)

Situation: Log file truncated (common with copytruncate rotation).

Before:
  access.log (10 MB, position at 8 MB)

After truncation:
  access.log (0 MB, position still thinks it's at 8 MB)

What Happens:

✅ LogLynx detects truncation (file size < last position)
✅ Resets position to 0
✅ Starts reading from beginning
✅ File is empty (just truncated)
✅ Waits for new logs

Result: ✅ Automatic recovery, no errors.

Scenario 4: Manual Log Re-Import

Situation: Administrator manually re-imports old logs.

# Copy old logs back
cp access.log.old access.log
systemctl restart loglynx

What Happens:

✅ LogLynx starts reading access.log.old
✅ Calculates hashes for all entries
✅ Checks database for each hash
✅ All hashes EXIST (already imported)
✅ All entries skipped
✅ Log: "Skipped 10000 duplicates"

Result: ✅ No duplicates stored, idempotent import.

Scenario 5: Multiple LogLynx Instances (Same Logs)

Situation: Two LogLynx instances reading the same log file (NOT RECOMMENDED, but handled).

Instance 1: Reading access.log (position 0-1000)
Instance 2: Reading access.log (position 0-1000)

What Happens:

✅ Instance 1 calculates hashes, inserts records
✅ Instance 2 calculates same hashes
✅ Database rejects duplicates (UNIQUE constraint)
✅ Instance 2 logs: "Skipped duplicates"

Result: ✅ No duplicates, but wasteful (avoid this scenario).

Best Practice: Use different log sources per instance.

Scenario 6: Database Corruption Recovery

Situation: Database file corrupted, restore from backup.

Current database: 1 million records (Jan 1 - Dec 31)
Backup database: 500k records (Jan 1 - Jun 30)

What Happens:

✅ Restore backup (500k records)
✅ Restart LogLynx
✅ LogLynx reads logs from last position
✅ Processes logs from Jul 1 - Dec 31
✅ Hashes NOT in database (backup was old)
✅ All records inserted (no duplicates)

Result: ✅ Database restored, missing data re-imported.

Scenario 7: Hash Collision (Extremely Rare)

Situation: Two different requests generate the same SHA256 hash.

Probability: ~1 in 2^256 (virtually impossible)

What Would Happen (theoretical):

⚠️ Request 1 inserted with hash a3f5b8c9...
⚠️ Request 2 generates same hash
⚠️ Database rejects insert (UNIQUE constraint)
⚠️ Request 2 skipped as "duplicate"

Impact: One request lost (but probability is ~0%)

Mitigation: Not needed (SHA256 is collision-resistant).

First Load Optimization

What Is It?

When the database is empty (first load), LogLynx skips deduplication checks for maximum performance.

How It Works

// Check if database is empty
var count int64
db.Model(&HTTPRequest{}).Count(&count)
isFirstLoad := (count == 0)

if isFirstLoad {
    // Skip UNIQUE constraint handling
    // Insert all records in batch (10-100x faster)
} else {
    // Normal deduplication (check for duplicates)
}

Performance Comparison

Records	First Load	Normal Load	Speedup
1K	0.1s	0.5s	5x
10K	1s	8s	8x
100K	10s	120s	12x
1M	2 min	20 min	10x
10M	20 min	6 hours	18x

When Does It Activate?

✅ First install (database is empty) ✅ After database reset (all tables dropped) ❌ Normal operation (database has records)

Is It Safe?

✅ Yes! The UNIQUE index still enforces uniqueness at the database level. ✅ Even on first load, duplicates are impossible (database constraint). ✅ Only skips the application-level duplicate handling for speed.

Performance Characteristics

Insert Performance

Scenario	Speed	Duplicates Detected
First load (empty DB)	5,000-10,000 logs/sec	None (N/A)
Normal load (no duplicates)	500-2,000 logs/sec	0%
Normal load (50% duplicates)	400-1,500 logs/sec	50%
All duplicates	300-1,000 logs/sec	100%

Query Performance

Deduplication hash lookups are extremely fast due to the UNIQUE index:

-- Hash lookup (to check for duplicate)
SELECT COUNT(*) FROM http_requests WHERE request_hash = 'a3f5b8c9...';

Performance: <1ms (index lookup)

Storage Overhead

Component	Size per Record
Hash field	64 bytes (fixed)
Index overhead	~50 bytes
Total	~114 bytes/record

For 1 million records: ~114 MB storage for deduplication.

Troubleshooting

Issue 1: "Too Many Duplicates Being Skipped"

Symptom: LogLynx logs show thousands of duplicates being skipped.

Possible Causes:

Log rotation caused re-reading of old logs
Multiple LogLynx instances reading same logs
Manual log re-import

Solution:

# Check LogLynx logs
tail -f loglynx.log | grep "Skipped duplicate"

# If expected (log rotation):
✅ No action needed, working as designed

# If unexpected (multiple instances):
⚠️ Stop duplicate instances, use different log sources

Issue 2: "Legitimate Requests Being Skipped"

Symptom: Valid requests not appearing in database.

Possible Causes:

Hash collision (extremely rare, ~0%)
Two requests with identical nanosecond timestamp (very rare)

Diagnosis:

-- Find potential collisions
SELECT request_hash, COUNT(*) as count
FROM http_requests
GROUP BY request_hash
HAVING count > 1;

Solution:

-- Check if these are truly different requests
SELECT * FROM http_requests
WHERE request_hash = 'a3f5b8c9...'
ORDER BY timestamp;

-- If truly different (hash collision):
⚠️ Report to developers (should never happen with SHA256)

-- If identical (not a collision):
✅ Working as designed

Issue 3: "First Load Not Optimizing"

Symptom: First load is slow despite empty database.

Diagnosis:

-- Check if database is truly empty
SELECT COUNT(*) FROM http_requests;

Solution:

# If count > 0:
# Database not empty, first-load optimization won't activate

# If count = 0 but still slow:
# Check logs for "First load detected" message
tail -f loglynx.log | grep "First load detected"

# If message not present:
# Report to developers

Issue 4: "UNIQUE Constraint Errors in Logs"

Symptom: Database errors about UNIQUE constraint violations.

Cause: Normal! These errors are expected when duplicates are detected.

Solution: ✅ No action needed. LogLynx silently handles these errors.

The error is logged at DEBUG level, not ERROR level:

DEBUG: Skipped duplicate entries (total=1000, inserted=950, duplicates=50)

Best Practices

✅ DO

Let LogLynx handle duplicates automatically - No configuration needed
Monitor "Skipped duplicates" count - High numbers may indicate log rotation
Use one LogLynx instance per log source - Avoid multiple readers
Trust the system - SHA256 collision is virtually impossible
Enable first-load optimization - Automatically enabled on empty DB

❌ DON'T

Don't disable deduplication - It's always active (database constraint)
Don't worry about hash collisions - Probability is ~0% (with a correct configuration of the password manager logs)
Don't manually delete duplicates - System handles it automatically
Don't run multiple instances on same logs - Wasteful, though safe
Don't modify request_hash values - Breaks deduplication

Summary

Key Takeaways

✅ Deduplication is automatic - No configuration required
✅ Cryptographically secure - SHA256 prevents collisions
✅ Database-enforced - UNIQUE index guarantees no duplicates
✅ Crash-safe - Safe to reprocess logs after crashes
✅ Idempotent - Safe to re-import logs multiple times
✅ High performance - 500-10,000 logs/sec depending on duplicates
✅ First-load optimized - 10-100x faster on empty database

Deduplication Formula

Hash = SHA256(
    timestamp.Unix() +
    clientIP +
    method +
    host +
    path +
    queryString +
    statusCode +
    duration +
    startUTC
)

Deduplication System

LogLynx Deduplication System - Complete Guide

Table of Contents

Overview

Key Features

How It Works

Architecture Overview

Database Enforcement

Hash Calculation

Formula

Example Calculation

Why These 9 Fields?

Precision Levels by Log Format

Normal Operation Scenarios

Scenario 1: Normal Log Processing

Scenario 2: Exact Duplicate (Same Second)

Scenario 3: True Duplicate (Same Nanosecond)

Scenario 4: Different Status Codes (Same Request)

Scenario 5: Different Query Parameters

Extraordinary Scenarios

Scenario 1: Application Crash During Processing

Scenario 2: Log Rotation (File Renamed)

Scenario 3: Log Rotation (File Truncated)

Scenario 4: Manual Log Re-Import

Scenario 5: Multiple LogLynx Instances (Same Logs)

Scenario 6: Database Corruption Recovery

Scenario 7: Hash Collision (Extremely Rare)

First Load Optimization

What Is It?

How It Works

Performance Comparison

When Does It Activate?

Is It Safe?

Performance Characteristics

Insert Performance

Query Performance

Storage Overhead

Troubleshooting

Issue 1: "Too Many Duplicates Being Skipped"

Issue 2: "Legitimate Requests Being Skipped"

Issue 3: "First Load Not Optimizing"

Issue 4: "UNIQUE Constraint Errors in Logs"

Best Practices

✅ DO

❌ DON'T

Summary

Key Takeaways

Deduplication Formula

Hash Precision

Performance

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LogLynx Wiki

Getting Started

Deployment

Reverse proxy recommended configuration

Key features

Quick Links

API Endpoints

Configuration

Deployment

Support

Clone this wiki locally