Skip to content

link-foundation/deduplino

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Deduplino

A CLI tool for deduplicating lino format files by identifying patterns in repeated link references and replacing them with numbered references for improved readability and reduced file size.

Installation

Using Bun (Recommended)

# Install globally with bun
bun install -g deduplino

# Or from source
git clone <repository-url>
cd deduplino
bun install
bun run build

Using NPM (Fallback)

npm install -g deduplino

Quick Start

# Basic usage
deduplino -i input.lino -o output.lino

# From stdin to stdout
echo "(test link)\n(test link)" | deduplino --piped-input

# Process with different threshold
deduplino --deduplication-threshold 0.5 -i input.lino

How It Works

Deduplino analyzes lino files to find patterns in link references and creates optimized representations using three pattern types.

Auto-Escape Feature

The --auto-escape option automatically converts non-lino text (like logs) into valid lino format:

  1. First attempt: Escape only references containing colons (timestamps, URLs, field names)
  2. Second attempt: Escape references with special characters (!@#$%^&*+=|\\:;?/<>.,)
  3. Final fallback: Escape all references except simple punctuation and quoted strings

Example log processing:

Input:  2025-07-25T21:32:46Z updateReferences id: a43fad436d79
Output: '2025-07-25T21:32:46Z' updateReferences 'id:' a43fad436d79

Pattern Types

1. Exact Duplicates

Links that appear identically multiple times.

Input:

(first second)
(first second)
(first second)

Output:

1: first second
1
1
1

2. Prefix Patterns

Links that share common beginnings.

Input:

(this is a link of cat)
(this is a link of tree)

Output:

1: this is a link of
1 cat
1 tree

3. Suffix Patterns

Links that share common endings.

Input:

(foo ends here)
(bar ends here)

Output:

1: ends here
foo 1
bar 1

Advanced Pattern Detection

The tool handles complex nested structures and can identify patterns in structured links:

Input:

(this is) a link
(this is) a link

Output:

1: this is
1 a link
1 a link

Algorithm

  1. Parse input using the Protocols.Lino parser
  2. Filter links with 2+ references (deduplicatable content)
  3. Identify Patterns:
    • Exact duplicates
    • Common prefixes between link pairs
    • Common suffixes between link pairs
    • Special handling for structured links
  4. Score & Select patterns by (frequency × pattern_length)
  5. Apply top patterns based on threshold
  6. Format output using library's formatLinks function

CLI Options

Option Short Description Default
[input-file] Input file as positional argument -
--input -i Input file path (alternative to positional argument) -
--output -o Output file path (smart naming if not provided) -
--deduplication-threshold -p Percentage of patterns to apply (0-1) 0.2
--auto-escape Automatically escape input to make it valid lino format false
--piped-input Read from stdin (use when piping data) false
--fail-on-parse-error Exit with code 1 if input cannot be parsed as lino format false
--detect-auto-escape-edge-cases Analyze log file line-by-line to find cases that auto-escape cannot fix false
--help -h Show help information -

Examples

Basic File Processing

# Deduplicate a file (smart output naming)
deduplino document.lino
# Creates document.deduped.lino

# Deduplicate with custom output
deduplino document.lino -o compressed.lino

# Traditional flag syntax
deduplino -i document.lino -o compressed.lino

# Process from pipeline
cat document.lino | deduplino --piped-input > compressed.lino

# Quick stdin processing
echo "(test)\n(test)" | deduplino --piped-input

Smart Output Naming

When you don't specify an output file, deduplino automatically generates one:

# File with .lino extension
deduplino input.lino           # → input.deduped.lino

# File without .lino extension  
deduplino server.log          # → server.log.deduped.lino
deduplino data.txt            # → data.txt.deduped.lino

Threshold Control

# Conservative (default) - top 20% of patterns
deduplino document.lino

# More aggressive - top 50% of patterns
deduplino --deduplication-threshold 0.5 -i document.lino

# Maximum deduplication - all patterns
deduplino --deduplication-threshold 1.0 -i document.lino

Auto-Escape for Logs

# Process log files that aren't valid lino format
deduplino --auto-escape -i server.log -o processed.lino

# Handle timestamps and special characters
echo "2025-07-25T21:32:46Z error: connection failed" | deduplino --auto-escape --piped-input
# Output: '2025-07-25T21:32:46Z' 'error:' connection failed

Pipeline Usage

# Chain with other tools
some-tool | deduplino --piped-input | other-tool

# Multiple processing steps
cat input.lino | deduplino --piped-input -p 0.3 | tee intermediate.lino | final-processor

Error Handling and Validation

# Validate lino format - exit with code 1 if invalid
deduplino --fail-on-parse-error -i document.lino

# Auto-escape with validation - useful for CI/CD pipelines
deduplino --auto-escape --fail-on-parse-error -i log.txt
# This will attempt auto-escape, but fail if it still can't parse the result

# Check if auto-escape worked properly
echo "problematic: input" | deduplino --piped-input --auto-escape --fail-on-parse-error

Edge Case Detection and Analysis

# Analyze a log file to find problematic lines
deduplino --detect-auto-escape-edge-cases -i server.log

# Find edge cases in piped input
cat application.log | deduplino --piped-input --detect-auto-escape-edge-cases

# Example output:
# 🔍 Found 3 edge case(s) that auto-escape cannot fix:
# 
# 📂 Unbalanced Parentheses (2 cases):
#    Line 42: "))((("
#    Line 156: "))((()))(("
# 
# 📂 Only Punctuation (1 cases):  
#    Line 89: "( ( ( ) )"
#
# 📊 Statistics:
#    Total lines processed: 1000
#    Failed lines: 3
#    Success rate: 99.7%

Pattern Selection Strategy

The --deduplication-threshold parameter controls which patterns are applied:

  • 0.2 (default): Apply top 20% of patterns for optimal readability/compression balance
  • 0.5: More aggressive deduplication, may impact readability
  • 1.0: Maximum deduplication, applies all found patterns

Patterns are ranked by: frequency × pattern_length

Development

Setup

bun install

Testing

# Run all tests
bun test

# Watch mode
bun test --watch

Building

# Build for production
bun run build

# Development mode with file watching
bun run dev

Project Structure

src/
├── index.ts          # CLI interface and argument parsing
├── deduplicator.ts   # Core deduplication algorithm
tests/
└── deduplicator.test.ts  # Comprehensive test suite (27 tests)

Algorithm Details

Pattern Finding

  • Exact: Map-based counting of identical content
  • Prefix/Suffix: Pairwise comparison with reference-level matching
  • Structured: Special handling for nested link structures like (this is) a link

Pattern Scoring

Patterns are scored by count × pattern.split(' ').length to favor:

  • High-frequency patterns (appear many times)
  • Longer patterns (more compression benefit)

Overlap Prevention

Selected patterns are filtered to prevent overlap - each link content can only be part of one pattern.

Dependencies

License

This is free and unencumbered software released into the public domain.

See LICENSE for full details or visit https://unlicense.org

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new functionality
  4. Ensure all tests pass: bun test
  5. Submit a pull request

Links

About

Deduplicate LiNo (Links Notation)

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •