A CLI tool for deduplicating lino format files by identifying patterns in repeated link references and replacing them with numbered references for improved readability and reduced file size.
# Install globally with bun
bun install -g deduplino
# Or from source
git clone <repository-url>
cd deduplino
bun install
bun run build
npm install -g deduplino
# Basic usage
deduplino -i input.lino -o output.lino
# From stdin to stdout
echo "(test link)\n(test link)" | deduplino --piped-input
# Process with different threshold
deduplino --deduplication-threshold 0.5 -i input.lino
Deduplino analyzes lino files to find patterns in link references and creates optimized representations using three pattern types.
The --auto-escape
option automatically converts non-lino text (like logs) into valid lino format:
- First attempt: Escape only references containing colons (timestamps, URLs, field names)
- Second attempt: Escape references with special characters (
!@#$%^&*+=|\\:;?/<>.,
) - Final fallback: Escape all references except simple punctuation and quoted strings
Example log processing:
Input: 2025-07-25T21:32:46Z updateReferences id: a43fad436d79
Output: '2025-07-25T21:32:46Z' updateReferences 'id:' a43fad436d79
Links that appear identically multiple times.
Input:
(first second)
(first second)
(first second)
Output:
1: first second
1
1
1
Links that share common beginnings.
Input:
(this is a link of cat)
(this is a link of tree)
Output:
1: this is a link of
1 cat
1 tree
Links that share common endings.
Input:
(foo ends here)
(bar ends here)
Output:
1: ends here
foo 1
bar 1
The tool handles complex nested structures and can identify patterns in structured links:
Input:
(this is) a link
(this is) a link
Output:
1: this is
1 a link
1 a link
- Parse input using the Protocols.Lino parser
- Filter links with 2+ references (deduplicatable content)
- Identify Patterns:
- Exact duplicates
- Common prefixes between link pairs
- Common suffixes between link pairs
- Special handling for structured links
- Score & Select patterns by (frequency × pattern_length)
- Apply top patterns based on threshold
- Format output using library's formatLinks function
Option | Short | Description | Default |
---|---|---|---|
[input-file] |
Input file as positional argument | - | |
--input |
-i |
Input file path (alternative to positional argument) | - |
--output |
-o |
Output file path (smart naming if not provided) | - |
--deduplication-threshold |
-p |
Percentage of patterns to apply (0-1) | 0.2 |
--auto-escape |
Automatically escape input to make it valid lino format | false | |
--piped-input |
Read from stdin (use when piping data) | false | |
--fail-on-parse-error |
Exit with code 1 if input cannot be parsed as lino format | false | |
--detect-auto-escape-edge-cases |
Analyze log file line-by-line to find cases that auto-escape cannot fix | false | |
--help |
-h |
Show help information | - |
# Deduplicate a file (smart output naming)
deduplino document.lino
# Creates document.deduped.lino
# Deduplicate with custom output
deduplino document.lino -o compressed.lino
# Traditional flag syntax
deduplino -i document.lino -o compressed.lino
# Process from pipeline
cat document.lino | deduplino --piped-input > compressed.lino
# Quick stdin processing
echo "(test)\n(test)" | deduplino --piped-input
When you don't specify an output file, deduplino automatically generates one:
# File with .lino extension
deduplino input.lino # → input.deduped.lino
# File without .lino extension
deduplino server.log # → server.log.deduped.lino
deduplino data.txt # → data.txt.deduped.lino
# Conservative (default) - top 20% of patterns
deduplino document.lino
# More aggressive - top 50% of patterns
deduplino --deduplication-threshold 0.5 -i document.lino
# Maximum deduplication - all patterns
deduplino --deduplication-threshold 1.0 -i document.lino
# Process log files that aren't valid lino format
deduplino --auto-escape -i server.log -o processed.lino
# Handle timestamps and special characters
echo "2025-07-25T21:32:46Z error: connection failed" | deduplino --auto-escape --piped-input
# Output: '2025-07-25T21:32:46Z' 'error:' connection failed
# Chain with other tools
some-tool | deduplino --piped-input | other-tool
# Multiple processing steps
cat input.lino | deduplino --piped-input -p 0.3 | tee intermediate.lino | final-processor
# Validate lino format - exit with code 1 if invalid
deduplino --fail-on-parse-error -i document.lino
# Auto-escape with validation - useful for CI/CD pipelines
deduplino --auto-escape --fail-on-parse-error -i log.txt
# This will attempt auto-escape, but fail if it still can't parse the result
# Check if auto-escape worked properly
echo "problematic: input" | deduplino --piped-input --auto-escape --fail-on-parse-error
# Analyze a log file to find problematic lines
deduplino --detect-auto-escape-edge-cases -i server.log
# Find edge cases in piped input
cat application.log | deduplino --piped-input --detect-auto-escape-edge-cases
# Example output:
# 🔍 Found 3 edge case(s) that auto-escape cannot fix:
#
# 📂 Unbalanced Parentheses (2 cases):
# Line 42: "))((("
# Line 156: "))((()))(("
#
# 📂 Only Punctuation (1 cases):
# Line 89: "( ( ( ) )"
#
# 📊 Statistics:
# Total lines processed: 1000
# Failed lines: 3
# Success rate: 99.7%
The --deduplication-threshold
parameter controls which patterns are applied:
- 0.2 (default): Apply top 20% of patterns for optimal readability/compression balance
- 0.5: More aggressive deduplication, may impact readability
- 1.0: Maximum deduplication, applies all found patterns
Patterns are ranked by: frequency × pattern_length
bun install
# Run all tests
bun test
# Watch mode
bun test --watch
# Build for production
bun run build
# Development mode with file watching
bun run dev
src/
├── index.ts # CLI interface and argument parsing
├── deduplicator.ts # Core deduplication algorithm
tests/
└── deduplicator.test.ts # Comprehensive test suite (27 tests)
- Exact: Map-based counting of identical content
- Prefix/Suffix: Pairwise comparison with reference-level matching
- Structured: Special handling for nested link structures like
(this is) a link
Patterns are scored by count × pattern.split(' ').length
to favor:
- High-frequency patterns (appear many times)
- Longer patterns (more compression benefit)
Selected patterns are filtered to prevent overlap - each link content can only be part of one pattern.
- @linksplatform/protocols-lino: Lino format parsing and formatting
- yargs: Command-line argument parsing
This is free and unencumbered software released into the public domain.
See LICENSE for full details or visit https://unlicense.org
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure all tests pass:
bun test
- Submit a pull request