tokendiff

A Go library and CLI for token-level diffing with delimiter support.

tokendiff uses a histogram diff algorithm that groups semantically related changes together, producing more readable output than traditional Myers-based approaches for complex structural changes.

Motivation

Traditional diff tools operate at the line level. Word-based tools like wdiff improve on this but can produce suboptimal results when comparing code. For example, when comparing:

void someFunction(SomeType var)
void someFunction(SomeOtherType var)

wdiff reports that someFunction(SomeType changed to someFunction(SomeOtherType - grouping the function name with the parameter type.

tokendiff treats delimiter characters like ( as separate tokens, correctly identifying that only SomeType changed to SomeOtherType.

Algorithm

This library uses the histogram diff algorithm via diffx. The histogram algorithm is a variant of the patience diff algorithm that performs well on real-world text by:

Finding unique tokens that appear exactly once in each input (strong anchors)
Using frequency analysis to avoid matching common tokens that would create confusing output
Recursively diffing the regions between anchors

This approach produces output that groups semantically related changes together, making diffs easier to read than traditional Myers-based algorithms when comparing files with significant structural changes.

Installation

Library

go get github.com/dacharyc/tokendiff

CLI Tool

go install github.com/dacharyc/tokendiff/cmd/tokendiff@latest

CLI Usage

tokendiff [options] file1 file2
tokendiff [options] -stdin file2

Options

Input/Output:

Flag	Description
`-d "..."`	Custom delimiter characters
`-P, --punctuation`	Use Unicode punctuation as delimiters
`-W, --white-space "..."`	Custom whitespace characters
`--line-mode`	Compare files line by line
`-C N`	Show N lines of context (implies --line-mode)
`-L N, --line-numbers N`	Show line numbers with width N (0 for auto)
`-stdin`	Read first input from stdin
`--diff-input`	Read unified diff from stdin and apply token-level diff

Output Formatting:

Flag	Description
`-w "..."`	String to mark start of deleted text (default: `[-`)
`-x "..."`	String to mark end of deleted text (default: `-]`)
`-y "..."`	String to mark start of inserted text (default: `{+`)
`-z "..."`	String to mark end of inserted text (default: `+}`)
`-c, --color SPEC`	Set colors (format: `del_fg[:bg],ins_fg[:bg]`, or `list`)
`--no-color`	Disable colored output
`-l, --less-mode`	Use overstrike for `less -r` viewing
`-p, --printer`	Use overstrike for printing
`-R, --repeat-markers`	Repeat markers at line boundaries
`-a, --aggregate-changes`	Combine adjacent insertions/deletions

Output Suppression:

Flag	Description
`-1`	Suppress deleted words
`-2`	Suppress inserted words
`-3`	Suppress common words

Comparison:

Flag	Description
`-i, --ignore-case`	Case-insensitive comparison
`-m N, --match-context N`	Minimum matching words between changes

Other:

Flag	Description
`-s, --statistics`	Print diff statistics
`--profile NAME`	Use settings from `~/.tokendiffrc.<NAME>`
`-v, --version`	Show version
`-h`	Show help

The CLI respects the NO_COLOR environment variable.

Configuration Files

tokendiff supports configuration files to set default options:

~/.tokendiffrc - Default configuration (loaded automatically)
~/.config/tokendiff/config - XDG-compliant location (fallback)
~/.tokendiffrc.<profile> - Named profile (use with --profile)

Config file format:

# Comment
option-name
option-name=value

Example ~/.tokendiffrc.html:

# HTML output profile
start-delete=<del>
stop-delete=</del>
start-insert=<ins>
stop-insert=</ins>
no-color

Usage:

tokendiff --profile=html old.txt new.txt

Command-line options override configuration file settings.

Exit Codes

Code	Meaning
0	Files are identical
1	Files differ
2	Error occurred

Examples

# Compare two files
tokendiff old.txt new.txt

# Line-by-line with context
tokendiff --line-mode -C 3 old.go new.go

# Compare git versions
git show HEAD~1:file.go | tokendiff -stdin file.go

# Custom delimiters
tokendiff -d "(){}[]" file1.txt file2.txt

# Case-insensitive comparison with statistics
tokendiff -i -s old.txt new.txt

# HTML-style markers
tokendiff -w '<del>' -x '</del>' -y '<ins>' -z '</ins>' old.txt new.txt

# View in less with overstrike highlighting
tokendiff -l old.txt new.txt | less -r

# Apply token-level diff to a unified diff
git diff | tokendiff --diff-input
diff -u old.txt new.txt | tokendiff --diff-input

Library Usage

Basic Usage

package main

import (
    "fmt"
    "github.com/dacharyc/tokendiff"
)

func main() {
    old := "void someFunction(SomeType var)"
    new := "void someFunction(SomeOtherType var)"

    diffs := tokendiff.DiffStrings(old, new, tokendiff.DefaultOptions())
    fmt.Println(tokendiff.FormatDiff(diffs))
    // Output: void someFunction([-SomeType-]{+SomeOtherType+} var)
}

Working with Tokens

// Tokenize text with delimiter awareness
tokens := tokendiff.Tokenize("foo(bar, baz)", tokendiff.DefaultOptions())
// tokens = ["foo", "(", "bar", ",", "baz", ")"]

// Diff pre-tokenized content
diffs := tokendiff.DiffTokens(tokens1, tokens2)

Custom Delimiters

opts := tokendiff.Options{
    Delimiters: "|:-",  // Custom delimiter set
}
diffs := tokendiff.DiffStrings(text1, text2, opts)

Preserving Whitespace

opts := tokendiff.Options{
    Delimiters:         tokendiff.DefaultDelimiters,
    PreserveWhitespace: true,  // Include whitespace as tokens
}

API

Types

type Operation int
const (
    Equal  Operation = iota  // Token unchanged
    Insert                   // Token was added
    Delete                   // Token was removed
)

type Diff struct {
    Type  Operation
    Token string
}

type Options struct {
    Delimiters         string  // Characters to treat as separate tokens
    Whitespace         string  // Characters to treat as whitespace
    UsePunctuation     bool    // Use Unicode punctuation as delimiters
    PreserveWhitespace bool    // Include whitespace as tokens
    IgnoreCase         bool    // Case-insensitive comparison
}

type FormatOptions struct {
    StartDelete string  // Marker for start of deleted text (default: "[-")
    StopDelete  string  // Marker for end of deleted text (default: "-]")
    StartInsert string  // Marker for start of inserted text (default: "{+")
    StopInsert  string  // Marker for end of inserted text (default: "+}")
    NoDeleted   bool    // Suppress deleted tokens
    NoInserted  bool    // Suppress inserted tokens
    NoCommon    bool    // Suppress unchanged tokens
}

Functions

Tokenizing and Diffing:

Tokenize(text string, opts Options) []string - Split text into tokens
DiffTokens(tokens1, tokens2 []string) []Diff - Diff two token slices
DiffStrings(text1, text2 string, opts Options) []Diff - Tokenize and diff two strings
DefaultOptions() Options - Get default options

Diff Transformations:

AggregateDiffs(diffs []Diff) []Diff - Combine adjacent same-type operations
ApplyMatchContext(diffs []Diff, minContext int) []Diff - Require minimum matching words between changes

Formatting:

FormatDiff(diffs []Diff) string - Format diff with default markers
FormatDiffWithOptions(diffs []Diff, opts FormatOptions) string - Format with custom markers
DefaultFormatOptions() FormatOptions - Get default format options
HasChanges(diffs []Diff) bool - Check if diff contains any changes
NeedsSpaceBefore(token string) bool - Check if space should precede token
NeedsSpaceAfter(token string) bool - Check if space should follow token

Unified Diff Parsing:

ParseUnifiedDiff(input string) ([]UnifiedDiff, error) - Parse unified diff format
ApplyWordDiff(hunk DiffHunk, opts Options) []Diff - Apply token-level diff to a hunk

Default Delimiters

(){}[]<>,.;:!?"'`@#$%^&*+-=/\|~

Performance

Benchmarks on Apple M1:

BenchmarkTokenize      ~2.5 µs/op
BenchmarkDiffStrings   ~10.5 µs/op

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.github/workflows		.github/workflows
cmd/tokendiff		cmd/tokendiff
testdata		testdata
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
diff_input.go		diff_input.go
diff_input_test.go		diff_input_test.go
format.go		format.go
format_test.go		format_test.go
go.mod		go.mod
go.sum		go.sum
linediff.go		linediff.go
linediff_test.go		linediff_test.go
postprocess.go		postprocess.go
postprocess_test.go		postprocess_test.go
preprocess.go		preprocess.go
preprocess_test.go		preprocess_test.go
tokendiff.go		tokendiff.go
tokendiff_test.go		tokendiff_test.go
tokenize.go		tokenize.go
tokenize_test.go		tokenize_test.go
unified.go		unified.go
unified_test.go		unified_test.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tokendiff

Motivation

Algorithm

Installation

Library

CLI Tool

CLI Usage

Options

Configuration Files

Exit Codes

Examples

Library Usage

Basic Usage

Working with Tokens

Custom Delimiters

Preserving Whitespace

API

Types

Functions

Default Delimiters

Performance

License

About

Uh oh!

Releases

Languages

License

dacharyc/tokendiff

Folders and files

Latest commit

History

Repository files navigation

tokendiff

Motivation

Algorithm

Installation

Library

CLI Tool

CLI Usage

Options

Configuration Files

Exit Codes

Examples

Library Usage

Basic Usage

Working with Tokens

Custom Delimiters

Preserving Whitespace

API

Types

Functions

Default Delimiters

Performance

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Languages