Skip to content
/ rvl Public

rvl reveals the smallest set of numeric changes that explain what actually changed between two datasets — or confidently tells you nothing changed.

License

Notifications You must be signed in to change notification settings

cmdrvl/rvl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

85 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

rvl

CI License: MIT GitHub release

Reveal the smallest set of numeric changes that explain what actually changed.

No AI. No inference. Pure deterministic arithmetic.

brew install cmdrvl/tap/rvl

TL;DR

The Problem: Comparing CSV exports by hand is slow and noisy — Excel hell, brittle scripts, eyeballing numbers. When two files differ, you need to know what actually changed and whether it matters.

The Solution: One command, one verdict. rvl finds the smallest ranked set of numeric deltas that explain the change — or proves nothing changed — using deterministic arithmetic. Never probabilistic. Never ambiguous.

Why Use rvl?

Feature What It Does
Ranked explanations Finds the fewest cells that account for ≥95% of total numeric change
Three clear outcomes REAL CHANGE, NO REAL CHANGE, or REFUSAL — never a partial answer
Tolerance-aware Ignores floating-point noise below your threshold — no false positives
Machine-readable --json output for pipelines, CI gates, and automation
Zero config Auto-detects delimiters, numeric formats, currency symbols, accounting parens
Deterministic Same inputs always produce the same output — no sampling, no heuristics

Quick Example

$ rvl old.csv new.csv --key id
RVL

REAL CHANGE

Compared: old.csv -> new.csv
Alignment: key=id
Columns: common=15 old_only=2 new_only=1
Checked: 4,183 rows, 12 numeric columns (50,196 cells)
Dialect(old): delimiter=, quote=" escape=none
Dialect(new): delimiter=, quote=" escape=none
Ranking: abs(delta) (unscaled)
Settings: threshold=95.0% tolerance=1e-9

3 cells explain 95.2% of total numeric change (threshold 95.0%):

1. NVDA.market_value  +1842100  (123 -> 1842223)
2. UST10Y.price       -0.37     (4.21 -> 3.84)
3. EURUSD.fx_rate     +0.0013   (1.0842 -> 1.0855)

Everything else in common numeric columns is <= tolerance or in the tail (not required to reach threshold).

Out of 50,196 cells, 3 cells explain 95.2% of all numeric change. That's the whole answer.

# No change? Proof:
$ rvl old.csv old_copy.csv
# → NO REAL CHANGE (exit 0), max delta 7e-10

# Machine-readable:
$ rvl old.csv new.csv --json | jq '.contributors[0]'

# Exit code only (for scripts):
$ rvl old.csv new.csv > /dev/null 2>&1
$ echo $?  # 0 = no change, 1 = changed, 2 = refused

The Three Outcomes

rvl always produces exactly one of three outcomes. There are no partial results, "and N more" buckets, or probabilistic scores.

1. REAL CHANGE

Printed when the top contributors (up to 25) explain ≥ threshold of total numeric change.

RVL

REAL CHANGE

Compared: old.csv -> new.csv
Alignment: key=id
Columns: common=15 old_only=2 new_only=1
Checked: 4,183 rows, 12 numeric columns (50,196 cells)
Dialect(old): delimiter=, quote=" escape=none
Dialect(new): delimiter=, quote=" escape=none
Ranking: abs(delta) (unscaled)
Settings: threshold=95.0% tolerance=1e-9

3 cells explain 95.2% of total numeric change (threshold 95.0%):

1. NVDA.market_value  +1842100  (123 -> 1842223)
2. UST10Y.price       -0.37     (4.21 -> 3.84)
3. EURUSD.fx_rate     +0.0013   (1.0842 -> 1.0855)

Everything else in common numeric columns is <= tolerance or in the tail (not required to reach threshold).

How to read this:

  • 3 cells explain 95.2% — only 3 numeric cells (out of 50,196) account for 95.2% of all numeric change.
  • Contributors — ranked by abs(delta), largest first. Each shows the cell label (row_id.column), signed delta, and old → new values.
  • Coverage — cumulative share of total change (L1 distance). rvl prints the smallest prefix of contributors whose cumulative coverage reaches the threshold.
  • Threshold — if the top 25 contributors can't reach 95%, rvl refuses (E_DIFFUSE) instead of printing a misleading partial list.

2. NO REAL CHANGE

Printed when all numeric deltas are within tolerance.

RVL

NO REAL CHANGE

Compared: old.csv -> new.csv
Alignment: row-order (no key)
Columns: common=15 old_only=2 new_only=1
Checked: 4,183 rows, 12 numeric columns (50,196 cells)
Dialect(old): delimiter=, quote=" escape=none
Dialect(new): delimiter=, quote=" escape=none
Ranking: abs(delta) (unscaled)
Settings: threshold=95.0% tolerance=1e-9
Max abs delta: 7e-10 (<= tolerance 1e-9).
No numeric deltas above tolerance in common numeric columns.

How to read this:

  • Max abs delta — the largest absolute difference observed across all cells (before tolerance zeroing). Proves nothing slipped through.
  • This is a deterministic guarantee: every common numeric cell was checked.

3. REFUSAL

Printed when rvl cannot produce a deterministic verdict. Always includes a concrete next step.

RVL ERROR (E_KEY_DUP)

Compared: old.csv -> new.csv
Alignment: key=id
Dialect(old): delimiter=, quote=" escape=none
Dialect(new): delimiter=, quote=" escape=none
Settings: threshold=95.0% tolerance=1e-9

Cannot align rows: key "id" is not unique in old.csv (first duplicate: "A123" at data record 184).
Next: choose a unique key column or dedupe the data, then rerun.

How to read this:

  • Error code — machine-stable identifier (e.g., E_KEY_DUP). See Refusal Codes.
  • Example — first concrete instance of the problem (file, record number, value).
  • Next — a concrete rerun command or remediation step. Refusals are operator handoffs, never dead ends.

How It Works

Alignment

Row-order mode (no --key): rows align by position. Requires identical non-blank row counts. If rvl detects that rows are shuffled (via key discovery), it refuses with E_NEED_KEY and suggests a --key to use.

Key mode (--key <column>): rows align by matching key values. Key values are ASCII-trimmed, must be non-empty and unique within each file, and must match exactly between files. Any violation produces a specific refusal (E_NO_KEY, E_KEY_EMPTY, E_KEY_DUP, E_KEY_MISMATCH).

Numeric Columns

Only columns present in both files are compared. Only numeric columns are diffed. A column is numeric if every aligned row is either missing on both sides or parseable finite numbers on both sides.

Supported numeric formats:

  • Plain: 123, -123.45, 1e6, -1.2E-3
  • Thousands separators: 1,234, -1,234,567.89 (US-style, 3-digit groups)
  • Currency prefix: $123.45, -$1,234.56, $-100
  • Accounting parentheses: (123.45) → parsed as -123.45
  • Leading + is allowed: +123, +$1,234.56

Missing tokens (case-insensitive): empty string, -, NA, N/A, NULL, NAN, NONE.

Tolerance

Absolute noise floor applied per-cell. If abs(new - old) <= tolerance, the delta is treated as zero (no contribution). Default: 1e-9. There is no relative/percentage tolerance in v0.

max_abs_delta in the output tracks the largest raw delta observed (before zeroing) for transparency.

Threshold and Coverage

  • Total change = sum of all abs(delta) values above tolerance (L1 distance across all common numeric cells).
  • Contribution = abs(delta) for a single cell (after tolerance).
  • Coverage = sum of top contributor contributions / total change.
  • Threshold (default 0.95) = minimum coverage required for a REAL CHANGE verdict.
  • MAX_CONTRIBUTORS = 25 (hard cap, not configurable in v0).

If the top 25 contributors can't reach the threshold, rvl refuses with E_DIFFUSE rather than printing an incomplete explanation. Lower the threshold explicitly if needed: --threshold 0.80.

Contributor Ranking

Contributors are ranked by abs(delta) descending (unscaled — large-magnitude columns dominate by design). Ties are broken by row ID ascending, then column name ascending (byte order). rvl prints only the smallest prefix of contributors whose cumulative coverage reaches the threshold.


How rvl Compares

Capability rvl Excel / Sheets diff / csvdiff Custom pandas script
Ranked numeric explanation ✅ Top-K with coverage proof ❌ Manual ❌ Row-level only ⚠️ You write it
Deterministic verdict ✅ Always ❌ Human judgment ⚠️ Text diff only ⚠️ You write it
Tolerance handling ✅ Built-in ❌ Manual rounding ❌ None ⚠️ You write it
Refusal on ambiguity ✅ Never wrong, refuses instead ❌ Silent errors ❌ Garbage in/out ❌ Crashes
Auto-detects delimiters N/A
Setup time ✅ One curl command N/A ⚠️ Minutes ❌ Hours
Machine-readable output --json ⚠️ Text only

When to use rvl:

  • Monthly/quarterly reconciliation of CSV exports (holdings, positions, balances)
  • CI gate: did the pipeline output actually change?
  • Audit trail: prove what changed and by how much

When rvl might not be ideal:

  • Non-numeric diffs (text columns, schema changes) — use shape for structural checks first
  • Files that don't fit in memory
  • Diffs where you need relative (percentage) tolerance — not yet supported in v0

Installation

Homebrew (Recommended)

brew install cmdrvl/tap/rvl

Shell Script

curl -fsSL https://raw.githubusercontent.com/cmdrvl/rvl/main/scripts/install.sh | bash

Windows (PowerShell)

Set-ExecutionPolicy -ExecutionPolicy Bypass -Scope Process -Force; iex ((New-Object System.Net.WebClient).DownloadString('https://raw.githubusercontent.com/cmdrvl/rvl/main/scripts/install.ps1'))

From Source

cargo build --release
./target/release/rvl --help

Prebuilt binaries are available for x86_64 and ARM64 on Linux, macOS, and Windows (x86_64). Each release includes SHA256 checksums, cosign signatures, and an SBOM.


CLI Reference

rvl <old.csv> <new.csv> [OPTIONS]

Flags

Flag Type Default Description
--key <column> string (none) Align rows by key column value. Without this, rows align by position (1st↔1st, 2nd↔2nd, etc.).
--threshold <float> float 0.95 Coverage target (0 < x ≤ 1.0). The minimum fraction of total numeric change that the top contributors must explain.
--tolerance <float> float 1e-9 Per-cell noise floor (x ≥ 0). Absolute deltas ≤ this value are treated as zero.
--delimiter <delim> string (auto-detect) Force CSV delimiter for both files. See Delimiter.
--capsule-out <dir> string (disabled) Write deterministic replay capsule artifacts (manifest.json, old.csv, new.csv, output.txt, replay.sh) to <dir>/capsule-<id>/.
--json flag false Emit a single JSON object on stdout instead of human-readable output.

Invalid --threshold or --tolerance values are CLI argument errors (exit 2).

Exit Codes

Code Meaning
0 NO REAL CHANGE
1 REAL CHANGE
2 REFUSAL or CLI error

Output Routing

Mode REAL CHANGE NO REAL CHANGE REFUSAL
Human (default) stdout stdout stderr
--json stdout stdout stdout

In --json mode, stderr is reserved for process-level failures only (CLI parse errors, panics).


Delimiter

Auto-Detection (default)

Each file's delimiter is detected independently by sampling the header plus up to 200 data records (or ~64KB). Candidate delimiters are tried in order: ,\t;|^. The candidate with the best score (most records parsed, most consistent field count, most fields) wins.

If multiple candidates tie and produce different parsed output, rvl refuses with E_DIALECT. If they produce identical output, the tie breaks by candidate order (comma first).

If auto-detection yields only 1 column, rvl refuses with E_DIALECT (the file may use an unsupported delimiter).

sep= Directive

If the first non-blank line of a file is sep=<char> (e.g., sep=;), rvl uses that delimiter for the file (unless --delimiter overrides it). The sep= line is skipped during parsing.

--delimiter (forced)

Overrides both auto-detection and sep= directives for both files. Accepted values:

Format Examples
Named comma, tab, semicolon, pipe, caret (case-insensitive)
Hex 0x09 (tab), 0x1f (unit separator), 0x2c (comma)
Single ASCII char ,, |, ;

Valid range: ASCII 0x010x7F, excluding " (0x22), \r (0x0D), \n (0x0A). Invalid values are CLI argument errors (exit 2). Use tab or 0x09, not \t (no escape sequences).


Agent / CI Integration

Both rvl and shape are designed to be consumed by agents and pipelines, not just humans.

Agent workflow: shape → rvl

# 1. Structural gate (is comparison even valid?)
shape old.csv new.csv --key id --json > shape.json
if [ $? -ne 0 ]; then
  # INCOMPATIBLE or REFUSAL — read .reasons or .refusal for why
  jq '.reasons // .refusal' shape.json
  exit 1
fi

# 2. Numeric explanation (only if structurally compatible)
rvl old.csv new.csv --key id --json > rvl.json

# 3. Agent extracts the verdict
outcome=$(jq -r '.outcome' rvl.json)
if [ "$outcome" = "REAL_CHANGE" ]; then
  jq '.contributors[] | "\(.row_id).\(.column): \(.delta)"' rvl.json
fi

What makes this agent-friendly

  • Exit codes0/1/2 map directly to pass/fail/error branching
  • --json — structured output an agent can parse without regex
  • Refusals have next steps — an agent can read .refusal.code and decide whether to retry with different flags or escalate
  • shape --describe — prints the tool's operator.json contract so an agent can discover invocation, flags, and exit codes without reading docs

Capsule replay workflow (agent swarms)

Use capsules when you need a deterministic handoff between agents, CI jobs, or debugging sessions:

# 1. Produce the normal verdict and write a replay capsule sidecar
rvl old.csv new.csv --key id --json --capsule-out ./capsules > run.json

# 2. Inspect generated capsule
ls ./capsules/capsule-*/
# manifest.json old.csv new.csv output.txt replay.sh

# 3. Re-run exactly from the capsule payload
cd ./capsules/capsule-<id>
./replay.sh > replay.json

manifest.json includes:

  • original invocation args (key, threshold, tolerance, delimiter, json)
  • outcome and refusal code (if any)
  • contributor summary for REAL_CHANGE
  • replay command plus artifact hashes for integrity checks

For troubleshooting, compare run.json vs replay.json outcome/refusal code first; if they differ, the environment or binary changed.


Scripting Examples

Check if files changed (exit code only):

rvl old.csv new.csv > /dev/null 2>&1
echo $?  # 0 = no change, 1 = changed, 2 = refused

Extract top contributor from JSON:

rvl old.csv new.csv --json | jq '.contributors[0]'

Get total change magnitude:

rvl old.csv new.csv --json | jq '.metrics.total_change'

Handle refusals programmatically:

rvl old.csv new.csv --json | jq 'select(.outcome == "REFUSAL") | .refusal'

Force a tab-delimited comparison with relaxed threshold:

rvl old.tsv new.tsv --delimiter tab --key account_id --threshold 0.80

Gate a pipeline (shape before rvl):

shape old.csv new.csv --key loan_id --json > shape.json \
  && rvl old.csv new.csv --key loan_id --json > rvl.json

Refusal Codes

Every refusal includes the error code, first concrete example, and a Next: remediation step.

Code Meaning Next Step
E_IO File read error Check file path and permissions
E_ENCODING Unsupported encoding (UTF-16/32 BOM or NUL bytes) Convert/re-export as UTF-8
E_CSV_PARSE CSV parse failure (invalid quoting/escaping) Re-export as standard RFC4180 CSV
E_HEADERS Missing header, duplicate headers, or rows wider than header Fix headers or re-export
E_DIALECT Delimiter ambiguous or undetectable Use --delimiter <delim> or add sep=<char> to file
E_NO_KEY --key column not found in one or both files Use a column name that exists in both files
E_KEY_EMPTY Empty key value in a non-blank row Choose a key column with no empty values, or fill missing keys
E_KEY_DUP Duplicate key values within a file Choose a unique key column or dedupe the data
E_KEY_MISMATCH Key sets differ between files (missing/extra keys) Export comparable scopes or fix the join key
E_ROWCOUNT Row count mismatch (row-order mode) Use --key <column> for a missing/extra-keys report
E_NEED_KEY Detected row reorder without --key Use --key <suggested> (rvl prints candidates)
E_MIXED_TYPES Column has both numeric and non-numeric values Normalize column values to numeric or exclude the column
E_NO_NUMERIC No numeric columns in common Ensure both files share at least one numeric column
E_MISSINGNESS Numeric value vs. missing token in aligned cell Fill missing values or exclude the column
E_DIFFUSE Top 25 contributors can't reach threshold Use --threshold 0.80 (or lower) to accept less coverage

Troubleshooting

"E_NEED_KEY" even though rows look the same

Your rows are in a different order between files. rvl detected this and refuses rather than silently comparing wrong row pairs. Use the --key it suggests:

rvl old.csv new.csv --key loan_id

"E_DIFFUSE" — can't reach threshold

Changes are spread across too many cells for the top 25 to explain 95%. This usually means a broad recalculation (e.g., FX revaluation). Lower the threshold:

rvl old.csv new.csv --threshold 0.80

"E_MIXED_TYPES" on a column that looks numeric

A cell in that column has a value rvl can't parse as a number (check for stray text, #N/A variants not in the missing list, or locale-specific formatting). The error message shows the first offending cell.

"E_DIALECT" — delimiter detection failed

Your file uses an uncommon delimiter or has inconsistent field counts. Force the delimiter:

rvl old.csv new.csv --delimiter pipe      # for |
rvl old.csv new.csv --delimiter 0x09      # for tab
rvl old.csv new.csv --delimiter semicolon # for ;

Large files are slow

rvl loads both files into memory. For very large files (millions of rows), ensure sufficient RAM. There is no streaming mode in v0.


Limitations

Limitation Detail
Numeric columns only rvl compares numbers. Text column changes are ignored — use diff or shape for structural checks.
Absolute tolerance only No relative/percentage tolerance in v0. A $0.01 delta on a $1M balance and a $0.01 balance are treated identically.
MAX_CONTRIBUTORS = 25 Hard cap, not configurable in v0. If change is spread across >25 cells, rvl refuses (E_DIFFUSE).
In-memory Both files are loaded fully into memory. No streaming mode yet.
Two files only No multi-file or directory comparison.
No column filtering All common numeric columns are compared. You can't exclude specific columns in v0.

FAQ

Why "rvl"?

Short for reveal. The tool reveals what actually changed, cutting through the noise.

Is this just diff for CSVs?

No. diff shows you every line that's different. rvl tells you which numeric changes matter — the smallest set that explains the change. It's an explanation, not a diff.

What if my files have different columns?

rvl compares only columns present in both files. Extra columns on either side are reported in the header but don't affect the verdict.

Can I use this in CI/CD?

Yes. Exit codes (0/1/2) and --json output are designed for automation. Gate on exit code, or parse the JSON for richer assertions.

What about non-US number formats (e.g., 1.234,56)?

Not supported in v0. rvl assumes US-style formatting (comma as thousands separator, period as decimal).

How does rvl relate to shape?

shape checks structural compatibility (do columns match? is the key valid?). rvl checks numeric content (what changed and by how much?). Run shape first to validate structure, then rvl to explain changes.


JSON Output Reference

A single JSON object on stdout. If the process fails before domain evaluation (e.g., invalid CLI args), JSON may not be emitted.

{
  "version": "rvl.v0",
  "outcome": "REAL_CHANGE",            // "REAL_CHANGE" | "NO_REAL_CHANGE" | "REFUSAL"
  "files": {
    "old": "old.csv",
    "new": "new.csv"
  },
  "alignment": {
    "mode": "key",                      // "key" | "row_order"
    "key_column": "u8:id"              // encoded identifier, or null
  },
  "dialect": {
    "old": { "delimiter": ",", "quote": "\"", "escape": null },
    "new": { "delimiter": ",", "quote": "\"", "escape": null }
  },
  "threshold": 0.95,
  "tolerance": 1e-9,
  "counts": {
    "rows_old": 4183,
    "rows_new": 4183,
    "rows_aligned": 4183,
    "columns_old": 17,
    "columns_new": 16,
    "columns_common": 15,
    "columns_old_only": 2,
    "columns_new_only": 1,
    "numeric_columns": 12,
    "numeric_cells_checked": 50196,
    "numeric_cells_changed": 3
  },
  "metrics": {
    "total_change": 1842100.3713,       // L1 distance (sum of abs deltas above tolerance)
    "max_abs_delta": 1842100.0,         // largest abs(delta) observed (pre-zeroing)
    "top_k_coverage": 0.952             // coverage of top MAX_CONTRIBUTORS
  },
  "limits": {
    "max_contributors": 25
  },
  "contributors": [                     // empty unless REAL_CHANGE
    {
      "row_id": "u8:NVDA",
      "column": "u8:market_value",
      "old": 123.0,
      "new": 1842223.0,
      "delta": 1842100.0,
      "contribution": 1842100.0,
      "share": 0.9998,                  // contribution / total_change
      "cumulative_share": 0.9998
    }
    // ... more contributors, ranked by contribution desc
  ],
  "refusal": null                       // null unless REFUSAL
  // When REFUSAL:
  // "refusal": {
  //   "code": "E_KEY_DUP",
  //   "message": "duplicate key values",
  //   "detail": { "file": "old.csv", "key_samples": ["A123"], ... }
  // }
}

Identifier Encoding (JSON)

Row IDs and column names in JSON use unambiguous encoding:

  • u8:<string> — valid UTF-8 with no ASCII control bytes (e.g., u8:NVDA, u8:market_value)
  • hex:<hex-bytes> — anything else (e.g., hex:ff00ab)

Copy the encoded identifier directly into --key to avoid ambiguity.

Nullable Fields

On REFUSAL, counts and metrics fields may be null if they couldn't be computed (e.g., rows_aligned is null for E_ROWCOUNT; all metrics are null for E_NEED_KEY).


Spec

The full specification is docs/PLAN_RVL.md. This README covers everything needed to use the tool; the spec adds implementation details, edge-case definitions, and testing requirements.

Development

cargo fmt --check
cargo clippy --all-targets -- -D warnings
cargo test

About

rvl reveals the smallest set of numeric changes that explain what actually changed between two datasets — or confidently tells you nothing changed.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors