Skip to content

Commit d7497de

Browse files
Zacclaude
andcommitted
Release v0.3.0 — document canon org identity resolution
Bump version to 0.3.0 across Cargo.toml and operator.json. Update README with the full org pipeline (block, edge, solve, audit, promote, explain), org CLI reference, E_ORG_* refusal codes, updated limitations, and revised FAQ reflecting the new entity resolution capability. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 04e6240 commit d7497de

File tree

4 files changed

+122
-11
lines changed

4 files changed

+122
-11
lines changed

Cargo.lock

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[package]
22
name = "canon"
3-
version = "0.2.2"
3+
version = "0.3.0"
44
edition = "2024"
55

66
[[bin]]

README.md

Lines changed: 119 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ The same loan appears as CUSIP `037833100` in one system, ISIN `US0378331005` in
2424
- **Pipeline composable**`canon --emit csv` appends a `<column>__canon` column to your CSV. Pipe the output directly into `rvl` or `shape`: `canon nov.csv --column cusip --emit csv | rvl - dec.canon.csv --key cusip__canon`.
2525
- **Full traceability** — every mapping includes `rule_id`, `canonical_type`, and `confidence`. Every unresolved entry includes the reason. Every result is auditable.
2626
- **Deduplication built in** — input values are deduplicated before lookup. 500 unique CUSIPs produce 500 mapping entries whether your file has 500 rows or 500,000.
27+
- **Org identity resolution**`canon org` resolves entities that appear under different names across documents via a deterministic multi-stage pipeline: block, score evidence, solve clusters, audit against evaluation suites, and promote into the registry.
2728

2829
---
2930

@@ -208,9 +209,9 @@ canon tape.csv --registry registries/cusip-isin/ --column cusip \
208209
- Building audit trails for regulatory mappings (every resolution traceable)
209210

210211
**When canon might not be ideal:**
211-
- Fuzzy entity matching (address variants, phonetic matching) — deferred to v1
212+
- Fuzzy entity matching (address variants, phonetic matching)
212213
- Master data management at enterprise scale
213-
- Record linkage / entity clustering
214+
- Probabilistic record linkage requiring ML models
214215

215216
---
216217

@@ -241,9 +242,11 @@ cargo build --release
241242

242243
```bash
243244
canon <INPUT> --registry <REGISTRY> --column <COLUMN> [OPTIONS]
244-
canon registry build --source <SOURCE> --seed <SEED> --seed-column <COLUMN> --output <DIR> --version <VER> [--incremental] [--max-rows <N>] [--max-bytes <N>] [--batch-size <N>] [--rate-limit-ms <MS>] [--provider-config <KEY=VALUE>]
245+
canon registry build --source <SOURCE> --seed <SEED> --seed-column <COLUMN> --output <DIR> --version <VER> [OPTIONS]
245246
canon registry diff --old <OLD_REGISTRY> --new <NEW_REGISTRY> [--emit json|summary]
246-
canon registry audit <SEED> --registry <REGISTRY> --column <COLUMN> [--emit json|summary] [--max-rows <N>] [--max-bytes <N>]
247+
canon registry audit <SEED> --registry <REGISTRY> --column <COLUMN> [--emit json|summary]
248+
canon org run <ROWS> --strategy <YAML> --registry <DIR> [--suite <DIR>] [--emit json|summary]
249+
canon org block|edge|solve|audit|promote|explain [OPTIONS]
247250
```
248251

249252
### Arguments
@@ -275,6 +278,13 @@ canon registry audit <SEED> --registry <REGISTRY> --column <COLUMN> [--emit json
275278
| `registry build --source <NAME> --seed <PATH> --seed-column <COLUMN> --output <DIR> --version <VER>` | Materialize a standard canon registry directory from a provider-backed seed corpus, with optional repeatable `--provider-config key=value` overrides. |
276279
| `registry diff --old <PATH> --new <PATH> [--emit json\|summary]` | Compare two versions of the same registry ID and report added, removed, changed, and unchanged effective mappings. |
277280
| `registry audit <SEED> --registry <PATH> --column <COLUMN> [--emit json\|summary]` | Audit a seed corpus against a registry and emit resolved/unresolved entries plus aggregate canonical-target and rule-hit counts. |
281+
| `org run <ROWS> --strategy <YAML> --registry <DIR> [--suite <DIR>] [--emit json\|summary]` | Run the full deterministic org-identity pipeline (block → edge → solve, optional audit + promote). |
282+
| `org block <ROWS> --strategy <YAML> --registry <DIR> [--emit jsonl\|summary]` | Generate candidate neighborhoods via blocking operators. |
283+
| `org edge <ROWS> --strategy <YAML> --candidates <JSONL> --registry <DIR> [--emit jsonl\|summary]` | Score typed evidence edges for blocked candidate pairs. |
284+
| `org solve <ROWS> --strategy <YAML> --edges <JSONL> --registry <DIR> [--emit json\|summary]` | Solve deterministic identity assignments from evidence edges. |
285+
| `org audit <RESULT> --suite <DIR> [--emit json\|summary]` | Validate a solve/run artifact against a frozen evaluation suite. |
286+
| `org promote <RESULT> --audit <JSON> --registry <DIR> --next-version <VER> [--emit json\|summary]` | Write audited results into registry aliases and escrow sidecars. |
287+
| `org explain <RESULT> --row <ID>\|--canon-id <ID>\|--escrow-id <ID> [--emit json\|summary]` | Proof trace for one row, canonical entity, or escrow entity. |
278288

279289
### Exit Codes
280290

@@ -398,6 +408,12 @@ Every refusal includes the error code, a concrete message, and a recovery path.
398408
| `E_TOO_LARGE` | Exceeds `--max-rows` or `--max-bytes` | Increase limits or reduce input |
399409
| `E_EMIT_FORMAT` | `--emit csv` with JSONL input | Use `--emit json` or provide CSV input |
400410
| `E_COLUMN_EXISTS` | Canonical column name already in header | Choose a different `--canon-column` |
411+
| `E_ORG_INPUT_CONTRACT` | Org input rows violate the strategy contract | Check required fields and side-field JSON |
412+
| `E_ORG_BAD_STRATEGY` | Org strategy YAML is malformed or invalid | Fix the strategy file |
413+
| `E_ORG_BAD_SUITE` | Evaluation suite missing or profile-mismatched | Check suite directory and strategy profile |
414+
| `E_ORG_FIXTURE_INVALID` | Suite fixture references are inconsistent | Fix fixture row catalog or expected pairs |
415+
| `E_ORG_VERSION_BUMP_REQUIRED` | Promotion requires an explicit next version | Pass `--next-version` |
416+
| `E_ORG_STALE_REGISTRY` | Registry changed since the audited snapshot | Re-run org against the current registry |
401417

402418
---
403419

@@ -426,14 +442,109 @@ canon tape.csv --registry registries/cusip-isin/ --column cusip \
426442

427443
---
428444

445+
## Organization Identity Resolution (`canon org`)
446+
447+
The same entity appears as "Wells Fargo & Company" in one document, "Wells Fargo Bank, N.A." in another, and "WFB" in a third. Three names, one issuer. `canon org` resolves these via a deterministic multi-stage pipeline — no ML models, no probabilistic matching, no black boxes.
448+
449+
The pipeline is YAML-driven: a **strategy file** defines which fields to observe, how to normalize names, which blocking operators generate candidates, how to score evidence, and what thresholds the solver uses to merge or abstain. Same strategy + same input + same registry = same output, every time.
450+
451+
```bash
452+
# Full pipeline in one command:
453+
$ canon org run rows.csv \
454+
--strategy strategy.yaml \
455+
--registry registries/org/ \
456+
--suite eval/holdout/ \
457+
--emit summary
458+
459+
org_run: 847 rows → 312 canonical entities, 4 escrow (pending), 0 escrow (conflict)
460+
audit: holdout 98/98 pass, perturbation stability 0.998
461+
```
462+
463+
Or run stages individually for inspection:
464+
465+
```bash
466+
$ canon org block rows.csv --strategy strategy.yaml --registry registries/org/ > blocks.jsonl
467+
$ canon org edge rows.csv --strategy strategy.yaml --candidates blocks.jsonl --registry registries/org/ > edges.jsonl
468+
$ canon org solve rows.csv --strategy strategy.yaml --edges edges.jsonl --registry registries/org/ > result.json
469+
$ canon org audit result.json --suite eval/holdout/
470+
$ canon org promote result.json --audit audit.json --registry registries/org/ --next-version 2.1.0
471+
$ canon org explain result.json --canon-id IC-00042
472+
```
473+
474+
---
475+
476+
## The Org Pipeline
477+
478+
### Strategy
479+
480+
A YAML file that configures the entire pipeline. Defines observation fields (`name_fields`, `anchor_fields`, `context_fields`), normalization views (lowercase, strip legal suffixes, extract initials), blocking operators, evidence rules, solver thresholds, reconciliation policy, and promotion gates.
481+
482+
### Block
483+
484+
Candidate neighborhood generation. Blocking operators reduce the O(n²) comparison space to plausible pairs:
485+
486+
| Operator | What it does |
487+
|----------|-------------|
488+
| `exact_view` | Blocks on exact match of a normalized name view |
489+
| `rare_token_overlap` | Blocks on shared rare tokens weighted by IDF |
490+
| `shared_anchor` | Blocks on shared anchor values (LEI, CIK, FIGI) |
491+
| `registry_alias_match` | Blocks on existing registry alias matches |
492+
493+
### Edge
494+
495+
Typed evidence scoring. Each candidate pair receives evidence edges:
496+
497+
- **Must-link** — strong deterministic evidence (shared trusted anchor, registry alias match)
498+
- **Support** — scored positive evidence (exact name view match, acronym-plus-token, categorical field equality)
499+
- **Cannot-link** — negative evidence (conflicting anchor values in the same namespace)
500+
501+
### Solve
502+
503+
Staged deterministic solver:
504+
505+
1. **Seed** — build initial components from must-link edges using union-find
506+
2. **Backbone** — merge clusters via reciprocal best scoring pairs (requires positive name evidence, respects max cluster diameter)
507+
3. **Attachment** — attach singletons to backbone clusters (requires winner margin, attachments don't chain)
508+
509+
**Reconciliation** then classifies each cluster:
510+
511+
- Single incumbent overlap → inherit existing canonical ID
512+
- Multiple incumbent overlap → abstain with conflict escrow
513+
- No incumbent → mint new canonical ID
514+
- Low evidence → abstain with pending escrow
515+
516+
### Audit
517+
518+
Validate results against frozen evaluation suites. Checks holdout fixture pass rates and perturbation stability (strategy-configurable threshold, e.g. ≥ 0.995). Promotion requires a passing audit.
519+
520+
### Promote
521+
522+
Write audited results back to the registry:
523+
524+
- Resolved entities get alias entries added to registry mapping files
525+
- Escrow sidecars are written for entities that need human review
526+
- Requires an explicit `--next-version` bump
527+
528+
### Explain
529+
530+
Proof traces for any row, entity, or escrow decision:
531+
532+
```bash
533+
$ canon org explain result.json --row src-row-42
534+
$ canon org explain result.json --canon-id IC-00042
535+
$ canon org explain result.json --escrow-id ESC-00007
536+
```
537+
538+
Returns the full evidence chain: which blocking operator surfaced the pair, which evidence edges were scored, which solver stage produced the merge or abstention, and why.
539+
540+
---
541+
429542
## Limitations
430543

431544
| Limitation | Detail |
432545
|------------|--------|
433-
| **Exact match only** | No fuzzy, phonetic, or normalized matching in v0. Registry must contain all variants. |
546+
| **Exact match only (core lookup)** | Core `canon` lookup uses exact byte match after ASCII-trim. `canon org` adds multi-field deterministic resolution but not fuzzy/phonetic matching. |
434547
| **Flat registries** | No subdirectories in v0. All mapping files must be at the registry root. |
435-
| **No multi-column matching** | Entity resolution (address + name + coordinates) is deferred to v1. |
436-
| **No suggestions** | Probabilistic/fuzzy suggestions (`canon suggest`) are deferred to v1. |
437548
| **CSV-only for `--emit csv`** | JSONL input cannot use `--emit csv` mode. |
438549

439550
---
@@ -446,7 +557,7 @@ Short for *canonical*. The tool produces canonical identifiers — one true ID f
446557

447558
### Is this entity resolution?
448559

449-
Not in v0. `canon` v0 resolves identifiers and aliases via exact lookup. Multi-column entity resolution (property addresses, counterparty matching with fuzzy logic) is planned for v1.
560+
Yes — as of v0.3.0, `canon org` performs deterministic multi-field org-identity resolution. It resolves entities that appear under different names across documents using a YAML-driven pipeline of blocking, evidence scoring, and cluster solving. Core `canon` (without `org`) still resolves identifiers via exact lookup against versioned registries.
450561

451562
### How does canon relate to rvl?
452563

operator.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
{
22
"schema_version": "operator.v0",
33
"name": "canon",
4-
"version": "0.2.2",
4+
"version": "0.3.0",
55
"description": "Canonical identifier resolution and org-identity orchestration using versioned registries",
66
"repository": "https://github.com/cmdrvl/canon",
77
"license": "MIT",

0 commit comments

Comments
 (0)