You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Bump version to 0.3.0 across Cargo.toml and operator.json. Update
README with the full org pipeline (block, edge, solve, audit, promote,
explain), org CLI reference, E_ORG_* refusal codes, updated limitations,
and revised FAQ reflecting the new entity resolution capability.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: README.md
+119-8Lines changed: 119 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -24,6 +24,7 @@ The same loan appears as CUSIP `037833100` in one system, ISIN `US0378331005` in
24
24
-**Pipeline composable** — `canon --emit csv` appends a `<column>__canon` column to your CSV. Pipe the output directly into `rvl` or `shape`: `canon nov.csv --column cusip --emit csv | rvl - dec.canon.csv --key cusip__canon`.
25
25
-**Full traceability** — every mapping includes `rule_id`, `canonical_type`, and `confidence`. Every unresolved entry includes the reason. Every result is auditable.
26
26
-**Deduplication built in** — input values are deduplicated before lookup. 500 unique CUSIPs produce 500 mapping entries whether your file has 500 rows or 500,000.
27
+
-**Org identity resolution** — `canon org` resolves entities that appear under different names across documents via a deterministic multi-stage pipeline: block, score evidence, solve clusters, audit against evaluation suites, and promote into the registry.
|`registry build --source <NAME> --seed <PATH> --seed-column <COLUMN> --output <DIR> --version <VER>`| Materialize a standard canon registry directory from a provider-backed seed corpus, with optional repeatable `--provider-config key=value` overrides. |
276
279
|`registry diff --old <PATH> --new <PATH> [--emit json\|summary]`| Compare two versions of the same registry ID and report added, removed, changed, and unchanged effective mappings. |
277
280
|`registry audit <SEED> --registry <PATH> --column <COLUMN> [--emit json\|summary]`| Audit a seed corpus against a registry and emit resolved/unresolved entries plus aggregate canonical-target and rule-hit counts. |
281
+
|`org run <ROWS> --strategy <YAML> --registry <DIR> [--suite <DIR>] [--emit json\|summary]`| Run the full deterministic org-identity pipeline (block → edge → solve, optional audit + promote). |
The same entity appears as "Wells Fargo & Company" in one document, "Wells Fargo Bank, N.A." in another, and "WFB" in a third. Three names, one issuer. `canon org` resolves these via a deterministic multi-stage pipeline — no ML models, no probabilistic matching, no black boxes.
448
+
449
+
The pipeline is YAML-driven: a **strategy file** defines which fields to observe, how to normalize names, which blocking operators generate candidates, how to score evidence, and what thresholds the solver uses to merge or abstain. Same strategy + same input + same registry = same output, every time.
Returns the full evidence chain: which blocking operator surfaced the pair, which evidence edges were scored, which solver stage produced the merge or abstention, and why.
539
+
540
+
---
541
+
429
542
## Limitations
430
543
431
544
| Limitation | Detail |
432
545
|------------|--------|
433
-
|**Exact match only**|No fuzzy, phonetic, or normalized matching in v0. Registry must contain all variants. |
546
+
|**Exact match only (core lookup)**|Core `canon` lookup uses exact byte match after ASCII-trim. `canon org` adds multi-field deterministic resolution but not fuzzy/phonetic matching. |
434
547
|**Flat registries**| No subdirectories in v0. All mapping files must be at the registry root. |
435
-
|**No multi-column matching**| Entity resolution (address + name + coordinates) is deferred to v1. |
436
-
|**No suggestions**| Probabilistic/fuzzy suggestions (`canon suggest`) are deferred to v1. |
437
548
|**CSV-only for `--emit csv`**| JSONL input cannot use `--emit csv` mode. |
438
549
439
550
---
@@ -446,7 +557,7 @@ Short for *canonical*. The tool produces canonical identifiers — one true ID f
446
557
447
558
### Is this entity resolution?
448
559
449
-
Not in v0. `canon` v0 resolves identifiers and aliases via exact lookup. Multi-column entity resolution (property addresses, counterparty matching with fuzzy logic) is planned for v1.
560
+
Yes — as of v0.3.0,`canon org` performs deterministic multi-field org-identity resolution. It resolves entities that appear under different names across documents using a YAML-driven pipeline of blocking, evidence scoring, and cluster solving. Core `canon` (without `org`) still resolves identifiers via exact lookup against versioned registries.
0 commit comments