beads: add 3 DSL enhancement beads + CMBS-WATL test fixtures

saltmachine · claude · saltmachine · commit 5f5facea2a1e · 2026-02-27T11:50:57.000-05:00
File enhancement beads derived from CMBS commentary fingerprinting:
- bd-3xa: sheet_name_regex bind (capture matched sheet name for downstream assertions)
- bd-3xb: column_search assertion (scan column range for regex match)
- bd-3xc: header_row_match assertion (detect header row by column pattern matching)

Add test fixtures in tests/fixtures/cmbs-watl/:
- README.md documenting observed WATL variants and Lambda detection chain
- cmbs-watl-current.fp.yaml (v1, 3/6 match with hardcoded sheet name)
- cmbs-watl-desired.fp.yaml (v2 target using proposed bind/column_search/header_row_match)

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;
diff --git a/.beads/issues.jsonl b/.beads/issues.jsonl
@@ -69,3 +69,6 @@
 {"id":"bd-w2c","title":"PRM-FP-0105b: Witness ledger + query","description":"Fill in witness/ledger.rs (append to EPISTEMIC_WITNESS or ~/.epistemic/witness.jsonl, --no-witness suppression, non-fatal failures) and witness/query.rs (witness query/last/count subcommands with filters). FILES: src/witness/ledger.rs, src/witness/query.rs (2 files). VERIFY: cargo test --lib -- witness::ledger && cargo test --lib -- witness::query","status":"closed","priority":1,"issue_type":"feature","created_at":"2026-02-26T11:47:32.096634Z","created_by":"zac","updated_at":"2026-02-27T01:05:57.059384Z","closed_at":"2026-02-27T01:05:57.059232Z","close_reason":"Implemented witness ledger append/path and query/last/count with witness module tests","compaction_level":0,"original_size":0,"dependencies":[{"issue_id":"bd-w2c","depends_on_id":"bd-1bm","type":"blocks","created_at":"2026-02-26T11:48:29.229538Z","created_by":"zac"},{"issue_id":"bd-w2c","depends_on_id":"bd-1pi","type":"blocks","created_at":"2026-02-26T11:48:39.434373Z","created_by":"zac"}]}
 {"id":"bd-w5b","title":"PRM-FP-0003: Add base release workflow (build + publish artifacts)","description":"Implement base release workflow parity with rvl: build target matrix archives and publish GitHub release assets. Keep this issue focused on build/publish mechanics only.\n\n## Acceptance Criteria\n- Add .github/workflows/release.yml; workflow derives version from Cargo.toml and ensures tag exists; builds release archives for target matrix and uploads them to GitHub release; generated artifacts are downloadable from release page; no signing/SBOM/provenance steps in this bead (handled separately).\n\n## Notes\nNon-goal: cosign, SHA256SUMS signing, SBOM/provenance generation, or Homebrew formula update.","acceptance_criteria":"Add .github/workflows/release.yml; workflow derives version from Cargo.toml and ensures tag exists; builds release archives for target matrix and uploads them to GitHub release; generated artifacts are downloadable from release page; no signing/SBOM/provenance steps in this bead (handled separately).","notes":"Non-goal: cosign, SHA256SUMS signing, SBOM/provenance generation, or Homebrew formula update.","status":"closed","priority":1,"issue_type":"feature","created_at":"2026-02-25T03:58:30.864567Z","created_by":"zac","updated_at":"2026-02-27T03:18:25.302280Z","closed_at":"2026-02-27T03:18:25.302184Z","close_reason":"Added .github/workflows/release.yml with Cargo.toml version derivation, tag existence/match checks, release target matrix build archives, artifact upload, and GitHub release publish step; YAML validated","compaction_level":0,"original_size":0,"labels":["infra","release","workflow"],"dependencies":[{"issue_id":"bd-w5b","depends_on_id":"bd-tdc","type":"blocks","created_at":"2026-02-26T11:21:23.452677Z","created_by":"zac"}]}
 {"id":"bd-zg8","title":"PRM-FP-0104c: Progress reporter (progress/reporter.rs)","description":"Fill in progress/reporter.rs: emit structured progress JSONL to stderr when --progress is set. Warning emission for skipped files. FILE: src/progress/reporter.rs (1 file). VERIFY: cargo test --lib -- progress::reporter","status":"closed","priority":1,"issue_type":"feature","created_at":"2026-02-26T11:47:21.744035Z","created_by":"zac","updated_at":"2026-02-27T00:54:26.968422Z","closed_at":"2026-02-27T00:54:26.968181Z","close_reason":"Implemented structured progress/warning stderr JSONL reporter with progress::reporter tests","compaction_level":0,"original_size":0,"dependencies":[{"issue_id":"bd-zg8","depends_on_id":"bd-1bm","type":"blocks","created_at":"2026-02-26T11:48:29.176622Z","created_by":"zac"}]}
+{"id":"bd-3xa","title":"PRM-FP-0228: sheet_name_regex bind — capture matched sheet name for downstream assertions","description":"## Problem\n\nSpreadsheet assertions (`cell_regex`, `cell_eq`, `range_non_null`, `sheet_min_rows`) require a hardcoded `sheet` parameter. When a workbook contains sheets with varying names across producers (e.g., CMBS watchlist sheets named `Watchlist` vs `Servicer Watch List` vs `Watch List Report for {DEAL}`), `sheet_name_regex` can prove a matching sheet exists but downstream assertions cannot reference the matched name.\n\n## Proposal\n\nAdd an optional `bind` field to `sheet_name_regex`. When present, the matched sheet name is captured as a named variable (e.g., `$watl_sheet`) that downstream assertions can reference in their `sheet` parameter.\n\n```yaml\nassertions:\n  - sheet_name_regex:\n      pattern: \"(?i)watch\\\\s?list|WATL\"\n      bind: \"$watl_sheet\"\n\n  - cell_regex:\n      sheet: \"$watl_sheet\"\n      cell: A2\n      pattern: \"(?i)CREFC\"\n\n  - sheet_min_rows:\n      sheet: \"$watl_sheet\"\n      min_rows: 5\n```\n\n## Implementation\n\n1. Add `bind: Option<String>` to `SheetNameRegex` assertion variant\n2. Add a `bindings: HashMap<String, String>` to assertion evaluation context\n3. When `sheet_name_regex` passes with `bind`, store `{bind_name → matched_sheet_name}`\n4. Before evaluating any assertion with a `sheet` param starting with `$`, resolve from bindings\n5. If a `$`-prefixed sheet name has no binding, fail with descriptive error\n\n## Acceptance\n\n- `bind` field parses in YAML and compiles to Rust crate\n- Bound name resolves correctly in downstream `cell_eq`, `cell_regex`, `range_non_null`, `sheet_min_rows` assertions\n- Unresolved `$` references produce clear error messages\n- When multiple sheets match, the first match is used (consistent with assertion short-circuit semantics)\n- Unit tests in `assertions.rs` covering: basic bind+resolve, unresolved reference, multiple sheets matching\n\n## Test Case\n\nSee `tests/fixtures/cmbs-watl/` — CMBS Watchlist files where sheet name varies by servicer. The `cmbs-watl-desired.fp.yaml` shows the target definition using `bind`.","status":"pending","priority":1,"issue_type":"feature","created_at":"2026-02-27T09:35:00.000000Z","created_by":"zac","updated_at":"2026-02-27T09:35:00.000000Z","labels":["dsl","enhancement","cmbs-watl"],"compaction_level":0,"original_size":0}
+{"id":"bd-3xb","title":"PRM-FP-0229: column_search assertion — search a column range for a regex match","description":"## Problem\n\nCMBS Excel workbooks have boilerplate text (CREFC headers, report titles, servicer names) in varying row positions. The current DSL requires `cell_regex` with a fixed cell reference, but the target row varies by servicer and format. The Lambda parser handles this by scanning the first ~30 rows; the fingerprint DSL has no equivalent.\n\n## Proposal\n\nAdd a `column_search` assertion that searches a single column over a row range for any cell matching a regex pattern.\n\n```yaml\nassertions:\n  - column_search:\n      sheet: \"Watchlist\"       # or \"$bound_name\"\n      column: A\n      row_range: \"1:20\"\n      pattern: \"(?i)CREFC Investor Reporting|SERVICER WATCHLIST\"\n```\n\nSemantics: scan cells A1 through A20; pass if ANY cell matches the pattern. This is the column-axis analog of `text_contains` for markdown.\n\n## Implementation\n\n1. Add `ColumnSearch { sheet, column, row_range, pattern }` to `Assertion` enum\n2. Parse `row_range` as `start:end` (1-based, inclusive)\n3. Iterate cells `{column}{start}` through `{column}{end}`, apply regex\n4. Pass on first match; fail if no cell matches\n5. In diagnose mode, report nearest partial matches and actual cell values scanned\n\n## Acceptance\n\n- Parses in YAML, compiles to crate\n- Scans specified column/row range and matches regex\n- Works with `$`-bound sheet names (from bd-3xa)\n- Diagnose mode shows what was scanned on failure\n- Unit tests: match found, no match, empty cells, out-of-range rows\n\n## Test Case\n\nSee `tests/fixtures/cmbs-watl/` — CREFC boilerplate appears at row 2 (BCMS RSRV), row 3 (MSC RSRV), or row 1 (CGCMT supp). `column_search` on A1:A10 catches all variants.","status":"pending","priority":1,"issue_type":"feature","created_at":"2026-02-27T09:35:00.000000Z","created_by":"zac","updated_at":"2026-02-27T09:35:00.000000Z","labels":["dsl","enhancement","cmbs-watl"],"compaction_level":0,"original_size":0,"dependencies":[{"issue_id":"bd-3xb","depends_on_id":"bd-3xa","type":"depends_on","created_at":"2026-02-27T09:35:00.000000Z","created_by":"zac"}]}
+{"id":"bd-3xc","title":"PRM-FP-0230: header_row_match assertion — detect header row by column pattern matching","description":"## Problem\n\nThe Lambda OSDA parser uses a NAME strategy that scans the first ~30 rows of a CSV/sheet looking for a row where enough column headers match known `name_variations` regex patterns (30-50% threshold). This is the core detection mechanism for WATL, DLSR, and all NAME-strategy schemas. The fingerprint DSL has no equivalent — `cell_regex` checks a single fixed cell.\n\n## Proposal\n\nAdd a `header_row_match` assertion that scans rows in a range and passes if any single row has at least `min_match` cells matching the provided column patterns.\n\n```yaml\nassertions:\n  - header_row_match:\n      sheet: \"$watl_sheet\"\n      row_range: \"1:30\"\n      min_match: 5\n      columns:\n        - pattern: \"(?i)Trans(action)?\\\\s*ID|^L1|^1(\\\\.0)?$\"\n        - pattern: \"(?i)Group\\\\s*ID|^L2|^2(\\\\.0)?$\"\n        - pattern: \"(?i)^Loan\\\\s*ID|^L3|^3(\\\\.0)?$\"\n        - pattern: \"(?i)Prospectus\\\\s*Loan\\\\s*ID|^L4|^4(\\\\.0)?$\"\n        - pattern: \"(?i)Property\\\\s*Name|^S55$|^5(\\\\.0)?$\"\n        - pattern: \"(?i)Property\\\\s*Type|^S61$|^6(\\\\.0)?$\"\n        - pattern: \"(?i)Comments.*Watchlist|^19(\\\\.0)?$\"\n```\n\nSemantics: for each row in `row_range`, test each cell against each column pattern. A cell matches at most one pattern (first match wins, no double-counting). If any row achieves `>= min_match` distinct pattern matches, the assertion passes.\n\n## Implementation\n\n1. Add `HeaderRowMatch { sheet, row_range, min_match, columns: Vec<ColumnPattern> }` to `Assertion` enum\n2. `ColumnPattern` has `pattern: String` (compiled to regex)\n3. For each row: iterate cells, match against patterns, count distinct matches\n4. Pass on first row achieving `min_match`; fail if no row qualifies\n5. In diagnose mode: report the best-matching row (highest count) and which patterns matched\n\n## Acceptance\n\n- Parses in YAML, compiles to crate\n- Correctly scans rows and counts distinct column pattern matches per row\n- Works with `$`-bound sheet names (from bd-3xa)\n- Diagnose mode reports best candidate row and match count\n- `min_match` threshold is respected (partial match = fail)\n- Unit tests: exact threshold match, below threshold, multiple qualifying rows, empty sheet\n- Integration test using CMBS-WATL fixture definitions\n\n## Test Case\n\nSee `tests/fixtures/cmbs-watl/` — the `cmbs-watl-desired.fp.yaml` uses `header_row_match` with 7 column patterns and `min_match: 5`. This mirrors the Lambda's 30% threshold on 22 columns. Header row is at row 11 (BCMS RSRV), row 4 (CGCMT supp), or row 12 (MSC RSRV).","status":"pending","priority":1,"issue_type":"feature","created_at":"2026-02-27T09:35:00.000000Z","created_by":"zac","updated_at":"2026-02-27T09:35:00.000000Z","labels":["dsl","enhancement","cmbs-watl"],"compaction_level":0,"original_size":0,"dependencies":[{"issue_id":"bd-3xc","depends_on_id":"bd-3xa","type":"depends_on","created_at":"2026-02-27T09:35:00.000000Z","created_by":"zac"},{"issue_id":"bd-3xc","depends_on_id":"bd-3xb","type":"depends_on","created_at":"2026-02-27T09:35:00.000000Z","created_by":"zac"}]}
diff --git a/tests/fixtures/cmbs-watl/README.md b/tests/fixtures/cmbs-watl/README.md
@@ -0,0 +1,53 @@
+# CMBS-WATL Test Case
+
+Test case derived from CredIQ CMBS pipeline (`lambda-osda-s3-to-db`).
+
+## Problem
+
+The OSDA Lambda parser successfully identifies and extracts Watchlist (WATL) data
+from CMBS Excel workbooks with varying sheet names and header positions. The
+fingerprint DSL cannot express the same detection because:
+
+1. `cell_regex` / `sheet_min_rows` / `range_non_null` require a hardcoded sheet name
+2. `sheet_name_regex` proves a matching sheet exists but doesn't capture the name
+3. There is no way to search a column range for a pattern (header row varies by servicer)
+
+## Observed Variants
+
+### Sheet Name Variants
+| Servicer | Sheet Name |
+|----------|-----------|
+| BCMS (Barclays) | `Watchlist` |
+| BMO | `Watchlist` |
+| CGCMT (supp) | `Watchlist` |
+| MSC (KeyBank) | `Servicer Watch List` |
+| BMO (duplicate) | `Watchlist (2)` |
+
+### Header Row Position
+| Format | Header Row | Field ID Row | CREFC Boilerplate |
+|--------|-----------|-------------|-------------------|
+| BCMS RSRV .xls | Row 11 | Row 9 | Row 2 |
+| CGCMT supp .xlsx | Row 4 | Row 3 | Row 1 (title only) |
+| MSC RSRV .xls | Row 12 | Row 11 | Row 3 |
+
+### Lambda Detection Chain
+1. S3 key matches filepath_pattern: `(?:.*Watch\s?list.*|_WATL)\.(?:csv|txt)$`
+2. Excel sheets converted to CSV; generated filename matched against pattern
+3. NAME strategy scans first ~30 rows for header row using column `name_variations` regex
+4. Requires 30-50% of 22 columns to match (threshold depends on column count)
+5. Row-level metadata filtering skips field number rows, CREFC code rows, etc.
+
+### Key WATL Column Patterns (from watl.py)
+```
+transaction_id:              Transaction ID | Trans? ID | ^L1, S1, D1$ | ^1(\.0)?$
+loan_id:                     ^Loan ID | ^L3, S3, D3$ | ^3(\.0)?$
+prospectus_loan_id:          Prospectus Loan ID | ^L4, D4, S4$ | ^4(\.0)?$
+property_name:               Property Name | ^S55$ | ^5(\.0)?$
+comments_servicer_watchlist: Comments - Servicer Watchlist | Comment/Action to be taken | Watchlist Comments | ^19(\.0)?$
+```
+
+## DSL Enhancements Required
+
+1. **Sheet name binding** (`sheet_name_regex` captures → downstream assertions reference)
+2. **Column search** (search column A, rows 1-N for a regex pattern)
+3. **Row search / header detection** (find a row where N cells match column patterns)
diff --git a/tests/fixtures/cmbs-watl/cmbs-watl-current.fp.yaml b/tests/fixtures/cmbs-watl/cmbs-watl-current.fp.yaml
@@ -0,0 +1,39 @@
+# Current best-effort WATL fingerprint (v1 — hardcoded sheet name)
+# Matches BCMS/BMO RSRV workbooks but fails on MSC (different sheet name)
+# and CGCMT supp (different boilerplate layout). See README.md for details.
+
+fingerprint_id: cmbs-watl.v1
+format: xlsx
+
+assertions:
+  - name: watl_sheet_present
+    sheet_name_regex:
+      pattern: "(?i)watch\\s?list|WATL"
+
+  - name: crefc_boilerplate
+    cell_regex:
+      sheet: "Watchlist"
+      cell: A2
+      pattern: "(?i)CREFC Investor Reporting Package"
+
+  - name: field_ids_or_header_row9
+    cell_regex:
+      sheet: "Watchlist"
+      cell: A9
+      pattern: "(?i)^L1|^Trans|^1\\.0$"
+
+  - name: has_data_rows
+    sheet_min_rows:
+      sheet: "Watchlist"
+      min_rows: 10
+
+extract:
+  - name: watl_header_area
+    type: range
+    sheet: "Watchlist"
+    range: "A8:V12"
+
+content_hash:
+  algorithm: blake3
+  over:
+    - watl_header_area
diff --git a/tests/fixtures/cmbs-watl/cmbs-watl-desired.fp.yaml b/tests/fixtures/cmbs-watl/cmbs-watl-desired.fp.yaml
@@ -0,0 +1,60 @@
+# Desired WATL fingerprint — uses proposed DSL enhancements.
+# This is the target definition once sheet_ref binding, column_search,
+# and header_row_match are implemented.
+#
+# NOT valid YAML for current fingerprint version — serves as spec for enhancements.
+
+fingerprint_id: cmbs-watl.v2
+format: xlsx
+
+assertions:
+  # Enhancement 1: sheet_name_regex with bind
+  # Captures the matched sheet name as $watl_sheet for downstream assertions.
+  - name: watl_sheet_present
+    sheet_name_regex:
+      pattern: "(?i)watch\\s?list|WATL"
+      bind: "$watl_sheet"
+
+  # Enhancement 2: column_search
+  # Searches column A rows 1-20 for any cell matching the pattern.
+  # Proves CREFC-standard content exists without knowing the exact row.
+  - name: crefc_or_watl_label
+    column_search:
+      sheet: "$watl_sheet"
+      column: A
+      row_range: "1:20"
+      pattern: "(?i)CREFC Investor Reporting|Watch\\s?List|SERVICER WATCHLIST"
+
+  # Enhancement 3: header_row_match
+  # Searches rows 1-30 for a row where >= min_match cells match the given column patterns.
+  # This mirrors the Lambda's NAME strategy column matching.
+  - name: watl_header_row
+    header_row_match:
+      sheet: "$watl_sheet"
+      row_range: "1:30"
+      min_match: 5
+      columns:
+        - pattern: "(?i)Trans(action)?\\s*ID|^L1|^1(\\.0)?$"
+        - pattern: "(?i)Group\\s*ID|^L2|^2(\\.0)?$"
+        - pattern: "(?i)^Loan\\s*ID|^L3|^3(\\.0)?$"
+        - pattern: "(?i)Prospectus\\s*Loan\\s*ID|^L4|^4(\\.0)?$"
+        - pattern: "(?i)Property\\s*Name|^S55$|^5(\\.0)?$"
+        - pattern: "(?i)Property\\s*Type|^S61$|^6(\\.0)?$"
+        - pattern: "(?i)Comments.*Servicer.*Watchlist|Comment.*Action.*taken|Watchlist Comments|^19(\\.0)?$"
+
+  # Standard assertion using bound sheet name
+  - name: has_data_rows
+    sheet_min_rows:
+      sheet: "$watl_sheet"
+      min_rows: 5
+
+extract:
+  - name: watl_headers
+    type: range
+    sheet: "$watl_sheet"
+    range: "A1:V30"
+
+content_hash:
+  algorithm: blake3
+  over:
+    - watl_headers