Skip to content

Commit 5f5face

Browse files
saltmachineclaude
andcommitted
beads: add 3 DSL enhancement beads + CMBS-WATL test fixtures
File enhancement beads derived from CMBS commentary fingerprinting: - bd-3xa: sheet_name_regex bind (capture matched sheet name for downstream assertions) - bd-3xb: column_search assertion (scan column range for regex match) - bd-3xc: header_row_match assertion (detect header row by column pattern matching) Add test fixtures in tests/fixtures/cmbs-watl/: - README.md documenting observed WATL variants and Lambda detection chain - cmbs-watl-current.fp.yaml (v1, 3/6 match with hardcoded sheet name) - cmbs-watl-desired.fp.yaml (v2 target using proposed bind/column_search/header_row_match) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 71ecfe4 commit 5f5face

File tree

4 files changed

+155
-0
lines changed

4 files changed

+155
-0
lines changed

.beads/issues.jsonl

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -69,3 +69,6 @@
6969
{"id":"bd-w2c","title":"PRM-FP-0105b: Witness ledger + query","description":"Fill in witness/ledger.rs (append to EPISTEMIC_WITNESS or ~/.epistemic/witness.jsonl, --no-witness suppression, non-fatal failures) and witness/query.rs (witness query/last/count subcommands with filters). FILES: src/witness/ledger.rs, src/witness/query.rs (2 files). VERIFY: cargo test --lib -- witness::ledger && cargo test --lib -- witness::query","status":"closed","priority":1,"issue_type":"feature","created_at":"2026-02-26T11:47:32.096634Z","created_by":"zac","updated_at":"2026-02-27T01:05:57.059384Z","closed_at":"2026-02-27T01:05:57.059232Z","close_reason":"Implemented witness ledger append/path and query/last/count with witness module tests","compaction_level":0,"original_size":0,"dependencies":[{"issue_id":"bd-w2c","depends_on_id":"bd-1bm","type":"blocks","created_at":"2026-02-26T11:48:29.229538Z","created_by":"zac"},{"issue_id":"bd-w2c","depends_on_id":"bd-1pi","type":"blocks","created_at":"2026-02-26T11:48:39.434373Z","created_by":"zac"}]}
7070
{"id":"bd-w5b","title":"PRM-FP-0003: Add base release workflow (build + publish artifacts)","description":"Implement base release workflow parity with rvl: build target matrix archives and publish GitHub release assets. Keep this issue focused on build/publish mechanics only.\n\n## Acceptance Criteria\n- Add .github/workflows/release.yml; workflow derives version from Cargo.toml and ensures tag exists; builds release archives for target matrix and uploads them to GitHub release; generated artifacts are downloadable from release page; no signing/SBOM/provenance steps in this bead (handled separately).\n\n## Notes\nNon-goal: cosign, SHA256SUMS signing, SBOM/provenance generation, or Homebrew formula update.","acceptance_criteria":"Add .github/workflows/release.yml; workflow derives version from Cargo.toml and ensures tag exists; builds release archives for target matrix and uploads them to GitHub release; generated artifacts are downloadable from release page; no signing/SBOM/provenance steps in this bead (handled separately).","notes":"Non-goal: cosign, SHA256SUMS signing, SBOM/provenance generation, or Homebrew formula update.","status":"closed","priority":1,"issue_type":"feature","created_at":"2026-02-25T03:58:30.864567Z","created_by":"zac","updated_at":"2026-02-27T03:18:25.302280Z","closed_at":"2026-02-27T03:18:25.302184Z","close_reason":"Added .github/workflows/release.yml with Cargo.toml version derivation, tag existence/match checks, release target matrix build archives, artifact upload, and GitHub release publish step; YAML validated","compaction_level":0,"original_size":0,"labels":["infra","release","workflow"],"dependencies":[{"issue_id":"bd-w5b","depends_on_id":"bd-tdc","type":"blocks","created_at":"2026-02-26T11:21:23.452677Z","created_by":"zac"}]}
7171
{"id":"bd-zg8","title":"PRM-FP-0104c: Progress reporter (progress/reporter.rs)","description":"Fill in progress/reporter.rs: emit structured progress JSONL to stderr when --progress is set. Warning emission for skipped files. FILE: src/progress/reporter.rs (1 file). VERIFY: cargo test --lib -- progress::reporter","status":"closed","priority":1,"issue_type":"feature","created_at":"2026-02-26T11:47:21.744035Z","created_by":"zac","updated_at":"2026-02-27T00:54:26.968422Z","closed_at":"2026-02-27T00:54:26.968181Z","close_reason":"Implemented structured progress/warning stderr JSONL reporter with progress::reporter tests","compaction_level":0,"original_size":0,"dependencies":[{"issue_id":"bd-zg8","depends_on_id":"bd-1bm","type":"blocks","created_at":"2026-02-26T11:48:29.176622Z","created_by":"zac"}]}
72+
{"id":"bd-3xa","title":"PRM-FP-0228: sheet_name_regex bind — capture matched sheet name for downstream assertions","description":"## Problem\n\nSpreadsheet assertions (`cell_regex`, `cell_eq`, `range_non_null`, `sheet_min_rows`) require a hardcoded `sheet` parameter. When a workbook contains sheets with varying names across producers (e.g., CMBS watchlist sheets named `Watchlist` vs `Servicer Watch List` vs `Watch List Report for {DEAL}`), `sheet_name_regex` can prove a matching sheet exists but downstream assertions cannot reference the matched name.\n\n## Proposal\n\nAdd an optional `bind` field to `sheet_name_regex`. When present, the matched sheet name is captured as a named variable (e.g., `$watl_sheet`) that downstream assertions can reference in their `sheet` parameter.\n\n```yaml\nassertions:\n - sheet_name_regex:\n pattern: \"(?i)watch\\\\s?list|WATL\"\n bind: \"$watl_sheet\"\n\n - cell_regex:\n sheet: \"$watl_sheet\"\n cell: A2\n pattern: \"(?i)CREFC\"\n\n - sheet_min_rows:\n sheet: \"$watl_sheet\"\n min_rows: 5\n```\n\n## Implementation\n\n1. Add `bind: Option<String>` to `SheetNameRegex` assertion variant\n2. Add a `bindings: HashMap<String, String>` to assertion evaluation context\n3. When `sheet_name_regex` passes with `bind`, store `{bind_name → matched_sheet_name}`\n4. Before evaluating any assertion with a `sheet` param starting with `$`, resolve from bindings\n5. If a `$`-prefixed sheet name has no binding, fail with descriptive error\n\n## Acceptance\n\n- `bind` field parses in YAML and compiles to Rust crate\n- Bound name resolves correctly in downstream `cell_eq`, `cell_regex`, `range_non_null`, `sheet_min_rows` assertions\n- Unresolved `$` references produce clear error messages\n- When multiple sheets match, the first match is used (consistent with assertion short-circuit semantics)\n- Unit tests in `assertions.rs` covering: basic bind+resolve, unresolved reference, multiple sheets matching\n\n## Test Case\n\nSee `tests/fixtures/cmbs-watl/` — CMBS Watchlist files where sheet name varies by servicer. The `cmbs-watl-desired.fp.yaml` shows the target definition using `bind`.","status":"pending","priority":1,"issue_type":"feature","created_at":"2026-02-27T09:35:00.000000Z","created_by":"zac","updated_at":"2026-02-27T09:35:00.000000Z","labels":["dsl","enhancement","cmbs-watl"],"compaction_level":0,"original_size":0}
73+
{"id":"bd-3xb","title":"PRM-FP-0229: column_search assertion — search a column range for a regex match","description":"## Problem\n\nCMBS Excel workbooks have boilerplate text (CREFC headers, report titles, servicer names) in varying row positions. The current DSL requires `cell_regex` with a fixed cell reference, but the target row varies by servicer and format. The Lambda parser handles this by scanning the first ~30 rows; the fingerprint DSL has no equivalent.\n\n## Proposal\n\nAdd a `column_search` assertion that searches a single column over a row range for any cell matching a regex pattern.\n\n```yaml\nassertions:\n - column_search:\n sheet: \"Watchlist\" # or \"$bound_name\"\n column: A\n row_range: \"1:20\"\n pattern: \"(?i)CREFC Investor Reporting|SERVICER WATCHLIST\"\n```\n\nSemantics: scan cells A1 through A20; pass if ANY cell matches the pattern. This is the column-axis analog of `text_contains` for markdown.\n\n## Implementation\n\n1. Add `ColumnSearch { sheet, column, row_range, pattern }` to `Assertion` enum\n2. Parse `row_range` as `start:end` (1-based, inclusive)\n3. Iterate cells `{column}{start}` through `{column}{end}`, apply regex\n4. Pass on first match; fail if no cell matches\n5. In diagnose mode, report nearest partial matches and actual cell values scanned\n\n## Acceptance\n\n- Parses in YAML, compiles to crate\n- Scans specified column/row range and matches regex\n- Works with `$`-bound sheet names (from bd-3xa)\n- Diagnose mode shows what was scanned on failure\n- Unit tests: match found, no match, empty cells, out-of-range rows\n\n## Test Case\n\nSee `tests/fixtures/cmbs-watl/` — CREFC boilerplate appears at row 2 (BCMS RSRV), row 3 (MSC RSRV), or row 1 (CGCMT supp). `column_search` on A1:A10 catches all variants.","status":"pending","priority":1,"issue_type":"feature","created_at":"2026-02-27T09:35:00.000000Z","created_by":"zac","updated_at":"2026-02-27T09:35:00.000000Z","labels":["dsl","enhancement","cmbs-watl"],"compaction_level":0,"original_size":0,"dependencies":[{"issue_id":"bd-3xb","depends_on_id":"bd-3xa","type":"depends_on","created_at":"2026-02-27T09:35:00.000000Z","created_by":"zac"}]}
74+
{"id":"bd-3xc","title":"PRM-FP-0230: header_row_match assertion — detect header row by column pattern matching","description":"## Problem\n\nThe Lambda OSDA parser uses a NAME strategy that scans the first ~30 rows of a CSV/sheet looking for a row where enough column headers match known `name_variations` regex patterns (30-50% threshold). This is the core detection mechanism for WATL, DLSR, and all NAME-strategy schemas. The fingerprint DSL has no equivalent — `cell_regex` checks a single fixed cell.\n\n## Proposal\n\nAdd a `header_row_match` assertion that scans rows in a range and passes if any single row has at least `min_match` cells matching the provided column patterns.\n\n```yaml\nassertions:\n - header_row_match:\n sheet: \"$watl_sheet\"\n row_range: \"1:30\"\n min_match: 5\n columns:\n - pattern: \"(?i)Trans(action)?\\\\s*ID|^L1|^1(\\\\.0)?$\"\n - pattern: \"(?i)Group\\\\s*ID|^L2|^2(\\\\.0)?$\"\n - pattern: \"(?i)^Loan\\\\s*ID|^L3|^3(\\\\.0)?$\"\n - pattern: \"(?i)Prospectus\\\\s*Loan\\\\s*ID|^L4|^4(\\\\.0)?$\"\n - pattern: \"(?i)Property\\\\s*Name|^S55$|^5(\\\\.0)?$\"\n - pattern: \"(?i)Property\\\\s*Type|^S61$|^6(\\\\.0)?$\"\n - pattern: \"(?i)Comments.*Watchlist|^19(\\\\.0)?$\"\n```\n\nSemantics: for each row in `row_range`, test each cell against each column pattern. A cell matches at most one pattern (first match wins, no double-counting). If any row achieves `>= min_match` distinct pattern matches, the assertion passes.\n\n## Implementation\n\n1. Add `HeaderRowMatch { sheet, row_range, min_match, columns: Vec<ColumnPattern> }` to `Assertion` enum\n2. `ColumnPattern` has `pattern: String` (compiled to regex)\n3. For each row: iterate cells, match against patterns, count distinct matches\n4. Pass on first row achieving `min_match`; fail if no row qualifies\n5. In diagnose mode: report the best-matching row (highest count) and which patterns matched\n\n## Acceptance\n\n- Parses in YAML, compiles to crate\n- Correctly scans rows and counts distinct column pattern matches per row\n- Works with `$`-bound sheet names (from bd-3xa)\n- Diagnose mode reports best candidate row and match count\n- `min_match` threshold is respected (partial match = fail)\n- Unit tests: exact threshold match, below threshold, multiple qualifying rows, empty sheet\n- Integration test using CMBS-WATL fixture definitions\n\n## Test Case\n\nSee `tests/fixtures/cmbs-watl/` — the `cmbs-watl-desired.fp.yaml` uses `header_row_match` with 7 column patterns and `min_match: 5`. This mirrors the Lambda's 30% threshold on 22 columns. Header row is at row 11 (BCMS RSRV), row 4 (CGCMT supp), or row 12 (MSC RSRV).","status":"pending","priority":1,"issue_type":"feature","created_at":"2026-02-27T09:35:00.000000Z","created_by":"zac","updated_at":"2026-02-27T09:35:00.000000Z","labels":["dsl","enhancement","cmbs-watl"],"compaction_level":0,"original_size":0,"dependencies":[{"issue_id":"bd-3xc","depends_on_id":"bd-3xa","type":"depends_on","created_at":"2026-02-27T09:35:00.000000Z","created_by":"zac"},{"issue_id":"bd-3xc","depends_on_id":"bd-3xb","type":"depends_on","created_at":"2026-02-27T09:35:00.000000Z","created_by":"zac"}]}

tests/fixtures/cmbs-watl/README.md

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
# CMBS-WATL Test Case
2+
3+
Test case derived from CredIQ CMBS pipeline (`lambda-osda-s3-to-db`).
4+
5+
## Problem
6+
7+
The OSDA Lambda parser successfully identifies and extracts Watchlist (WATL) data
8+
from CMBS Excel workbooks with varying sheet names and header positions. The
9+
fingerprint DSL cannot express the same detection because:
10+
11+
1. `cell_regex` / `sheet_min_rows` / `range_non_null` require a hardcoded sheet name
12+
2. `sheet_name_regex` proves a matching sheet exists but doesn't capture the name
13+
3. There is no way to search a column range for a pattern (header row varies by servicer)
14+
15+
## Observed Variants
16+
17+
### Sheet Name Variants
18+
| Servicer | Sheet Name |
19+
|----------|-----------|
20+
| BCMS (Barclays) | `Watchlist` |
21+
| BMO | `Watchlist` |
22+
| CGCMT (supp) | `Watchlist` |
23+
| MSC (KeyBank) | `Servicer Watch List` |
24+
| BMO (duplicate) | `Watchlist (2)` |
25+
26+
### Header Row Position
27+
| Format | Header Row | Field ID Row | CREFC Boilerplate |
28+
|--------|-----------|-------------|-------------------|
29+
| BCMS RSRV .xls | Row 11 | Row 9 | Row 2 |
30+
| CGCMT supp .xlsx | Row 4 | Row 3 | Row 1 (title only) |
31+
| MSC RSRV .xls | Row 12 | Row 11 | Row 3 |
32+
33+
### Lambda Detection Chain
34+
1. S3 key matches filepath_pattern: `(?:.*Watch\s?list.*|_WATL)\.(?:csv|txt)$`
35+
2. Excel sheets converted to CSV; generated filename matched against pattern
36+
3. NAME strategy scans first ~30 rows for header row using column `name_variations` regex
37+
4. Requires 30-50% of 22 columns to match (threshold depends on column count)
38+
5. Row-level metadata filtering skips field number rows, CREFC code rows, etc.
39+
40+
### Key WATL Column Patterns (from watl.py)
41+
```
42+
transaction_id: Transaction ID | Trans? ID | ^L1, S1, D1$ | ^1(\.0)?$
43+
loan_id: ^Loan ID | ^L3, S3, D3$ | ^3(\.0)?$
44+
prospectus_loan_id: Prospectus Loan ID | ^L4, D4, S4$ | ^4(\.0)?$
45+
property_name: Property Name | ^S55$ | ^5(\.0)?$
46+
comments_servicer_watchlist: Comments - Servicer Watchlist | Comment/Action to be taken | Watchlist Comments | ^19(\.0)?$
47+
```
48+
49+
## DSL Enhancements Required
50+
51+
1. **Sheet name binding** (`sheet_name_regex` captures → downstream assertions reference)
52+
2. **Column search** (search column A, rows 1-N for a regex pattern)
53+
3. **Row search / header detection** (find a row where N cells match column patterns)
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
# Current best-effort WATL fingerprint (v1 — hardcoded sheet name)
2+
# Matches BCMS/BMO RSRV workbooks but fails on MSC (different sheet name)
3+
# and CGCMT supp (different boilerplate layout). See README.md for details.
4+
5+
fingerprint_id: cmbs-watl.v1
6+
format: xlsx
7+
8+
assertions:
9+
- name: watl_sheet_present
10+
sheet_name_regex:
11+
pattern: "(?i)watch\\s?list|WATL"
12+
13+
- name: crefc_boilerplate
14+
cell_regex:
15+
sheet: "Watchlist"
16+
cell: A2
17+
pattern: "(?i)CREFC Investor Reporting Package"
18+
19+
- name: field_ids_or_header_row9
20+
cell_regex:
21+
sheet: "Watchlist"
22+
cell: A9
23+
pattern: "(?i)^L1|^Trans|^1\\.0$"
24+
25+
- name: has_data_rows
26+
sheet_min_rows:
27+
sheet: "Watchlist"
28+
min_rows: 10
29+
30+
extract:
31+
- name: watl_header_area
32+
type: range
33+
sheet: "Watchlist"
34+
range: "A8:V12"
35+
36+
content_hash:
37+
algorithm: blake3
38+
over:
39+
- watl_header_area
Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
# Desired WATL fingerprint — uses proposed DSL enhancements.
2+
# This is the target definition once sheet_ref binding, column_search,
3+
# and header_row_match are implemented.
4+
#
5+
# NOT valid YAML for current fingerprint version — serves as spec for enhancements.
6+
7+
fingerprint_id: cmbs-watl.v2
8+
format: xlsx
9+
10+
assertions:
11+
# Enhancement 1: sheet_name_regex with bind
12+
# Captures the matched sheet name as $watl_sheet for downstream assertions.
13+
- name: watl_sheet_present
14+
sheet_name_regex:
15+
pattern: "(?i)watch\\s?list|WATL"
16+
bind: "$watl_sheet"
17+
18+
# Enhancement 2: column_search
19+
# Searches column A rows 1-20 for any cell matching the pattern.
20+
# Proves CREFC-standard content exists without knowing the exact row.
21+
- name: crefc_or_watl_label
22+
column_search:
23+
sheet: "$watl_sheet"
24+
column: A
25+
row_range: "1:20"
26+
pattern: "(?i)CREFC Investor Reporting|Watch\\s?List|SERVICER WATCHLIST"
27+
28+
# Enhancement 3: header_row_match
29+
# Searches rows 1-30 for a row where >= min_match cells match the given column patterns.
30+
# This mirrors the Lambda's NAME strategy column matching.
31+
- name: watl_header_row
32+
header_row_match:
33+
sheet: "$watl_sheet"
34+
row_range: "1:30"
35+
min_match: 5
36+
columns:
37+
- pattern: "(?i)Trans(action)?\\s*ID|^L1|^1(\\.0)?$"
38+
- pattern: "(?i)Group\\s*ID|^L2|^2(\\.0)?$"
39+
- pattern: "(?i)^Loan\\s*ID|^L3|^3(\\.0)?$"
40+
- pattern: "(?i)Prospectus\\s*Loan\\s*ID|^L4|^4(\\.0)?$"
41+
- pattern: "(?i)Property\\s*Name|^S55$|^5(\\.0)?$"
42+
- pattern: "(?i)Property\\s*Type|^S61$|^6(\\.0)?$"
43+
- pattern: "(?i)Comments.*Servicer.*Watchlist|Comment.*Action.*taken|Watchlist Comments|^19(\\.0)?$"
44+
45+
# Standard assertion using bound sheet name
46+
- name: has_data_rows
47+
sheet_min_rows:
48+
sheet: "$watl_sheet"
49+
min_rows: 5
50+
51+
extract:
52+
- name: watl_headers
53+
type: range
54+
sheet: "$watl_sheet"
55+
range: "A1:V30"
56+
57+
content_hash:
58+
algorithm: blake3
59+
over:
60+
- watl_headers

0 commit comments

Comments
 (0)