Skip to content

Commit e71116b

Browse files
committed
Example of the eval working
1 parent 597ab79 commit e71116b

File tree

2 files changed

+276
-10
lines changed

2 files changed

+276
-10
lines changed

config_library/pattern-2/fcc-invoices/README.md

Lines changed: 68 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -141,19 +141,48 @@ idp-cli download-results \
141141

142142
## Step 4: Run Evaluation
143143

144-
Evaluate the extraction results against ground truth using the **simplified evaluation script** (recommended):
144+
### Option A: Single Source of Truth (Recommended)
145+
146+
Use the IDP config directly - no separate Stickler config needed:
145147

146148
```bash
147149
cd config_library/pattern-2/fcc-invoices
148150

151+
python bulk_evaluate_from_idp_config.py \
152+
--results-dir ../../../fcc_results/cli-batch-20251017-190516 \
153+
--csv-path sample_labels_3.csv \
154+
--idp-config-path sr_FCC_config.json \
155+
--output-dir evaluation_output
156+
```
157+
158+
**Benefits:**
159+
- Single source of truth - evaluation config comes from IDP config
160+
- Extracts Stickler settings from `x-aws-stickler-*` extensions in JSON Schema
161+
- No need to maintain separate `stickler_config.json`
162+
- Guarantees evaluation matches deployment configuration
163+
164+
### Option B: Separate Stickler Config
165+
166+
Use the simplified script with standalone Stickler config:
167+
168+
```bash
149169
python bulk_evaluate_fcc_invoices_simple.py \
150170
--results-dir ../../../fcc_results/cli-batch-20251017-190516 \
151171
--csv-path sample_labels_3.csv \
152172
--config-path stickler_config.json \
153173
--output-dir evaluation_output
154174
```
155175

156-
**Alternative**: Use the legacy script (more complex, same results):
176+
**Benefits:**
177+
- 260 lines vs 671 lines (61% less code)
178+
- Easier to understand and modify
179+
- No temporary file overhead
180+
- Direct integration with SticklerEvaluationService
181+
182+
### Option C: Legacy Script
183+
184+
Use the original complex script (not recommended):
185+
157186
```bash
158187
python bulk_evaluate_fcc_invoices.py \
159188
--results-dir ../../../fcc_results/cli-batch-20251017-190516 \
@@ -164,21 +193,14 @@ python bulk_evaluate_fcc_invoices.py \
164193

165194
**Note**: The `sample_labels_3.csv` contains ground truth for 3 sample documents. For full dataset evaluation, use `sr_refactor_labels_5_5_25.csv`.
166195

167-
**What this does:**
196+
**What evaluation does:**
168197
- Loads ground truth labels from CSV
169198
- Matches documents by doc_id
170199
- Performs doc-by-doc comparison using SticklerEvaluationService
171200
- Saves individual comparison results
172201
- Aggregates metrics across all documents
173202
- Generates comprehensive evaluation report
174203

175-
**Why use the simplified script?**
176-
- 260 lines vs 671 lines (61% less code)
177-
- Easier to understand and modify
178-
- No temporary file overhead
179-
- Direct integration with SticklerEvaluationService
180-
- Same accurate results
181-
182204
**Expected output:**
183205
```
184206
================================================================================
@@ -416,12 +438,48 @@ python bulk_evaluate_fcc_invoices_simple.py \
416438
--output-dir evaluation_output-2
417439
```
418440

441+
### Evaluation with IDP Config
442+
443+
New evaluation script that uses IDP config directly:
444+
445+
```bash
446+
python bulk_evaluate_from_idp_config.py \
447+
--results-dir ../../../fcc_results-updated-2/cli-batch-20251031-164416 \
448+
--csv-path sample_labels_3.csv \
449+
--idp-config-path sr_FCC_config.json \
450+
--output-dir evaluation_output-idp-config
451+
```
452+
453+
**Results:**
454+
```
455+
📈 Overall Metrics:
456+
Precision: 0.5185
457+
Recall: 1.0000
458+
F1 Score: 0.6829
459+
Accuracy: 0.5185
460+
461+
Confusion Matrix:
462+
TP: 14 | FP: 13
463+
FN: 0 | TN: 0
464+
FP1: 2 | FP2: 11
465+
466+
📋 Field-Level Metrics (Top Fields):
467+
agency F1: 0.8000
468+
gross_total F1: 0.8000
469+
net_amount_due F1: 0.8000
470+
line_item__days F1: 0.8000
471+
line_item__start_date F1: 0.8000
472+
line_item__end_date F1: 0.8000
473+
```
474+
419475
### Notes
420476

421477
- Multiple deploy/inference cycles were run to iterate on the configuration
422478
- Final batch ID: `cli-batch-20251031-164416`
423479
- Evaluation successfully produced results with the simplified script
424480
- Configuration now properly uses `{ATTRIBUTE_NAMES_AND_DESCRIPTIONS}` placeholder for automatic schema injection
481+
- New `bulk_evaluate_from_idp_config.py` extracts Stickler config from `x-aws-stickler-*` extensions
482+
- Single source of truth: IDP config contains both extraction schema and evaluation settings
425483

426484
-region us-west-2 \
427485
--wait \
Lines changed: 208 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,208 @@
1+
{
2+
"summary": {
3+
"documents_processed": 3,
4+
"errors": 0
5+
},
6+
"overall_metrics": {
7+
"precision": 0.5185185185185185,
8+
"recall": 1.0,
9+
"f1_score": 0.6829268292682926,
10+
"accuracy": 0.5185185185185185,
11+
"tp": 14,
12+
"fp": 13,
13+
"tn": 0,
14+
"fn": 0,
15+
"fp1": 2,
16+
"fp2": 11,
17+
"total": 27
18+
},
19+
"field_metrics": {
20+
"agency": {
21+
"precision": 0.6666666666666666,
22+
"recall": 1.0,
23+
"f1_score": 0.8,
24+
"accuracy": 0.6666666666666666,
25+
"tp": 2,
26+
"fp": 1,
27+
"tn": 0,
28+
"fn": 0,
29+
"fp1": 1,
30+
"fp2": 0,
31+
"total": 3
32+
},
33+
"advertiser": {
34+
"precision": 0.3333333333333333,
35+
"recall": 1.0,
36+
"f1_score": 0.5,
37+
"accuracy": 0.3333333333333333,
38+
"tp": 1,
39+
"fp": 2,
40+
"tn": 0,
41+
"fn": 0,
42+
"fp1": 0,
43+
"fp2": 2,
44+
"total": 3
45+
},
46+
"gross_total": {
47+
"precision": 0.6666666666666666,
48+
"recall": 1.0,
49+
"f1_score": 0.8,
50+
"accuracy": 0.6666666666666666,
51+
"tp": 2,
52+
"fp": 1,
53+
"tn": 0,
54+
"fn": 0,
55+
"fp1": 0,
56+
"fp2": 1,
57+
"total": 3
58+
},
59+
"net_amount_due": {
60+
"precision": 0.6666666666666666,
61+
"recall": 1.0,
62+
"f1_score": 0.8,
63+
"accuracy": 0.6666666666666666,
64+
"tp": 2,
65+
"fp": 1,
66+
"tn": 0,
67+
"fn": 0,
68+
"fp1": 1,
69+
"fp2": 0,
70+
"total": 3
71+
},
72+
"line_item__description": {
73+
"precision": 0.0,
74+
"recall": 0.0,
75+
"f1_score": 0.0,
76+
"accuracy": 0.0,
77+
"tp": 0,
78+
"fp": 3,
79+
"tn": 0,
80+
"fn": 0,
81+
"fp1": 0,
82+
"fp2": 3,
83+
"total": 3
84+
},
85+
"line_item__days": {
86+
"precision": 0.6666666666666666,
87+
"recall": 1.0,
88+
"f1_score": 0.8,
89+
"accuracy": 0.6666666666666666,
90+
"tp": 2,
91+
"fp": 1,
92+
"tn": 0,
93+
"fn": 0,
94+
"fp1": 0,
95+
"fp2": 1,
96+
"total": 3
97+
},
98+
"line_item__rate": {
99+
"precision": 0.3333333333333333,
100+
"recall": 1.0,
101+
"f1_score": 0.5,
102+
"accuracy": 0.3333333333333333,
103+
"tp": 1,
104+
"fp": 2,
105+
"tn": 0,
106+
"fn": 0,
107+
"fp1": 0,
108+
"fp2": 2,
109+
"total": 3
110+
},
111+
"line_item__start_date": {
112+
"precision": 0.6666666666666666,
113+
"recall": 1.0,
114+
"f1_score": 0.8,
115+
"accuracy": 0.6666666666666666,
116+
"tp": 2,
117+
"fp": 1,
118+
"tn": 0,
119+
"fn": 0,
120+
"fp1": 0,
121+
"fp2": 1,
122+
"total": 3
123+
},
124+
"line_item__end_date": {
125+
"precision": 0.6666666666666666,
126+
"recall": 1.0,
127+
"f1_score": 0.8,
128+
"accuracy": 0.6666666666666666,
129+
"tp": 2,
130+
"fp": 1,
131+
"tn": 0,
132+
"fn": 0,
133+
"fp1": 0,
134+
"fp2": 1,
135+
"total": 3
136+
}
137+
},
138+
"errors": [],
139+
"stickler_config_used": {
140+
"model_name": "FCCInvoice",
141+
"match_threshold": 0.7,
142+
"fields": {
143+
"agency": {
144+
"type": "list",
145+
"comparator": "FuzzyComparator",
146+
"threshold": 0.8,
147+
"weight": 2.0,
148+
"description": "The advertising agency or media buyer handling the political advertising purchase."
149+
},
150+
"advertiser": {
151+
"type": "list",
152+
"comparator": "FuzzyComparator",
153+
"threshold": 0.8,
154+
"weight": 2.0,
155+
"description": "The political advertiser or campaign purchasing the broadcast time."
156+
},
157+
"gross_total": {
158+
"type": "list",
159+
"comparator": "ExactComparator",
160+
"threshold": 1.0,
161+
"weight": 3.0,
162+
"description": "The total gross amount for all line items before any discounts or adjustments."
163+
},
164+
"net_amount_due": {
165+
"type": "list",
166+
"comparator": "ExactComparator",
167+
"threshold": 1.0,
168+
"weight": 3.0,
169+
"description": "The final net amount due after any discounts or adjustments have been applied."
170+
},
171+
"line_item__description": {
172+
"type": "list",
173+
"comparator": "LevenshteinComparator",
174+
"threshold": 0.7,
175+
"weight": 1.5,
176+
"description": "List of broadcast time slot descriptions (e.g., 'M-F 11a-12p' for Monday through Friday 11am to 12pm)."
177+
},
178+
"line_item__days": {
179+
"type": "list",
180+
"comparator": "ExactComparator",
181+
"threshold": 1.0,
182+
"weight": 1.0,
183+
"description": "List of days of the week for each broadcast slot (e.g., 'MTWTF--' where each position represents a day)."
184+
},
185+
"line_item__rate": {
186+
"type": "list",
187+
"comparator": "ExactComparator",
188+
"threshold": 1.0,
189+
"weight": 2.0,
190+
"description": "List of rates or costs for each broadcast time slot (may include commas for thousands separator)."
191+
},
192+
"line_item__start_date": {
193+
"type": "list",
194+
"comparator": "ExactComparator",
195+
"threshold": 1.0,
196+
"weight": 2.0,
197+
"description": "List of start dates for each line item's broadcast schedule (typically in MM/DD/YY format)."
198+
},
199+
"line_item__end_date": {
200+
"type": "list",
201+
"comparator": "ExactComparator",
202+
"threshold": 1.0,
203+
"weight": 2.0,
204+
"description": "List of end dates for each line item's broadcast schedule (typically in MM/DD/YY format)."
205+
}
206+
}
207+
}
208+
}

0 commit comments

Comments
 (0)