Skip to content

Commit e433829

Browse files
committed
Adding samples to the FCC benchmark branch to demo how this works.
1 parent 1d6ca16 commit e433829

File tree

5 files changed

+1524
-9
lines changed

5 files changed

+1524
-9
lines changed
Lines changed: 333 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,333 @@
1+
# FCC Invoice Processing - End-to-End Example
2+
3+
This directory contains a complete end-to-end example for processing FCC (Federal Communications Commission) political advertising invoices using the IDP accelerator with Stickler-based evaluation.
4+
5+
## Overview
6+
7+
This example demonstrates:
8+
1. **Deployment** - Deploy the IDP stack with FCC invoice configuration
9+
2. **Inference** - Run inference on sample FCC invoices
10+
3. **Evaluation** - Evaluate extraction results using Stickler
11+
4. **Review** - Analyze individual and aggregated metrics
12+
13+
## Directory Contents
14+
15+
```
16+
config_library/pattern-2/fcc-invoices/
17+
├── README.md # This file
18+
├── config.yaml # Base IDP configuration
19+
├── fcc_configured.yaml # Deployed stack configuration
20+
├── stickler_config.json # Stickler evaluation rules
21+
├── bulk_evaluate_fcc_invoices.py # Evaluation script
22+
└── sr_refactor_labels_5_5_25.csv # Ground truth labels (full dataset)
23+
```
24+
25+
## Sample Data
26+
27+
Sample documents are located in `samples/fcc-invoices/`:
28+
- 3 sample PDF invoices
29+
- `fcc_invoices_sample_3.csv` - Manifest for the 3 samples
30+
31+
## Prerequisites
32+
33+
1. **AWS Credentials**: Valid AWS credentials with appropriate permissions
34+
2. **Python Environment**: Python 3.12+ with required packages
35+
3. **IDP CLI**: Installed and configured
36+
4. **Stickler**: Installed with `pip install -e "./stickler[dev]"`
37+
5. **Dependencies**: `pip install pandas`
38+
39+
## Step 1: Deploy the Stack
40+
41+
Deploy the IDP stack with FCC invoice configuration:
42+
43+
```bash
44+
idp-cli deploy \
45+
--stack-name fcc-inf-test \
46+
--custom-config config_library/pattern-2/fcc-invoices/config.yaml \
47+
--region us-west-2 \
48+
--wait \
49+
--template-url https://s3.us-west-2.amazonaws.com/bobs-artifacts-us-west-2/idp-wip/idp-main.yaml \
50+
--admin-email [email protected] \
51+
--pattern pattern-2
52+
```
53+
54+
**What this does:**
55+
- Creates CloudFormation stack with Lambda functions, S3 buckets, and DynamoDB tables
56+
- Configures extraction model (Claude Sonnet 4)
57+
- Sets up OCR with Textract (LAYOUT + TABLES features)
58+
- Deploys with FCC-specific prompts and schema
59+
60+
**Expected output:**
61+
- Stack creation takes ~5-10 minutes
62+
- Stack status: `CREATE_COMPLETE`
63+
64+
## Step 2: Run Inference
65+
66+
Run inference on the sample documents:
67+
68+
```bash
69+
idp-cli run-inference \
70+
--stack-name fcc-inf-test \
71+
--manifest samples/fcc-invoices/fcc_invoices_sample_3.csv \
72+
--region us-west-2
73+
```
74+
75+
**What this does:**
76+
- Uploads documents to S3 input bucket
77+
- Triggers Lambda processing pipeline
78+
- Performs OCR with Textract
79+
- Extracts structured data using Claude
80+
- Stores results in S3 output bucket
81+
82+
**Expected output:**
83+
```
84+
Validating manifest...
85+
✓ Manifest validated successfully
86+
Initializing batch processor for stack: fcc-inf-test
87+
✓ Batch submitted successfully
88+
Batch ID: batch-20251017-140000
89+
Processing 3 documents...
90+
```
91+
92+
**Monitor progress:**
93+
```bash
94+
idp-cli status \
95+
--stack-name fcc-inf-test \
96+
--batch-id <batch-id> \
97+
--region us-west-2 \
98+
--wait
99+
```
100+
101+
## Step 3: Download Results
102+
103+
Download the inference results locally:
104+
105+
```bash
106+
idp-cli download-results \
107+
--stack-name fcc-inf-test \
108+
--batch-id cli-batch-20251017-190516 \
109+
--output-dir fcc_results \
110+
--region us-west-2
111+
```
112+
113+
**Note**: Replace `cli-batch-20251017-190516` with your actual batch ID from the inference step. You can specify any output directory name.
114+
115+
**What this does:**
116+
- Downloads all result files from S3
117+
- Creates directory structure: `fcc_results/<doc_id>/sections/1/result.json`
118+
- Each result contains extracted fields and metadata
119+
120+
**Result structure:**
121+
```json
122+
{
123+
"document_class": {
124+
"type": "FCC-Invoice"
125+
},
126+
"inference_result": {
127+
"agency": "Agency Name",
128+
"advertiser": "Advertiser Name",
129+
"gross_total": "1,234.56",
130+
"net_amount_due": "1,234.56",
131+
"line_item__description": ["M-F 11a-12p", "M-F 12n-1p"],
132+
"line_item__days": ["MTWTF--", "MTWTF--"],
133+
"line_item__rate": ["100.00", "150.00"],
134+
"line_item__start_date": ["11/01/21", "11/01/21"],
135+
"line_item__end_date": ["11/07/21", "11/07/21"]
136+
}
137+
}
138+
```
139+
140+
## Step 4: Run Evaluation
141+
142+
Evaluate the extraction results against ground truth:
143+
144+
```bash
145+
cd config_library/pattern-2/fcc-invoices
146+
147+
python bulk_evaluate_fcc_invoices.py \
148+
--results-dir ../../../fcc_results \
149+
--csv-path sr_refactor_labels_5_5_25.csv \
150+
--output-dir evaluation_output
151+
```
152+
153+
**What this does:**
154+
- Loads ground truth labels from CSV
155+
- Matches documents by doc_id
156+
- Performs doc-by-doc comparison using Stickler
157+
- Saves individual comparison results
158+
- Aggregates metrics across all documents
159+
- Generates comprehensive evaluation report
160+
161+
**Expected output:**
162+
```
163+
================================================================================
164+
BULK FCC INVOICE EVALUATION
165+
================================================================================
166+
167+
📊 Loading ground truth from sr_refactor_labels_5_5_25.csv...
168+
✓ Loaded 221 documents with ground truth labels
169+
170+
📁 Loading inference results from ../../../fcc_results...
171+
✓ Loaded 3 inference results
172+
173+
🔗 Matching ground truth to inference results...
174+
✓ Matched 3 document pairs
175+
176+
⚙️ Evaluating 3 documents...
177+
✓ Completed evaluation
178+
Individual results saved to: evaluation_output
179+
180+
================================================================================
181+
AGGREGATED EVALUATION RESULTS
182+
================================================================================
183+
184+
📊 Processing Summary:
185+
Documents processed: 3
186+
Errors encountered: 0
187+
Non-matches found: 23
188+
189+
📈 Overall Metrics:
190+
Precision: 0.7341
191+
Recall: 0.4637
192+
F1 Score: 0.5684
193+
Accuracy: 0.3993
194+
195+
Confusion Matrix:
196+
TP: 530 | FP: 192
197+
FN: 613 | TN: 5
198+
FP1 (False Alarm): 11
199+
FP2 (Wrong Value): 181
200+
201+
📋 Field-Level Metrics (Top 10 by F1 Score):
202+
Field Precision Recall F1
203+
---------------------------------------- ---------- ---------- ----------
204+
line_item__description 0.9236 0.8261 0.8721
205+
gross_total 1.0000 0.7500 0.8571
206+
net_amount_due 1.0000 0.7500 0.8571
207+
line_item__rate 0.8169 0.7117 0.7607
208+
...
209+
210+
💾 Aggregated results saved to evaluation_output/aggregated_metrics.json
211+
212+
================================================================================
213+
✅ Evaluation complete!
214+
Individual results: evaluation_output
215+
Aggregated metrics: evaluation_output/aggregated_metrics.json
216+
================================================================================
217+
```
218+
219+
## Step 5: Review Results
220+
221+
### Individual Document Results
222+
223+
Each document has a detailed comparison result:
224+
225+
```bash
226+
cat evaluation_output/0492b95bc342870920c480040bc33513.json | python -m json.tool | less
227+
```
228+
229+
**Contains:**
230+
- Field-by-field scores
231+
- Confusion matrix (overall and per-field)
232+
- Non-matches with details
233+
- Similarity scores
234+
235+
### Aggregated Metrics
236+
237+
View the overall performance:
238+
239+
```bash
240+
cat evaluation_output/aggregated_metrics.json | python -m json.tool | less
241+
```
242+
243+
**Contains:**
244+
- Overall precision, recall, F1, accuracy
245+
- Per-field performance metrics
246+
- Confusion matrix breakdown
247+
- Non-match summary
248+
249+
## Understanding the Results
250+
251+
### Confusion Matrix Metrics
252+
253+
- **TP (True Positive)**: Correctly extracted field with correct value
254+
- **FP (False Positive)**: Extracted field with incorrect value or shouldn't exist
255+
- **TN (True Negative)**: Correctly didn't extract a field that shouldn't exist
256+
- **FN (False Negative)**: Failed to extract a field that should exist
257+
- **FP1 (False Alarm)**: Extracted a field that shouldn't exist
258+
- **FP2 (Wrong Value)**: Extracted a field with wrong value
259+
260+
### Derived Metrics
261+
262+
- **Precision**: TP / (TP + FP) - How many extracted values are correct
263+
- **Recall**: TP / (TP + FN) - How many ground truth values were found
264+
- **F1 Score**: Harmonic mean of precision and recall
265+
- **Accuracy**: (TP + TN) / Total - Overall correctness
266+
267+
## Stickler Configuration
268+
269+
The `stickler_config.json` defines validation rules:
270+
271+
### Simple Fields (Lists)
272+
- `agency`: FuzzyComparator (threshold 0.8) - Allows minor name variations
273+
- `advertiser`: FuzzyComparator (threshold 0.8)
274+
- `gross_total`: ExactComparator (threshold 1.0) - Requires exact match
275+
- `net_amount_due`: ExactComparator (threshold 1.0)
276+
277+
### Line Item Fields (Lists)
278+
- `line_item__description`: LevenshteinComparator (threshold 0.7) - Allows typos
279+
- `line_item__days`: ExactComparator (threshold 1.0)
280+
- `line_item__rate`: ExactComparator (threshold 1.0)
281+
- `line_item__start_date`: ExactComparator (threshold 1.0)
282+
- `line_item__end_date`: ExactComparator (threshold 1.0)
283+
284+
**Note**: All fields are configured as lists to match the flat format used by both ground truth and predictions.
285+
286+
## Data Format
287+
288+
### Ground Truth (CSV)
289+
The `sr_refactor_labels_5_5_25.csv` contains:
290+
- `doc_id`: Document identifier (without .pdf extension)
291+
- `refactored_labels`: JSON string with ground truth in flat list format
292+
293+
### Inference Results
294+
Directory structure: `results_dir/{doc_id}.pdf/sections/1/result.json`
295+
296+
The flat format uses `line_item__` prefix for list fields, where each field is a list of values.
297+
298+
## Troubleshooting
299+
300+
### No matched pairs found
301+
- Verify `doc_id` in CSV matches directory names in results
302+
- Check if doc_id has `.pdf` extension mismatch
303+
304+
### AWS Token Expired
305+
```bash
306+
# Refresh your AWS credentials
307+
aws sso login --profile your-profile
308+
```
309+
310+
### Stack not found
311+
```bash
312+
# Verify stack exists
313+
idp-cli list-stacks --region us-west-2
314+
```
315+
316+
### Large matrix warnings
317+
- Normal for documents with many line items (>100)
318+
- Stickler uses Hungarian algorithm for optimal matching
319+
- May be slower but produces accurate results
320+
321+
## Next Steps
322+
323+
1. **Scale Up**: Process more documents by creating a larger manifest
324+
2. **Tune Configuration**: Adjust Stickler thresholds based on results
325+
3. **Analyze Errors**: Review non-matches to identify extraction issues
326+
4. **Iterate**: Update prompts or schema based on evaluation findings
327+
328+
## Additional Resources
329+
330+
- [IDP CLI Documentation](../../README.md)
331+
- [Stickler Documentation](../../../stickler/README.md)
332+
- [Pattern 2 Architecture](../README.md)
333+
- [Evaluation Guide](../../../lib/idp_common_pkg/idp_common/evaluation/README_STICKLER.md)

0 commit comments

Comments
 (0)