|
| 1 | +# FCC Invoice Processing - End-to-End Example |
| 2 | + |
| 3 | +This directory contains a complete end-to-end example for processing FCC (Federal Communications Commission) political advertising invoices using the IDP accelerator with Stickler-based evaluation. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +This example demonstrates: |
| 8 | +1. **Deployment** - Deploy the IDP stack with FCC invoice configuration |
| 9 | +2. **Inference** - Run inference on sample FCC invoices |
| 10 | +3. **Evaluation** - Evaluate extraction results using Stickler |
| 11 | +4. **Review** - Analyze individual and aggregated metrics |
| 12 | + |
| 13 | +## Directory Contents |
| 14 | + |
| 15 | +``` |
| 16 | +config_library/pattern-2/fcc-invoices/ |
| 17 | +├── README.md # This file |
| 18 | +├── config.yaml # Base IDP configuration |
| 19 | +├── fcc_configured.yaml # Deployed stack configuration |
| 20 | +├── stickler_config.json # Stickler evaluation rules |
| 21 | +├── bulk_evaluate_fcc_invoices.py # Evaluation script |
| 22 | +└── sr_refactor_labels_5_5_25.csv # Ground truth labels (full dataset) |
| 23 | +``` |
| 24 | + |
| 25 | +## Sample Data |
| 26 | + |
| 27 | +Sample documents are located in `samples/fcc-invoices/`: |
| 28 | +- 3 sample PDF invoices |
| 29 | +- `fcc_invoices_sample_3.csv` - Manifest for the 3 samples |
| 30 | + |
| 31 | +## Prerequisites |
| 32 | + |
| 33 | +1. **AWS Credentials**: Valid AWS credentials with appropriate permissions |
| 34 | +2. **Python Environment**: Python 3.12+ with required packages |
| 35 | +3. **IDP CLI**: Installed and configured |
| 36 | +4. **Stickler**: Installed with `pip install -e "./stickler[dev]"` |
| 37 | +5. **Dependencies**: `pip install pandas` |
| 38 | + |
| 39 | +## Step 1: Deploy the Stack |
| 40 | + |
| 41 | +Deploy the IDP stack with FCC invoice configuration: |
| 42 | + |
| 43 | +```bash |
| 44 | +idp-cli deploy \ |
| 45 | + --stack-name fcc-inf-test \ |
| 46 | + --custom-config config_library/pattern-2/fcc-invoices/config.yaml \ |
| 47 | + --region us-west-2 \ |
| 48 | + --wait \ |
| 49 | + --template-url https://s3.us-west-2.amazonaws.com/bobs-artifacts-us-west-2/idp-wip/idp-main.yaml \ |
| 50 | + |
| 51 | + --pattern pattern-2 |
| 52 | +``` |
| 53 | + |
| 54 | +**What this does:** |
| 55 | +- Creates CloudFormation stack with Lambda functions, S3 buckets, and DynamoDB tables |
| 56 | +- Configures extraction model (Claude Sonnet 4) |
| 57 | +- Sets up OCR with Textract (LAYOUT + TABLES features) |
| 58 | +- Deploys with FCC-specific prompts and schema |
| 59 | + |
| 60 | +**Expected output:** |
| 61 | +- Stack creation takes ~5-10 minutes |
| 62 | +- Stack status: `CREATE_COMPLETE` |
| 63 | + |
| 64 | +## Step 2: Run Inference |
| 65 | + |
| 66 | +Run inference on the sample documents: |
| 67 | + |
| 68 | +```bash |
| 69 | +idp-cli run-inference \ |
| 70 | + --stack-name fcc-inf-test \ |
| 71 | + --manifest samples/fcc-invoices/fcc_invoices_sample_3.csv \ |
| 72 | + --region us-west-2 |
| 73 | +``` |
| 74 | + |
| 75 | +**What this does:** |
| 76 | +- Uploads documents to S3 input bucket |
| 77 | +- Triggers Lambda processing pipeline |
| 78 | +- Performs OCR with Textract |
| 79 | +- Extracts structured data using Claude |
| 80 | +- Stores results in S3 output bucket |
| 81 | + |
| 82 | +**Expected output:** |
| 83 | +``` |
| 84 | +Validating manifest... |
| 85 | +✓ Manifest validated successfully |
| 86 | +Initializing batch processor for stack: fcc-inf-test |
| 87 | +✓ Batch submitted successfully |
| 88 | +Batch ID: batch-20251017-140000 |
| 89 | +Processing 3 documents... |
| 90 | +``` |
| 91 | + |
| 92 | +**Monitor progress:** |
| 93 | +```bash |
| 94 | +idp-cli status \ |
| 95 | + --stack-name fcc-inf-test \ |
| 96 | + --batch-id <batch-id> \ |
| 97 | + --region us-west-2 \ |
| 98 | + --wait |
| 99 | +``` |
| 100 | + |
| 101 | +## Step 3: Download Results |
| 102 | + |
| 103 | +Download the inference results locally: |
| 104 | + |
| 105 | +```bash |
| 106 | +idp-cli download-results \ |
| 107 | + --stack-name fcc-inf-test \ |
| 108 | + --batch-id cli-batch-20251017-190516 \ |
| 109 | + --output-dir fcc_results \ |
| 110 | + --region us-west-2 |
| 111 | +``` |
| 112 | + |
| 113 | +**Note**: Replace `cli-batch-20251017-190516` with your actual batch ID from the inference step. You can specify any output directory name. |
| 114 | + |
| 115 | +**What this does:** |
| 116 | +- Downloads all result files from S3 |
| 117 | +- Creates directory structure: `fcc_results/<doc_id>/sections/1/result.json` |
| 118 | +- Each result contains extracted fields and metadata |
| 119 | + |
| 120 | +**Result structure:** |
| 121 | +```json |
| 122 | +{ |
| 123 | + "document_class": { |
| 124 | + "type": "FCC-Invoice" |
| 125 | + }, |
| 126 | + "inference_result": { |
| 127 | + "agency": "Agency Name", |
| 128 | + "advertiser": "Advertiser Name", |
| 129 | + "gross_total": "1,234.56", |
| 130 | + "net_amount_due": "1,234.56", |
| 131 | + "line_item__description": ["M-F 11a-12p", "M-F 12n-1p"], |
| 132 | + "line_item__days": ["MTWTF--", "MTWTF--"], |
| 133 | + "line_item__rate": ["100.00", "150.00"], |
| 134 | + "line_item__start_date": ["11/01/21", "11/01/21"], |
| 135 | + "line_item__end_date": ["11/07/21", "11/07/21"] |
| 136 | + } |
| 137 | +} |
| 138 | +``` |
| 139 | + |
| 140 | +## Step 4: Run Evaluation |
| 141 | + |
| 142 | +Evaluate the extraction results against ground truth: |
| 143 | + |
| 144 | +```bash |
| 145 | +cd config_library/pattern-2/fcc-invoices |
| 146 | + |
| 147 | +python bulk_evaluate_fcc_invoices.py \ |
| 148 | + --results-dir ../../../fcc_results \ |
| 149 | + --csv-path sr_refactor_labels_5_5_25.csv \ |
| 150 | + --output-dir evaluation_output |
| 151 | +``` |
| 152 | + |
| 153 | +**What this does:** |
| 154 | +- Loads ground truth labels from CSV |
| 155 | +- Matches documents by doc_id |
| 156 | +- Performs doc-by-doc comparison using Stickler |
| 157 | +- Saves individual comparison results |
| 158 | +- Aggregates metrics across all documents |
| 159 | +- Generates comprehensive evaluation report |
| 160 | + |
| 161 | +**Expected output:** |
| 162 | +``` |
| 163 | +================================================================================ |
| 164 | +BULK FCC INVOICE EVALUATION |
| 165 | +================================================================================ |
| 166 | +
|
| 167 | +📊 Loading ground truth from sr_refactor_labels_5_5_25.csv... |
| 168 | +✓ Loaded 221 documents with ground truth labels |
| 169 | +
|
| 170 | +📁 Loading inference results from ../../../fcc_results... |
| 171 | +✓ Loaded 3 inference results |
| 172 | +
|
| 173 | +🔗 Matching ground truth to inference results... |
| 174 | +✓ Matched 3 document pairs |
| 175 | +
|
| 176 | +⚙️ Evaluating 3 documents... |
| 177 | +✓ Completed evaluation |
| 178 | + Individual results saved to: evaluation_output |
| 179 | +
|
| 180 | +================================================================================ |
| 181 | +AGGREGATED EVALUATION RESULTS |
| 182 | +================================================================================ |
| 183 | +
|
| 184 | +📊 Processing Summary: |
| 185 | + Documents processed: 3 |
| 186 | + Errors encountered: 0 |
| 187 | + Non-matches found: 23 |
| 188 | +
|
| 189 | +📈 Overall Metrics: |
| 190 | + Precision: 0.7341 |
| 191 | + Recall: 0.4637 |
| 192 | + F1 Score: 0.5684 |
| 193 | + Accuracy: 0.3993 |
| 194 | +
|
| 195 | + Confusion Matrix: |
| 196 | + TP: 530 | FP: 192 |
| 197 | + FN: 613 | TN: 5 |
| 198 | + FP1 (False Alarm): 11 |
| 199 | + FP2 (Wrong Value): 181 |
| 200 | +
|
| 201 | +📋 Field-Level Metrics (Top 10 by F1 Score): |
| 202 | + Field Precision Recall F1 |
| 203 | + ---------------------------------------- ---------- ---------- ---------- |
| 204 | + line_item__description 0.9236 0.8261 0.8721 |
| 205 | + gross_total 1.0000 0.7500 0.8571 |
| 206 | + net_amount_due 1.0000 0.7500 0.8571 |
| 207 | + line_item__rate 0.8169 0.7117 0.7607 |
| 208 | + ... |
| 209 | +
|
| 210 | +💾 Aggregated results saved to evaluation_output/aggregated_metrics.json |
| 211 | +
|
| 212 | +================================================================================ |
| 213 | +✅ Evaluation complete! |
| 214 | + Individual results: evaluation_output |
| 215 | + Aggregated metrics: evaluation_output/aggregated_metrics.json |
| 216 | +================================================================================ |
| 217 | +``` |
| 218 | + |
| 219 | +## Step 5: Review Results |
| 220 | + |
| 221 | +### Individual Document Results |
| 222 | + |
| 223 | +Each document has a detailed comparison result: |
| 224 | + |
| 225 | +```bash |
| 226 | +cat evaluation_output/0492b95bc342870920c480040bc33513.json | python -m json.tool | less |
| 227 | +``` |
| 228 | + |
| 229 | +**Contains:** |
| 230 | +- Field-by-field scores |
| 231 | +- Confusion matrix (overall and per-field) |
| 232 | +- Non-matches with details |
| 233 | +- Similarity scores |
| 234 | + |
| 235 | +### Aggregated Metrics |
| 236 | + |
| 237 | +View the overall performance: |
| 238 | + |
| 239 | +```bash |
| 240 | +cat evaluation_output/aggregated_metrics.json | python -m json.tool | less |
| 241 | +``` |
| 242 | + |
| 243 | +**Contains:** |
| 244 | +- Overall precision, recall, F1, accuracy |
| 245 | +- Per-field performance metrics |
| 246 | +- Confusion matrix breakdown |
| 247 | +- Non-match summary |
| 248 | + |
| 249 | +## Understanding the Results |
| 250 | + |
| 251 | +### Confusion Matrix Metrics |
| 252 | + |
| 253 | +- **TP (True Positive)**: Correctly extracted field with correct value |
| 254 | +- **FP (False Positive)**: Extracted field with incorrect value or shouldn't exist |
| 255 | +- **TN (True Negative)**: Correctly didn't extract a field that shouldn't exist |
| 256 | +- **FN (False Negative)**: Failed to extract a field that should exist |
| 257 | +- **FP1 (False Alarm)**: Extracted a field that shouldn't exist |
| 258 | +- **FP2 (Wrong Value)**: Extracted a field with wrong value |
| 259 | + |
| 260 | +### Derived Metrics |
| 261 | + |
| 262 | +- **Precision**: TP / (TP + FP) - How many extracted values are correct |
| 263 | +- **Recall**: TP / (TP + FN) - How many ground truth values were found |
| 264 | +- **F1 Score**: Harmonic mean of precision and recall |
| 265 | +- **Accuracy**: (TP + TN) / Total - Overall correctness |
| 266 | + |
| 267 | +## Stickler Configuration |
| 268 | + |
| 269 | +The `stickler_config.json` defines validation rules: |
| 270 | + |
| 271 | +### Simple Fields (Lists) |
| 272 | +- `agency`: FuzzyComparator (threshold 0.8) - Allows minor name variations |
| 273 | +- `advertiser`: FuzzyComparator (threshold 0.8) |
| 274 | +- `gross_total`: ExactComparator (threshold 1.0) - Requires exact match |
| 275 | +- `net_amount_due`: ExactComparator (threshold 1.0) |
| 276 | + |
| 277 | +### Line Item Fields (Lists) |
| 278 | +- `line_item__description`: LevenshteinComparator (threshold 0.7) - Allows typos |
| 279 | +- `line_item__days`: ExactComparator (threshold 1.0) |
| 280 | +- `line_item__rate`: ExactComparator (threshold 1.0) |
| 281 | +- `line_item__start_date`: ExactComparator (threshold 1.0) |
| 282 | +- `line_item__end_date`: ExactComparator (threshold 1.0) |
| 283 | + |
| 284 | +**Note**: All fields are configured as lists to match the flat format used by both ground truth and predictions. |
| 285 | + |
| 286 | +## Data Format |
| 287 | + |
| 288 | +### Ground Truth (CSV) |
| 289 | +The `sr_refactor_labels_5_5_25.csv` contains: |
| 290 | +- `doc_id`: Document identifier (without .pdf extension) |
| 291 | +- `refactored_labels`: JSON string with ground truth in flat list format |
| 292 | + |
| 293 | +### Inference Results |
| 294 | +Directory structure: `results_dir/{doc_id}.pdf/sections/1/result.json` |
| 295 | + |
| 296 | +The flat format uses `line_item__` prefix for list fields, where each field is a list of values. |
| 297 | + |
| 298 | +## Troubleshooting |
| 299 | + |
| 300 | +### No matched pairs found |
| 301 | +- Verify `doc_id` in CSV matches directory names in results |
| 302 | +- Check if doc_id has `.pdf` extension mismatch |
| 303 | + |
| 304 | +### AWS Token Expired |
| 305 | +```bash |
| 306 | +# Refresh your AWS credentials |
| 307 | +aws sso login --profile your-profile |
| 308 | +``` |
| 309 | + |
| 310 | +### Stack not found |
| 311 | +```bash |
| 312 | +# Verify stack exists |
| 313 | +idp-cli list-stacks --region us-west-2 |
| 314 | +``` |
| 315 | + |
| 316 | +### Large matrix warnings |
| 317 | +- Normal for documents with many line items (>100) |
| 318 | +- Stickler uses Hungarian algorithm for optimal matching |
| 319 | +- May be slower but produces accurate results |
| 320 | + |
| 321 | +## Next Steps |
| 322 | + |
| 323 | +1. **Scale Up**: Process more documents by creating a larger manifest |
| 324 | +2. **Tune Configuration**: Adjust Stickler thresholds based on results |
| 325 | +3. **Analyze Errors**: Review non-matches to identify extraction issues |
| 326 | +4. **Iterate**: Update prompts or schema based on evaluation findings |
| 327 | + |
| 328 | +## Additional Resources |
| 329 | + |
| 330 | +- [IDP CLI Documentation](../../README.md) |
| 331 | +- [Stickler Documentation](../../../stickler/README.md) |
| 332 | +- [Pattern 2 Architecture](../README.md) |
| 333 | +- [Evaluation Guide](../../../lib/idp_common_pkg/idp_common/evaluation/README_STICKLER.md) |
0 commit comments