Example of the eval working

sromoam · sromoam · commit e71116bfaf69 · 2025-10-31T12:17:59.000-05:00
diff --git a/config_library/pattern-2/fcc-invoices/README.md b/config_library/pattern-2/fcc-invoices/README.md
@@ -141,19 +141,48 @@ idp-cli download-results \
 
 ## Step 4: Run Evaluation
 
-Evaluate the extraction results against ground truth using the **simplified evaluation script** (recommended):
+### Option A: Single Source of Truth (Recommended)
+
+Use the IDP config directly - no separate Stickler config needed:
 
 ```bash
 cd config_library/pattern-2/fcc-invoices
 
+python bulk_evaluate_from_idp_config.py \
+  --results-dir ../../../fcc_results/cli-batch-20251017-190516 \
+  --csv-path sample_labels_3.csv \
+  --idp-config-path sr_FCC_config.json \
+  --output-dir evaluation_output
+```
+
+**Benefits:**
+- Single source of truth - evaluation config comes from IDP config
+- Extracts Stickler settings from `x-aws-stickler-*` extensions in JSON Schema
+- No need to maintain separate `stickler_config.json`
+- Guarantees evaluation matches deployment configuration
+
+### Option B: Separate Stickler Config
+
+Use the simplified script with standalone Stickler config:
+
+```bash
 python bulk_evaluate_fcc_invoices_simple.py \
   --results-dir ../../../fcc_results/cli-batch-20251017-190516 \
   --csv-path sample_labels_3.csv \
   --config-path stickler_config.json \
   --output-dir evaluation_output
 ```
 
-**Alternative**: Use the legacy script (more complex, same results):
+**Benefits:**
+- 260 lines vs 671 lines (61% less code)
+- Easier to understand and modify
+- No temporary file overhead
+- Direct integration with SticklerEvaluationService
+
+### Option C: Legacy Script
+
+Use the original complex script (not recommended):
+
 ```bash
 python bulk_evaluate_fcc_invoices.py \
   --results-dir ../../../fcc_results/cli-batch-20251017-190516 \
@@ -164,21 +193,14 @@ python bulk_evaluate_fcc_invoices.py \
 
 **Note**: The `sample_labels_3.csv` contains ground truth for 3 sample documents. For full dataset evaluation, use `sr_refactor_labels_5_5_25.csv`.
 
-**What this does:**
+**What evaluation does:**
 - Loads ground truth labels from CSV
 - Matches documents by doc_id
 - Performs doc-by-doc comparison using SticklerEvaluationService
 - Saves individual comparison results
 - Aggregates metrics across all documents
 - Generates comprehensive evaluation report
 
-**Why use the simplified script?**
-- 260 lines vs 671 lines (61% less code)
-- Easier to understand and modify
-- No temporary file overhead
-- Direct integration with SticklerEvaluationService
-- Same accurate results
-
 **Expected output:**
 ```
 ================================================================================
@@ -416,12 +438,48 @@ python bulk_evaluate_fcc_invoices_simple.py \
   --output-dir evaluation_output-2
 ```
 
+### Evaluation with IDP Config
+
+New evaluation script that uses IDP config directly:
+
+```bash
+python bulk_evaluate_from_idp_config.py \
+  --results-dir ../../../fcc_results-updated-2/cli-batch-20251031-164416 \
+  --csv-path sample_labels_3.csv \
+  --idp-config-path sr_FCC_config.json \
+  --output-dir evaluation_output-idp-config
+```
+
+**Results:**
+```
+📈 Overall Metrics:
+  Precision: 0.5185
+  Recall:    1.0000
+  F1 Score:  0.6829
+  Accuracy:  0.5185
+
+  Confusion Matrix:
+    TP:     14  |  FP:     13
+    FN:      0  |  TN:      0
+    FP1:      2  |  FP2:     11
+
+📋 Field-Level Metrics (Top Fields):
+  agency                 F1: 0.8000
+  gross_total            F1: 0.8000
+  net_amount_due         F1: 0.8000
+  line_item__days        F1: 0.8000
+  line_item__start_date  F1: 0.8000
+  line_item__end_date    F1: 0.8000
+```
+
 ### Notes
 
 - Multiple deploy/inference cycles were run to iterate on the configuration
 - Final batch ID: `cli-batch-20251031-164416`
 - Evaluation successfully produced results with the simplified script
 - Configuration now properly uses `{ATTRIBUTE_NAMES_AND_DESCRIPTIONS}` placeholder for automatic schema injection
+- New `bulk_evaluate_from_idp_config.py` extracts Stickler config from `x-aws-stickler-*` extensions
+- Single source of truth: IDP config contains both extraction schema and evaluation settings
 
 -region us-west-2 \
   --wait \
diff --git a/evaluation_output-idp-config/aggregated_metrics.json b/evaluation_output-idp-config/aggregated_metrics.json
@@ -0,0 +1,208 @@
+{
+  "summary": {
+    "documents_processed": 3,
+    "errors": 0
+  },
+  "overall_metrics": {
+    "precision": 0.5185185185185185,
+    "recall": 1.0,
+    "f1_score": 0.6829268292682926,
+    "accuracy": 0.5185185185185185,
+    "tp": 14,
+    "fp": 13,
+    "tn": 0,
+    "fn": 0,
+    "fp1": 2,
+    "fp2": 11,
+    "total": 27
+  },
+  "field_metrics": {
+    "agency": {
+      "precision": 0.6666666666666666,
+      "recall": 1.0,
+      "f1_score": 0.8,
+      "accuracy": 0.6666666666666666,
+      "tp": 2,
+      "fp": 1,
+      "tn": 0,
+      "fn": 0,
+      "fp1": 1,
+      "fp2": 0,
+      "total": 3
+    },
+    "advertiser": {
+      "precision": 0.3333333333333333,
+      "recall": 1.0,
+      "f1_score": 0.5,
+      "accuracy": 0.3333333333333333,
+      "tp": 1,
+      "fp": 2,
+      "tn": 0,
+      "fn": 0,
+      "fp1": 0,
+      "fp2": 2,
+      "total": 3
+    },
+    "gross_total": {
+      "precision": 0.6666666666666666,
+      "recall": 1.0,
+      "f1_score": 0.8,
+      "accuracy": 0.6666666666666666,
+      "tp": 2,
+      "fp": 1,
+      "tn": 0,
+      "fn": 0,
+      "fp1": 0,
+      "fp2": 1,
+      "total": 3
+    },
+    "net_amount_due": {
+      "precision": 0.6666666666666666,
+      "recall": 1.0,
+      "f1_score": 0.8,
+      "accuracy": 0.6666666666666666,
+      "tp": 2,
+      "fp": 1,
+      "tn": 0,
+      "fn": 0,
+      "fp1": 1,
+      "fp2": 0,
+      "total": 3
+    },
+    "line_item__description": {
+      "precision": 0.0,
+      "recall": 0.0,
+      "f1_score": 0.0,
+      "accuracy": 0.0,
+      "tp": 0,
+      "fp": 3,
+      "tn": 0,
+      "fn": 0,
+      "fp1": 0,
+      "fp2": 3,
+      "total": 3
+    },
+    "line_item__days": {
+      "precision": 0.6666666666666666,
+      "recall": 1.0,
+      "f1_score": 0.8,
+      "accuracy": 0.6666666666666666,
+      "tp": 2,
+      "fp": 1,
+      "tn": 0,
+      "fn": 0,
+      "fp1": 0,
+      "fp2": 1,
+      "total": 3
+    },
+    "line_item__rate": {
+      "precision": 0.3333333333333333,
+      "recall": 1.0,
+      "f1_score": 0.5,
+      "accuracy": 0.3333333333333333,
+      "tp": 1,
+      "fp": 2,
+      "tn": 0,
+      "fn": 0,
+      "fp1": 0,
+      "fp2": 2,
+      "total": 3
+    },
+    "line_item__start_date": {
+      "precision": 0.6666666666666666,
+      "recall": 1.0,
+      "f1_score": 0.8,
+      "accuracy": 0.6666666666666666,
+      "tp": 2,
+      "fp": 1,
+      "tn": 0,
+      "fn": 0,
+      "fp1": 0,
+      "fp2": 1,
+      "total": 3
+    },
+    "line_item__end_date": {
+      "precision": 0.6666666666666666,
+      "recall": 1.0,
+      "f1_score": 0.8,
+      "accuracy": 0.6666666666666666,
+      "tp": 2,
+      "fp": 1,
+      "tn": 0,
+      "fn": 0,
+      "fp1": 0,
+      "fp2": 1,
+      "total": 3
+    }
+  },
+  "errors": [],
+  "stickler_config_used": {
+    "model_name": "FCCInvoice",
+    "match_threshold": 0.7,
+    "fields": {
+      "agency": {
+        "type": "list",
+        "comparator": "FuzzyComparator",
+        "threshold": 0.8,
+        "weight": 2.0,
+        "description": "The advertising agency or media buyer handling the political advertising purchase."
+      },
+      "advertiser": {
+        "type": "list",
+        "comparator": "FuzzyComparator",
+        "threshold": 0.8,
+        "weight": 2.0,
+        "description": "The political advertiser or campaign purchasing the broadcast time."
+      },
+      "gross_total": {
+        "type": "list",
+        "comparator": "ExactComparator",
+        "threshold": 1.0,
+        "weight": 3.0,
+        "description": "The total gross amount for all line items before any discounts or adjustments."
+      },
+      "net_amount_due": {
+        "type": "list",
+        "comparator": "ExactComparator",
+        "threshold": 1.0,
+        "weight": 3.0,
+        "description": "The final net amount due after any discounts or adjustments have been applied."
+      },
+      "line_item__description": {
+        "type": "list",
+        "comparator": "LevenshteinComparator",
+        "threshold": 0.7,
+        "weight": 1.5,
+        "description": "List of broadcast time slot descriptions (e.g., 'M-F 11a-12p' for Monday through Friday 11am to 12pm)."
+      },
+      "line_item__days": {
+        "type": "list",
+        "comparator": "ExactComparator",
+        "threshold": 1.0,
+        "weight": 1.0,
+        "description": "List of days of the week for each broadcast slot (e.g., 'MTWTF--' where each position represents a day)."
+      },
+      "line_item__rate": {
+        "type": "list",
+        "comparator": "ExactComparator",
+        "threshold": 1.0,
+        "weight": 2.0,
+        "description": "List of rates or costs for each broadcast time slot (may include commas for thousands separator)."
+      },
+      "line_item__start_date": {
+        "type": "list",
+        "comparator": "ExactComparator",
+        "threshold": 1.0,
+        "weight": 2.0,
+        "description": "List of start dates for each line item's broadcast schedule (typically in MM/DD/YY format)."
+      },
+      "line_item__end_date": {
+        "type": "list",
+        "comparator": "ExactComparator",
+        "threshold": 1.0,
+        "weight": 2.0,
+        "description": "List of end dates for each line item's broadcast schedule (typically in MM/DD/YY format)."
+      }
+    }
+  }
+}