Skip to content

Commit 1d6ca16

Browse files
committed
Fixing bug with the CLI.
Adding in a config baseline.
1 parent 7a1530b commit 1d6ca16

File tree

2 files changed

+132
-5
lines changed

2 files changed

+132
-5
lines changed
Lines changed: 130 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,130 @@
1+
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
2+
# SPDX-License-Identifier: MIT-0
3+
notes: Configuration for FCC invoice information extraction (no classification)
4+
ocr:
5+
backend: "textract"
6+
model_id: "us.anthropic.claude-3-7-sonnet-20250219-v1:0"
7+
system_prompt: "You are an expert OCR system. Extract all text from the provided image accurately, preserving layout where possible."
8+
task_prompt: "Extract all text from this document image. Preserve the layout, including paragraphs, tables, and formatting."
9+
features:
10+
- name: LAYOUT
11+
- name: TABLES
12+
image:
13+
target_width: ""
14+
target_height: ""
15+
16+
classes:
17+
- name: FCC-Invoice
18+
description: >-
19+
Federal Communications Commission (FCC) political advertising invoice showing broadcast
20+
time purchases, including line items with descriptions, dates, rates, and totals for
21+
political advertising campaigns.
22+
attributes:
23+
- name: agency
24+
description: >-
25+
The advertising agency or media buyer handling the political advertising purchase.
26+
evaluation_method: EXACT
27+
attributeType: simple
28+
29+
- name: advertiser
30+
description: >-
31+
The political advertiser or campaign purchasing the broadcast time.
32+
evaluation_method: EXACT
33+
attributeType: simple
34+
35+
- name: gross_total
36+
description: >-
37+
The total gross amount for all line items before any discounts or adjustments.
38+
evaluation_method: NUMERIC_EXACT
39+
attributeType: simple
40+
41+
- name: net_amount_due
42+
description: >-
43+
The final net amount due after any discounts or adjustments have been applied.
44+
evaluation_method: NUMERIC_EXACT
45+
attributeType: simple
46+
47+
- name: line_items
48+
listItemTemplate:
49+
itemAttributes:
50+
- name: description
51+
description: >-
52+
The broadcast time slot description, typically showing days of week and time range
53+
(e.g., "M-F 11a-12p" for Monday through Friday 11am to 12pm).
54+
evaluation_method: EXACT
55+
56+
- name: days
57+
description: >-
58+
The days of the week for this broadcast slot, often in format like "MTWTF--"
59+
where each position represents a day (Monday, Tuesday, Wednesday, Thursday, Friday,
60+
Saturday, Sunday) with dashes for non-broadcast days.
61+
evaluation_method: EXACT
62+
63+
- name: rate
64+
description: >-
65+
The rate or cost for this specific broadcast time slot, may include commas
66+
for thousands separator.
67+
evaluation_method: NUMERIC_EXACT
68+
69+
- name: start_date
70+
description: >-
71+
The start date for this line item's broadcast schedule, typically in MM/DD/YY format.
72+
evaluation_method: EXACT
73+
74+
- name: end_date
75+
description: >-
76+
The end date for this line item's broadcast schedule, typically in MM/DD/YY format.
77+
evaluation_method: EXACT
78+
79+
itemDescription: >-
80+
Each item represents a specific broadcast time slot purchase with its schedule,
81+
rate, and date range.
82+
83+
description: >-
84+
List of line items detailing each broadcast time slot purchase, including the time
85+
description, days of week, rate, and date range for the advertising schedule.
86+
evaluation_method: LLM
87+
attributeType: list
88+
89+
extraction:
90+
model_id: "us.anthropic.claude-3-7-sonnet-20250219-v1:0"
91+
temperature: 0.0
92+
top_p: 0.9
93+
max_tokens: 4096
94+
system_prompt: |
95+
You are an expert at extracting structured information from FCC political advertising invoices.
96+
Extract all requested fields accurately, paying special attention to:
97+
- Line item details including time slots, days, rates, and date ranges
98+
- Monetary amounts (preserve formatting with commas and decimals)
99+
- Date formats (typically MM/DD/YY)
100+
- Agency and advertiser names
101+
102+
For line items, ensure you capture all rows from any tables showing broadcast schedules.
103+
Days of week are often encoded as 7-character strings where each position represents a day.
104+
105+
task_prompt: |
106+
Extract the following information from this FCC invoice:
107+
108+
1. Agency name
109+
2. Advertiser name
110+
3. Gross total amount
111+
4. Net amount due
112+
5. All line items with:
113+
- Description (time slot)
114+
- Days (day of week encoding)
115+
- Rate (cost)
116+
- Start date
117+
- End date
118+
119+
Return the information in the specified JSON schema format.
120+
121+
classification:
122+
enabled: false
123+
# No classification needed - all documents are FCC invoices
124+
125+
evaluation:
126+
enabled: true
127+
methods:
128+
- EXACT
129+
- NUMERIC_EXACT
130+
- LLM

idp_cli/idp_cli/cli.py

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,9 @@
77
Command-line tool for batch document processing with the IDP Accelerator.
88
"""
99

10+
import fnmatch
1011
import logging
12+
import os
1113
import sys
1214
import time
1315
from typing import Optional
@@ -860,7 +862,6 @@ def generate_manifest(
860862

861863
# Import scan method directly
862864
import glob as glob_module
863-
import os
864865

865866
dir_path = os.path.abspath(directory)
866867
if recursive:
@@ -884,8 +885,6 @@ def generate_manifest(
884885
prefix = uri_parts[1] if len(uri_parts) > 1 else ""
885886

886887
# List S3 objects
887-
import fnmatch
888-
889888
import boto3
890889

891890
s3 = boto3.client("s3", region_name=region)
@@ -934,8 +933,6 @@ def generate_manifest(
934933
f"[bold blue]Matching baselines from: {baseline_dir}[/bold blue]"
935934
)
936935

937-
import os
938-
939936
baseline_path = os.path.abspath(baseline_dir)
940937

941938
# Scan for baseline subdirectories

0 commit comments

Comments
 (0)