Skip to content

Commit 4d33d79

Browse files
committed
Merge branch 'feature/integrate-realkie-dataset' into 'develop'
Auto-Deploy RealKIE-FCC-Verified Dataset for Test Studio See merge request genaiic-reusable-assets/engagement-artifacts/genaiic-idp-accelerator!460
2 parents a67f11d + 88ea587 commit 4d33d79

File tree

6 files changed

+404
-1
lines changed

6 files changed

+404
-1
lines changed

CHANGELOG.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,23 @@ SPDX-License-Identifier: MIT-0
55

66
## [Unreleased]
77

8+
### Added
9+
10+
- **RealKIE-FCC-Verified Dataset Auto-Deployment for Test Studio**
11+
- Added fully automatic deployment of the public RealKIE-FCC-Verified dataset from HuggingFace during stack deployment with zero manual steps
12+
- **Lightweight Implementation**: Uses `hf_hub_download()` API for both parquet metadata and PDF files, with `pyarrow` for efficient parquet reading - total package size ~20MB (well under 250MB Lambda limit)
13+
- **Direct File Download**: Downloads original PDF files from HuggingFace repository's `/pdfs` directory and parquet metadata from `/data` directory using unified `hf_hub_download()` approach
14+
- **Complete Dataset Deployment**: 75 FCC invoice documents (PDFs + ground truth) automatically deployed to TestSetBucket and registered in Test Studio
15+
- **Zero User Effort**: Test set immediately available in Test Studio UI post-deployment - no manual downloads, no local files, no additional scripts
16+
- **Version Control**: Dataset version pinned to CloudFormation CustomResource property enabling controlled updates when new dataset versions are released
17+
- **Efficient Updates**: Skips re-download on stack updates unless dataset version changes, preventing unnecessary deployment time
18+
- **Ground Truth Included**: Complete baseline data extracted from HuggingFace parquet `json_response` field in accelerator format (Agency, Advertiser, GrossTotal, PaymentTerms, AgencyCommission, NetAmountDue, LineItems)
19+
- **S3 Structure**: Organized in TestSetBucket with proper `input/{doc_id}.pdf` and `baseline/{doc_id}.pdf/sections/1/result.json` structure
20+
- **Lambda Implementation**: Custom Resource Lambda function (900s timeout, 2GB memory) with minimal dependencies (huggingface-hub, pyarrow, boto3, crhelper)
21+
- **Single Data Source**: Everything sourced from the public HuggingFace dataset - fully reproducible and deployable anywhere
22+
- **Use Cases**: Immediate testing capability after deployment, benchmark dataset for evaluating extraction performance, training and demonstration purposes
23+
- **Configuration**: Controlled by `FccDatasetDeployment` CustomResource with configurable `DatasetVersion` property (default: "1.0")
24+
825
## [0.4.7]
926

1027
### Added

VERSION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
0.4.7
1+
0.4.8-wip1

docs/test-studio.md

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,55 @@ The Test Studio consists of two main tabs:
88
1. **Test Sets**: Create and manage reusable collections of test documents
99
2. **Test Executions**: Execute tests, view results, and compare test runs
1010

11+
## Pre-Deployed Test Set: RealKIE-FCC-Verified
12+
13+
The accelerator automatically deploys the **RealKIE-FCC-Verified** dataset from HuggingFace (https://huggingface.co/datasets/amazon-agi/RealKIE-FCC-Verified) as a ready-to-use test set during stack deployment. This public dataset contains 75 invoice documents sourced from the Federal Communications Commission (FCC).
14+
15+
### Fully Automatic Deployment
16+
17+
During stack deployment, the system automatically:
18+
19+
1. **Downloads Dataset** from HuggingFace (75 documents)
20+
2. **Reconstructs PDFs** from PNG page images using lossless img2pdf conversion
21+
3. **Uploads PDFs** to `s3://TestSetBucket/realkie-fcc-verified/input/`
22+
4. **Extracts Ground Truth** from `json_response` field (already in accelerator format!)
23+
5. **Uploads Baselines** to `s3://TestSetBucket/realkie-fcc-verified/baseline/`
24+
6. **Registers Test Set** in DynamoDB with metadata
25+
26+
**Zero Manual Steps Required** - Everything is sourced from the public HuggingFace dataset and deployed automatically.
27+
28+
### Key Features
29+
30+
- **Fully Automatic**: Complete deployment during stack creation with zero user effort
31+
- **PDF Reconstruction**: Converts PNG page images to PDF documents using img2pdf for lossless quality
32+
- **Complete Ground Truth**: Structured invoice attributes (Agency, Advertiser, GrossTotal, PaymentTerms, AgencyCommission, NetAmountDue, LineItems)
33+
- **Version Control**: Dataset version pinned in CloudFormation (DatasetVersion: "1.0"), updateable via parameter
34+
- **Smart Updates**: Skips re-download on stack updates unless version changes
35+
- **Single Public Source**: Everything from HuggingFace - fully reproducible anywhere
36+
- **Benchmark Ready**: 75 FCC invoice documents ideal for extraction evaluation
37+
38+
### Deployment Time
39+
40+
- **First Deployment**: Adds ~5-10 minutes to stack deployment (downloads dataset + converts images)
41+
- **Stack Updates**: Near-instant (skips if version unchanged)
42+
- **Version Updates**: Re-downloads and re-processes when DatasetVersion changes
43+
44+
### Usage
45+
46+
The RealKIE-FCC-Verified test set is immediately available after stack deployment:
47+
48+
1. Navigate to **Test Executions** tab
49+
2. Select "RealKIE-FCC-Verified" from the **Select Test Set** dropdown
50+
3. Enter a description in the **Context** field
51+
4. Click **Run Test** to start processing
52+
5. Monitor progress and view results when complete
53+
54+
This dataset provides an excellent benchmark for:
55+
- Evaluating extraction accuracy on invoice documents
56+
- Comparing different model configurations
57+
- Testing prompt engineering improvements
58+
- Training and demonstration purposes
59+
1160

1261
https://github.com/user-attachments/assets/7c5adf30-8d5c-4292-93b0-0149506322c7
1362

Lines changed: 254 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,254 @@
1+
"""
2+
Lambda function to deploy the RealKIE-FCC-Verified dataset from HuggingFace
3+
to the TestSetBucket during stack deployment.
4+
"""
5+
6+
import json
7+
import os
8+
import logging
9+
import boto3
10+
from datetime import datetime
11+
from typing import Dict, Any
12+
import cfnresponse
13+
14+
# Set HuggingFace cache to /tmp (Lambda's writable directory)
15+
os.environ['HF_HOME'] = '/tmp/huggingface'
16+
os.environ['HUGGINGFACE_HUB_CACHE'] = '/tmp/huggingface/hub'
17+
18+
# Lightweight HuggingFace access
19+
from huggingface_hub import hf_hub_download
20+
import pyarrow.parquet as pq
21+
22+
# Configure logging
23+
logger = logging.getLogger()
24+
logger.setLevel(os.environ.get('LOG_LEVEL', 'INFO'))
25+
26+
# AWS clients
27+
s3_client = boto3.client('s3')
28+
dynamodb = boto3.resource('dynamodb')
29+
30+
# Environment variables
31+
TESTSET_BUCKET = os.environ.get('TESTSET_BUCKET')
32+
TRACKING_TABLE = os.environ.get('TRACKING_TABLE')
33+
34+
# Constants
35+
DATASET_NAME = 'RealKIE-FCC-Verified'
36+
DATASET_PREFIX = 'realkie-fcc-verified/'
37+
TEST_SET_ID = 'realkie-fcc-verified'
38+
39+
40+
def handler(event, context):
41+
"""
42+
Main Lambda handler for deploying the FCC dataset.
43+
"""
44+
logger.info(f"Event: {json.dumps(event)}")
45+
46+
try:
47+
request_type = event['RequestType']
48+
49+
if request_type == 'Delete':
50+
# On stack deletion, we leave the data in place
51+
logger.info("Delete request - keeping dataset in place")
52+
cfnresponse.send(event, context, cfnresponse.SUCCESS, {})
53+
return
54+
55+
# Extract properties
56+
properties = event['ResourceProperties']
57+
dataset_version = properties.get('DatasetVersion', '1.0')
58+
dataset_description = properties.get('DatasetDescription', '')
59+
60+
logger.info(f"Processing dataset version: {dataset_version}")
61+
62+
# Check if dataset already exists with this version
63+
if check_existing_version(dataset_version):
64+
logger.info(f"Dataset version {dataset_version} already deployed, skipping")
65+
cfnresponse.send(event, context, cfnresponse.SUCCESS, {
66+
'Message': f'Dataset version {dataset_version} already exists'
67+
})
68+
return
69+
70+
# Download and deploy the dataset
71+
result = deploy_dataset(dataset_version, dataset_description)
72+
73+
logger.info(f"Dataset deployment completed: {result}")
74+
cfnresponse.send(event, context, cfnresponse.SUCCESS, result)
75+
76+
except Exception as e:
77+
logger.error(f"Error deploying dataset: {str(e)}", exc_info=True)
78+
cfnresponse.send(event, context, cfnresponse.FAILED, {},
79+
reason=f"Error deploying dataset: {str(e)}")
80+
81+
82+
def check_existing_version(version: str) -> bool:
83+
"""
84+
Check if the dataset with the specified version already exists.
85+
"""
86+
try:
87+
table = dynamodb.Table(TRACKING_TABLE)
88+
response = table.get_item(
89+
Key={
90+
'PK': f'testset#{TEST_SET_ID}',
91+
'SK': 'metadata'
92+
}
93+
)
94+
95+
if 'Item' in response:
96+
existing_version = response['Item'].get('datasetVersion', '')
97+
logger.info(f"Found existing dataset version: {existing_version}")
98+
99+
# Check if version matches and files exist
100+
if existing_version == version:
101+
# Verify at least some files exist in S3
102+
try:
103+
response = s3_client.list_objects_v2(
104+
Bucket=TESTSET_BUCKET,
105+
Prefix=f'{DATASET_PREFIX}input/',
106+
MaxKeys=1
107+
)
108+
if response.get('KeyCount', 0) > 0:
109+
logger.info("Files exist in S3, skipping deployment")
110+
return True
111+
except Exception as e:
112+
logger.warning(f"Error checking S3 files: {e}")
113+
114+
return False
115+
116+
except Exception as e:
117+
logger.warning(f"Error checking existing version: {e}")
118+
return False
119+
120+
121+
def deploy_dataset(version: str, description: str) -> Dict[str, Any]:
122+
"""
123+
Deploy the dataset by downloading PDFs and ground truth from HuggingFace
124+
using lightweight hf_hub_download and pyarrow.
125+
"""
126+
try:
127+
# Ensure cache directory exists in /tmp (Lambda's writable directory)
128+
cache_dir = '/tmp/huggingface/hub'
129+
os.makedirs(cache_dir, exist_ok=True)
130+
logger.info(f"Using cache directory: {cache_dir}")
131+
132+
logger.info(f"Downloading dataset from HuggingFace: amazon-agi/RealKIE-FCC-Verified")
133+
134+
# Download the parquet file with metadata using hf_hub_download
135+
parquet_path = hf_hub_download(
136+
repo_id="amazon-agi/RealKIE-FCC-Verified",
137+
filename="data/test-00000-of-00001.parquet",
138+
repo_type="dataset",
139+
cache_dir=cache_dir
140+
)
141+
142+
logger.info(f"Downloaded parquet metadata file")
143+
144+
# Read parquet file with pyarrow
145+
table = pq.read_table(parquet_path)
146+
data_dict = table.to_pydict()
147+
148+
num_documents = len(data_dict['id'])
149+
logger.info(f"Loaded {num_documents} documents from parquet")
150+
151+
# Process and upload each document
152+
file_count = 0
153+
skipped_count = 0
154+
155+
for idx in range(num_documents):
156+
try:
157+
document_id = data_dict['id'][idx]
158+
json_response = data_dict['json_response'][idx]
159+
160+
if not json_response:
161+
logger.warning(f"Skipping {document_id}: no json_response")
162+
skipped_count += 1
163+
continue
164+
165+
logger.info(f"Processing {document_id}")
166+
167+
# Download PDF file from HuggingFace repository using hf_hub_download
168+
try:
169+
pdf_path = hf_hub_download(
170+
repo_id="amazon-agi/RealKIE-FCC-Verified",
171+
filename=f"pdfs/{document_id}",
172+
repo_type="dataset",
173+
cache_dir=cache_dir
174+
)
175+
176+
# Read the downloaded PDF
177+
with open(pdf_path, 'rb') as f:
178+
pdf_bytes = f.read()
179+
180+
logger.info(f"Downloaded PDF for {document_id} ({len(pdf_bytes):,} bytes)")
181+
182+
# Upload PDF to input folder
183+
pdf_key = f'{DATASET_PREFIX}input/{document_id}'
184+
s3_client.put_object(
185+
Bucket=TESTSET_BUCKET,
186+
Key=pdf_key,
187+
Body=pdf_bytes,
188+
ContentType='application/pdf'
189+
)
190+
191+
except Exception as e:
192+
logger.error(f"Error downloading/uploading PDF for {document_id}: {e}")
193+
skipped_count += 1
194+
continue
195+
196+
# Upload ground truth baseline (wrap in inference_result)
197+
result_json = {"inference_result": json_response}
198+
result_key = f'{DATASET_PREFIX}baseline/{document_id}/sections/1/result.json'
199+
s3_client.put_object(
200+
Bucket=TESTSET_BUCKET,
201+
Key=result_key,
202+
Body=json.dumps(result_json, indent=2),
203+
ContentType='application/json'
204+
)
205+
206+
file_count += 1
207+
208+
if file_count % 10 == 0:
209+
logger.info(f"Processed {file_count}/{num_documents} documents...")
210+
211+
except Exception as e:
212+
logger.error(f"Error processing document {idx} ({document_id}): {e}")
213+
skipped_count += 1
214+
continue
215+
216+
logger.info(f"Successfully deployed {file_count} documents (skipped {skipped_count})")
217+
218+
# Create test set record in DynamoDB
219+
create_testset_record(version, description, file_count)
220+
221+
return {
222+
'DatasetVersion': version,
223+
'FileCount': file_count,
224+
'SkippedCount': skipped_count,
225+
'Message': f'Successfully deployed {file_count} documents with PDFs and baseline files'
226+
}
227+
228+
except Exception as e:
229+
logger.error(f"Error deploying dataset: {e}", exc_info=True)
230+
raise
231+
232+
233+
def create_testset_record(version: str, description: str, file_count: int):
234+
"""
235+
Create or update the test set record in DynamoDB.
236+
"""
237+
table = dynamodb.Table(TRACKING_TABLE)
238+
timestamp = datetime.utcnow().isoformat() + 'Z'
239+
240+
item = {
241+
'PK': f'testset#{TEST_SET_ID}',
242+
'SK': 'metadata',
243+
'id': TEST_SET_ID,
244+
'name': DATASET_NAME,
245+
'filePattern': '',
246+
'fileCount': file_count,
247+
'status': 'COMPLETED',
248+
'createdAt': timestamp,
249+
'datasetVersion': version,
250+
'source': 'huggingface:amazon-agi/RealKIE-FCC-Verified'
251+
}
252+
253+
table.put_item(Item=item)
254+
logger.info(f"Created test set record in DynamoDB: {TEST_SET_ID}")
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# Lightweight HuggingFace file download (no heavy datasets library)
2+
huggingface-hub>=0.20.0
3+
4+
# For reading parquet metadata files efficiently
5+
pyarrow>=20.0.0
6+
7+
# AWS SDK (boto3 is already available in Lambda runtime, but specified for local testing)
8+
boto3>=1.34.0
9+
10+
# For CloudFormation custom resource responses
11+
cfnresponse

0 commit comments

Comments
 (0)