You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CHANGELOG.md
+4-3Lines changed: 4 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,14 +9,15 @@ SPDX-License-Identifier: MIT-0
9
9
10
10
-**RealKIE-FCC-Verified Dataset Auto-Deployment for Test Studio**
11
11
- Added fully automatic deployment of the public RealKIE-FCC-Verified dataset from HuggingFace during stack deployment with zero manual steps
12
-
-**Direct PDF Download**: Downloads original PDF files from HuggingFace repository's `/pdfs` directory using `hf_hub_download` API
12
+
-**Lightweight Implementation**: Uses `hf_hub_download()` API for both parquet metadata and PDF files, with `pyarrow` for efficient parquet reading - total package size ~20MB (well under 250MB Lambda limit)
13
+
-**Direct File Download**: Downloads original PDF files from HuggingFace repository's `/pdfs` directory and parquet metadata from `/data` directory using unified `hf_hub_download()` approach
13
14
-**Complete Dataset Deployment**: 75 FCC invoice documents (PDFs + ground truth) automatically deployed to TestSetBucket and registered in Test Studio
14
15
-**Zero User Effort**: Test set immediately available in Test Studio UI post-deployment - no manual downloads, no local files, no additional scripts
15
16
-**Version Control**: Dataset version pinned to CloudFormation CustomResource property enabling controlled updates when new dataset versions are released
16
17
-**Efficient Updates**: Skips re-download on stack updates unless dataset version changes, preventing unnecessary deployment time
17
-
-**Ground Truth Included**: Complete baseline data extracted from HuggingFace `json_response` field in accelerator format (Agency, Advertiser, GrossTotal, PaymentTerms, AgencyCommission, NetAmountDue, LineItems)
18
+
-**Ground Truth Included**: Complete baseline data extracted from HuggingFace parquet `json_response` field in accelerator format (Agency, Advertiser, GrossTotal, PaymentTerms, AgencyCommission, NetAmountDue, LineItems)
18
19
-**S3 Structure**: Organized in TestSetBucket with proper `input/{doc_id}.pdf` and `baseline/{doc_id}.pdf/sections/1/result.json` structure
19
-
-**Lambda Implementation**: Custom Resource Lambda function (900s timeout, 2GB memory) with HuggingFace datasets library and Hub API for direct PDF access
20
+
-**Lambda Implementation**: Custom Resource Lambda function (900s timeout, 2GB memory) with minimal dependencies (huggingface-hub, pyarrow, boto3, crhelper)
20
21
-**Single Data Source**: Everything sourced from the public HuggingFace dataset - fully reproducible and deployable anywhere
21
22
-**Use Cases**: Immediate testing capability after deployment, benchmark dataset for evaluating extraction performance, training and demonstration purposes
22
23
-**Configuration**: Controlled by `FccDatasetDeployment` CustomResource with configurable `DatasetVersion` property (default: "1.0")
0 commit comments