You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CHANGELOG.md
+3-13Lines changed: 3 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,19 +8,9 @@ SPDX-License-Identifier: MIT-0
8
8
### Added
9
9
10
10
-**RealKIE-FCC-Verified Dataset Auto-Deployment for Test Studio**
11
-
- Added fully automatic deployment of the public RealKIE-FCC-Verified dataset from HuggingFace during stack deployment with zero manual steps
12
-
-**Lightweight Implementation**: Uses `hf_hub_download()` API for both parquet metadata and PDF files, with `pyarrow` for efficient parquet reading - total package size ~20MB (well under 250MB Lambda limit)
13
-
-**Direct File Download**: Downloads original PDF files from HuggingFace repository's `/pdfs` directory and parquet metadata from `/data` directory using unified `hf_hub_download()` approach
14
-
-**Complete Dataset Deployment**: 75 FCC invoice documents (PDFs + ground truth) automatically deployed to TestSetBucket and registered in Test Studio
15
-
-**Zero User Effort**: Test set immediately available in Test Studio UI post-deployment - no manual downloads, no local files, no additional scripts
16
-
-**Version Control**: Dataset version pinned to CloudFormation CustomResource property enabling controlled updates when new dataset versions are released
17
-
-**Efficient Updates**: Skips re-download on stack updates unless dataset version changes, preventing unnecessary deployment time
18
-
-**Ground Truth Included**: Complete baseline data extracted from HuggingFace parquet `json_response` field in accelerator format (Agency, Advertiser, GrossTotal, PaymentTerms, AgencyCommission, NetAmountDue, LineItems)
19
-
-**S3 Structure**: Organized in TestSetBucket with proper `input/{doc_id}.pdf` and `baseline/{doc_id}.pdf/sections/1/result.json` structure
20
-
-**Lambda Implementation**: Custom Resource Lambda function (900s timeout, 2GB memory) with minimal dependencies (huggingface-hub, pyarrow, boto3, crhelper)
21
-
-**Single Data Source**: Everything sourced from the public HuggingFace dataset - fully reproducible and deployable anywhere
22
-
-**Use Cases**: Immediate testing capability after deployment, benchmark dataset for evaluating extraction performance, training and demonstration purposes
23
-
-**Configuration**: Controlled by `FccDatasetDeployment` CustomResource with configurable `DatasetVersion` property (default: "1.0")
11
+
- Automatically deploys 75 FCC invoice documents from HuggingFace public dataset during stack deployment - zero manual steps required
12
+
- Test set immediately available in Test Studio UI with complete ground truth for benchmarking extraction accuracy
13
+
- Version controlled via CloudFormation property - skips re-download on stack updates unless version changes
0 commit comments