Skip to content

Commit a1627a3

Browse files
author
Bob Strahan
committed
> Update CHANGELOG.md to simplify RealKIE-FCC dataset description
1 parent 59e56b3 commit a1627a3

File tree

1 file changed

+3
-13
lines changed

1 file changed

+3
-13
lines changed

CHANGELOG.md

Lines changed: 3 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -8,19 +8,9 @@ SPDX-License-Identifier: MIT-0
88
### Added
99

1010
- **RealKIE-FCC-Verified Dataset Auto-Deployment for Test Studio**
11-
- Added fully automatic deployment of the public RealKIE-FCC-Verified dataset from HuggingFace during stack deployment with zero manual steps
12-
- **Lightweight Implementation**: Uses `hf_hub_download()` API for both parquet metadata and PDF files, with `pyarrow` for efficient parquet reading - total package size ~20MB (well under 250MB Lambda limit)
13-
- **Direct File Download**: Downloads original PDF files from HuggingFace repository's `/pdfs` directory and parquet metadata from `/data` directory using unified `hf_hub_download()` approach
14-
- **Complete Dataset Deployment**: 75 FCC invoice documents (PDFs + ground truth) automatically deployed to TestSetBucket and registered in Test Studio
15-
- **Zero User Effort**: Test set immediately available in Test Studio UI post-deployment - no manual downloads, no local files, no additional scripts
16-
- **Version Control**: Dataset version pinned to CloudFormation CustomResource property enabling controlled updates when new dataset versions are released
17-
- **Efficient Updates**: Skips re-download on stack updates unless dataset version changes, preventing unnecessary deployment time
18-
- **Ground Truth Included**: Complete baseline data extracted from HuggingFace parquet `json_response` field in accelerator format (Agency, Advertiser, GrossTotal, PaymentTerms, AgencyCommission, NetAmountDue, LineItems)
19-
- **S3 Structure**: Organized in TestSetBucket with proper `input/{doc_id}.pdf` and `baseline/{doc_id}.pdf/sections/1/result.json` structure
20-
- **Lambda Implementation**: Custom Resource Lambda function (900s timeout, 2GB memory) with minimal dependencies (huggingface-hub, pyarrow, boto3, crhelper)
21-
- **Single Data Source**: Everything sourced from the public HuggingFace dataset - fully reproducible and deployable anywhere
22-
- **Use Cases**: Immediate testing capability after deployment, benchmark dataset for evaluating extraction performance, training and demonstration purposes
23-
- **Configuration**: Controlled by `FccDatasetDeployment` CustomResource with configurable `DatasetVersion` property (default: "1.0")
11+
- Automatically deploys 75 FCC invoice documents from HuggingFace public dataset during stack deployment - zero manual steps required
12+
- Test set immediately available in Test Studio UI with complete ground truth for benchmarking extraction accuracy
13+
- Version controlled via CloudFormation property - skips re-download on stack updates unless version changes
2414

2515
## [0.4.7]
2616

0 commit comments

Comments
 (0)