Skip to content

Commit 6cb432b

Browse files
committed
Merge branch 'feature/doc-reporting' into 'develop'
Feature/doc reporting See merge request genaiic-reusable-assets/engagement-artifacts/genaiic-idp-accelerator!216
2 parents f5ee92f + d03d59b commit 6cb432b

18 files changed

+1364
-10360
lines changed

CHANGELOG.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,8 @@ SPDX-License-Identifier: MIT-0
55

66
## [Unreleased]
77

8+
### Added
9+
810
### Fixed
911

1012

@@ -41,6 +43,20 @@ SPDX-License-Identifier: MIT-0
4143
- **Backward Compatibility**: Maintains same interface as standard assessment service with seamless migration path
4244
- **Enhanced Documentation**: Comprehensive documentation in `docs/assessment.md` and example notebooks for both standard and granular approaches
4345

46+
- **Reporting Database now has Document Sections Tables to enable querying across document fields**
47+
- Added comprehensive document sections storage system that automatically creates tables for each section type (classification)
48+
- **Dynamic Table Creation**: AWS Glue Crawler automatically discovers new section types and creates corresponding tables (e.g., `invoice`, `receipt`, `bank_statement`)
49+
- **Configurable Crawler Schedule**: Support for manual, every 15 minutes, hourly (default), or daily crawler execution via `DocumentSectionsCrawlerFrequency` parameter
50+
- **Partitioned Storage**: Data organized by section type and date for efficient querying with Amazon Athena
51+
52+
- **Partition Projections for Evaluation and Metering tables**
53+
- **Automated Partition Management**: Eliminates need for `MSCK REPAIR TABLE` operations with projection-based partition discovery
54+
- **Performance Benefits**: Athena can efficiently prune partitions based on date ranges without manual partition loading
55+
- **Backward Compatibility Warning**: The partition structure change from `year=2024/month=03/day=15/` to `date=2024-03-15/` means that data saved in the evaluation or metering tables prior to v0.3.7 will not be visible in Athena queries after updating. To retain access to historical data, you can either:
56+
- Manually reorganize existing S3 data to match the new partition structure
57+
- Create separate Athena tables pointing to the old partition structure for historical queries
58+
59+
4460
- **Optimize the classification process for single class configurations in Pattern-2**
4561
- Detects when only a single document class is defined in the configuration
4662
- Automatically classifies all document pages as that single class

VERSION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
0.3.7-beta
1+
0.3.7-gamma

docs/reporting-database.md

Lines changed: 123 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,9 @@ The GenAI IDP Accelerator includes a comprehensive reporting database that captu
1212
- [Section Evaluations](#section-evaluations)
1313
- [Attribute Evaluations](#attribute-evaluations)
1414
- [Metering Table](#metering-table)
15+
- [Document Sections Tables](#document-sections-tables)
16+
- [Dynamic Section Tables](#dynamic-section-tables)
17+
- [Crawler Configuration](#crawler-configuration)
1518
- [Using the Reporting Database with Athena](#using-the-reporting-database-with-athena)
1619
- [Sample Queries](#sample-queries)
1720
- [Creating Dashboards](#creating-dashboards)
@@ -37,7 +40,7 @@ The `document_evaluations` table contains document-level evaluation metrics:
3740
| false_discovery_rate | double | False discovery rate (0-1) |
3841
| execution_time | double | Time taken to evaluate (seconds) |
3942

40-
This table is partitioned by year, month (YYYY-MM format), day (YYYY-MM-DD format), and document ID.
43+
This table is partitioned by date (YYYY-MM-DD format).
4144

4245
### Section Evaluations
4346

@@ -56,7 +59,7 @@ The `section_evaluations` table contains section-level evaluation metrics:
5659
| false_discovery_rate | double | Section false discovery rate (0-1) |
5760
| evaluation_date | timestamp | When the evaluation was performed |
5861

59-
This table is partitioned by year, month (YYYY-MM format), day (YYYY-MM-DD format), and document ID.
62+
This table is partitioned by date (YYYY-MM-DD format).
6063

6164
### Attribute Evaluations
6265

@@ -78,7 +81,7 @@ The `attribute_evaluations` table contains attribute-level evaluation metrics:
7881
| confidence_threshold | string | Confidence threshold used |
7982
| evaluation_date | timestamp | When the evaluation was performed |
8083

81-
This table is partitioned by year, month (YYYY-MM format), day (YYYY-MM-DD format), and document ID.
84+
This table is partitioned by date (YYYY-MM-DD format).
8285

8386
## Metering Table
8487

@@ -94,14 +97,65 @@ The `metering` table captures detailed usage metrics for each document processin
9497
| number_of_pages | int | Number of pages in the document |
9598
| timestamp | timestamp | When the operation was performed |
9699

97-
This table is partitioned by year, month (YYYY-MM format), day (YYYY-MM-DD format), and document ID.
100+
This table is partitioned by date (YYYY-MM-DD format).
98101

99102
The metering table is particularly valuable for:
100103
- Cost analysis and allocation
101104
- Usage pattern identification
102105
- Resource optimization
103106
- Performance benchmarking across different document types and sizes
104107

108+
## Document Sections Tables
109+
110+
The document sections tables store the actual extracted data from document sections in a structured format suitable for analytics. These tables are automatically discovered by AWS Glue Crawler and are organized by section type (classification).
111+
112+
### Dynamic Section Tables
113+
114+
Document sections are stored in dynamically created tables based on the section classification. Each section type gets its own table (e.g., `invoice`, `receipt`, `bank_statement`, etc.) with the following characteristics:
115+
116+
**Common Metadata Columns:**
117+
| Column | Type | Description |
118+
|--------|------|-------------|
119+
| section_id | string | Unique identifier for the section |
120+
| document_id | string | Unique identifier for the document |
121+
| section_classification | string | Type/class of the section |
122+
| section_confidence | double | Confidence score for the section classification |
123+
| timestamp | timestamp | When the document was processed |
124+
125+
**Dynamic Data Columns:**
126+
The remaining columns are dynamically inferred from the JSON extraction results and vary by section type. Common patterns include:
127+
- Nested JSON objects are flattened using dot notation (e.g., `customer.name`, `customer.address.street`)
128+
- Arrays are converted to JSON strings
129+
- Primitive values (strings, numbers, booleans) are preserved as their native types
130+
131+
**Partitioning:**
132+
Each section type table is partitioned by date (YYYY-MM-DD format) for efficient querying.
133+
134+
**File Organization:**
135+
```
136+
document_sections/
137+
├── invoice/
138+
│ └── date=2024-01-15/
139+
│ ├── doc-123_section_1.parquet
140+
│ └── doc-456_section_3.parquet
141+
├── receipt/
142+
│ └── date=2024-01-15/
143+
│ └── doc-789_section_2.parquet
144+
└── bank_statement/
145+
└── date=2024-01-15/
146+
└── doc-abc_section_1.parquet
147+
```
148+
149+
### Crawler Configuration
150+
151+
The AWS Glue Crawler automatically discovers new section types and creates corresponding tables. The crawler can be configured to run:
152+
- Manually (on-demand)
153+
- Every 15 minutes
154+
- Every hour (default)
155+
- Daily
156+
157+
This ensures that new section types are automatically available for querying without manual intervention.
158+
105159
## Using the Reporting Database with Athena
106160

107161
Amazon Athena provides a serverless query service to analyze data directly in Amazon S3. The reporting database tables are automatically registered in the AWS Glue Data Catalog, making them immediately available for querying in Athena.
@@ -190,6 +244,71 @@ ORDER BY
190244
avg_tokens_per_page DESC;
191245
```
192246

247+
**Document sections analysis by type:**
248+
```sql
249+
-- Query invoice sections for customer analysis
250+
SELECT
251+
document_id,
252+
section_id,
253+
"customer.name" as customer_name,
254+
"customer.address.city" as customer_city,
255+
"total_amount" as invoice_total,
256+
date
257+
FROM
258+
invoice
259+
WHERE
260+
date BETWEEN '2024-01-01' AND '2024-01-31'
261+
ORDER BY
262+
date DESC;
263+
```
264+
265+
**Section processing volume by date:**
266+
```sql
267+
-- Count sections processed by type and date
268+
SELECT
269+
date,
270+
section_classification,
271+
COUNT(*) as section_count,
272+
COUNT(DISTINCT document_id) as document_count
273+
FROM (
274+
SELECT date, section_classification, document_id FROM invoice
275+
UNION ALL
276+
SELECT date, section_classification, document_id FROM receipt
277+
UNION ALL
278+
SELECT date, section_classification, document_id FROM bank_statement
279+
)
280+
GROUP BY
281+
date, section_classification
282+
ORDER BY
283+
date DESC, section_count DESC;
284+
```
285+
286+
**Date range queries with new partition structure:**
287+
```sql
288+
-- Efficient date range query using single date partition
289+
SELECT
290+
COUNT(*) as total_documents,
291+
AVG(accuracy) as avg_accuracy
292+
FROM
293+
document_evaluations
294+
WHERE
295+
date BETWEEN '2024-01-01' AND '2024-01-31';
296+
297+
-- Monthly aggregation
298+
SELECT
299+
SUBSTR(date, 1, 7) as month,
300+
COUNT(*) as document_count,
301+
AVG(accuracy) as avg_accuracy
302+
FROM
303+
document_evaluations
304+
WHERE
305+
date >= '2024-01-01'
306+
GROUP BY
307+
SUBSTR(date, 1, 7)
308+
ORDER BY
309+
month;
310+
```
311+
193312
### Creating Dashboards
194313

195314
For more advanced visualization and dashboarding:

0 commit comments

Comments
 (0)