You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CHANGELOG.md
+16Lines changed: 16 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,6 +5,8 @@ SPDX-License-Identifier: MIT-0
5
5
6
6
## [Unreleased]
7
7
8
+
### Added
9
+
8
10
### Fixed
9
11
10
12
@@ -41,6 +43,20 @@ SPDX-License-Identifier: MIT-0
41
43
-**Backward Compatibility**: Maintains same interface as standard assessment service with seamless migration path
42
44
-**Enhanced Documentation**: Comprehensive documentation in `docs/assessment.md` and example notebooks for both standard and granular approaches
43
45
46
+
-**Reporting Database now has Document Sections Tables to enable querying across document fields**
47
+
- Added comprehensive document sections storage system that automatically creates tables for each section type (classification)
48
+
-**Dynamic Table Creation**: AWS Glue Crawler automatically discovers new section types and creates corresponding tables (e.g., `invoice`, `receipt`, `bank_statement`)
49
+
-**Configurable Crawler Schedule**: Support for manual, every 15 minutes, hourly (default), or daily crawler execution via `DocumentSectionsCrawlerFrequency` parameter
50
+
-**Partitioned Storage**: Data organized by section type and date for efficient querying with Amazon Athena
51
+
52
+
-**Partition Projections for Evaluation and Metering tables**
53
+
-**Automated Partition Management**: Eliminates need for `MSCK REPAIR TABLE` operations with projection-based partition discovery
54
+
-**Performance Benefits**: Athena can efficiently prune partitions based on date ranges without manual partition loading
55
+
-**Backward Compatibility Warning**: The partition structure change from `year=2024/month=03/day=15/` to `date=2024-03-15/` means that data saved in the evaluation or metering tables prior to v0.3.7 will not be visible in Athena queries after updating. To retain access to historical data, you can either:
56
+
- Manually reorganize existing S3 data to match the new partition structure
57
+
- Create separate Athena tables pointing to the old partition structure for historical queries
58
+
59
+
44
60
-**Optimize the classification process for single class configurations in Pattern-2**
45
61
- Detects when only a single document class is defined in the configuration
46
62
- Automatically classifies all document pages as that single class
| confidence_threshold | string | Confidence threshold used |
79
82
| evaluation_date | timestamp | When the evaluation was performed |
80
83
81
-
This table is partitioned by year, month (YYYY-MM format), day (YYYY-MM-DD format), and document ID.
84
+
This table is partitioned by date (YYYY-MM-DD format).
82
85
83
86
## Metering Table
84
87
@@ -94,14 +97,65 @@ The `metering` table captures detailed usage metrics for each document processin
94
97
| number_of_pages | int | Number of pages in the document |
95
98
| timestamp | timestamp | When the operation was performed |
96
99
97
-
This table is partitioned by year, month (YYYY-MM format), day (YYYY-MM-DD format), and document ID.
100
+
This table is partitioned by date (YYYY-MM-DD format).
98
101
99
102
The metering table is particularly valuable for:
100
103
- Cost analysis and allocation
101
104
- Usage pattern identification
102
105
- Resource optimization
103
106
- Performance benchmarking across different document types and sizes
104
107
108
+
## Document Sections Tables
109
+
110
+
The document sections tables store the actual extracted data from document sections in a structured format suitable for analytics. These tables are automatically discovered by AWS Glue Crawler and are organized by section type (classification).
111
+
112
+
### Dynamic Section Tables
113
+
114
+
Document sections are stored in dynamically created tables based on the section classification. Each section type gets its own table (e.g., `invoice`, `receipt`, `bank_statement`, etc.) with the following characteristics:
115
+
116
+
**Common Metadata Columns:**
117
+
| Column | Type | Description |
118
+
|--------|------|-------------|
119
+
| section_id | string | Unique identifier for the section |
120
+
| document_id | string | Unique identifier for the document |
121
+
| section_classification | string | Type/class of the section |
122
+
| section_confidence | double | Confidence score for the section classification |
123
+
| timestamp | timestamp | When the document was processed |
124
+
125
+
**Dynamic Data Columns:**
126
+
The remaining columns are dynamically inferred from the JSON extraction results and vary by section type. Common patterns include:
127
+
- Nested JSON objects are flattened using dot notation (e.g., `customer.name`, `customer.address.street`)
128
+
- Arrays are converted to JSON strings
129
+
- Primitive values (strings, numbers, booleans) are preserved as their native types
130
+
131
+
**Partitioning:**
132
+
Each section type table is partitioned by date (YYYY-MM-DD format) for efficient querying.
133
+
134
+
**File Organization:**
135
+
```
136
+
document_sections/
137
+
├── invoice/
138
+
│ └── date=2024-01-15/
139
+
│ ├── doc-123_section_1.parquet
140
+
│ └── doc-456_section_3.parquet
141
+
├── receipt/
142
+
│ └── date=2024-01-15/
143
+
│ └── doc-789_section_2.parquet
144
+
└── bank_statement/
145
+
└── date=2024-01-15/
146
+
└── doc-abc_section_1.parquet
147
+
```
148
+
149
+
### Crawler Configuration
150
+
151
+
The AWS Glue Crawler automatically discovers new section types and creates corresponding tables. The crawler can be configured to run:
152
+
- Manually (on-demand)
153
+
- Every 15 minutes
154
+
- Every hour (default)
155
+
- Daily
156
+
157
+
This ensures that new section types are automatically available for querying without manual intervention.
158
+
105
159
## Using the Reporting Database with Athena
106
160
107
161
Amazon Athena provides a serverless query service to analyze data directly in Amazon S3. The reporting database tables are automatically registered in the AWS Glue Data Catalog, making them immediately available for querying in Athena.
@@ -190,6 +244,71 @@ ORDER BY
190
244
avg_tokens_per_page DESC;
191
245
```
192
246
247
+
**Document sections analysis by type:**
248
+
```sql
249
+
-- Query invoice sections for customer analysis
250
+
SELECT
251
+
document_id,
252
+
section_id,
253
+
"customer.name"as customer_name,
254
+
"customer.address.city"as customer_city,
255
+
"total_amount"as invoice_total,
256
+
date
257
+
FROM
258
+
invoice
259
+
WHERE
260
+
date BETWEEN '2024-01-01'AND'2024-01-31'
261
+
ORDER BY
262
+
dateDESC;
263
+
```
264
+
265
+
**Section processing volume by date:**
266
+
```sql
267
+
-- Count sections processed by type and date
268
+
SELECT
269
+
date,
270
+
section_classification,
271
+
COUNT(*) as section_count,
272
+
COUNT(DISTINCT document_id) as document_count
273
+
FROM (
274
+
SELECTdate, section_classification, document_id FROM invoice
275
+
UNION ALL
276
+
SELECTdate, section_classification, document_id FROM receipt
277
+
UNION ALL
278
+
SELECTdate, section_classification, document_id FROM bank_statement
279
+
)
280
+
GROUP BY
281
+
date, section_classification
282
+
ORDER BY
283
+
dateDESC, section_count DESC;
284
+
```
285
+
286
+
**Date range queries with new partition structure:**
287
+
```sql
288
+
-- Efficient date range query using single date partition
0 commit comments