You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CHANGELOG.md
+42-5Lines changed: 42 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,17 +9,54 @@ SPDX-License-Identifier: MIT-0
9
9
10
10
### Added
11
11
12
-
-**Agentic extraction preview with Strands agents** delivering structured field validation, configurable review flows, and sample notebooks/assets for Pattern-2 lending documents.
13
-
- Agentic extraction utilises Strands Agent framework to produce structured outputs in an iterative and self reviewing agent loop. It utilises tools for interacting with the output which will be extended in the future.
14
-
- Through the library this already allows users to utilise the `structured_output` function from the `lib/idp_common_pkg/idp_common/extraction/agentic_idp.py` to request extractions using Pydantic Models with custom validators defined. We intend to enable deeper validation customizations through the UI as well in the future.
15
-
-**Containerized Pattern-2 deployment pipeline** that builds and pushes all Lambda images via CodeBuild using the new Dockerfile, plus automated ECR cleanup and tests.
12
+
-**Agentic extraction preview with Strands agents (experimental)** introducing intelligent, self-correcting document extraction with improved schema compliance and accuracy improvements over traditional methods.
13
+
- Leverages the Strands Agent framework with iterative validation loops and automatic error correction to deliver schema compliance
14
+
- Provides structured output through Pydantic models with built-in validators, automatic retry handling, and superior handling of complex nested structures and date standardization
15
+
- Includes sample notebooks and configuration assets demonstrating agentic extraction for Pattern-2 lending documents
16
+
- Programmatic access available via `structured_output` function in `lib/idp_common_pkg/idp_common/extraction/agentic_idp.py`
17
+
- Currently this is an experimental feature. Future extensibility includes UI-based validation customization, code generation, and Model Context Protocol (MCP) integration for external data enrichment during extraction
18
+
19
+
-**IDP CLI - Command Line Interface for Batch Document Processing**
20
+
- Added CLI tool (`idp_cli/`) for programmatic batch document processing and stack management
21
+
-**Key Features**: Deploy/update/delete CloudFormation stacks, process and reprocess documents from local directories or S3 URIs, live progress monitoring with rich terminal UI, download processing results locally, validate manifests before processing, generate manifests from directories with automatic baseline matching
22
+
-**Selective Reprocessing**: New `rerun-inference` command to reprocess documents from specific pipeline steps (classification or extraction) while leveraging existing OCR data for cost/time optimization
23
+
-**Evaluation Framework**: Workflow for accuracy testing including initial processing, manual validation, baseline creation, and automated evaluation with detailed metrics
24
+
-**Analytics Integration**: Query aggregated results via Athena SQL or use Agent Analytics in Web UI for visual analysis
-**Documentation**: README with Quick Start, Commands Reference, Evaluation Workflow, and troubleshooting guides
27
+
28
+
-**Extraction Results Integration in Summarization Service**
29
+
- Integrates extraction results from the extraction service into summarization module for context-aware summaries
30
+
-**Features**: Fully backward compatible (works with or without extraction results), automatic section handling, error resilient with graceful continuation, comprehensive logging
31
+
-**Configuration**: Enable by adding `{EXTRACTION_RESULTS}` placeholder to `task_prompt` in config.yaml
-Updated Pattern-2 templates, `publish.py`, and documentation to detect container builds, surface new prerequisites (Docker/ECR), and detail enabling agentic extraction.
36
+
-**Containerized Pattern-2 deployment pipeline** that builds and pushes all Lambda images via CodeBuild using the new Dockerfile, plus automated ECR cleanup and tests.
20
37
- Lambda docker image deployments have a 10 GB image size limit compared to the 250 MB zip limit of regular deployment. This however doesn't allow for viewing the code in the AWS console.
21
38
The change was introduced to accommodate the increased package size of introducing Strands into the package dependencies.
22
39
40
+
### Fixed
41
+
-**Discovery function times out when processing large documents.**
42
+
- increase lambda discovery processor timeout to 900s
43
+
-**Corrected baseline directory structure documentation in evaluation.md**
- Fixed invalid GovCloud template generation where ProcessChanges AppSync resources were not being removed, causing "Fn::GetAtt references undefined resource GraphQLApi" errors
49
+
- Updated `scripts/generate_govcloud_template.py` to remove all ProcessChanges-related resources and extend AppSync parameter cleanup to all pattern stacks
50
+
- Fixed InvalidClientTokenId validation error by ensuring CloudFormation client uses the correct region when validating templates (commercial vs GovCloud)
51
+
-**Enhanced Processing Flow Visualization for Disabled Steps**
52
+
- Fixed UX issue where disabled processing steps (when `summarization.enabled: false` or `assessment.enabled: false` in configuration) appeared visually identical to active steps in the "View Processing Flow" display
53
+
-**Key Benefit**: Users can now immediately see which steps are actually processing data vs. steps that execute but skip processing based on configuration settings, preventing confusion about whether summarization or assessment ran
54
+
- Limitation: the new visual indicators are driven from the current config, which may have been altered since the document was processed. We will address this in a later release. See Issue #86.
55
+
56
+
### Known Issues
57
+
-**GovCloud Deployments fail, due to lack of ARM support for CodeBuild. Fix targeted for next release.**
Copy file name to clipboardExpand all lines: config_library/pattern-2/lending-package-sample/config.yaml
+32-20Lines changed: 32 additions & 20 deletions
Original file line number
Diff line number
Diff line change
@@ -1082,12 +1082,14 @@ summarization:
1082
1082
top_k: "5"
1083
1083
task_prompt: >-
1084
1084
<document-text>
1085
-
1086
1085
{DOCUMENT_TEXT}
1087
-
1088
1086
</document-text>
1089
1087
1090
-
Analyze the provided document (<document-text>) and create a comprehensive summary.
1088
+
<extracted-attributes>
1089
+
{EXTRACTION_RESULTS}
1090
+
</extracted-attributes>
1091
+
1092
+
Analyze the provided document (<document-text>) along with the extracted attributes (<extracted-attributes>) to create a comprehensive and accurate summary.
1091
1093
1092
1094
CRITICAL INSTRUCTION: You MUST return your response as valid JSON with the
1093
1095
EXACT structure shown at the end of these instructions. Do not include any
@@ -1096,29 +1098,39 @@ summarization:
1096
1098
Create a summary that captures the essential information from the document.
1097
1099
Your summary should:
1098
1100
1099
-
1. Extract key information, main points, and important details
1101
+
1. **Integrate Extracted Attributes**: Begin with a "Key Information" section that highlights the most important extracted attributes in a structured format (use tables or lists as appropriate)
1102
+
1103
+
2. **Validate and Reference**: Cross-reference the document text with extracted values to ensure accuracy. When mentioning specific values, prefer the extracted attributes when they are available
1104
+
1105
+
3. **Maintain Document Structure**: Preserve the original document's organizational structure where appropriate, using the extracted attributes to enhance each section
1100
1106
1101
-
2. Maintain the original document's organizational structure where
1102
-
appropriate
1107
+
4. **Highlight Critical Data**: Emphasize important extracted values such as:
1108
+
- Names, addresses, and identification numbers
1109
+
- Dates and time periods
1110
+
- Monetary amounts and financial figures
1111
+
- Status indicators and classifications
1112
+
- Any calculated or derived values
1103
1113
1104
-
3. Preserve important facts, figures, dates, and entities
1114
+
5. **Use Markdown Formatting**: Apply markdown for better readability:
1115
+
- Use headers (##, ###) for sections
1116
+
- Create tables for structured data from extracted attributes
1117
+
- Use **bold** for important values and *italics* for emphasis
1118
+
- Create lists (bullet or numbered) for multiple items
1105
1119
1106
-
4. Reduce the length while retaining all critical information
1120
+
6. **Provide Context**: For each extracted value mentioned, provide brief context from the document text to explain its significance
1107
1121
1108
-
5. Use markdown formatting for better readability (headings, lists,
1109
-
emphasis, etc.)
1122
+
7. **Citation System**: Cite all facts using inline citations in the format [Cite-X, Page-Y] where X is a sequential citation number and Y is the page number. Format citations as markdown links: [[Cite-1, Page-3]](#cite-1-page-3)
1110
1123
1111
-
6. Cite all relevant facts from the source document using inline citations
1112
-
in the format [Cite-X, Page-Y] where X is a sequential citation number and Y
1113
-
is the page number
1124
+
8. **Data Completeness**: If extracted attributes are missing or empty, note this in the summary and rely more heavily on the document text
1114
1125
1115
-
7. Format citations as markdown links that reference the full citation list
1116
-
at the bottom of the summary
1117
-
Example: [[Cite-1, Page-3]](#cite-1-page-3)
1126
+
9. **References Section**: At the end, include a "References" section listing all citations with their exact text from the source document
1118
1127
1119
-
8. At the end of the summary, include a "References" section that lists all
1120
-
citations with their exact text from the source document in the format:
1121
-
[Cite-X, Page-Y]: Exact text from the document
1128
+
Structure your summary as follows:
1129
+
- **Key Information** (from extracted attributes)
1130
+
- **Document Overview**
1131
+
- **Detailed Sections** (based on document structure)
1132
+
- **Summary and Conclusions**
1133
+
- **References**
1122
1134
1123
1135
Output Format:
1124
1136
@@ -1127,7 +1139,7 @@ summarization:
1127
1139
1128
1140
```json
1129
1141
{
1130
-
"summary": "A comprehensive summary in markdown format with inline citations linked to a references section at the bottom"
1142
+
"summary": "A comprehensive summary in markdown format that integrates extracted attributes with document text, includes inline citations linked to a references section at the bottom"
0 commit comments