Skip to content

Commit 4eed2aa

Browse files
committed
Merge branch 'feature/modify-class-rerun-extraction' into 'develop'
Edit Sections Feature for Modifying Class/Type and Reprocessing Extraction See merge request genaiic-reusable-assets/engagement-artifacts/genaiic-idp-accelerator!329
2 parents dffe70a + e469351 commit 4eed2aa

File tree

25 files changed

+1874
-344
lines changed

25 files changed

+1874
-344
lines changed

CHANGELOG.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,14 @@ SPDX-License-Identifier: MIT-0
66
## [Unreleased]
77

88
### Added
9+
10+
- **Edit Sections Feature for Modifying Class/Type and Reprocessing Extraction**
11+
- Added Edit Sections interface for Pattern-2 and Pattern-3 workflows with reprocessing optimization
12+
- **Key Features**: Section management (create, update, delete), classification updates, page reassignment with overlap detection, real-time validation
13+
- **Selective Reprocessing**: Only modified sections are reprocessed while preserving existing data for unmodified sections
14+
- **Processing Pipeline**: All functions (OCR/Classification/Extraction/Assessment) automatically skip redundant operations based on data presence
15+
- **Pattern Compatibility**: Full functionality for Pattern-2/Pattern-3, informative modal for Pattern-1 explaining BDA not yet supported
16+
917
- **Analytics Agent Schema Optimization for Improved Performance**
1018
- **Embedded Database Overview**: Complete table listing and guidance embedded directly in system prompt (no tool call needed)
1119
- **On-Demand Detailed Schemas**: `get_table_info(['specific_tables'])` loads detailed column information only for tables actually needed by the query

VERSION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
0.3.16
1+
0.3.17-rc1

docs/web-ui.md

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,66 @@ The solution includes a responsive web-based user interface built with React tha
2626
- **Document Process Flow visualization** for detailed workflow execution monitoring and troubleshooting
2727
- **Document Analytics** for querying and visualizing processed document data
2828

29+
## Edit Sections
30+
31+
The Edit Sections feature provides an intelligent interface for modifying document section classifications and page assignments, with automatic reprocessing optimization for Pattern-2 and Pattern-3 workflows.
32+
33+
### Key Capabilities
34+
35+
- **Section Management**: Create, update, and delete document sections with validation
36+
- **Classification Updates**: Change section document types with real-time validation
37+
- **Page Reassignment**: Move pages between sections with overlap detection
38+
- **Intelligent Reprocessing**: Only modified sections are reprocessed, preserving existing data
39+
- **Immediate Feedback**: Status updates appear instantly in the UI
40+
- **Pattern Compatibility**: Available for Pattern-2 and Pattern-3, with informative guidance for Pattern-1
41+
42+
### How to Use
43+
44+
1. Navigate to a completed document's detail page
45+
2. In the "Document Sections" panel, click the "Edit Sections" button
46+
3. **For Pattern-2/Pattern-3**: Enter edit mode with inline editing capabilities
47+
4. **For Pattern-1**: View informative modal explaining BDA architecture differences
48+
49+
#### Editing Workflow (Pattern-2/Pattern-3)
50+
51+
1. **Edit Section Classifications**: Use dropdowns to change document types
52+
2. **Modify Page Assignments**: Edit comma-separated page IDs (e.g., "1, 2, 3")
53+
3. **Add New Sections**: Click "Add Section" for new document boundaries
54+
4. **Delete Sections**: Use remove buttons to delete unnecessary sections
55+
5. **Validation**: Real-time validation prevents overlapping pages and invalid configurations
56+
6. **Submit Changes**: Click "Save & Process Changes" to trigger selective reprocessing
57+
58+
### Processing Optimization
59+
60+
The Edit Sections feature uses **2-phase schema knowledge optimization**:
61+
62+
#### Phase 1: Frontend
63+
- **Selective Payload**: Only sends sections that actually changed
64+
- **Validation Engine**: Prevents invalid configurations before submission
65+
66+
#### Phase 2: Backend
67+
- **Pipeline**: Processing functions automatically skip redundant operations
68+
- **OCR**: Skips if pages already have OCR data
69+
- **Classification**: Skips if pages already classified
70+
- **Extraction**: Skips if sections have extraction data
71+
- **Assessment**: Skips if extraction results contain assessment data
72+
- **Selective Reprocessing**: Only modified sections lose their data and get reprocessed
73+
74+
### Pattern Compatibility
75+
76+
#### Pattern-2 and Pattern-3 Support
77+
- **Full Functionality**: Complete edit capabilities with intelligent reprocessing
78+
- **Performance Optimization**: Automatic selective processing for efficiency
79+
- **Data Preservation**: Unmodified sections retain all processing results
80+
81+
#### Pattern-1 Information
82+
Pattern-1 uses **Bedrock Data Automation (BDA)** with automatic section management. When Edit Sections is clicked, users see an informative modal explaining:
83+
84+
- **Architecture Differences**: BDA handles section boundaries automatically
85+
- **Alternative Workflows**: Available options like "View/Edit Data", Configuration updates, and document reprocessing
86+
- **Future Considerations**: Guidance on using Pattern-2/Pattern-3 for fine-grained section control
87+
88+
2989
## Document Analytics
3090

3191
The Document Analytics feature allows users to query their processed documents using natural language and receive results in various formats including charts, tables, and text responses.

lib/idp_common_pkg/idp_common/appsync/client.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -93,7 +93,7 @@ def execute_mutation(
9393
request = AWSRequest(
9494
method="POST",
9595
url=self.api_url,
96-
data=json.dumps(data).encode(),
96+
data=json.dumps(data, default=str).encode(),
9797
headers={
9898
"Content-Type": "application/json",
9999
"Accept": "application/json",

lib/idp_common_pkg/idp_common/appsync/service.py

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -149,10 +149,18 @@ def _document_to_update_input(self, document: Document) -> Dict[str, Any]:
149149
if section.confidence_threshold_alerts:
150150
alerts_data = []
151151
for alert in section.confidence_threshold_alerts:
152+
# Convert Decimal values to string to avoid serialization issues
153+
confidence_value = alert.get("confidence")
154+
confidence_threshold_value = alert.get("confidence_threshold")
155+
152156
alert_data = {
153157
"attributeName": alert.get("attribute_name"),
154-
"confidence": alert.get("confidence"),
155-
"confidenceThreshold": alert.get("confidence_threshold"),
158+
"confidence": float(confidence_value)
159+
if confidence_value is not None
160+
else None,
161+
"confidenceThreshold": float(confidence_threshold_value)
162+
if confidence_threshold_value is not None
163+
else None,
156164
}
157165
alerts_data.append(alert_data)
158166
section_data["ConfidenceThresholdAlerts"] = alerts_data
@@ -164,7 +172,7 @@ def _document_to_update_input(self, document: Document) -> Dict[str, Any]:
164172

165173
# Add metering data if available
166174
if document.metering:
167-
input_data["Metering"] = json.dumps(document.metering)
175+
input_data["Metering"] = json.dumps(document.metering, default=str)
168176

169177
# Add evaluation status & report if available
170178
if document.evaluation_status:

lib/idp_common_pkg/idp_common/dynamodb/service.py

Lines changed: 22 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -286,7 +286,7 @@ def _dynamodb_item_to_document(self, item: Dict[str, Any]) -> Document:
286286
doc = Document(
287287
id=item.get("ObjectKey"),
288288
input_key=item.get("ObjectKey"),
289-
num_pages=item.get("PageCount", 0),
289+
num_pages=int(item.get("PageCount", 0)), # Ensure PageCount is integer
290290
queued_time=item.get("QueuedTime"),
291291
start_time=item.get("WorkflowStartTime"),
292292
completion_time=item.get("CompletionTime"),
@@ -304,23 +304,38 @@ def _dynamodb_item_to_document(self, item: Dict[str, Any]) -> Document:
304304
logger.warning(f"Unknown status '{object_status}', using QUEUED")
305305
doc.status = Status.QUEUED
306306

307-
# Convert metering data
308-
metering_json = item.get("Metering")
309-
if metering_json:
307+
# Convert metering data - handle both JSON string and native dict formats
308+
metering_data = item.get("Metering")
309+
if metering_data:
310310
try:
311-
doc.metering = json.loads(metering_json)
311+
if isinstance(metering_data, str):
312+
# It's a JSON string, parse it
313+
if metering_data.strip(): # Only parse non-empty strings
314+
doc.metering = json.loads(metering_data)
315+
else:
316+
doc.metering = {}
317+
else:
318+
# It's already a dict/object (native DynamoDB format), use it directly
319+
doc.metering = metering_data
312320
except json.JSONDecodeError:
313-
logger.warning("Failed to parse metering data")
321+
logger.warning("Failed to parse metering JSON string, using empty dict")
322+
doc.metering = {}
323+
except Exception as e:
324+
logger.warning(f"Error processing metering data: {e}, using empty dict")
325+
doc.metering = {}
314326

315327
# Convert pages
316328
pages_data = item.get("Pages", [])
317329
if pages_data is not None: # Ensure pages_data is not None before iterating
318330
for page_data in pages_data:
319331
page_id = str(page_data.get("Id"))
332+
text_uri = page_data.get("TextUri")
320333
doc.pages[page_id] = Page(
321334
page_id=page_id,
322335
image_uri=page_data.get("ImageUri"),
323-
raw_text_uri=page_data.get("TextUri"),
336+
raw_text_uri=text_uri,
337+
parsed_text_uri=text_uri, # Set both raw and parsed to same URI
338+
text_confidence_uri=page_data.get("TextConfidenceUri"),
324339
classification=page_data.get("Class"),
325340
)
326341

lib/idp_common_pkg/idp_common/models.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -277,7 +277,7 @@ def from_dict(cls, data: Dict[str, Any]) -> "Document":
277277
input_bucket=data.get("input_bucket"),
278278
input_key=data.get("input_key"),
279279
output_bucket=data.get("output_bucket"),
280-
num_pages=data.get("num_pages", 0),
280+
num_pages=int(data.get("num_pages", 0)), # Ensure num_pages is integer
281281
initial_event_time=data.get("initial_event_time"),
282282
queued_time=data.get("queued_time"),
283283
start_time=data.get("start_time"),
@@ -356,7 +356,7 @@ def from_s3_event(cls, event: Dict[str, Any], output_bucket: str) -> "Document":
356356

357357
def to_json(self) -> str:
358358
"""Convert document to JSON string."""
359-
return json.dumps(self.to_dict())
359+
return json.dumps(self.to_dict(), default=str)
360360

361361
@classmethod
362362
def from_json(cls, json_str: str) -> "Document":

lib/idp_common_pkg/idp_common/utils/__init__.py

Lines changed: 16 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -89,7 +89,21 @@ def merge_metering_data(existing_metering: Dict[str, Any],
8989
for unit, value in metrics.items():
9090
if service_api not in merged:
9191
merged[service_api] = {}
92-
merged[service_api][unit] = merged[service_api].get(unit, 0) + value
92+
93+
# Convert both values to numbers to handle string vs int mismatch
94+
try:
95+
existing_value = merged[service_api].get(unit, 0)
96+
# Handle both string and numeric values
97+
if isinstance(existing_value, str):
98+
existing_value = float(existing_value)
99+
if isinstance(value, str):
100+
value = float(value)
101+
102+
merged[service_api][unit] = existing_value + value
103+
except (ValueError, TypeError) as e:
104+
logger.warning(f"Error converting metering values for {service_api}.{unit}: existing={merged[service_api].get(unit)}, new={value}, error={e}")
105+
# Fallback to new value if conversion fails
106+
merged[service_api][unit] = value
93107
else:
94108
logger.warning(f"Unexpected metering data format for {service_api}: {metrics}")
95109

@@ -632,4 +646,4 @@ def check_token_limit(document_text: str, extraction_results: Dict[str, Any], co
632646
else:
633647
logger.info(f"This document is configured with {int(configured_max_tokens)} max_tokens, "
634648
f" requires approximately {int(estimated_tokens)} tokens.")
635-
return None
649+
return None

0 commit comments

Comments
 (0)