Skip to content

Commit 13d0da6

Browse files
authored
Merge pull request #80 from kazmer97/feature/json-schema-extraction-model
Feature/json schema extraction model
2 parents 702a67d + 085b22c commit 13d0da6

File tree

93 files changed

+15734
-6339
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

93 files changed

+15734
-6339
lines changed

ANALYSIS_JSON_SCHEMA_LIBRARIES.md

Lines changed: 236 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,236 @@
1+
# JSON Schema Library Utilization Analysis
2+
3+
## Current State
4+
5+
### Backend (Python)
6+
**Library Available:** `jsonschema` (Draft202012Validator)
7+
**Currently Used In:**
8+
-`src/lambda/update_configuration/index.py:88-135` - Validates extractionSchema on upload
9+
-`lib/idp_common_pkg/idp_common/extraction/agentic_idp.py` - Some validation
10+
11+
**Schema Definitions Available:**
12+
-`EXTRACTION_CLASS_SCHEMA` in `schema_definition.py`
13+
-`EXTRACTION_SCHEMA_ARRAY` (referenced but need to verify)
14+
- ✅ Comprehensive schema with AWS extensions defined
15+
16+
### Frontend (JavaScript)
17+
**Library Available:** `ajv` + `ajv-formats`
18+
**Currently Used In:**
19+
-`src/ui/src/hooks/useSchemaValidation.js` - Custom validation logic
20+
- ⚠️ Duplicates validation that could use AJV's built-in features
21+
22+
## Opportunities for Improvement
23+
24+
### HIGH PRIORITY: Backend Migration Validation
25+
26+
**File:** `lib/idp_common_pkg/idp_common/config_schema/migration.py`
27+
**Issue:** NO validation after migration
28+
**Risk:** Can produce invalid schemas that break downstream
29+
30+
**Current:**
31+
```python
32+
def migrate_legacy_to_schema(legacy_classes):
33+
# ... migration logic ...
34+
return _convert_classes_to_json_schema(migrated_classes)
35+
# NO VALIDATION!
36+
```
37+
38+
**Proposed:**
39+
```python
40+
def migrate_legacy_to_schema(legacy_classes, validate=True):
41+
result = _convert_classes_to_json_schema(migrated_classes)
42+
43+
if validate:
44+
from jsonschema import Draft202012Validator, ValidationError
45+
from .schema_definition import EXTRACTION_CLASS_SCHEMA
46+
47+
validator = Draft202012Validator(EXTRACTION_CLASS_SCHEMA)
48+
try:
49+
if isinstance(result, list):
50+
for schema in result:
51+
validator.validate(schema)
52+
else:
53+
validator.validate(result)
54+
except ValidationError as e:
55+
raise ValueError(f"Migration produced invalid schema: {e.message}")
56+
57+
return result
58+
```
59+
60+
**Impact:** Prevents invalid schemas from being created
61+
62+
---
63+
64+
### HIGH PRIORITY: Validate AWS Extensions
65+
66+
**File:** `lib/idp_common_pkg/idp_common/config_schema/migration.py:56-67`
67+
**Issue:** No validation of AWS extension values
68+
**Risk:** Invalid evaluation_method, confidence_threshold can be stored
69+
70+
**Current:**
71+
```python
72+
if "evaluation_method" in attr:
73+
schema_attr["x-aws-idp-evaluation-method"] = attr["evaluation_method"]
74+
# No validation!
75+
76+
if "confidence_threshold" in attr:
77+
threshold = attr["confidence_threshold"]
78+
# Weak string-to-float conversion
79+
```
80+
81+
**Proposed:** Use schema definition to validate
82+
```python
83+
def _validate_aws_extensions(extensions: Dict[str, Any]) -> None:
84+
"""Validate AWS IDP extensions against schema."""
85+
from jsonschema import Draft202012Validator, ValidationError
86+
87+
# Extract AWS extension schema from EXTRACTION_CLASS_SCHEMA
88+
aws_extension_schema = {
89+
"type": "object",
90+
"properties": {
91+
"x-aws-idp-evaluation-method": {
92+
"type": "string",
93+
"enum": ["EXACT", "NUMERIC_EXACT", "FUZZY", "SEMANTIC"]
94+
},
95+
"x-aws-idp-confidence-threshold": {
96+
"type": "number",
97+
"minimum": 0,
98+
"maximum": 1
99+
}
100+
}
101+
}
102+
103+
validator = Draft202012Validator(aws_extension_schema)
104+
try:
105+
validator.validate(extensions)
106+
except ValidationError as e:
107+
raise ValueError(f"Invalid AWS extension: {e.message}")
108+
```
109+
110+
---
111+
112+
### MEDIUM PRIORITY: Frontend - Use AJV for All Validation
113+
114+
**File:** `src/ui/src/hooks/useSchemaValidation.js:67-133`
115+
**Issue:** Manual validation logic duplicates what AJV can do
116+
117+
**Current:** Manual checks for minLength/maxLength, etc.
118+
```javascript
119+
if (attribute.type === 'string') {
120+
if (attribute.minLength !== undefined && attribute.maxLength !== undefined) {
121+
if (attribute.minLength > attribute.maxLength) {
122+
errors.push({ path: '/minLength', message: 'minLength cannot be greater than maxLength' });
123+
}
124+
}
125+
}
126+
```
127+
128+
**Proposed:** Define meta-schema and use AJV
129+
```javascript
130+
const ATTRIBUTE_META_SCHEMA = {
131+
type: 'object',
132+
properties: {
133+
type: { enum: ['string', 'number', 'integer', 'boolean', 'object', 'array', 'null'] },
134+
// AJV will validate all JSON Schema keywords automatically
135+
},
136+
// Add custom formats for AWS extensions
137+
if: { properties: { type: { const: 'string' } } },
138+
then: {
139+
properties: {
140+
minLength: { type: 'integer', minimum: 0 },
141+
maxLength: { type: 'integer', minimum: 0 }
142+
},
143+
// AJV can validate this relationship:
144+
if: {
145+
required: ['minLength', 'maxLength']
146+
},
147+
then: {
148+
// Custom keyword or use ajv-keywords plugin
149+
}
150+
}
151+
};
152+
153+
const validateAttribute = useCallback((attribute) => {
154+
const validate = ajv.compile(ATTRIBUTE_META_SCHEMA);
155+
const valid = validate(attribute);
156+
157+
if (!valid) {
158+
return {
159+
valid: false,
160+
errors: validate.errors.map(err => ({
161+
path: err.instancePath,
162+
message: err.message
163+
}))
164+
};
165+
}
166+
167+
return { valid: true, errors: [] };
168+
}, [ajv]);
169+
```
170+
171+
**Benefits:**
172+
- Eliminate 70+ lines of manual validation
173+
- Leverage AJV's optimized validation
174+
- Automatically handle new JSON Schema keywords
175+
176+
---
177+
178+
### MEDIUM PRIORITY: Backend - Validate Configuration on Read
179+
180+
**File:** `lib/idp_common_pkg/idp_common/config/configuration_manager.py`
181+
**Issue:** No validation when reading from DynamoDB
182+
183+
**Proposed:**
184+
```python
185+
def get_configuration(self, configuration_type: str, validate=True) -> Dict[str, Any]:
186+
"""Get configuration with optional validation."""
187+
config = # ... fetch from DynamoDB ...
188+
189+
if validate and 'classes' in config:
190+
from idp_common.config_schema import validate_extraction_schema
191+
try:
192+
validate_extraction_schema(config['classes'])
193+
except Exception as e:
194+
logger.error(f"Invalid config in DynamoDB: {e}")
195+
# Could return default or raise
196+
197+
return config
198+
```
199+
200+
---
201+
202+
### LOW PRIORITY: Frontend - Share Schema Definition
203+
204+
**Issue:** Schema definition duplicated between frontend and backend
205+
206+
**Current:**
207+
- Backend: `lib/idp_common_pkg/idp_common/config_schema/schema_definition.py`
208+
- Frontend: Partial schema in `useSchemaValidation.js`
209+
210+
**Proposed:**
211+
- Generate JSON file from Python schema definition
212+
- Import in both frontend and backend
213+
- Single source of truth
214+
215+
---
216+
217+
## Implementation Priority
218+
219+
### Phase 1 (HIGH - Immediate)
220+
1. ✅ Add validation to `migrate_legacy_to_schema()`
221+
2. ✅ Add AWS extension validation in migration
222+
3. ✅ Validate after migration in configuration_resolver
223+
224+
### Phase 2 (MEDIUM - This Sprint)
225+
4. ⚠️ Use AJV meta-schema in useSchemaValidation
226+
5. ⚠️ Add validation on config read in ConfigurationManager
227+
228+
### Phase 3 (LOW - Future)
229+
6. ⬜ Share schema definition between frontend/backend
230+
7. ⬜ Add JSON Schema $ref resolution using library features
231+
232+
## Code Savings Estimate
233+
234+
- Backend validation: +50 lines (new code for safety)
235+
- Frontend AJV improvements: -70 lines (eliminate manual validation)
236+
- **Net:** -20 lines, +much better validation coverage

CHANGELOG.md

Lines changed: 11 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -13,22 +13,15 @@ SPDX-License-Identifier: MIT-0
1313
- Added support for Claude Haiku 4.5
1414
- Available for configuration across all document processing steps
1515

16-
- **X-Ray Integration for Error Analyzer Agent**
17-
- Integrated AWS X-Ray tracing tools to enhance diagnostic capabilities of the error analyzer agent
18-
- X-Ray context enables better distinction between infrastructure issues and application logic failures
19-
- Added trace ID persistence in DynamoDB alongside document status for complete traceability
20-
- Enhanced CloudWatch error log filtering for more targeted error analysis
21-
- Simplified CloudWatch results structure for improved readability and analysis
22-
- Updated error analyzer recommendations to leverage X-Ray insights for more accurate root cause identification
23-
24-
- **EU Region Support with Automatic Model Mapping**
25-
- Added support for deploying the solution in EU regions (eu-central-1, eu-west-1, etc.)
26-
- Automatic model endpoint mapping between US and EU regions for seamless deployment
27-
- Comprehensive model mapping table covering Amazon Nova and Anthropic Claude models
28-
- Intelligent fallback mappings when direct EU equivalents are unavailable
29-
- Quick Launch button for eu-central-1 region in README and deployment documentation
30-
- IDP CLI now supports eu-central-1 deployment with automatic template URL selection
31-
- Complete technical documentation in `docs/eu-region-model-support.md` with best practices and troubleshooting
16+
- **JSON Schema Format for Class Definitions** - [docs/json-schema-migration.md](./docs/json-schema-migration.md)
17+
- Document class definitions now use industry-standard JSON Schema Draft 2020-12 format for improved flexibility and tooling integration
18+
- **Standards-Based Validation**: Leverage standard JSON Schema validators and tooling ecosystem for better configuration validation
19+
- **Enhanced Extensibility**: Custom IDP properties use standard JSON Schema extension pattern (`x-aws-idp-*` prefix) for clean separation of concerns
20+
- **Modern Data Contract**: Define document structures using widely-adopted JSON Schema format with robust type system (`string`, `number`, `boolean`, `object`, `array`)
21+
- **Nested Structure Support**: Natural representation of complex documents with nested objects and arrays using JSON Schema's native `properties` and `items` keywords
22+
- **Automatic Migration**: Existing legacy configurations automatically migrate to JSON Schema format on first load - completely transparent to users
23+
- **Backward Compatible**: Legacy format remains supported through automatic migration - no manual configuration updates required
24+
- **Comprehensive Documentation**: New migration guide with format comparison, field mapping table, and best practices
3225

3326
### Changed
3427

@@ -42,9 +35,9 @@ SPDX-License-Identifier: MIT-0
4235

4336

4437
- **Migrated UI Build System from Create React App to Vite**
45-
- Upgraded to Vite 7 for faster build times
38+
- Upgraded to Vite 7 for faster build times and improved developer experience
4639
- Updated to React 18, AWS Amplify v6, react-router-dom v6, and Cloudscape Design System
47-
- Reduced dependencies and node_modules size
40+
- Reduced dependencies and node_modules size for faster installs
4841
- Implemented strategic code splitting for improved performance
4942
- Environment variables now use `VITE_` prefix instead of `REACT_APP_` for local development
5043

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -161,6 +161,7 @@ For detailed deployment and testing instructions, see the [Deployment Guide](./d
161161
- [Agent Analysis](./docs/agent-analysis.md) - Natural language analytics and data visualization feature
162162
- [Custom MCP Agent](./docs/custom-MCP-agent.md) - Integrating external MCP servers for custom tools and capabilities
163163
- [Configuration](./docs/configuration.md) - Configuration and customization options
164+
- [JSON Schema Migration](./docs/json-schema-migration.md) - JSON Schema format guide and legacy migration details
164165
- [Discovery](./docs/discovery.md) - Pattern-neutral discovery process and BDA blueprint automation
165166
- [Classification](./docs/classification.md) - Customizing document classification
166167
- [Extraction](./docs/extraction.md) - Customizing information extraction

docs/configuration.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,12 +5,14 @@ SPDX-License-Identifier: MIT-0
55

66
The GenAIIDP solution provides multiple configuration approaches to customize document processing behavior to suit your specific needs.
77

8+
> **📝 Note:** Starting with version 0.3.21, document class definitions use **JSON Schema** format instead of the legacy custom format. See [json-schema-migration.md](json-schema-migration.md) for migration details and format comparison. Legacy configurations are automatically migrated on first use.
9+
810
## Pattern Configuration via Web UI
911

1012
The web interface allows real-time configuration updates without stack redeployment:
1113

12-
- **Document Classes**: Define and modify document categories and their descriptions
13-
- **Extraction Attributes**: Configure fields to extract for each document class
14+
- **Document Classes**: Define and modify document categories and their descriptions (using JSON Schema format)
15+
- **Extraction Attributes**: Configure fields to extract for each document class (defined as JSON Schema properties)
1416
- **Few Shot Examples**: Upload and configure example documents to improve accuracy (supported in Pattern 2)
1517
- **Model Selection**: Choose between available Bedrock models for classification and extraction
1618
- **Prompt Engineering**: Customize system and task prompts for optimal results

0 commit comments

Comments
 (0)