Add DocumentAI reference implementation using DDE module#252
Add DocumentAI reference implementation using DDE module#252laurencegoolsby wants to merge 80 commits intomainfrom
Conversation
- Add enable_document_data_extraction boolean - Add environment variables and bucket policies to service - Add additional outputs for BDA project/blueprint
- Move infra/modules/document-data-extraction/resources/bedrock-data-automation/* to infra/modules/document-data-extraction/resources - Rename infra/app-flask/service/blueprints to document-data-extraction-blueprints - Remove bda_ prefix from infra/modules/document-data-extraction/resources/variables.tf and infra/app-flask/service/document_data_extraction.tf - Update infra/app-flask/app-config/env-config/document_data_extraction.tf name and path
- Added local.prefix to bucket names and BDA project name - Updated environment variables to use prefixed bucket names
- Create KMS key - Remove KMS configuration parameters from module interface - Add Bedrock Data Automation access_policy_arn output for service integration - Update blueprints_map to use map(string) tags instead of complex objects - Remove enabled_blueprint logic in favor of reading blueprints directory - Update README
…ment data extraction integration - Create /test-document-data-extraction endpoint in Flask app - Update boto3/botocore to support latest BDA API (1.38.0+) - Fix Terraform outputs to include bucket name prefixes - Add DDE environment variables to local.env template
…ts.tf, infra/modules/document-data-extraction/resources/variables.tf
…into bda-module
- Remove Bedrock Data Automation KMS key alias - Remove unused Document Data Extraction outputs - Format code via make format
- Upgrade moto from version 4.0.2 to 5.1.18 - Update to use moto.mock_aws as moto.mock_s3 was removed from moto version 5x
- Upgrade urllib3 to ^2.6.0 (fixes GHSA-2xpw-w6gg-jr37, GHSA-gm62-xv2j-4w53, GHSA-pq67-6m6q-mj2v) - Upgrade filelock to ^3.20.1 (fixes GHSA-w853-jp5j-5j7f) - Addresses high and medium severity vulnerabilities found by grype scan
|
navapbc/template-infra#990 is about adding DDE module test support to |
- Add externalReferenceId attribute and GSI for querying documents by external reference id (e.g. caseId) - Add DDE_EXTERNAL_REF_ID_INDEX_NAME environment variable - GSI enables grouping documents by case/client identifier
- Add metrics.tf with Glue database, table, SQS queue, and Athena workgroup - Configure S3 buckets for raw metrics and Athena results - Add EventBridge schedules for metrics processor and aggregator - Configure partition projection for efficient Athena queries - Add IAM policies for metrics Lambda functions
- Add kms:Encrypt to storage module IAM policy for Lambda encryption access - Export kms_key_arn from storage module for workgroup configuration - Configure Athena workgroup with SSE-KMS encryption using bucket's KMS key - Add s3:GetBucketLocation to metrics_aggregator policy for bucket verification
- Add multipage-upload-session table with sessionId (primary key) and page_number (sort key) - Configure GSI for querying sessions by sessionId - Supports multipage document upload feature endpoints
- Rename clientId to tenantId in DynamoDB tables and GSIs - Update document_metadata table partition key index name - Add tenantId GSI to document_metadata table - Add documentai_batches table with tenantId support - Add DOCUMENTAI_BATCH_TABLE_NAME environment variable
- Renamed DynamoDB table from multipage_upload_sessions to document_builds - Changed primary key from sessionId to buildId - Updated environment variable from DOCUMENTAI_MULTIPAGE_UPLOAD_SESSIONS_TABLE_NAME to DOCUMENTAI_DOCUMENT_BUILDS_TABLE_NAME
- Add KMS encryption for DynamoDB tables (document_metadata, document_builds, documentai_batches) - Enable point-in-time recovery on all DynamoDB tables for disaster recovery - Configure X-Ray tracing on all Lambda functions for distributed debugging - Create shared dead letter queue for Lambda failure handling - Set Lambda concurrent execution limit to 100 to prevent runaway costs - Migrate lambda_artifacts bucket to storage module with KMS encryption
- Enable KMS encryption for Lambda DLQ using shared DynamoDB KMS key - Add checkov skip for CKV_AWS_117 (VPC not needed for S3/DynamoDB access) - Add checkov skip for CKV_AWS_272 (code signing not required) - Add checkov skip for CKV_AWS_173 (env vars don't contain sensitive data)
There was a problem hiding this comment.
- There's no app code? There should be a
/app-docaidirectory? - This seems to be including a lot of extra infrastructure for functionality that doesn't existing the the DocumentAI API app? SQS?, Athena?, batches?, etc.
- Missing the template state tracking in general
This PR should reflect adding a new application based on main of navapbc/strata-template-documentai-api and the current version of the infra template (which reflects main on navapbc/template-infra). Please follow https://navapbc.github.io/platform-cli/adding-an-app/ for adding a app-docai instance (i.e., run infra add-app and app install with appropriate flags).
Thank you @doshitan Good point. The My plan is to proceed as follows:
This way the platform-test PR shows a proper installation of an approved template, not a work-in-progress. Objections? |
…EventBridge S3 events. - Remove Lambda functions, layers, and IAM roles - Add KMS decrypt permissions to dynamodb_read_write IAM policy for encrypted tables - Configure ephemeral_write_volumes=["/tmp"] for ECS tasks (required by pdf2image) - Rename environment variables from DDE_* to DOCUMENTAI_* prefix - Apply workspace prefix to EventBridge source bucket names - Add file upload jobs: - document_processor: watches input_bucket_name with "input/" prefix - bda_result_processor: watches output_bucket_name with "processed/" prefix
- Update environment variables to include full S3 paths with prefixes - Update ECS task commands to use .cli modules instead of .main - DOCUMENTAI_INPUT_LOCATION now includes /input prefix - DOCUMENTAI_OUTPUT_LOCATION now includes /processed prefix - Add DOCUMENTAI_BATCH_INPUT_LOCATION, DOCUMENTAI_BUILD_INPUT_LOCATION to environment variables
…ION environment variables - Rename replace DOCUMENTAI_ with BDA_ in DOCUMENTAI_PROJECT_ARN, DOCUMENTAI_PROFILE_ARN, DOCUMENTAI_REGION - Variables are specific to Bedrock Data Automation (BDA) rather than the broader DOCUMENTAI_ prefix
…CUMENTAI_BATCH_INPUT_LOCATION environment variable
- Add max_batch_size, max_bda_invoke_retry_attempts environment variables - Rename DOCUMENTAI_BUILD_INPUT_LOCATION environment variable to a more generic DOCUMENTAI_PREPROCESSING_LOCATION - Increase service memory to 2048MB from 512MB; resolves batch process crash due to OOM exceptions
- Add aws-managed birth certificate, 1040,1099-int blueprints - Add custom-employment-termination-letter.json - Add custom-employment-verification-letter.json - Add custom-i-766-work-authorization.json - Add custom-i20-student-visa.json - Add custom-i94-arrival-and-departure.json - Add custom-ira-account-document.json - Add custom-prof-of-lost-health-coverage.json - Add custom-tax-reporting-statement.json - Add custom-unemployment-insurance-claim.json
Ticket
Resolves #253
Changes
Add a complete DocumentAI reference implementation demonstrating how to build document processing workflows using the Bedrock Data Automation (DDE) module.
Infrastructure
Application
Context for reviewers
This PR demonstrates a complete DocumentAI application built on top of the DDE module from navapbc/template-infra#989. It shows how to:
The infrastructure uses the updated DDE module tested in #237.
Testing
Deployed to docai-test workspace: https://app-docai.platform-test-dev.navateam.com
GET config
GET v1/schemas
GET v1/documents/{job_id}
Successfully processes documents using BDA with custom and AWS-managed blueprints.
Preview environment for app
Preview environment for app-flask
Preview environment for app-nextjs
Preview environment for app-rails