Add DocumentAI reference implementation using DDE module by laurencegoolsby · Pull Request #252 · navapbc/platform-test

laurencegoolsby · 2026-01-23T17:07:49Z

Ticket

Resolves #253

Changes

Add a complete DocumentAI reference implementation demonstrating how to build document processing workflows using the Bedrock Data Automation (DDE) module.

Infrastructure

Add Lambda functions for document processing workflow
Add DynamoDB table for job tracking with GSI for job ID queries
Add EventBridge rules for BDA event handling
Add S3 event notifications for automated processing
Configure DDE module with custom and AWS-managed blueprints
Add IAM policies for Lambda execution and BDA invocation

Application

Integrate with DDE module for document data extraction
Support for 8 AWS-managed blueprints (invoice, receipt, W2, driver license, passport, etc.)
Support for custom blueprints (social security card)
Document processing workflow with async job tracking

Context for reviewers

This PR demonstrates a complete DocumentAI application built on top of the DDE module from navapbc/template-infra#989. It shows how to:

Use the DDE module in a real application
Handle BDA events and job tracking
Integrate custom and AWS-managed blueprints
Build serverless document processing workflows

The infrastructure uses the updated DDE module tested in #237.

Testing

Deployed to docai-test workspace: https://app-docai.platform-test-dev.navateam.com

GET config

GET v1/schemas

GET v1/documents/{job_id}

Successfully processes documents using BDA with custom and AWS-managed blueprints.

Preview environment for app

Service endpoint: https://p-252-app-dev-1606277548.us-east-1.elb.amazonaws.com
Deployed commit: 97f0bd6

Preview environment for app-flask

Service endpoint: https://p-252-app-flask-dev-1989389458.us-east-1.elb.amazonaws.com
Deployed commit: 2489b62

Preview environment for app-nextjs

Service endpoint: https://p-252-app-nextjs-dev-1700365542.us-east-1.elb.amazonaws.com
Deployed commit: 97f0bd6

Preview environment for app-rails

Service endpoint: https://p-252-app-rails-dev-1841623302.us-east-1.elb.amazonaws.com
Deployed commit: 97f0bd6

- Add enable_document_data_extraction boolean - Add environment variables and bucket policies to service - Add additional outputs for BDA project/blueprint

- Move infra/modules/document-data-extraction/resources/bedrock-data-automation/* to infra/modules/document-data-extraction/resources - Rename infra/app-flask/service/blueprints to document-data-extraction-blueprints - Remove bda_ prefix from infra/modules/document-data-extraction/resources/variables.tf and infra/app-flask/service/document_data_extraction.tf - Update infra/app-flask/app-config/env-config/document_data_extraction.tf name and path

- Added local.prefix to bucket names and BDA project name - Updated environment variables to use prefixed bucket names

- Create KMS key - Remove KMS configuration parameters from module interface - Add Bedrock Data Automation access_policy_arn output for service integration - Update blueprints_map to use map(string) tags instead of complex objects - Remove enabled_blueprint logic in favor of reading blueprints directory - Update README

…ment data extraction integration - Create /test-document-data-extraction endpoint in Flask app - Update boto3/botocore to support latest BDA API (1.38.0+) - Fix Terraform outputs to include bucket name prefixes - Add DDE environment variables to local.env template

…ts.tf, infra/modules/document-data-extraction/resources/variables.tf

…into bda-module

- Remove Bedrock Data Automation KMS key alias - Remove unused Document Data Extraction outputs - Format code via make format

- Upgrade moto from version 4.0.2 to 5.1.18 - Update to use moto.mock_aws as moto.mock_s3 was removed from moto version 5x

- Upgrade urllib3 to ^2.6.0 (fixes GHSA-2xpw-w6gg-jr37, GHSA-gm62-xv2j-4w53, GHSA-pq67-6m6q-mj2v) - Upgrade filelock to ^3.20.1 (fixes GHSA-w853-jp5j-5j7f) - Addresses high and medium severity vulnerabilities found by grype scan

doshitan · 2026-01-29T01:10:06Z

navapbc/template-infra#990 is about adding DDE module test support to template-infra's example app (template-only-app). This work adding a new DocumentAI API instance is different. So removed mention of navapbc/template-infra#990 from here and added #253.

- Add externalReferenceId attribute and GSI for querying documents by external reference id (e.g. caseId) - Add DDE_EXTERNAL_REF_ID_INDEX_NAME environment variable - GSI enables grouping documents by case/client identifier

- Add metrics.tf with Glue database, table, SQS queue, and Athena workgroup - Configure S3 buckets for raw metrics and Athena results - Add EventBridge schedules for metrics processor and aggregator - Configure partition projection for efficient Athena queries - Add IAM policies for metrics Lambda functions

- Add kms:Encrypt to storage module IAM policy for Lambda encryption access - Export kms_key_arn from storage module for workgroup configuration - Configure Athena workgroup with SSE-KMS encryption using bucket's KMS key - Add s3:GetBucketLocation to metrics_aggregator policy for bucket verification

- Add multipage-upload-session table with sessionId (primary key) and page_number (sort key) - Configure GSI for querying sessions by sessionId - Supports multipage document upload feature endpoints

- Rename clientId to tenantId in DynamoDB tables and GSIs - Update document_metadata table partition key index name - Add tenantId GSI to document_metadata table - Add documentai_batches table with tenantId support - Add DOCUMENTAI_BATCH_TABLE_NAME environment variable

- Renamed DynamoDB table from multipage_upload_sessions to document_builds - Changed primary key from sessionId to buildId - Updated environment variable from DOCUMENTAI_MULTIPAGE_UPLOAD_SESSIONS_TABLE_NAME to DOCUMENTAI_DOCUMENT_BUILDS_TABLE_NAME

- Add KMS encryption for DynamoDB tables (document_metadata, document_builds, documentai_batches) - Enable point-in-time recovery on all DynamoDB tables for disaster recovery - Configure X-Ray tracing on all Lambda functions for distributed debugging - Create shared dead letter queue for Lambda failure handling - Set Lambda concurrent execution limit to 100 to prevent runaway costs - Migrate lambda_artifacts bucket to storage module with KMS encryption

- Enable KMS encryption for Lambda DLQ using shared DynamoDB KMS key - Add checkov skip for CKV_AWS_117 (VPC not needed for S3/DynamoDB access) - Add checkov skip for CKV_AWS_272 (code signing not required) - Add checkov skip for CKV_AWS_173 (env vars don't contain sensitive data)

doshitan

There's no app code? There should be a /app-docai directory?
This seems to be including a lot of extra infrastructure for functionality that doesn't existing the the DocumentAI API app? SQS?, Athena?, batches?, etc.
Missing the template state tracking in general

This PR should reflect adding a new application based on main of navapbc/strata-template-documentai-api and the current version of the infra template (which reflects main on navapbc/template-infra). Please follow https://navapbc.github.io/platform-cli/adding-an-app/ for adding a app-docai instance (i.e., run infra add-app and app install with appropriate flags).

laurencegoolsby · 2026-03-03T18:18:48Z

There's no app code? There should be a /app-docai directory?

This seems to be including a lot of extra infrastructure for functionality that doesn't existing the the DocumentAI API app? SQS?, Athena?, batches?, etc.

Missing the template state tracking in general

This PR should reflect adding a new application based on main of navapbc/strata-template-documentai-api and the current version of the infra template (which reflects main on navapbc/template-infra). Please follow https://navapbc.github.io/platform-cli/adding-an-app/ for adding a app-docai instance (i.e., run infra add-app and app install with appropriate flags).

Thank you @doshitan Good point.

The strata-template-documentai-api has unmerged branches with the features (batches, metrics, document builds) that this infrastructure supports.

My plan is to proceed as follows:

Get the strata-template PRs reviewed and merged to main first
Then run app install from the merged template to create /app-docai/ in platform-test
Update this PR to match the infrastructure to the installed app

This way the platform-test PR shows a proper installation of an approved template, not a work-in-progress.

Objections?

…EventBridge S3 events. - Remove Lambda functions, layers, and IAM roles - Add KMS decrypt permissions to dynamodb_read_write IAM policy for encrypted tables - Configure ephemeral_write_volumes=["/tmp"] for ECS tasks (required by pdf2image) - Rename environment variables from DDE_* to DOCUMENTAI_* prefix - Apply workspace prefix to EventBridge source bucket names - Add file upload jobs: - document_processor: watches input_bucket_name with "input/" prefix - bda_result_processor: watches output_bucket_name with "processed/" prefix

- Update environment variables to include full S3 paths with prefixes - Update ECS task commands to use .cli modules instead of .main - DOCUMENTAI_INPUT_LOCATION now includes /input prefix - DOCUMENTAI_OUTPUT_LOCATION now includes /processed prefix - Add DOCUMENTAI_BATCH_INPUT_LOCATION, DOCUMENTAI_BUILD_INPUT_LOCATION to environment variables

…ION environment variables - Rename replace DOCUMENTAI_ with BDA_ in DOCUMENTAI_PROJECT_ARN, DOCUMENTAI_PROFILE_ARN, DOCUMENTAI_REGION - Variables are specific to Bedrock Data Automation (BDA) rather than the broader DOCUMENTAI_ prefix

…CUMENTAI_BATCH_INPUT_LOCATION environment variable

- Add max_batch_size, max_bda_invoke_retry_attempts environment variables - Rename DOCUMENTAI_BUILD_INPUT_LOCATION environment variable to a more generic DOCUMENTAI_PREPROCESSING_LOCATION - Increase service memory to 2048MB from 512MB; resolves batch process crash due to OOM exceptions

- Add aws-managed birth certificate, 1040,1099-int blueprints - Add custom-employment-termination-letter.json - Add custom-employment-verification-letter.json - Add custom-i-766-work-authorization.json - Add custom-i20-student-visa.json - Add custom-i94-arrival-and-departure.json - Add custom-ira-account-document.json - Add custom-prof-of-lost-health-coverage.json - Add custom-tax-reporting-statement.json - Add custom-unemployment-insurance-claim.json

damianj and others added 30 commits November 6, 2025 13:38

test out bda module

e54746e

attach bucket policies from storage module to role

8352efa

format

e8cb3d0

use list and convert to set

40aee44

use list and convert to set

ab7d8c4

typo

f0199c4

use map instead of set

86ff682

add cloudformation to list of services CICD role is allowed to manage

ac77968

trigger pipeline

fd3f33d

delete resources

a2a48b5

delete resources

96ad262

delete resources

1e1c897

delete resources

7856f79

format

50ff1c9

recreate resources

6d586a9

add readme

c1baf3b

Update document data extraction module

4af7042

- Add enable_document_data_extraction boolean - Add environment variables and bucket policies to service - Add additional outputs for BDA project/blueprint

Update resource naming to include local.prefix

cc7a87f

- Added local.prefix to bucket names and BDA project name - Updated environment variables to use prefixed bucket names

Fix override_configuration variable naming

102a032

Update OpenAPI spec

4ef26a5

Update infra/app-flask/service/main.tf, infra/app-flask/service/outpu…

7aa1a33

…ts.tf, infra/modules/document-data-extraction/resources/variables.tf

Merge branch 'bda-module' of https://github.com/navapbc/platform-test …

a78c0a7

…into bda-module

Update per PR feedback.

5c4c7f2

- Remove Bedrock Data Automation KMS key alias - Remove unused Document Data Extraction outputs - Format code via make format

Address lint findings

89e12e9

Fix failing test_create_user_csv.py test.

ec68315

- Upgrade moto from version 4.0.2 to 5.1.18 - Update to use moto.mock_aws as moto.mock_s3 was removed from moto version 5x

Fix security vulnerabilities in urllib3 and filelock

4886a25

- Upgrade urllib3 to ^2.6.0 (fixes GHSA-2xpw-w6gg-jr37, GHSA-gm62-xv2j-4w53, GHSA-pq67-6m6q-mj2v) - Upgrade filelock to ^3.20.1 (fixes GHSA-w853-jp5j-5j7f) - Addresses high and medium severity vulnerabilities found by grype scan

Update local.env

2e2e3c1

laurencegoolsby added 8 commits January 22, 2026 21:59

Format terraform files

2699ca3

Add checkov skip for AWS-managed encryption

bc0c355

Rename blueprint to avoid KMS key conflict with existing blueprint

5970af4

Fix S3 URI construction in DDE endpoint

2faedf7

Attach bedrock access policy to app role for DDE invocation

afd08ac

Add InvokeDataAutomationAsync permission and profile resource access

ea72d79

Merge bda-module to get updated DDE and storage modules

c533d3e

Move custom blueprint to service directory and update path

ec4b2ac

laurencegoolsby added 12 commits February 12, 2026 10:03

Add externalReferenceId GSI to document metadata table

8c0cadd

- Add externalReferenceId attribute and GSI for querying documents by external reference id (e.g. caseId) - Add DDE_EXTERNAL_REF_ID_INDEX_NAME environment variable - GSI enables grouping documents by case/client identifier

Merge branch 'main' into lgoolsby/app-documentai

ea8a115

Increase documentai_metrics_aggregator lambda timeout to 300 seconds

b4913f4

Add metrics environment variables to main.tf

b4ee75b

Add DynamoDB table for multipage upload sessions

328c195

- Add multipage-upload-session table with sessionId (primary key) and page_number (sort key) - Configure GSI for querying sessions by sessionId - Supports multipage document upload feature endpoints

Enable KMS encryption for metrics SQS queue

0a08dbf

laurencegoolsby requested a review from doshitan March 3, 2026 16:29

doshitan requested changes Mar 3, 2026

View reviewed changes

laurencegoolsby added 6 commits March 3, 2026 23:47

Update file_upload_jobs to use main rather than cli, remove unused DO…

e22d368

…CUMENTAI_BATCH_INPUT_LOCATION environment variable

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DocumentAI reference implementation using DDE module#252

Add DocumentAI reference implementation using DDE module#252
laurencegoolsby wants to merge 80 commits intomainfrom
lgoolsby/app-documentai

laurencegoolsby commented Jan 23, 2026 •

edited by github-actions bot

Loading

Uh oh!

doshitan commented Jan 29, 2026

Uh oh!

doshitan left a comment •

edited

Loading

Uh oh!

laurencegoolsby commented Mar 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

laurencegoolsby commented Jan 23, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Ticket

Changes

Infrastructure

Application

Context for reviewers

Testing

GET config

GET v1/schemas

GET v1/documents/{job_id}

Preview environment for app

Preview environment for app-flask

Preview environment for app-nextjs

Preview environment for app-rails

Uh oh!

doshitan commented Jan 29, 2026

Uh oh!

doshitan left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

laurencegoolsby commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

laurencegoolsby commented Jan 23, 2026 •

edited by github-actions bot

Loading

doshitan left a comment •

edited

Loading

laurencegoolsby commented Mar 3, 2026 •

edited

Loading