Skip to content

Add DocumentAI reference implementation using DDE module#252

Open
laurencegoolsby wants to merge 80 commits intomainfrom
lgoolsby/app-documentai
Open

Add DocumentAI reference implementation using DDE module#252
laurencegoolsby wants to merge 80 commits intomainfrom
lgoolsby/app-documentai

Conversation

@laurencegoolsby
Copy link

@laurencegoolsby laurencegoolsby commented Jan 23, 2026

Ticket

Resolves #253

Changes

Add a complete DocumentAI reference implementation demonstrating how to build document processing workflows using the Bedrock Data Automation (DDE) module.

Infrastructure

  • Add Lambda functions for document processing workflow
  • Add DynamoDB table for job tracking with GSI for job ID queries
  • Add EventBridge rules for BDA event handling
  • Add S3 event notifications for automated processing
  • Configure DDE module with custom and AWS-managed blueprints
  • Add IAM policies for Lambda execution and BDA invocation

Application

  • Integrate with DDE module for document data extraction
  • Support for 8 AWS-managed blueprints (invoice, receipt, W2, driver license, passport, etc.)
  • Support for custom blueprints (social security card)
  • Document processing workflow with async job tracking

Context for reviewers

This PR demonstrates a complete DocumentAI application built on top of the DDE module from navapbc/template-infra#989. It shows how to:

  • Use the DDE module in a real application
  • Handle BDA events and job tracking
  • Integrate custom and AWS-managed blueprints
  • Build serverless document processing workflows

The infrastructure uses the updated DDE module tested in #237.

Testing

Deployed to docai-test workspace: https://app-docai.platform-test-dev.navateam.com

GET config

image

GET v1/schemas

image

GET v1/documents/{job_id}

image

Successfully processes documents using BDA with custom and AWS-managed blueprints.

Preview environment for app

Preview environment for app-flask

Preview environment for app-nextjs

Preview environment for app-rails

damianj and others added 30 commits November 6, 2025 13:38
- Add enable_document_data_extraction boolean
- Add environment variables and bucket policies to service
- Add additional outputs for BDA project/blueprint
- Move infra/modules/document-data-extraction/resources/bedrock-data-automation/* to infra/modules/document-data-extraction/resources
- Rename infra/app-flask/service/blueprints to document-data-extraction-blueprints
- Remove bda_ prefix from infra/modules/document-data-extraction/resources/variables.tf and infra/app-flask/service/document_data_extraction.tf
- Update infra/app-flask/app-config/env-config/document_data_extraction.tf name and path
- Added local.prefix to bucket names and BDA project name
- Updated environment variables to use prefixed bucket names
- Create KMS key
- Remove KMS configuration parameters from module interface
- Add Bedrock Data Automation access_policy_arn output for service integration
- Update blueprints_map to use map(string) tags instead of complex objects
- Remove enabled_blueprint logic in favor of reading blueprints directory
- Update README
…ment data extraction integration

- Create /test-document-data-extraction endpoint in Flask app
- Update boto3/botocore to support latest BDA API (1.38.0+)
- Fix Terraform outputs to include bucket name prefixes
- Add DDE environment variables to local.env template
…ts.tf, infra/modules/document-data-extraction/resources/variables.tf
- Remove Bedrock Data Automation KMS key alias
- Remove unused Document Data Extraction outputs
- Format code via make format
- Upgrade moto from version 4.0.2 to 5.1.18
- Update to use moto.mock_aws as moto.mock_s3 was removed from moto version 5x
- Upgrade urllib3 to ^2.6.0 (fixes GHSA-2xpw-w6gg-jr37, GHSA-gm62-xv2j-4w53, GHSA-pq67-6m6q-mj2v)
- Upgrade filelock to ^3.20.1 (fixes GHSA-w853-jp5j-5j7f)
- Addresses high and medium severity vulnerabilities found by grype scan
@doshitan
Copy link
Contributor

navapbc/template-infra#990 is about adding DDE module test support to template-infra's example app (template-only-app). This work adding a new DocumentAI API instance is different. So removed mention of navapbc/template-infra#990 from here and added #253.

- Add externalReferenceId attribute and GSI for querying documents by external reference id (e.g. caseId)
- Add DDE_EXTERNAL_REF_ID_INDEX_NAME environment variable
- GSI enables grouping documents by case/client identifier
- Add metrics.tf with Glue database, table, SQS queue, and Athena workgroup
- Configure S3 buckets for raw metrics and Athena results
- Add EventBridge schedules for metrics processor and aggregator
- Configure partition projection for efficient Athena queries
- Add IAM policies for metrics Lambda functions
- Add kms:Encrypt to storage module IAM policy for Lambda encryption access
- Export kms_key_arn from storage module for workgroup configuration
- Configure Athena workgroup with SSE-KMS encryption using bucket's KMS key
- Add s3:GetBucketLocation to metrics_aggregator policy for bucket verification
- Add multipage-upload-session table with sessionId (primary key) and page_number (sort key)
- Configure GSI for querying sessions by sessionId
- Supports multipage document upload feature endpoints
- Rename clientId to tenantId in DynamoDB tables and GSIs
- Update document_metadata table partition key index name
- Add tenantId GSI to document_metadata table
- Add documentai_batches table with tenantId support
- Add DOCUMENTAI_BATCH_TABLE_NAME environment variable
- Renamed DynamoDB table from multipage_upload_sessions to document_builds
- Changed primary key from sessionId to buildId
- Updated environment variable from DOCUMENTAI_MULTIPAGE_UPLOAD_SESSIONS_TABLE_NAME to DOCUMENTAI_DOCUMENT_BUILDS_TABLE_NAME
- Add KMS encryption for DynamoDB tables (document_metadata, document_builds, documentai_batches)
- Enable point-in-time recovery on all DynamoDB tables for disaster recovery
- Configure X-Ray tracing on all Lambda functions for distributed debugging
- Create shared dead letter queue for Lambda failure handling
- Set Lambda concurrent execution limit to 100 to prevent runaway costs
- Migrate lambda_artifacts bucket to storage module with KMS encryption
- Enable KMS encryption for Lambda DLQ using shared DynamoDB KMS key
- Add checkov skip for CKV_AWS_117 (VPC not needed for S3/DynamoDB access)
- Add checkov skip for CKV_AWS_272 (code signing not required)
- Add checkov skip for CKV_AWS_173 (env vars don't contain sensitive data)
@laurencegoolsby laurencegoolsby requested a review from doshitan March 3, 2026 16:29
Copy link
Contributor

@doshitan doshitan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • There's no app code? There should be a /app-docai directory?
  • This seems to be including a lot of extra infrastructure for functionality that doesn't existing the the DocumentAI API app? SQS?, Athena?, batches?, etc.
  • Missing the template state tracking in general

This PR should reflect adding a new application based on main of navapbc/strata-template-documentai-api and the current version of the infra template (which reflects main on navapbc/template-infra). Please follow https://navapbc.github.io/platform-cli/adding-an-app/ for adding a app-docai instance (i.e., run infra add-app and app install with appropriate flags).

@laurencegoolsby
Copy link
Author

laurencegoolsby commented Mar 3, 2026

  • There's no app code? There should be a /app-docai directory?
  • This seems to be including a lot of extra infrastructure for functionality that doesn't existing the the DocumentAI API app? SQS?, Athena?, batches?, etc.
  • Missing the template state tracking in general

This PR should reflect adding a new application based on main of navapbc/strata-template-documentai-api and the current version of the infra template (which reflects main on navapbc/template-infra). Please follow https://navapbc.github.io/platform-cli/adding-an-app/ for adding a app-docai instance (i.e., run infra add-app and app install with appropriate flags).

Thank you @doshitan Good point.

The strata-template-documentai-api has unmerged branches with the features (batches, metrics, document builds) that this infrastructure supports.

My plan is to proceed as follows:

  1. Get the strata-template PRs reviewed and merged to main first
  2. Then run app install from the merged template to create /app-docai/ in platform-test
  3. Update this PR to match the infrastructure to the installed app

This way the platform-test PR shows a proper installation of an approved template, not a work-in-progress.

Objections?

…EventBridge S3 events.

- Remove Lambda functions, layers, and IAM roles
- Add KMS decrypt permissions to dynamodb_read_write IAM policy for encrypted tables
- Configure ephemeral_write_volumes=["/tmp"] for ECS tasks (required by pdf2image)
- Rename environment variables from DDE_* to DOCUMENTAI_* prefix
- Apply workspace prefix to EventBridge source bucket names

- Add file upload jobs:
  - document_processor: watches input_bucket_name with "input/" prefix
  - bda_result_processor: watches output_bucket_name with "processed/" prefix
- Update environment variables to include full S3 paths with prefixes
- Update ECS task commands to use .cli modules instead of .main
- DOCUMENTAI_INPUT_LOCATION now includes /input prefix
- DOCUMENTAI_OUTPUT_LOCATION now includes /processed prefix
- Add DOCUMENTAI_BATCH_INPUT_LOCATION, DOCUMENTAI_BUILD_INPUT_LOCATION to environment variables
…ION environment variables

- Rename replace DOCUMENTAI_ with BDA_ in DOCUMENTAI_PROJECT_ARN, DOCUMENTAI_PROFILE_ARN, DOCUMENTAI_REGION
- Variables are specific to Bedrock Data Automation (BDA) rather than the broader DOCUMENTAI_ prefix
…CUMENTAI_BATCH_INPUT_LOCATION environment variable
- Add max_batch_size, max_bda_invoke_retry_attempts environment variables
- Rename DOCUMENTAI_BUILD_INPUT_LOCATION environment variable to a more generic DOCUMENTAI_PREPROCESSING_LOCATION
- Increase service memory to 2048MB from 512MB; resolves batch process crash due to OOM exceptions
- Add aws-managed birth certificate, 1040,1099-int blueprints
- Add custom-employment-termination-letter.json
- Add custom-employment-verification-letter.json
- Add custom-i-766-work-authorization.json
- Add custom-i20-student-visa.json
- Add custom-i94-arrival-and-departure.json
- Add custom-ira-account-document.json
- Add custom-prof-of-lost-health-coverage.json
- Add custom-tax-reporting-statement.json
- Add custom-unemployment-insurance-claim.json
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create DocumentAI API instance

5 participants