Skip to content

Feature: Automatic Document Language Detection and Locale Parametrizationย #11

@dgomesbr

Description

@dgomesbr

Summary

Implement automatic language detection for uploaded PDF documents and dynamically apply the appropriate locale settings to Adobe PDF Services API calls for optimal processing results across multiple languages.

๐ŸŽฏ Motivation

Currently, the PDF accessibility processing pipeline uses a hardcoded English locale (en-US) for all documents. This limits the effectiveness of Adobe PDF Services' autotagging and extraction capabilities for non-English documents, particularly for languages like Spanish, Catalan, French, German, and others that have specific linguistic rules and accessibility requirements.

โœจ Features Implemented

1. Automatic Language Detection

  • AWS Comprehend Integration: Utilizes AWS Comprehend's DetectDominantLanguage API to analyze document content
  • Smart Text Sampling: Extracts text from the first 5 pages of the PDF for language analysis
  • Confidence Thresholding: Only applies detected language if confidence score โ‰ฅ 70%
  • Graceful Fallbacks: Defaults to English (en-US) for low-confidence detections or errors

2. Comprehensive Language Support

Supports 30+ languages with proper locale mapping:

Language AWS Code Adobe Locale Region
English en en-US United States
Spanish es es-ES Spain
Catalan ca ca-ES Spain
French fr fr-FR France
German de de-DE Germany
Italian it it-IT Italy
Portuguese pt pt-BR Brazil
Japanese ja ja-JP Japan
Chinese zh zh-CN China (Simplified)
And 20+ more...

3. Integrated Processing Pipeline

  • Autotagging: Applies detected locale to AutotagPDFParams for language-aware accessibility tagging
  • Text Extraction: Uses detected locale in ExtractPDFParams for improved text and table extraction
  • PDF Metadata: Sets document language metadata consistently across the pipeline

4. Enhanced Error Handling & Logging

  • Comprehensive logging of detection process and confidence scores
  • Handles AWS Comprehend API limits (5000 bytes max text)
  • Manages insufficient text scenarios gracefully
  • Detailed error reporting for troubleshooting

๐Ÿ”ง Technical Implementation

Core Components Added:

1. Language Detection Function

def detect_document_language(pdf_path, filename):
    """
    Detect the dominant language in a PDF document using AWS Comprehend.
    Returns Adobe PDF Services locale code (e.g., 'es-ES', 'ca-ES', 'en-US')
    """

2. Updated API Functions

  • autotag_pdf_with_options() - Now accepts detected_locale parameter
  • extract_api() - Now accepts detected_locale parameter
  • set_language_comprehend() - Enhanced to use detected locale for PDF metadata

3. Language-to-Locale Mapping

Comprehensive mapping dictionary from AWS Comprehend language codes to Adobe PDF Services locale codes.

Infrastructure Changes:

AWS CDK Updates (app.py):

  • IAM Permissions: Added comprehend:DetectDominantLanguage permission to ECS task role
  • Environment Variables: Removed hardcoded PDF_LOCALE environment variable
  • Backward Compatibility: Maintains support for manual locale override via environment variable

๐Ÿ“Š Processing Flow

graph TD
    A[PDF Upload] --> B[Download from S3]
    B --> C[Extract Text from First 5 Pages]
    C --> D[AWS Comprehend Language Detection]
    D --> E{Confidence โ‰ฅ 70%?}
    E -->|Yes| F[Map to Adobe Locale]
    E -->|No| G[Default to en-US]
    F --> H[Apply Locale to Adobe APIs]
    G --> H
    H --> I[Autotagging with Locale]
    H --> J[Text Extraction with Locale]
    I --> K[Set PDF Language Metadata]
    J --> K
    K --> L[Upload Processed PDF]
Loading

๐Ÿงช Testing Scenarios

Test Cases to Validate:

  1. Spanish Documents: Verify es-ES locale detection and application
  2. Catalan Documents: Verify ca-ES locale detection and application
  3. Mixed Language Documents: Test confidence thresholding
  4. Scanned/Image PDFs: Handle insufficient text scenarios
  5. Very Short Documents: Test minimum text requirements
  6. Error Scenarios: AWS Comprehend API failures, network issues
  7. Backward Compatibility: Manual locale override still works

Expected Improvements:

  • Better Accessibility Tagging: Language-specific heading detection and structure analysis
  • Improved Text Extraction: Better handling of language-specific characters and formatting
  • Enhanced Metadata: Proper language metadata in final PDF documents
  • Compliance: Better WCAG 2.1 compliance for non-English documents

๐Ÿ“ˆ Benefits

For Users:

  • Automatic Processing: No manual language configuration required
  • Better Accuracy: Language-aware processing improves accessibility tagging quality
  • Multi-language Support: Seamless handling of documents in 30+ languages
  • Consistent Results: Standardized locale application across all processing steps

For Developers:

  • Maintainable Code: Clean separation of language detection logic
  • Extensible Design: Easy to add new language mappings
  • Comprehensive Logging: Detailed insights into language detection process
  • Error Resilience: Robust fallback mechanisms

๐Ÿ” Monitoring & Observability

Key Metrics to Track:

  • Language detection confidence scores
  • Distribution of detected languages
  • Fallback to default locale frequency
  • AWS Comprehend API usage and costs
  • Processing time impact

Log Messages Added:

  • Detected language: {code} (confidence: {score})
  • Using locale for autotagging: {locale}
  • Using locale for extraction: {locale}
  • Language set to {code} (from detected locale: {locale})

๐Ÿš€ Deployment Notes

Prerequisites:

  • AWS Comprehend service availability in deployment region
  • Updated IAM permissions for ECS task role
  • No additional environment variables required

Rollback Plan:

  • Set PDF_LOCALE environment variable to force specific locale
  • Previous hardcoded behavior can be restored by setting PDF_LOCALE=en-US

๐Ÿ”ฎ Future Enhancements

Potential Improvements:

  1. Language Detection Caching: Cache results for similar documents
  2. Multi-language Documents: Handle documents with multiple languages
  3. Custom Language Models: Support for domain-specific language detection
  4. User Override Interface: Allow manual language selection in frontend
  5. Language-specific Processing Rules: Customize processing based on detected language
  6. Analytics Dashboard: Visualize language distribution and processing metrics

๐Ÿ“ Files Modified

Core Changes:

  • docker_autotag/autotag.py: Added language detection and locale parametrization
  • app.py: Updated IAM permissions and removed hardcoded locale

Key Functions Added/Modified:

  • detect_document_language() - New function for language detection
  • autotag_pdf_with_options() - Added locale parameter
  • extract_api() - Added locale parameter
  • set_language_comprehend() - Enhanced with locale support
  • main() - Integrated language detection workflow

๐Ÿท๏ธ Labels

enhancement language-support aws-comprehend adobe-pdf-services accessibility internationalization i18n

๐Ÿ”— Related Issues

  • Addresses need for multi-language document processing
  • Improves accessibility compliance for non-English documents
  • Enhances Adobe PDF Services API utilization

Priority: High
Complexity: Medium
Impact: High - Significantly improves processing quality for non-English documents

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions