-
Notifications
You must be signed in to change notification settings - Fork 25
Description
Summary
Implement automatic language detection for uploaded PDF documents and dynamically apply the appropriate locale settings to Adobe PDF Services API calls for optimal processing results across multiple languages.
๐ฏ Motivation
Currently, the PDF accessibility processing pipeline uses a hardcoded English locale (en-US) for all documents. This limits the effectiveness of Adobe PDF Services' autotagging and extraction capabilities for non-English documents, particularly for languages like Spanish, Catalan, French, German, and others that have specific linguistic rules and accessibility requirements.
โจ Features Implemented
1. Automatic Language Detection
- AWS Comprehend Integration: Utilizes AWS Comprehend's
DetectDominantLanguageAPI to analyze document content - Smart Text Sampling: Extracts text from the first 5 pages of the PDF for language analysis
- Confidence Thresholding: Only applies detected language if confidence score โฅ 70%
- Graceful Fallbacks: Defaults to English (
en-US) for low-confidence detections or errors
2. Comprehensive Language Support
Supports 30+ languages with proper locale mapping:
| Language | AWS Code | Adobe Locale | Region |
|---|---|---|---|
| English | en |
en-US |
United States |
| Spanish | es |
es-ES |
Spain |
| Catalan | ca |
ca-ES |
Spain |
| French | fr |
fr-FR |
France |
| German | de |
de-DE |
Germany |
| Italian | it |
it-IT |
Italy |
| Portuguese | pt |
pt-BR |
Brazil |
| Japanese | ja |
ja-JP |
Japan |
| Chinese | zh |
zh-CN |
China (Simplified) |
| And 20+ more... |
3. Integrated Processing Pipeline
- Autotagging: Applies detected locale to
AutotagPDFParamsfor language-aware accessibility tagging - Text Extraction: Uses detected locale in
ExtractPDFParamsfor improved text and table extraction - PDF Metadata: Sets document language metadata consistently across the pipeline
4. Enhanced Error Handling & Logging
- Comprehensive logging of detection process and confidence scores
- Handles AWS Comprehend API limits (5000 bytes max text)
- Manages insufficient text scenarios gracefully
- Detailed error reporting for troubleshooting
๐ง Technical Implementation
Core Components Added:
1. Language Detection Function
def detect_document_language(pdf_path, filename):
"""
Detect the dominant language in a PDF document using AWS Comprehend.
Returns Adobe PDF Services locale code (e.g., 'es-ES', 'ca-ES', 'en-US')
"""2. Updated API Functions
autotag_pdf_with_options()- Now acceptsdetected_localeparameterextract_api()- Now acceptsdetected_localeparameterset_language_comprehend()- Enhanced to use detected locale for PDF metadata
3. Language-to-Locale Mapping
Comprehensive mapping dictionary from AWS Comprehend language codes to Adobe PDF Services locale codes.
Infrastructure Changes:
AWS CDK Updates (app.py):
- IAM Permissions: Added
comprehend:DetectDominantLanguagepermission to ECS task role - Environment Variables: Removed hardcoded
PDF_LOCALEenvironment variable - Backward Compatibility: Maintains support for manual locale override via environment variable
๐ Processing Flow
graph TD
A[PDF Upload] --> B[Download from S3]
B --> C[Extract Text from First 5 Pages]
C --> D[AWS Comprehend Language Detection]
D --> E{Confidence โฅ 70%?}
E -->|Yes| F[Map to Adobe Locale]
E -->|No| G[Default to en-US]
F --> H[Apply Locale to Adobe APIs]
G --> H
H --> I[Autotagging with Locale]
H --> J[Text Extraction with Locale]
I --> K[Set PDF Language Metadata]
J --> K
K --> L[Upload Processed PDF]
๐งช Testing Scenarios
Test Cases to Validate:
- Spanish Documents: Verify
es-ESlocale detection and application - Catalan Documents: Verify
ca-ESlocale detection and application - Mixed Language Documents: Test confidence thresholding
- Scanned/Image PDFs: Handle insufficient text scenarios
- Very Short Documents: Test minimum text requirements
- Error Scenarios: AWS Comprehend API failures, network issues
- Backward Compatibility: Manual locale override still works
Expected Improvements:
- Better Accessibility Tagging: Language-specific heading detection and structure analysis
- Improved Text Extraction: Better handling of language-specific characters and formatting
- Enhanced Metadata: Proper language metadata in final PDF documents
- Compliance: Better WCAG 2.1 compliance for non-English documents
๐ Benefits
For Users:
- Automatic Processing: No manual language configuration required
- Better Accuracy: Language-aware processing improves accessibility tagging quality
- Multi-language Support: Seamless handling of documents in 30+ languages
- Consistent Results: Standardized locale application across all processing steps
For Developers:
- Maintainable Code: Clean separation of language detection logic
- Extensible Design: Easy to add new language mappings
- Comprehensive Logging: Detailed insights into language detection process
- Error Resilience: Robust fallback mechanisms
๐ Monitoring & Observability
Key Metrics to Track:
- Language detection confidence scores
- Distribution of detected languages
- Fallback to default locale frequency
- AWS Comprehend API usage and costs
- Processing time impact
Log Messages Added:
Detected language: {code} (confidence: {score})Using locale for autotagging: {locale}Using locale for extraction: {locale}Language set to {code} (from detected locale: {locale})
๐ Deployment Notes
Prerequisites:
- AWS Comprehend service availability in deployment region
- Updated IAM permissions for ECS task role
- No additional environment variables required
Rollback Plan:
- Set
PDF_LOCALEenvironment variable to force specific locale - Previous hardcoded behavior can be restored by setting
PDF_LOCALE=en-US
๐ฎ Future Enhancements
Potential Improvements:
- Language Detection Caching: Cache results for similar documents
- Multi-language Documents: Handle documents with multiple languages
- Custom Language Models: Support for domain-specific language detection
- User Override Interface: Allow manual language selection in frontend
- Language-specific Processing Rules: Customize processing based on detected language
- Analytics Dashboard: Visualize language distribution and processing metrics
๐ Files Modified
Core Changes:
docker_autotag/autotag.py: Added language detection and locale parametrizationapp.py: Updated IAM permissions and removed hardcoded locale
Key Functions Added/Modified:
detect_document_language()- New function for language detectionautotag_pdf_with_options()- Added locale parameterextract_api()- Added locale parameterset_language_comprehend()- Enhanced with locale supportmain()- Integrated language detection workflow
๐ท๏ธ Labels
enhancement language-support aws-comprehend adobe-pdf-services accessibility internationalization i18n
๐ Related Issues
- Addresses need for multi-language document processing
- Improves accessibility compliance for non-English documents
- Enhances Adobe PDF Services API utilization
Priority: High
Complexity: Medium
Impact: High - Significantly improves processing quality for non-English documents