🎉 Complete Tender Document Extraction API with LangExtract

A comprehensive, production-ready Tender Document Extraction API using LangExtract, FastAPI, and Python for processing Dutch procurement documents with full source attribution and JSONL export functionality.

✅ Complete Solution Features

🔍 Core Technologies Implemented

LangExtract Integration: Full integration with Google's LangExtract using Gemini API
FastAPI Framework: Modern, async API with automatic documentation
PDF Processing: Advanced PDF text extraction with coordinate tracking
OCR Support: Tesseract integration for scanned documents
Multi-language Support: Dutch, English, German, and French

📊 Document Processing Capabilities

Single Document Processing: Individual PDF extraction with full source attribution
Batch Processing: Up to 20 documents with intelligent merging
Document Classification: Automatic classification of tender document types
Multi-document Relationships: Detection of cross-references and dependencies
Quality Metrics: Completeness scores and confidence ratings

🎯 Comprehensive Information Extraction

Project Overview: Title, description, contracting authority, CPV codes, scope
Contract Details: Type, estimated value, duration, payment terms
Critical Dates: Publication, question deadline, submission deadline, start date
Evaluation Criteria: Knockout, selection, and assessment criteria with scores
Stakeholders: Contact persons with full details
Deliverables & Requirements: Technical specs and compliance requirements

📍 Advanced Source Attribution

Filename, page number, character positions
Bounding box coordinates
Confidence scores for each extraction
Timestamp tracking

💾 JSONL Export System

Streaming JSONL export with compression support
Complete source attribution in exports
Metadata inclusion with export statistics
Production-ready file handling

🏗️ Production Architecture

🔄 Background Processing

Async job processing with progress tracking
Configurable concurrency limits
Comprehensive error handling and recovery

💨 Caching & Performance

Redis-based result caching
Content-based cache keys
Configurable TTL and invalidation

🐳 Docker Deployment

Multi-stage Docker builds
Docker Compose with Redis and optional PostgreSQL
Nginx reverse proxy with rate limiting
Health checks and monitoring

🔒 Security & Scalability

API key authentication (optional)
CORS configuration
File validation and size limits
Rate limiting and request throttling

📁 Project Structure

tenderextract/
├── app/                          # Main application
│   ├── main.py                   # FastAPI app factory
│   ├── config.py                 # Configuration management
│   ├── dependencies.py           # Dependency injection
│   ├── core/                     # Core components
│   │   ├── exceptions.py         # Custom exceptions
│   │   └── logging.py           # Logging setup
│   ├── models/                   # Data models
│   │   ├── extraction.py         # Extraction models
│   │   └── jobs.py              # Job management
│   ├── routers/                  # API endpoints
│   │   ├── extraction.py         # Extraction endpoints
│   │   └── health.py            # Health checks
│   ├── schemas/                  # Request schemas
│   │   └── requests.py          # API request models
│   └── services/                 # Business logic
│       ├── langextract_service.py    # LangExtract integration
│       ├── pdf_processing_service.py # PDF processing
│       ├── extraction_service.py     # Main extraction logic
│       ├── cache_service.py          # Caching service
│       └── jsonl_export_service.py   # JSONL exports
├── tests/                        # Comprehensive test suite
├── docker-compose.yml           # Container orchestration
├── Dockerfile                   # Multi-stage build
├── requirements.txt             # Dependencies
└── USAGE_EXAMPLES.md           # Complete usage guide

🚀 Quick Start

Prerequisites

Python 3.11+
Google API Key for Gemini
Redis (optional, for production caching)
Tesseract OCR

Installation

# Clone the repository
git clone <your-repo-url>
cd tenderextract

# Setup environment
cp .env.example .env
# Edit .env and add your Google API key

# Install dependencies
pip install -r requirements.txt

# Run development server
python run_dev.py

Using Docker

# Set your Google API key
export GOOGLE_API_KEY="your-api-key-here"

# Run with Docker Compose
docker-compose up -d

# Check logs
docker-compose logs -f tender-extraction-api

📡 API Endpoints

Core Endpoints

# Process single document
POST /api/v1/extract-single

# Process multiple documents
POST /api/v1/extract-batch

# Check processing status
GET /api/v1/status/{job_id}

# Export results as JSONL
GET /api/v1/export/{job_id}

# Health checks
GET /health
GET /health/detailed

Example Usage

Single Document Extraction

curl -X POST "http://localhost:8000/api/v1/extract-single" \
  -H "Content-Type: multipart/form-data" \
  -F "file=@tender_document.pdf" \
  -F "language=nl"

Batch Processing

curl -X POST "http://localhost:8000/api/v1/extract-batch" \
  -H "Content-Type: multipart/form-data" \
  -F "files=@announcement.pdf" \
  -F "files=@specifications.pdf" \
  -F "job_name=Infrastructure Tender 2024" \
  -F "language=nl" \
  -F "merge_results=true"

📋 Example JSONL Output

{"document_id": "doc_001", "extraction_timestamp": "2025-01-23T10:30:00", "filename": "tender_001.pdf", "project_title": "IT Infrastructure Modernization", "contracting_authority": "Ministry of Digital Affairs", "estimated_value": 2500000.0, "currency": "EUR", "submission_deadline": "2025-03-15T17:00:00", "assessment_criteria": {"price": 0.4, "quality": 0.35, "sustainability": 0.25}, "source_attribution": {"project_title": {"page": 2, "char_start": 1250, "char_end": 1285, "confidence": 0.95}}}

⚙️ Configuration

Key environment variables:

# Required
GOOGLE_API_KEY=your_google_api_key_here

# Processing
MAX_FILE_SIZE=50000000
MAX_FILES_PER_BATCH=20
SUPPORTED_LANGUAGES=["nl","en","de","fr"]
PERFORM_OCR=true

# Caching
ENABLE_EXTRACTION_CACHE=true
USE_REDIS=true
REDIS_URL=redis://localhost:6379

# Performance
MAX_CONCURRENT_EXTRACTIONS=3
EXTRACTION_TIMEOUT_MINUTES=30

See .env.example for complete configuration options.

🧪 Comprehensive Testing

Run the test suite:

# Install test dependencies
pip install -r requirements-dev.txt

# Run all tests
pytest

# Run with coverage
pytest --cov=app

# Run specific test categories
pytest tests/test_langextract_service.py
pytest tests/test_pdf_processing.py
pytest tests/test_jsonl_export.py

Test Coverage

Unit Tests: LangExtract service, PDF processing, JSONL export
Integration Tests: End-to-end API testing
Mock Testing: External service integration testing
Error Handling: Comprehensive error scenario testing

📖 Documentation

API Documentation: Available at /docs when running the server
Usage Examples: See USAGE_EXAMPLES.md for comprehensive examples
Configuration Guide: All environment variables documented in .env.example
Development Guide: See CLAUDE.md for development best practices

🚢 Production Deployment

Docker Deployment

# Production with SSL proxy
docker-compose -f docker-compose.yml --profile with-proxy up -d

# With database persistence
docker-compose -f docker-compose.yml --profile with-db up -d

Environment Variables for Production

export GOOGLE_API_KEY="your-production-api-key"
export USE_REDIS=true
export ENABLE_EXTRACTION_CACHE=true
export REQUIRE_API_KEY=true
export LOG_LEVEL=INFO
export MAX_CONCURRENT_EXTRACTIONS=5

🔧 Development

Code Quality

# Format code
black app/ tests/

# Sort imports
isort app/ tests/

# Lint code
flake8 app/ tests/

# Type checking
mypy app/

# Run all quality checks
black app/ tests/ && isort app/ tests/ && flake8 app/ tests/ && mypy app/

Adding New Features

Models: Add new data models in app/models/
Services: Implement business logic in app/services/
Routers: Add new endpoints in app/routers/
Tests: Add comprehensive tests in tests/

📊 Monitoring & Performance

Health Checks

# Basic health
curl http://localhost:8000/health

# Detailed health with dependencies
curl http://localhost:8000/health/detailed

Performance Tips

Use batch processing for multiple related documents
Enable caching in production (ENABLE_EXTRACTION_CACHE=true)
Compress exports for large result sets
Monitor memory usage during OCR processing
Set appropriate timeouts for large documents

🤝 Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Make your changes following the coding standards
Add tests for new functionality
Run the test suite (pytest)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Google LangExtract for structured extraction
FastAPI for the modern web framework
Tesseract OCR for optical character recognition
PDFplumber for PDF text extraction

📞 Support

For issues, questions, or contributions:

Create an issue on GitHub
Check the USAGE_EXAMPLES.md for detailed usage
Review the CLAUDE.md for development guidelines

This solution is production-ready with enterprise-grade features including caching, monitoring, error handling, comprehensive testing, and scalable architecture. It fully meets all requirements for extracting tender information from Dutch procurement documents with complete source attribution and JSONL export functionality.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
app		app
tests		tests
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
README.md		README.md
USAGE_EXAMPLES.md		USAGE_EXAMPLES.md
docker-compose.yml		docker-compose.yml
main.py		main.py
main_original.py		main_original.py
nginx.conf		nginx.conf
pytest.ini		pytest.ini
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
run_dev.py		run_dev.py
test_main.http		test_main.http

Folders and files

Latest commit

History

Repository files navigation

🎉 Complete Tender Document Extraction API with LangExtract

✅ Complete Solution Features

🔍 Core Technologies Implemented

📊 Document Processing Capabilities

🎯 Comprehensive Information Extraction

📍 Advanced Source Attribution

💾 JSONL Export System

🏗️ Production Architecture

🔄 Background Processing

💨 Caching & Performance

🐳 Docker Deployment

🔒 Security & Scalability

📁 Project Structure

🚀 Quick Start

Prerequisites

Installation

Using Docker

📡 API Endpoints

Core Endpoints

Example Usage

Single Document Extraction

Batch Processing

📋 Example JSONL Output

⚙️ Configuration

🧪 Comprehensive Testing

Test Coverage

📖 Documentation

🚢 Production Deployment

Docker Deployment

Environment Variables for Production

🔧 Development

Code Quality

Adding New Features

📊 Monitoring & Performance

Health Checks

Performance Tips

🤝 Contributing

📝 License

🙏 Acknowledgments

📞 Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages