Large-scale document archive system with full-text search capabilities.
A comprehensive document archive indexing 100,000+ documents with:
- Full-text search with OCR fallback for scanned documents
- Metadata extraction and entity recognition
- Timeline visualization and relationship mapping
- PDF generation and document export
- RESTful API for programmatic access
URL: https://hmm.raya.li
- Full-Text Search: Elasticsearch-powered search across all documents
- OCR Support: Tesseract OCR for scanned/image-based documents
- Entity Recognition: Automatic extraction of names, dates, locations
- Timeline View: Chronological visualization of document events
- Export: PDF generation and bulk document download
- API: RESTful endpoints for research integration
- Backend: Python, Flask
- Search: Elasticsearch
- Database: PostgreSQL
- OCR: Tesseract
- Frontend: Vue.js
- Hosting: Self-hosted on OVH infrastructure
# Search documents
curl "https://hmm.raya.li/api/search?q=query&limit=20"
# Get document by ID
curl "https://hmm.raya.li/api/documents/{id}"
# Export as PDF
curl "https://hmm.raya.li/api/documents/{id}/pdf"Documents are sourced from public court records, FOIA releases, and other public domain sources.
This archive is for research and educational purposes only. All documents are publicly available through official channels.
MIT License - See LICENSE file