Skip to content

Large-scale document archive with full-text search. 100k+ documents indexed.

License

Notifications You must be signed in to change notification settings

raya-ac/epstein-archive

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 

Repository files navigation

Epstein Archive

Large-scale document archive system with full-text search capabilities.

Overview

A comprehensive document archive indexing 100,000+ documents with:

  • Full-text search with OCR fallback for scanned documents
  • Metadata extraction and entity recognition
  • Timeline visualization and relationship mapping
  • PDF generation and document export
  • RESTful API for programmatic access

Live Instance

URL: https://hmm.raya.li

Features

  • Full-Text Search: Elasticsearch-powered search across all documents
  • OCR Support: Tesseract OCR for scanned/image-based documents
  • Entity Recognition: Automatic extraction of names, dates, locations
  • Timeline View: Chronological visualization of document events
  • Export: PDF generation and bulk document download
  • API: RESTful endpoints for research integration

Tech Stack

  • Backend: Python, Flask
  • Search: Elasticsearch
  • Database: PostgreSQL
  • OCR: Tesseract
  • Frontend: Vue.js
  • Hosting: Self-hosted on OVH infrastructure

API Usage

# Search documents
curl "https://hmm.raya.li/api/search?q=query&limit=20"

# Get document by ID
curl "https://hmm.raya.li/api/documents/{id}"

# Export as PDF
curl "https://hmm.raya.li/api/documents/{id}/pdf"

Data Source

Documents are sourced from public court records, FOIA releases, and other public domain sources.

Disclaimer

This archive is for research and educational purposes only. All documents are publicly available through official channels.

License

MIT License - See LICENSE file

About

Large-scale document archive with full-text search. 100k+ documents indexed.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published