Skip to content

Analyzes text documents for readability, academic integrity, and linguistic insights via REST API.

License

Notifications You must be signed in to change notification settings

michael-borck/document-lens

Repository files navigation

DocumentLens

Text Analysis & Academic Intelligence Microservice

Transform text content into actionable insights through comprehensive linguistic analysis, writing quality assessment, and academic integrity checking.

🚀 Quick Start

# Docker deployment (recommended)
docker-compose up -d

# Or raw deployment
./deploy.sh

# API available at: http://localhost:8002
# Documentation: http://localhost:8002/docs

📊 API Endpoints

Core Analysis

  • GET /health - Service health check
  • POST /text - Text analysis (readability, quality, word frequency)
  • POST /academic - Academic analysis (citations, DOI resolution, integrity)
  • POST /files - File upload + analysis (PDF, DOCX, TXT, MD)

Advanced Text Analysis

  • POST /advanced/ngrams - N-gram extraction with optional filter terms
  • POST /advanced/ner - Named entity recognition
  • POST /advanced/search/keywords - Batch keyword search across multiple terms

Document Intelligence

  • POST /files/infer-metadata - Infer year, company, industry, document type from content
  • POST /text/infer-metadata - Metadata inference from raw text
  • Page-level text extraction (via include_extracted_text=true on /files)

Integration

🎯 Use Cases

  • Text Analysis: Readability, writing quality, word frequency for any text content
  • Academic Analysis: Citation verification, DOI resolution, AI detection, integrity checking
  • Document Intelligence: Extract and analyze text from PDFs and Word documents
  • Sustainability Research: Batch keyword analysis for TCFD, GRI, SDGs, SASB frameworks
  • Corporate Report Analysis: Auto-detect metadata (year, company, industry) from annual reports
  • Multi-Service Workflows: Integrate with specialized analysis services

Desktop Application Support

DocumentLens powers the document-lens-desktop Electron application for researchers analyzing corporate sustainability reports. Features include:

  • Smart metadata inference (company name, year, industry, document type)
  • Framework keyword analysis (TCFD, GRI, SDGs, SASB)
  • Batch processing with SQLite storage
  • Offline operation via bundled Python backend

🏗️ Microservices Ecosystem

DocumentLens is part of a focused microservices architecture:

Service Purpose Repository
DocumentLens Text analysis & academic intelligence This repo
PresentationLens Presentation design & structure analysis presentation-lens
RecordingLens Student recordings (video/audio) analysis recording-lens
CodeLens Source code quality & analysis code-lens
SubmissionLens Student submission router & frontend submission-lens

Integration Pattern

graph LR
    A[Student Submission] --> B[SubmissionLens Frontend]
    B --> C{File Type Router}
    C -->|Text/PDF/DOCX| D[DocumentLens]
    C -->|PPTX| E[PresentationLens]
    C -->|Video/Audio| F[RecordingLens]
    C -->|Source Code| G[CodeLens]
    E --> D
    F --> D
    G --> D
    D --> H[Combined Feedback]
    H --> B
    B --> I[Student Dashboard]
Loading

🚀 Deployment

Docker Deployment (Recommended)

git clone https://github.com/michael-borck/document-lens.git
cd document-lens
docker-compose up -d  # Single container deployment

Raw/Native Deployment

git clone https://github.com/michael-borck/document-lens.git
cd document-lens
./deploy.sh  # Handles venv, dependencies, and production server

🧪 Testing

# Install dev dependencies
uv sync --extra dev

# Run all tests
uv run pytest tests/ -v

# Run specific test file
uv run pytest tests/test_files.py -v

# Run only PDF tests
uv run pytest tests/ -m pdf -v

# Skip slow tests
uv run pytest tests/ -m "not slow" -v

# Run with coverage report
uv run pytest tests/

Test Structure

  • tests/conftest.py - Shared fixtures and test client setup
  • tests/test_health.py - Health/smoke tests
  • tests/test_text_analysis.py - Text analysis endpoint tests
  • tests/test_academic_analysis.py - Academic analysis endpoint tests
  • tests/test_files.py - PDF file upload tests

Test Data

Place test files (PDF, DOCX, etc.) in the test-data/ directory. The test suite automatically discovers and uses these files for parameterized tests.

📚 Documentation

  • DEPLOYMENT.md - Deployment guide for Docker and raw installations
  • DOCUMENTLENS_SETUP.md - Setup and usage instructions
  • .env.example - Configuration template
  • docs/ - Additional architecture and integration documentation

DocumentLens: Pure text intelligence at the heart of content analysis

About

Analyzes text documents for readability, academic integrity, and linguistic insights via REST API.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •