Enhance testing setup and documentation

rithulkamesh · cursoragent · rithulkamesh · commit 3443794f79f1 · 2026-02-21T15:37:47.000+05:30
- Added development dependencies for testing: pytest and pytest-cov.
- Configured pytest options in `pyproject.toml` for test paths and warning filters.
- Updated `README.md` to include architecture overview and usage examples.

Co-authored-by: Cursor &lt;cursoragent@cursor.com&gt;
diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
@@ -0,0 +1,25 @@
+name: Test
+
+on:
+  push:
+    branches: [main]
+  pull_request:
+    branches: [main]
+
+jobs:
+  test:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Checkout
+        uses: actions/checkout@v4
+
+      - name: Set up Python
+        uses: astral-sh/setup-uv@v4
+        with:
+          version: "latest"
+
+      - name: Install dependencies
+        run: uv sync --extra dev
+
+      - name: Run tests
+        run: uv run pytest tests -v
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -0,0 +1,32 @@
+# Contributing to DocProc
+
+## Prerequisites
+
+- [uv](https://docs.astral.sh/uv/) (recommended) or pip
+- Python 3.11+
+
+## Setup
+
+```bash
+git clone https://github.com/rithulkamesh/docproc.git
+cd docproc
+uv sync --extra dev
+```
+
+## Running tests
+
+```bash
+uv run pytest tests -v
+```
+
+## Code style
+
+No strict linter enforced. Consider using [black](https://black.readthedocs.io/) or [ruff](https://docs.astral.sh/ruff/) for formatting.
+
+## Pull requests
+
+1. Fork the repo and create a branch
+2. Make your changes
+3. Run tests: `uv run pytest tests -v`
+4. Open a PR with a clear description
+5. Ensure CI passes
diff --git a/docproc/doc/loaders/__init__.py b/docproc/doc/loaders/__init__.py
@@ -1,13 +1,20 @@
 """Multi-format document loaders."""
 
 from docproc.doc.loaders.base import DocumentLoader, LoadedPage
-from docproc.doc.loaders.factory import load_document, get_full_text, get_page_count, get_supported_extensions
+from docproc.doc.loaders.factory import (
+    get_full_text,
+    get_loader,
+    get_page_count,
+    get_supported_extensions,
+    load_document,
+)
 
 __all__ = [
     "DocumentLoader",
     "LoadedPage",
-    "load_document",
     "get_full_text",
+    "get_loader",
     "get_page_count",
     "get_supported_extensions",
+    "load_document",
 ]
diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md
@@ -0,0 +1,40 @@
+# DocProc Architecture
+
+## Overview
+
+DocProc extracts content from documents (PDF, DOCX, PPTX, XLSX), optionally refines it with LLMs, and indexes it for RAG queries.
+
+## Pipeline flow
+
+```
+Document (PDF/DOCX/PPTX/XLSX)
+    -> Load (get_full_text or vision extract for PDF images)
+    -> Optional LLM refine (markdown, LaTeX)
+    -> Sanitize & dedupe
+    -> Output (.md for CLI) or Index (RAG for API)
+```
+
+## Modules
+
+| Module | Purpose |
+|--------|---------|
+| `docproc/doc/loaders` | Load documents, extract full text. PDF uses PyMuPDF; DOCX/PPTX/XLSX use python-docx, python-pptx, openpyxl. |
+| `docproc/extractors` | Vision LLM extraction for PDF embedded images (Azure Vision or vision-capable LLM). |
+| `docproc/refiners` | LLM refinement: clean markdown, LaTeX math, remove boilerplate. |
+| `docproc/providers` | AI providers: OpenAI, Azure, Anthropic, Ollama, LiteLLM. |
+| `docproc/sanitize` | Text sanitization and deduplication. |
+| `docproc/pipeline` | Shared extraction pipeline (extract_document_to_text) used by CLI and API. |
+| `docproc/api` | FastAPI server: upload, documents, query, models. |
+| `docproc/rag` | RAG backends: embedding-based or CLaRa. |
+| `docproc/stores` | Vector stores: PgVector, Qdrant, Chroma, FAISS, memory. |
+
+## Configuration
+
+- **docproc.yaml**: Single config file. One database, multiple AI providers, one primary AI.
+- **Environment overrides**: `DOCPROC_CONFIG`, `DATABASE_URL`, `OPENAI_API_KEY`, `AZURE_OPENAI_*`, etc.
+- See [CONFIGURATION.md](CONFIGURATION.md) for the full schema.
+
+## CLI vs API
+
+- **CLI** (`docproc --file input.pdf -o output.md`): Runs the pipeline locally, writes to .md. No server, no RAG.
+- **API** (`docproc-serve`): Accepts uploads, runs the pipeline in background, indexes to vector store, serves query endpoint.
diff --git a/docs/README.md b/docs/README.md
@@ -6,6 +6,10 @@
 |----------|-------------|
 | [CONFIGURATION.md](CONFIGURATION.md) | **Configuration reference** — `docproc.yaml` schema, database providers (PgVector, Qdrant, Chroma, FAISS, memory), AI providers (OpenAI, Azure, Anthropic, Ollama, LiteLLM), ingest options (vision, LLM refinement), RAG, environment overrides |
 | [AZURE_SETUP.md](AZURE_SETUP.md) | **Azure setup** — Azure OpenAI deployments, Azure AI Vision (Computer Vision) for image extraction (Describe + Read API), credentials via env or `scripts/azure_env.sh` |
+| [ARCHITECTURE.md](ARCHITECTURE.md) | **Architecture overview** — Pipeline flow, modules, CLI vs API |
+| [USAGE.md](USAGE.md) | **Usage examples** — CLI, API, Docker, curl examples |
+
+See also [CONTRIBUTING.md](../CONTRIBUTING.md) for development setup and running tests.
 
 ## Concepts
 
diff --git a/docs/USAGE.md b/docs/USAGE.md
@@ -0,0 +1,71 @@
+# DocProc Usage Examples
+
+## CLI
+
+### Extract document to markdown
+
+```bash
+# With config
+docproc --file input.pdf -o output.md --config docproc.yaml
+
+# With DOCPROC_CONFIG env
+export DOCPROC_CONFIG=docproc.yaml
+docproc --file slides.pptx -o slides.md
+```
+
+### Supported formats
+
+PDF, DOCX, PPTX, XLSX (same as API). Use `-o output.md` for markdown output.
+
+See [docproc.cli.yaml](../docproc.cli.yaml) for an Ollama-only config example.
+
+## API
+
+### Start the server
+
+```bash
+DOCPROC_CONFIG=docproc.yaml docproc-serve
+# API at http://localhost:8000
+```
+
+### Upload a document
+
+```bash
+curl -X POST http://localhost:8000/documents/upload \
+  -F "file=@input.pdf"
+# Returns: {"id": "...", "status": "processing"}
+```
+
+### List documents
+
+```bash
+curl http://localhost:8000/documents/
+```
+
+### Get document status and content
+
+```bash
+curl http://localhost:8000/documents/{document_id}
+# Returns status, progress, full_text, regions when completed
+```
+
+### Query (RAG)
+
+```bash
+curl -X POST http://localhost:8000/query \
+  -H "Content-Type: application/json" \
+  -d '{"query": "What is the main idea?"}'
+```
+
+## Docker
+
+```bash
+docker run -p 8000:8000 -e OPENAI_API_KEY=sk-xxx ghcr.io/rithulkamesh/docproc:latest
+```
+
+See [README.md](../README.md) for full Docker Compose setup.
+
+## Configuration
+
+- [CONFIGURATION.md](CONFIGURATION.md) — Config schema, database and AI providers
+- [AZURE_SETUP.md](AZURE_SETUP.md) — Azure OpenAI and Azure AI Vision setup
diff --git a/pyproject.toml b/pyproject.toml
@@ -41,6 +41,10 @@ server = [
     "sentence-transformers>=2.2",
     "streamlit>=1.28",
 ]
+dev = [
+    "pytest>=7.0",
+    "pytest-cov>=4.0",
+]
 
 [project.scripts]
 docproc = "docproc.bin.cli:main"
@@ -52,6 +56,13 @@ include = ["docproc"]
 [tool.hatch.build.targets.wheel]
 include = ["docproc"]
 
+[tool.pytest.ini_options]
+testpaths = ["tests"]
+addopts = "-v"
+filterwarnings = [
+    "ignore:builtin type .* has no __module__ attribute:DeprecationWarning",
+]
+
 [build-system]
 requires = ["hatchling"]
 build-backend = "hatchling.build"
diff --git a/tests/conftest.py b/tests/conftest.py
@@ -0,0 +1,39 @@
+"""Pytest fixtures for docproc tests."""
+
+import tempfile
+from pathlib import Path
+
+import pytest
+
+
+@pytest.fixture
+def tmp_config(tmp_path):
+    """Write a minimal docproc YAML config to a temp file."""
+    config_path = tmp_path / "docproc.yaml"
+    config_path.write_text(
+        """
+primary_ai: ollama
+ai_providers:
+  - provider: ollama
+    base_url: http://localhost:11434
+    default_model: llava
+    default_vision_model: llava
+ingest:
+  use_vision: false
+  use_llm_refine: false
+""",
+        encoding="utf-8",
+    )
+    return str(config_path)
+
+
+@pytest.fixture
+def sample_docx(tmp_path):
+    """Create a minimal valid DOCX file with 'Hello world' content."""
+    from docx import Document
+
+    doc = Document()
+    doc.add_paragraph("Hello world")
+    path = tmp_path / "sample.docx"
+    doc.save(path)
+    return path
diff --git a/tests/test_cli.py b/tests/test_cli.py
@@ -0,0 +1,47 @@
+"""Smoke tests for CLI."""
+
+import subprocess
+import sys
+from pathlib import Path
+
+import pytest
+
+
+def test_cli_help():
+    """docproc --help exits 0."""
+    result = subprocess.run(
+        [sys.executable, "-m", "docproc.bin.cli", "--help"],
+        capture_output=True,
+        text=True,
+    )
+    assert result.returncode == 0
+    assert "output" in result.stdout.lower() or "file" in result.stdout.lower()
+
+
+def test_cli_nonexistent_file():
+    """docproc with nonexistent file exits non-zero."""
+    result = subprocess.run(
+        [sys.executable, "-m", "docproc.bin.cli", "--file", "/nonexistent/file.pdf", "-o", "out.md"],
+        capture_output=True,
+        text=True,
+    )
+    assert result.returncode != 0
+
+
+def test_cli_extract_docx_to_md(sample_docx, tmp_config, tmp_path):
+    """docproc extracts DOCX to markdown with config."""
+    out_md = tmp_path / "output.md"
+    result = subprocess.run(
+        [
+            sys.executable, "-m", "docproc.bin.cli",
+            "--file", str(sample_docx),
+            "-o", str(out_md),
+            "--config", tmp_config,
+        ],
+        capture_output=True,
+        text=True,
+    )
+    assert result.returncode == 0, result.stderr
+    assert out_md.exists()
+    content = out_md.read_text(encoding="utf-8")
+    assert "Hello" in content or "hello" in content
diff --git a/tests/test_config.py b/tests/test_config.py
@@ -0,0 +1,29 @@
+"""Unit tests for config loader and schema."""
+
+import pytest
+
+from docproc.config.loader import load_config
+
+
+def test_load_config_with_explicit_path(tmp_config):
+    """load_config with explicit path loads the file."""
+    cfg = load_config(tmp_config)
+    assert cfg.primary_ai == "ollama"
+    assert len(cfg.ai_providers) == 1
+    assert cfg.ai_providers[0].provider == "ollama"
+    assert cfg.config_path == tmp_config
+
+
+def test_load_config_minimal(tmp_config):
+    """load_config with minimal file uses schema defaults for missing keys."""
+    cfg = load_config(tmp_config)
+    assert cfg.rag.backend == "clara"
+    assert cfg.ingest.use_vision is False  # from our fixture
+
+
+def test_load_config_rag_schema_defaults(tmp_config):
+    """load_config applies schema defaults for rag when not in file."""
+    cfg = load_config(tmp_config)
+    assert cfg.rag.backend in ("clara", "embedding")
+    assert cfg.rag.top_k == 5
+    assert cfg.rag.chunk_size == 512
diff --git a/tests/test_loaders.py b/tests/test_loaders.py
@@ -0,0 +1,47 @@
+"""Unit tests for document loaders."""
+
+import pytest
+
+from docproc.doc.loaders import (
+    get_supported_extensions,
+    get_loader,
+    get_full_text,
+    load_document,
+)
+
+
+def test_get_supported_extensions():
+    """get_supported_extensions returns expected list."""
+    exts = get_supported_extensions()
+    assert ".docx" in exts
+    assert ".pdf" in exts
+    assert ".pptx" in exts
+    assert ".xlsx" in exts
+    assert len(exts) >= 4
+
+
+def test_get_loader_raises_unsupported():
+    """get_loader raises for unsupported format."""
+    import tempfile
+    with tempfile.NamedTemporaryFile(suffix=".xyz", delete=False) as f:
+        path = f.name
+    try:
+        from pathlib import Path
+        with pytest.raises(ValueError, match="Unsupported format"):
+            get_loader(Path(path))
+    finally:
+        import os
+        os.unlink(path)
+
+
+def test_get_full_text_docx(sample_docx):
+    """get_full_text extracts content from DOCX fixture."""
+    text = get_full_text(sample_docx)
+    assert "Hello world" in text or "Hello" in text
+
+
+def test_load_document_docx(sample_docx):
+    """load_document yields pages from DOCX."""
+    pages = list(load_document(sample_docx))
+    assert len(pages) >= 1
+    assert pages[0].text or any(r.content for r in pages[0].regions)
diff --git a/tests/test_sanitize.py b/tests/test_sanitize.py
diff --git a/uv.lock b/uv.lock