Skip to content

Commit 3443794

Browse files
Enhance testing setup and documentation
- Added development dependencies for testing: pytest and pytest-cov. - Configured pytest options in `pyproject.toml` for test paths and warning filters. - Updated `README.md` to include architecture overview and usage examples. Co-authored-by: Cursor <cursoragent@cursor.com>
1 parent c7cfba4 commit 3443794

File tree

13 files changed

+619
-3
lines changed

13 files changed

+619
-3
lines changed

.github/workflows/test.yml

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
name: Test
2+
3+
on:
4+
push:
5+
branches: [main]
6+
pull_request:
7+
branches: [main]
8+
9+
jobs:
10+
test:
11+
runs-on: ubuntu-latest
12+
steps:
13+
- name: Checkout
14+
uses: actions/checkout@v4
15+
16+
- name: Set up Python
17+
uses: astral-sh/setup-uv@v4
18+
with:
19+
version: "latest"
20+
21+
- name: Install dependencies
22+
run: uv sync --extra dev
23+
24+
- name: Run tests
25+
run: uv run pytest tests -v

CONTRIBUTING.md

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
# Contributing to DocProc
2+
3+
## Prerequisites
4+
5+
- [uv](https://docs.astral.sh/uv/) (recommended) or pip
6+
- Python 3.11+
7+
8+
## Setup
9+
10+
```bash
11+
git clone https://github.com/rithulkamesh/docproc.git
12+
cd docproc
13+
uv sync --extra dev
14+
```
15+
16+
## Running tests
17+
18+
```bash
19+
uv run pytest tests -v
20+
```
21+
22+
## Code style
23+
24+
No strict linter enforced. Consider using [black](https://black.readthedocs.io/) or [ruff](https://docs.astral.sh/ruff/) for formatting.
25+
26+
## Pull requests
27+
28+
1. Fork the repo and create a branch
29+
2. Make your changes
30+
3. Run tests: `uv run pytest tests -v`
31+
4. Open a PR with a clear description
32+
5. Ensure CI passes

docproc/doc/loaders/__init__.py

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,20 @@
11
"""Multi-format document loaders."""
22

33
from docproc.doc.loaders.base import DocumentLoader, LoadedPage
4-
from docproc.doc.loaders.factory import load_document, get_full_text, get_page_count, get_supported_extensions
4+
from docproc.doc.loaders.factory import (
5+
get_full_text,
6+
get_loader,
7+
get_page_count,
8+
get_supported_extensions,
9+
load_document,
10+
)
511

612
__all__ = [
713
"DocumentLoader",
814
"LoadedPage",
9-
"load_document",
1015
"get_full_text",
16+
"get_loader",
1117
"get_page_count",
1218
"get_supported_extensions",
19+
"load_document",
1320
]

docs/ARCHITECTURE.md

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
# DocProc Architecture
2+
3+
## Overview
4+
5+
DocProc extracts content from documents (PDF, DOCX, PPTX, XLSX), optionally refines it with LLMs, and indexes it for RAG queries.
6+
7+
## Pipeline flow
8+
9+
```
10+
Document (PDF/DOCX/PPTX/XLSX)
11+
-> Load (get_full_text or vision extract for PDF images)
12+
-> Optional LLM refine (markdown, LaTeX)
13+
-> Sanitize & dedupe
14+
-> Output (.md for CLI) or Index (RAG for API)
15+
```
16+
17+
## Modules
18+
19+
| Module | Purpose |
20+
|--------|---------|
21+
| `docproc/doc/loaders` | Load documents, extract full text. PDF uses PyMuPDF; DOCX/PPTX/XLSX use python-docx, python-pptx, openpyxl. |
22+
| `docproc/extractors` | Vision LLM extraction for PDF embedded images (Azure Vision or vision-capable LLM). |
23+
| `docproc/refiners` | LLM refinement: clean markdown, LaTeX math, remove boilerplate. |
24+
| `docproc/providers` | AI providers: OpenAI, Azure, Anthropic, Ollama, LiteLLM. |
25+
| `docproc/sanitize` | Text sanitization and deduplication. |
26+
| `docproc/pipeline` | Shared extraction pipeline (extract_document_to_text) used by CLI and API. |
27+
| `docproc/api` | FastAPI server: upload, documents, query, models. |
28+
| `docproc/rag` | RAG backends: embedding-based or CLaRa. |
29+
| `docproc/stores` | Vector stores: PgVector, Qdrant, Chroma, FAISS, memory. |
30+
31+
## Configuration
32+
33+
- **docproc.yaml**: Single config file. One database, multiple AI providers, one primary AI.
34+
- **Environment overrides**: `DOCPROC_CONFIG`, `DATABASE_URL`, `OPENAI_API_KEY`, `AZURE_OPENAI_*`, etc.
35+
- See [CONFIGURATION.md](CONFIGURATION.md) for the full schema.
36+
37+
## CLI vs API
38+
39+
- **CLI** (`docproc --file input.pdf -o output.md`): Runs the pipeline locally, writes to .md. No server, no RAG.
40+
- **API** (`docproc-serve`): Accepts uploads, runs the pipeline in background, indexes to vector store, serves query endpoint.

docs/README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,10 @@
66
|----------|-------------|
77
| [CONFIGURATION.md](CONFIGURATION.md) | **Configuration reference**`docproc.yaml` schema, database providers (PgVector, Qdrant, Chroma, FAISS, memory), AI providers (OpenAI, Azure, Anthropic, Ollama, LiteLLM), ingest options (vision, LLM refinement), RAG, environment overrides |
88
| [AZURE_SETUP.md](AZURE_SETUP.md) | **Azure setup** — Azure OpenAI deployments, Azure AI Vision (Computer Vision) for image extraction (Describe + Read API), credentials via env or `scripts/azure_env.sh` |
9+
| [ARCHITECTURE.md](ARCHITECTURE.md) | **Architecture overview** — Pipeline flow, modules, CLI vs API |
10+
| [USAGE.md](USAGE.md) | **Usage examples** — CLI, API, Docker, curl examples |
11+
12+
See also [CONTRIBUTING.md](../CONTRIBUTING.md) for development setup and running tests.
913

1014
## Concepts
1115

docs/USAGE.md

Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
# DocProc Usage Examples
2+
3+
## CLI
4+
5+
### Extract document to markdown
6+
7+
```bash
8+
# With config
9+
docproc --file input.pdf -o output.md --config docproc.yaml
10+
11+
# With DOCPROC_CONFIG env
12+
export DOCPROC_CONFIG=docproc.yaml
13+
docproc --file slides.pptx -o slides.md
14+
```
15+
16+
### Supported formats
17+
18+
PDF, DOCX, PPTX, XLSX (same as API). Use `-o output.md` for markdown output.
19+
20+
See [docproc.cli.yaml](../docproc.cli.yaml) for an Ollama-only config example.
21+
22+
## API
23+
24+
### Start the server
25+
26+
```bash
27+
DOCPROC_CONFIG=docproc.yaml docproc-serve
28+
# API at http://localhost:8000
29+
```
30+
31+
### Upload a document
32+
33+
```bash
34+
curl -X POST http://localhost:8000/documents/upload \
35+
-F "file=@input.pdf"
36+
# Returns: {"id": "...", "status": "processing"}
37+
```
38+
39+
### List documents
40+
41+
```bash
42+
curl http://localhost:8000/documents/
43+
```
44+
45+
### Get document status and content
46+
47+
```bash
48+
curl http://localhost:8000/documents/{document_id}
49+
# Returns status, progress, full_text, regions when completed
50+
```
51+
52+
### Query (RAG)
53+
54+
```bash
55+
curl -X POST http://localhost:8000/query \
56+
-H "Content-Type: application/json" \
57+
-d '{"query": "What is the main idea?"}'
58+
```
59+
60+
## Docker
61+
62+
```bash
63+
docker run -p 8000:8000 -e OPENAI_API_KEY=sk-xxx ghcr.io/rithulkamesh/docproc:latest
64+
```
65+
66+
See [README.md](../README.md) for full Docker Compose setup.
67+
68+
## Configuration
69+
70+
- [CONFIGURATION.md](CONFIGURATION.md) — Config schema, database and AI providers
71+
- [AZURE_SETUP.md](AZURE_SETUP.md) — Azure OpenAI and Azure AI Vision setup

pyproject.toml

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,10 @@ server = [
4141
"sentence-transformers>=2.2",
4242
"streamlit>=1.28",
4343
]
44+
dev = [
45+
"pytest>=7.0",
46+
"pytest-cov>=4.0",
47+
]
4448

4549
[project.scripts]
4650
docproc = "docproc.bin.cli:main"
@@ -52,6 +56,13 @@ include = ["docproc"]
5256
[tool.hatch.build.targets.wheel]
5357
include = ["docproc"]
5458

59+
[tool.pytest.ini_options]
60+
testpaths = ["tests"]
61+
addopts = "-v"
62+
filterwarnings = [
63+
"ignore:builtin type .* has no __module__ attribute:DeprecationWarning",
64+
]
65+
5566
[build-system]
5667
requires = ["hatchling"]
5768
build-backend = "hatchling.build"

tests/conftest.py

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
"""Pytest fixtures for docproc tests."""
2+
3+
import tempfile
4+
from pathlib import Path
5+
6+
import pytest
7+
8+
9+
@pytest.fixture
10+
def tmp_config(tmp_path):
11+
"""Write a minimal docproc YAML config to a temp file."""
12+
config_path = tmp_path / "docproc.yaml"
13+
config_path.write_text(
14+
"""
15+
primary_ai: ollama
16+
ai_providers:
17+
- provider: ollama
18+
base_url: http://localhost:11434
19+
default_model: llava
20+
default_vision_model: llava
21+
ingest:
22+
use_vision: false
23+
use_llm_refine: false
24+
""",
25+
encoding="utf-8",
26+
)
27+
return str(config_path)
28+
29+
30+
@pytest.fixture
31+
def sample_docx(tmp_path):
32+
"""Create a minimal valid DOCX file with 'Hello world' content."""
33+
from docx import Document
34+
35+
doc = Document()
36+
doc.add_paragraph("Hello world")
37+
path = tmp_path / "sample.docx"
38+
doc.save(path)
39+
return path

tests/test_cli.py

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
"""Smoke tests for CLI."""
2+
3+
import subprocess
4+
import sys
5+
from pathlib import Path
6+
7+
import pytest
8+
9+
10+
def test_cli_help():
11+
"""docproc --help exits 0."""
12+
result = subprocess.run(
13+
[sys.executable, "-m", "docproc.bin.cli", "--help"],
14+
capture_output=True,
15+
text=True,
16+
)
17+
assert result.returncode == 0
18+
assert "output" in result.stdout.lower() or "file" in result.stdout.lower()
19+
20+
21+
def test_cli_nonexistent_file():
22+
"""docproc with nonexistent file exits non-zero."""
23+
result = subprocess.run(
24+
[sys.executable, "-m", "docproc.bin.cli", "--file", "/nonexistent/file.pdf", "-o", "out.md"],
25+
capture_output=True,
26+
text=True,
27+
)
28+
assert result.returncode != 0
29+
30+
31+
def test_cli_extract_docx_to_md(sample_docx, tmp_config, tmp_path):
32+
"""docproc extracts DOCX to markdown with config."""
33+
out_md = tmp_path / "output.md"
34+
result = subprocess.run(
35+
[
36+
sys.executable, "-m", "docproc.bin.cli",
37+
"--file", str(sample_docx),
38+
"-o", str(out_md),
39+
"--config", tmp_config,
40+
],
41+
capture_output=True,
42+
text=True,
43+
)
44+
assert result.returncode == 0, result.stderr
45+
assert out_md.exists()
46+
content = out_md.read_text(encoding="utf-8")
47+
assert "Hello" in content or "hello" in content

tests/test_config.py

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
"""Unit tests for config loader and schema."""
2+
3+
import pytest
4+
5+
from docproc.config.loader import load_config
6+
7+
8+
def test_load_config_with_explicit_path(tmp_config):
9+
"""load_config with explicit path loads the file."""
10+
cfg = load_config(tmp_config)
11+
assert cfg.primary_ai == "ollama"
12+
assert len(cfg.ai_providers) == 1
13+
assert cfg.ai_providers[0].provider == "ollama"
14+
assert cfg.config_path == tmp_config
15+
16+
17+
def test_load_config_minimal(tmp_config):
18+
"""load_config with minimal file uses schema defaults for missing keys."""
19+
cfg = load_config(tmp_config)
20+
assert cfg.rag.backend == "clara"
21+
assert cfg.ingest.use_vision is False # from our fixture
22+
23+
24+
def test_load_config_rag_schema_defaults(tmp_config):
25+
"""load_config applies schema defaults for rag when not in file."""
26+
cfg = load_config(tmp_config)
27+
assert cfg.rag.backend in ("clara", "embedding")
28+
assert cfg.rag.top_k == 5
29+
assert cfg.rag.chunk_size == 512

0 commit comments

Comments
 (0)