DocMind AI provides local document analysis with zero cloud dependency. It combines hybrid retrieval (dense + sparse), optional knowledge graph extraction (GraphRAG), and a 5-agent coordinator to analyze PDFs, Office docs, HTML/Markdown, and image-rich PDFs. Built on LlamaIndex pipelines with LangGraph supervisor orchestration, the default vLLM profile targets Qwen/Qwen3-4B-Instruct-2507-FP8 (128 K context window) and runs entirely on your hardware with optional GPU acceleration.
Architecture: Traditional document analysis tools either send your data to the cloud (privacy risk) or provide basic keyword search (limited intelligence). DocMind AI keeps everything local while still supporting complex, multi-step queries, entity/relationship extraction (GraphRAG), and agent-coordinated synthesis.
Design goals:
- Privacy by default: remote endpoints are blocked unless explicitly allowed.
- Reproducibility: deterministic ingestion caching and snapshot manifests.
- Extensibility: RouterQueryEngine routes across vector, hybrid, and optional graph retrieval.
- Privacy-focused, local-first: Remote LLM endpoints are blocked by default; enable explicitly when needed.
- Library-first ingestion pipeline: LlamaIndex
IngestionPipelinewithUnstructuredReaderwhen installed (fallback to plain-text),TokenTextSplitter, optional spaCy enrichment, and optionalTitleExtractor. - Multi-format parsing: Unstructured covers common formats (PDF, DOCX, PPTX, XLSX, HTML, Markdown, TXT, CSV, JSON, RTF, MSG, ODT, EPUB) when supported; unsupported types fall back to plain-text when possible.
- Hybrid retrieval with routing: RouterQueryEngine with
semantic_search, optionalhybrid_search(Qdrant server-side fusion), and optionalknowledge_graph(GraphRAG). - Qdrant server-side fusion: Query API RRF (default) or DBSF over named vectors
text-denseandtext-sparse; sparse queries use FastEmbed BM42/BM25 when available. - Reranking and multimodal: Text rerank via BGE cross-encoder; SigLIP visual rerank runs when visual nodes are present; ColPali optional via the
multimodalextra. - Multi-agent coordination: LangGraph supervisor orchestrates five agents (router, planner, retrieval, synthesis, validation).
- Snapshots and reproducibility: DuckDB KV cache plus snapshot manifests with corpus/config hashes; graph exports as JSONL/Parquet (Parquet requires PyArrow).
- PDF page images: PyMuPDF renders page images to WebP/JPEG; optional AES-GCM encryption with
.encoutputs and just-in-time decryption for visual scoring. - ArtifactStore (multimodal durability): Page images/thumbnails are stored as content-addressed
ArtifactRef(sha256, suffix)(no base64 blobs or host paths in durable stores). - Multimodal UX: Chat renders image sources and supports query-by-image “Visual search” (SigLIP) for image-rich PDFs.
- Offline-first design: Runs fully offline once models are present; remote endpoints must be explicitly enabled.
- GPU acceleration: Optional GPU extras for local embedding/reranking acceleration (vLLM runs as an external OpenAI-compatible server).
- Robust retries and logging: Tenacity-backed retries for LLM calls and structured logging via Loguru.
- Observability and operations: Optional OTLP tracing/metrics plus JSONL telemetry; Docker and Compose included for local deployments.
- 🧠 DocMind AI: Local LLM for AI-Powered Document Analysis
- ✨ Features of DocMind AI
- Table of Contents
- Getting Started with DocMind AI
- Usage
- API Usage Examples
- Architecture
- Implementation Details
- Configuration
- Performance Defaults and Monitoring
- Offline Operation
- Troubleshooting
- How to Cite
- Contributing
- License
- Observability
-
One supported LLM backend running locally: Ollama (default), vLLM OpenAI-compatible server, LM Studio, or a llama.cpp server.
-
Python 3.13.11 (see
pyproject.toml) -
(Optional) Docker and Docker Compose for containerized deployment.
-
(Optional) NVIDIA GPU (e.g., RTX 4090 Laptop) with at least 16 GB VRAM for 128 K context (vLLM) and accelerated performance.
-
Clone the repository:
git clone https://github.com/BjornMelin/docmind-ai-llm.git cd docmind-ai-llm -
Install dependencies:
uv sync
Need LlamaIndex OpenTelemetry instrumentation? Install the optional observability extras as well:
uv sync --extra observability
Need GraphRAG adapters or ColPali visual reranking? Install the optional extras:
uv sync --extra graph uv sync --extra multimodal
Key Dependencies Included:
- LlamaIndex (>=0.14.12,<0.15.0): Retrieval, RouterQueryEngine, IngestionPipeline, PropertyGraphIndex
- LangGraph (==1.0.6): 5-agent supervisor orchestration (graph-native
StateGraph, no external supervisor wrapper) - Streamlit (>=1.52.2,<2.0.0): Web interface framework
- Ollama (0.6.1): Local LLM integration
- Qdrant Client (>=1.15.1,<2.0.0): Vector database operations
- Unstructured (>=0.18.26,<0.19.0): Multi-format parsing (PDF/DOCX/PPTX/XLSX, etc.)
- LlamaIndex Embeddings FastEmbed (>=0.5.0,<0.6.0): Sparse query encoding (optional fastembed-gpu >=0.7.4,<0.8.0)
- Tenacity (>=9.1.2,<10.0.0): Retry strategies with exponential backoff
- Loguru (>=0.7.3,<1.0.0): Structured logging
- Pydantic (2.12.5): Data validation and settings.
-
Install spaCy language model:
spaCy is bundled for optional NLP enrichment (sentence segmentation + entity extraction during ingestion). Install a language model if you plan to use enrichment:
# Install the small English model (recommended, ~15MB) uv run python -m spacy download en_core_web_sm # Optional: Install larger models for better accuracy # Medium model (~50MB): uv run python -m spacy download en_core_web_md # Large model (~560MB): uv run python -m spacy download en_core_web_lg
Note: spaCy models are downloaded and cached locally. The app does not auto-download models; install them explicitly for offline use.
Optional configuration (defaults shown):
# Enable/disable enrichment DOCMIND_SPACY__ENABLED=true # Pipeline name or path (blank fallback when missing) DOCMIND_SPACY__MODEL=en_core_web_sm # cpu|cuda|apple|auto (auto prefers CUDA, then Apple, else CPU) DOCMIND_SPACY__DEVICE=auto DOCMIND_SPACY__GPU_ID=0
Cross-platform acceleration:
- NVIDIA CUDA (Linux/Windows):
uv sync --extra gpuand setDOCMIND_SPACY__DEVICE=auto|cuda - Apple Silicon (macOS arm64):
uv sync --extra appleand setDOCMIND_SPACY__DEVICE=auto|apple
See
docs/specs/spec-015-nlp-enrichment-spacy.mdanddocs/developers/gpu-setup.md. - NVIDIA CUDA (Linux/Windows):
-
Set up environment configuration:
Copy the example environment file and configure your settings:
cp .env.example .env # Edit .env with your preferred settings # Model names are backend-specific: # - Ollama: use the local tag (e.g., qwen3-4b-instruct-2507) # - vLLM/LM Studio/llama.cpp: use the served model name # NOTE: DOCMIND_MODEL (top-level) overrides backend-specific model vars such as DOCMIND_VLLM__MODEL at runtime. # Example - Ollama (local, default): # DOCMIND_LLM_BACKEND=ollama # DOCMIND_OLLAMA_BASE_URL=http://localhost:11434 # DOCMIND_MODEL=qwen3-4b-instruct-2507 # Example - LM Studio (local, OpenAI-compatible): # DOCMIND_LLM_BACKEND=lmstudio # DOCMIND_OPENAI__BASE_URL=http://localhost:1234/v1 # DOCMIND_OPENAI__API_KEY=not-needed # DOCMIND_MODEL=your-model-name # Example - vLLM OpenAI-compatible server: # DOCMIND_LLM_BACKEND=vllm # DOCMIND_OPENAI__BASE_URL=http://localhost:8000/v1 # DOCMIND_OPENAI__API_KEY=not-needed # DOCMIND_VLLM__MODEL=Qwen/Qwen3-4B-Instruct-2507-FP8 # Example - llama.cpp server: # DOCMIND_LLM_BACKEND=llamacpp # DOCMIND_OPENAI__BASE_URL=http://localhost:8080/v1 # DOCMIND_OPENAI__API_KEY=not-needed # DOCMIND_MODEL=your-model-name # Offline-first recommended: # HF_HUB_OFFLINE=1 # TRANSFORMERS_OFFLINE=1 # Optional - OpenAI-compatible cloud / gateway (breaks strict offline): # DOCMIND_LLM_BACKEND=openai_compatible # DOCMIND_OPENAI__BASE_URL=https://api.openai.com/v1 # DOCMIND_OPENAI__API_KEY=sk-... # DOCMIND_OPENAI__API_MODE=responses # DOCMIND_SECURITY__ALLOW_REMOTE_ENDPOINTS=true
For a complete overview, see
docs/developers/configuration.md. The relevant section isLLM Backend Selection. -
(Optional) Install GPU support (embeddings/reranking acceleration):
Install the repo’s GPU extras and the CUDA wheel index for PyTorch:
nvidia-smi uv sync --extra gpu --index https://download.pytorch.org/whl/cu128 --index-strategy=unsafe-best-match uv run python -c "import torch; print(torch.__version__); print(torch.cuda.is_available())"Hardware Guidance:
- CUDA-capable GPU (16 GB VRAM recommended for 128 K context)
- CUDA Toolkit 12.8+
- Driver compatible with CUDA 12.8
Notes:
- vLLM is supported via an external OpenAI-compatible server (see Troubleshooting section 6 for connectivity checks).
- Measure performance on your hardware with
uv run python scripts/performance_monitor.py.
See GPU Setup Guide (installation) and Hardware Policy (hardware/VRAM guidance).
Locally:
uv run streamlit run app.pyTo honor DOCMIND_UI__STREAMLIT_PORT, use:
./scripts/run_app.shWith Docker:
docker compose up --buildAccess the app at http://localhost:8501.
Note: GPU reservations in docker-compose.yml require Docker Engine with
Compose V2 (Docker Compose plugin). The deploy.resources.reservations.devices
block is ignored on older Compose versions and in swarm mode.
- Select the active provider (
ollama,vllm,lmstudio,llamacpp). - Set model, context window, timeout, and GPU acceleration toggle.
- Model IDs are backend-specific (Ollama tags vs OpenAI-compatible model names).
- OpenAI-compatible base URLs are normalized to include
/v1(LM Studio enforces/v1). - When
DOCMIND_SECURITY__ALLOW_REMOTE_ENDPOINTS=false(default), loopback hosts are always allowed, but non-loopback hosts must be allowlisted viaDOCMIND_SECURITY__ENDPOINT_ALLOWLISTand must DNS-resolve to public IPs (private/link-local/reserved ranges are rejected). - Set
DOCMIND_SECURITY__ALLOW_REMOTE_ENDPOINTS=trueto opt out (required for private/internal endpoints like Docker service hostnames).
- Upload files in the Documents page.
- Optional toggles:
- Build GraphRAG (beta) to create a PropertyGraphIndex when enabled.
- Encrypt page images (AES-GCM) to store rendered PDF images as
.enc.
- GraphRAG requires optional graph dependencies; the Settings page shows adapter status.
- Ingestion builds a vector index (Qdrant) and optional graph index, then writes a snapshot to
data/storage/. - Graph exports (JSONL/Parquet) are available when a graph index exists.
- The Chat page autoloads the latest snapshot per
graphrag_cfg.autoload_policy. - Stale snapshots trigger a warning; rebuild from the Documents page.
- Responses are generated via
MultiAgentCoordinatorand the router engine; the UI streams chunks for readability.
- Enable
DOCMIND_ANALYTICS_ENABLED=trueto use the Analytics page. - Charts read from
data/analytics/analytics.duckdbwhen query metrics are present.
from pathlib import Path
from src.models.processing import IngestionConfig, IngestionInput
from src.processing.ingestion_pipeline import ingest_documents_sync
cfg = IngestionConfig(cache_dir=Path("./cache/ingestion"))
inputs = [
IngestionInput(
document_id="doc-1",
source_path=Path("path/to/document.pdf"),
metadata={"source": "local"},
)
]
result = ingest_documents_sync(cfg, inputs)
print(result.manifest.model_dump())from llama_index.core import StorageContext, VectorStoreIndex
from src.agents.coordinator import MultiAgentCoordinator
from src.config import settings
from src.retrieval.router_factory import build_router_engine
from src.utils.storage import create_vector_store
# Requires Qdrant running and embeddings configured.
# Uses `result.nodes` from the ingestion example above.
store = create_vector_store(
settings.database.qdrant_collection,
enable_hybrid=settings.retrieval.enable_server_hybrid,
)
storage_context = StorageContext.from_defaults(vector_store=store)
vector_index = VectorStoreIndex(result.nodes, storage_context=storage_context, show_progress=False)
router = build_router_engine(vector_index, pg_index=None, settings=settings)
coord = MultiAgentCoordinator()
resp = coord.process_query(
"Summarize the key findings and action items",
context=None,
settings_override={"router_engine": router, "vector": vector_index},
)
print(resp.content)from src.prompting import list_presets, list_templates, render_prompt
tpl = next(t for t in list_templates() if t.id == "comprehensive-analysis")
tones = list_presets("tones")
roles = list_presets("roles")
ctx = {
"context": "Example context",
"tone": tones["professional"],
"role": roles["assistant"],
}
prompt = render_prompt(tpl.id, ctx)
print(prompt)Templates live in templates/prompts/*.prompt.md. Presets are in templates/presets/*.yaml.
import os
from src.config.settings import DocMindSettings
os.environ["DOCMIND_LLM_BACKEND"] = "vllm"
os.environ["DOCMIND_VLLM__MODEL"] = "Qwen/Qwen3-4B-Instruct-2507-FP8"
os.environ["DOCMIND_VLLM__CONTEXT_WINDOW"] = "131072"
os.environ["DOCMIND_ENABLE_GPU_ACCELERATION"] = "true"
settings = DocMindSettings()
print(settings.llm_backend, settings.vllm.model, settings.effective_context_window)import hashlib
from pathlib import Path
from src.models.processing import IngestionConfig, IngestionInput
from src.processing.ingestion_pipeline import ingest_documents_sync
folder = Path("/path/to/documents")
extensions = {".pdf", ".docx", ".txt", ".md", ".pptx", ".xlsx"}
paths = [p for p in folder.rglob("*") if p.suffix.lower() in extensions]
inputs = []
for path in paths:
digest = hashlib.sha256(path.read_bytes()).hexdigest()
inputs.append(
IngestionInput(
document_id=f"doc-{digest[:16]}",
source_path=path,
metadata={"source": path.name},
)
)
result = ingest_documents_sync(IngestionConfig(cache_dir=Path("./cache/ingestion")), inputs)
print(f"Processed {len(result.nodes)} nodes from {len(inputs)} files")flowchart TD
A["Documents page<br/>Upload files"] --> B["Ingestion pipeline<br/>UnstructuredReader or text fallback"]
B --> C["TokenTextSplitter, spaCy enrichment (optional),<br/>TitleExtractor (optional)<br/>LlamaIndex IngestionPipeline"]
C --> D["Nodes and metadata"]
D --> E["VectorStoreIndex<br/>Qdrant named vectors"]
C --> F["PDF page image exports<br/>PyMuPDF, optional AES-GCM"]
D --> G["PropertyGraphIndex<br/>optional"]
E --> H["RouterQueryEngine<br/>semantic_search / hybrid_search<br/>knowledge_graph"]
G --> H
H --> I["MultiAgentCoordinator<br/>LangGraph supervisor - 5 agents"]
I --> J["Chat page<br/>Responses"]
K["Snapshot manager<br/>data/storage"] <--> E
K <--> G
L["Ingestion cache<br/>DuckDB KV"] <--> C
- Parsing: Uses LlamaIndex
UnstructuredReaderwhen available; falls back to plain-text for unsupported inputs. - Chunking:
TokenTextSplitterwith configurablechunk_size/chunk_overlap;TitleExtractoris optional. - NLP enrichment (optional): spaCy sentence segmentation + entity extraction during ingestion; outputs are stored as safe node metadata (
docmind_nlp). Seedocs/specs/spec-015-nlp-enrichment-spacy.md. - Caching: DuckDB KV ingestion cache with optional docstore persistence.
- PDF page images: PyMuPDF renders page images; optional AES-GCM encryption and
.enchandling. - Observability: OpenTelemetry spans are recorded when observability is enabled.
-
Unified Text Embeddings: BGE-M3 (BAAI/bge-m3) via LlamaIndex for dense vectors (1024D); sparse query vectors via FastEmbed BM42/BM25 when available.
-
Multimodal: SigLIP visual scoring by default; OpenCLIP optional. ColPali visual reranking is optional (multimodal extra).
-
Multimodal retrieval (PDF images):
multimodal_searchfuses text hybrid with SigLIP text→image retrieval over a dedicated Qdrant image collection and returns image-bearing sources for rendering. -
Fusion: Server-side RRF via Qdrant Query API when
DOCMIND_RETRIEVAL__ENABLE_SERVER_HYBRID=true(DBSF optional). -
Deduplication: Configurable key via
DOCMIND_RETRIEVAL__DEDUP_KEY(page_id|doc_id); default =page_id. -
Router composition: See
src/retrieval/router_factory.py(tools:semantic_search,hybrid_search,knowledge_graph). Selector preference:PydanticSingleSelector(preferred) →LLMSingleSelectorfallback. Theknowledge_graphtool is activated only when a PropertyGraphIndex is present and healthy; otherwise the router uses vector/hybrid only. -
Storage: Qdrant vector database with metadata filtering and concurrent access
-
Supervisor Pattern: LangGraph
StateGraphsupervisor (repo-local implementation insrc/agents/supervisor_graph.py) with checkpoint/store support -
5 Specialized Agents:
- Query Router: Analyzes query complexity and determines optimal retrieval strategy
- Query Planner: Decomposes complex queries into manageable sub-tasks for better processing
- Retrieval Expert: Executes optimized retrieval with server-side hybrid (Qdrant) and optional GraphRAG; supports optional DSPy query optimization when enabled
- Result Synthesizer: Combines and reconciles results from multiple retrieval passes with deduplication
- Response Validator: Validates response quality, accuracy, and completeness before final output
-
Enhanced Capabilities: Optional GraphRAG for multi-hop reasoning and optional DSPy query optimization for query rewriting
-
Workflow Coordination: Supervisor routes between agents; coordination overhead is tracked with a 200ms target threshold.
-
Session State: Streamlit session state holds chat history; snapshots persist retrieval artifacts to disk.
-
Async Execution: Concurrent agent operations with automatic resource management and fallback mechanisms
- GPU Acceleration: Optional GPU extras for embeddings/reranking; vLLM runs as an external OpenAI-compatible server.
- Async processing: Asynchronous ingestion is supported; retrieval/rerank stages use bounded timeouts and fail open.
- Reranking: Text cross-encoder + SigLIP visual stage with rank-level RRF merge; ColPali optional.
- Memory Management: Device selection and VRAM checks are centralized in
src/utils/core.py.
DocMind AI uses a unified Pydantic Settings model (src/config/settings.py). Environment variables use the DOCMIND_ prefix with __ for nested fields. The Streamlit entrypoint calls bootstrap_settings() to load .env (no import-time .env IO).
DocMind’s DOCMIND_* variables configure the application (routing, security, and provider selection) and are intentionally separate from provider/server variables such as OLLAMA_*, OPENAI_*, or VLLM_* that control those services directly. Keeping a single, app-scoped config surface:
- avoids collisions with provider/daemon env vars on the same machine,
- keeps security policy (remote endpoint allowlisting) centralized, and
- ensures consistent behavior across backends.
Use DOCMIND_OLLAMA_API_KEY for Ollama Cloud access; OLLAMA_* remains reserved for the Ollama server/CLI itself.
Configuration is centralized and strongly typed. Prefer .env overrides and keep runtime toggles in one place for repeatable local runs.
DocMind AI uses environment variables for configuration. Copy the example file and customize:
cp .env.example .envKey configuration options in .env:
# LLM backend
DOCMIND_LLM_BACKEND=ollama
DOCMIND_OLLAMA_BASE_URL=http://localhost:11434
# Optional (Ollama Cloud / web search)
# DOCMIND_OLLAMA_API_KEY=
# DOCMIND_OLLAMA_ENABLE_WEB_SEARCH=false
# DOCMIND_OLLAMA_EMBED_DIMENSIONS=
# DOCMIND_OLLAMA_ENABLE_LOGPROBS=false
# DOCMIND_OLLAMA_TOP_LOGPROBS=0
# DOCMIND_LLM_BACKEND=vllm
# DOCMIND_MODEL=Qwen/Qwen3-4B-Instruct-2507-FP8 # top-level override
# DOCMIND_VLLM__VLLM_BASE_URL=http://localhost:8000
# DOCMIND_VLLM__MODEL=Qwen/Qwen3-4B-Instruct-2507-FP8
# DOCMIND_VLLM__CONTEXT_WINDOW=131072
# Embeddings
DOCMIND_EMBEDDING__MODEL_NAME=BAAI/bge-m3
# Retrieval / reranking
DOCMIND_RETRIEVAL__ENABLE_SERVER_HYBRID=false
DOCMIND_RETRIEVAL__FUSION_MODE=rrf
DOCMIND_RETRIEVAL__USE_RERANKING=true
DOCMIND_RETRIEVAL__RERANKING_TOP_K=5
# Cache
DOCMIND_CACHE__DIR=./cache
DOCMIND_CACHE__FILENAME=docmind.duckdb
DOCMIND_CACHE__MAX_SIZE_MB=1000
# GraphRAG (requires both flags)
DOCMIND_ENABLE_GRAPHRAG=false
DOCMIND_GRAPHRAG_CFG__ENABLED=false
# GPU and security toggles
DOCMIND_ENABLE_GPU_ACCELERATION=true
DOCMIND_SECURITY__ALLOW_REMOTE_ENDPOINTS=falseSee the complete .env.example file for all available configuration options.
To turn on query optimization via DSPy, enable the feature flag in your .env:
DOCMIND_ENABLE_DSPY_OPTIMIZATION=trueOptional tuning (defaults are sensible):
DOCMIND_DSPY_OPTIMIZATION_ITERATIONS=10
DOCMIND_DSPY_OPTIMIZATION_SAMPLES=20
DOCMIND_DSPY_MAX_RETRIES=3
DOCMIND_DSPY_TEMPERATURE=0.1
DOCMIND_DSPY_METRIC_THRESHOLD=0.8
DOCMIND_ENABLE_DSPY_BOOTSTRAPPING=trueNotes:
- DSPy runs in the agents layer and augments retrieval by refining the query; retrieval remains library-first (server-side hybrid via Qdrant + reranking).
- DSPy ships with the default dependencies; if it is unavailable or the flag is false, the system falls back gracefully to standard retrieval.
Streamlit UI Configuration (optional):
Create .streamlit/config.toml if you want to override Streamlit defaults:
[theme]
base = "light"
primaryColor = "#FF4B4B"
[server]
maxUploadSize = 200Cache Configuration:
- Ingestion cache: DuckDB KV store under
./cache/docmind.duckdb(seeDOCMIND_CACHE__DIRandDOCMIND_CACHE__FILENAME). - PDF page images: rendered under
./cache/page_images/and stored durably as content-addressed artifacts under./data/artifacts/by default. - Model weights: cached via Hugging Face defaults (
~/.cache/huggingface).
Note: Performance depends on hardware, model size, and corpus size. Use the scripts below to measure on your machine.
- Rerank timeouts: text 250ms, SigLIP 150ms, ColPali 400ms, total budget 800ms (
DOCMIND_RETRIEVAL__*). - Coordination overhead target: 200ms (
COORDINATION_OVERHEAD_THRESHOLDinsrc/agents/coordinator.py). - Context cap: 131072 by default, max 200000 (
DOCMIND_LLM_CONTEXT_WINDOW_MAX). - Monitoring thresholds:
DOCMIND_MONITORING__MAX_QUERY_LATENCY_MS,DOCMIND_MONITORING__MAX_MEMORY_GB,DOCMIND_MONITORING__MAX_VRAM_GB.
uv run python scripts/performance_monitor.py --run-tests --check-regressions --reportuv run python scripts/test_gpu.py --quick
- Hybrid retrieval uses Qdrant named vectors
text-dense(1024D COSINE; BGE-M3) andtext-sparse(FastEmbed BM42/BM25 + IDF) whenDOCMIND_RETRIEVAL__ENABLE_SERVER_HYBRID=true. - Default fusion is RRF; DBSF is available with
DOCMIND_RETRIEVAL__FUSION_MODE=dbsf. - Prefetch defaults: dense 200, sparse 400;
fused_top_k=60;page_idde-dup. - Reranking is enabled by default: BGE v2-m3 (text) + SigLIP (visual), with optional ColPali; timeouts are enforced and fail open.
- Feature flags (hybrid, reranking) are env-only; RRF K and timeouts are adjustable in the Settings page.
- Router parity: RouterQueryEngine tools (vector/hybrid/KG) apply the same reranking policy via
node_postprocessorsbehindDOCMIND_RETRIEVAL__USE_RERANKING.
HF_HUB_OFFLINE=1andTRANSFORMERS_OFFLINE=1to disable network egress (after predownload).DOCMIND_RETRIEVAL__FUSION_MODE=rrf|dbsfto control Qdrant fusion.DOCMIND_RETRIEVAL__USE_RERANKING=true|false(canonical env override).- LLM base URLs are validated when
DOCMIND_SECURITY__ALLOW_REMOTE_ENDPOINTS=false: loopback is always allowed; allowlisted non-loopback hosts are DNS-resolved and rejected if they map to private/link-local/reserved ranges.
DocMind AI is designed for complete offline operation:
-
Install Ollama locally:
# Download from https://ollama.com/download ollama serve # Start the service
-
Pull required models:
ollama pull qwen3-4b-instruct-2507 # Recommended for 128 K context ollama pull qwen2:7b # Alternative lightweight model
-
Verify GPU setup (optional):
nvidia-smi # Check GPU availability uv run python scripts/test_gpu.py --quick # Validate CUDA setup
Run once (online) to predownload required models for offline use:
uv run python tools/models/pull.py --all --cache_dir ./models_cacheDocMind snapshots persist indices atomically for reproducible retrieval.
manifest.meta.jsonfields includeschema_version,persist_format_version,complete,created_at,index_id,graph_store_type,vector_store_type,corpus_hash,config_hash,versions, andgraph_exportswhen present.- Hashing:
corpus_hashcomputed with POSIX relpaths relative to a stable base dir (the Documents UI usesuploads/) for OS-agnostic stability. - Chat autoload: the Chat page loads the latest non-stale snapshot when available; otherwise it shows a staleness badge and offers to rebuild.
- Graph exports preserve relation labels when provided by
get_rel_map(fallback labelrelated). Exports: JSONL baseline (portable) and Parquet (optional, requires PyArrow). Export seeding follows a retriever-first policy: graph -> vector -> deterministic fallback.
Set env for offline operation:
export HF_HUB_OFFLINE=1
export TRANSFORMERS_OFFLINE=1Model sizing depends on your hardware and chosen backend. See Hardware Policy for device and VRAM guidance.
# Check if Ollama is running
curl http://localhost:11434/api/version
# If not running, start it
ollama serve# Install GPU dependencies
uv sync --extra gpu --index https://download.pytorch.org/whl/cu128 --index-strategy=unsafe-best-match
# Verify CUDA installation
nvidia-smi
uv run python -c "import torch; print(torch.cuda.is_available())"# Pull models manually
ollama pull qwen3-4b-instruct-2507 # For 128 K context
ollama pull qwen2:7b # Alternative
ollama list # Verify installation-
Reduce context size in Settings (131072 → 65536 → 32768 → 4096)
-
Use smaller models (4B instead of 7B/14B for lower VRAM)
-
Adjust chunking via
DOCMIND_PROCESSING__CHUNK_SIZEandDOCMIND_PROCESSING__CHUNK_OVERLAP -
Close other applications to free RAM
# Smoke test ingestion (no external services)
uv run python scripts/run_ingestion_demo.py
# If a specific file fails in the UI, reproduce via a targeted ingest:
uv run python -c "from pathlib import Path; from src.models.processing import IngestionConfig, IngestionInput; from src.processing.ingestion_pipeline import ingest_documents_sync; p=Path('path/to/problem-file.pdf'); r=ingest_documents_sync(IngestionConfig(cache_dir=Path('./cache/ingestion-debug')), [IngestionInput(document_id='debug', source_path=p, metadata={'source': p.name})]); print(f'nodes={len(r.nodes)} exports={len(r.exports)}')"# Confirm the app is pointing at the right server
echo "$DOCMIND_LLM_BACKEND"
echo "$DOCMIND_OPENAI__BASE_URL"
# vLLM is OpenAI-compatible; this should return JSON.
curl --fail --silent "$DOCMIND_OPENAI__BASE_URL/models" | headNotes:
- vLLM does not support Windows natively; use WSL2 or run vLLM on a Linux host.
- vLLM performance features (FlashInfer, FP8 KV cache) are configured on the vLLM server process, not inside this app.
This repo pins PyTorch 2.8.0 for reproducibility. If you need CUDA wheels, install with the CUDA index:
uv pip install torch==2.8.0 --extra-index-url https://download.pytorch.org/whl/cu128
uv run python -c "import torch; print(torch.__version__); print(torch.cuda.is_available())"# Reduce GPU memory utilization in .env
export DOCMIND_VLLM__GPU_MEMORY_UTILIZATION=0.75 # Reduce from 0.85
# Monitor GPU memory usage
nvidia-smi --query-gpu=memory.used,memory.total --format=csv --loop=1
# Clear GPU memory cache
uv run python -c "import torch; torch.cuda.empty_cache()"# Run performance validation script
uv run python scripts/performance_monitor.py --run-tests --check-regressions --report- Enable GPU acceleration in the Settings page
- Use appropriate model sizes for your hardware
- Enable caching to speed up repeat analysis
- Adjust chunk sizes based on document complexity
- Use hybrid search for better retrieval quality
-
Check logs in
logs/directory for detailed errors -
Review troubleshooting FAQ
-
Search existing GitHub Issues
-
Open a new issue with: steps to reproduce, error logs, system info
If you use DocMind AI in your research or work, please cite it as follows:
@software{melin_docmind_ai_2025,
author = {Melin, Bjorn},
title = {DocMind AI: Local LLM for AI-Powered Document Analysis},
url = {https://github.com/BjornMelin/docmind-ai-llm},
version = {0.1.0},
year = {2025}
}Contributions are welcome! Please follow these steps:
-
Fork the repository and create a feature branch
-
Set up development environment:
git clone https://github.com/your-username/docmind-ai-llm.git cd docmind-ai-llm uv sync --group dev -
Make your changes following the established patterns
-
Run tests and linting:
# Lint & format uv run ruff format . uv run ruff check . --fix uv run pyright --threads 4 # Fast tiered validation (unit + integration) uv run python scripts/run_tests.py --fast # Coverage gate uv run python scripts/run_tests.py --coverage # Quality gates (CI-style report) uv run python scripts/run_quality_gates.py --ci --report
-
Submit a pull request with clear description of changes
-
Follow PEP 8 style guide (enforced by Ruff)
-
Add type hints for all functions
-
Include docstrings for public APIs
-
Write tests for new functionality
-
Update documentation as needed
We use a tiered test strategy and keep everything offline by default:
- Unit (fast, offline): mocks only; no network/GPU.
- Integration (offline): component interactions; router uses a session-autouse MockLLM fixture in
tests/integration/conftest.py, preventing any Ollama/remote calls. - System/E2E (optional): heavier flows beyond the PR quality gates.
Quick local commands:
# Fast unit + integration sweep (offline)
uv run python scripts/run_tests.py --fast
# Extras (multimodal) lane - skips automatically when optional deps missing
uv run python scripts/run_tests.py --extras
# Full coverage gate (unit + integration)
uv run python scripts/run_tests.py --coverage
# Targeted module or pattern
uv run python scripts/run_tests.py tests/unit/persistence/test_snapshot_manager.pyDefault pytest invocations now run without implicit coverage gates. Use the
scripted --coverage workflow (or run coverage report) when you need HTML,
XML, or JSON artifacts for CI or local analysis.
CI pipeline mirrors this flow using uv run python scripts/run_tests.py --fast as a quick gate followed by --coverage for the full report. This keeps coverage thresholds stable while still surfacing integration regressions early. See ADR-014 for quality gates/validation and ADR-029 for the boundary-first testing strategy.
See the Developer Handbook for detailed guidelines. For an overview of the unit test layout and fixture strategy, see tests/README.md.
This project is licensed under the MIT License - see the LICENSE file for details.
DocMind AI configures OpenTelemetry tracing and metrics via configure_observability (see SPEC-012).
- Observability is disabled by default; enable with
DOCMIND_OBSERVABILITY__ENABLED=true. - OTLP exporters are used when enabled; set
DOCMIND_OBSERVABILITY__ENDPOINTandDOCMIND_OBSERVABILITY__PROTOCOLas needed. - LlamaIndex instrumentation requires the
observabilityextra (uv sync --extra observability). - Core spans cover ingestion runs, snapshot promotion, GraphRAG exports, router selection, and UI actions.
- Telemetry events (
router_selected,export_performed,snapshot_stale_detected) are persisted as JSONL for local audits.
For a local metrics smoke test, run:
uv run python scripts/demo_metrics_console.pyUse tests/unit/telemetry/test_observability_config.py as a reference for wiring custom exporters in extensions.
Built by Bjorn Melin