Skip to content

mikahoy045/Context-Engine

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

458 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CI

Context-Engine at a Glance

Context-Engine is a plug-and-play MCP retrieval stack that unifies code indexing, hybrid search, and optional llama.cpp decoding so product teams can ship context-aware agents in minutes, not weeks.

Context-Engine Usage

Key differentiators

  • One-command bring-up delivers dual SSE/RMCP endpoints, seeded Qdrant, and live watch/reindex loops for fast local validation.
  • ReFRAG-inspired micro-chunking, token budgeting, and gate-first filtering surface precise spans while keeping prompts lean.
  • Shared memory/indexer schema and reranker tooling make it easy to mix dense, lexical, and semantic signals without bespoke glue code.
  • NEW: Performance optimizations including connection pooling, intelligent caching, request deduplication, and async subprocess management that cut redundant calls and smooth spikes under load.
  • Operational playbooks (prune, warm, health, cache) plus rich tests give teams confidence to take the stack from laptop to production.

Built for

  • AI platform and IDE tooling teams that need an MCP-compliant context layer without rebuilding indexing, embeddings, or retrieval heuristics.
  • DevEx and documentation groups standing up internal assistants that must ingest large or fast-changing codebases with minimal babysitting.

Solves

  • Slow agent onboarding caused by fractured infra—ship a consistent stack for memory, search, and decoding under one config.
  • Context drift in monorepos—automatic micro-chunking and watcher-driven reindexing keep embeddings aligned with reality.
  • Fragmented client compatibility—serve both legacy SSE and modern HTTP RMCP clients from the same deployment.
  • NEW: Performance relief via intelligent caching, connection pooling, and async I/O patterns that eliminate redundant processing.

Context-Engine

Context-Engine Quickstart (5 minutes)

This gets you from zero to “search works” in under five minutes.

  1. Prereqs
  • Docker + Docker Compose
  • make (optional but recommended)
  • Node/npm if you want to use mcp-remote (optional)
  1. command (recommended)
# Provisions tokenizer.json, downloads a tiny llama.cpp model, reindexes, and brings all services up
INDEX_MICRO_CHUNKS=1 MAX_MICRO_CHUNKS_PER_FILE=200 make reset-dev-dual
# Provisions the context-engine for rapid development, 
HOST_INDEX_PATH=. COLLECTION_NAME=codebase docker compose run --rm indexer --root /work --recreate --no-skip-unchanged
  • Default ports: Memory MCP :8000, Indexer MCP :8001, 8003, Qdrant :6333, llama.cpp :8080

Seamless Setup Note:

  • The stack uses a single unified codebase collection by default
  • All your code goes into one collection for seamless cross-repo search
  • No per-workspace fragmentation - search across everything at once
  • Health checks auto-detect and fix cache/collection sync issues
  • Just run make reset-dev-dual on any machine and it works™

Make targets: SSE, RMCP, and dual-compat

  • Legacy SSE only (default):
    • Ports: 8000 (/sse), 8001 (/sse)
    • Command: INDEX_MICRO_CHUNKS=1 MAX_MICRO_CHUNKS_PER_FILE=200 make reset-dev
  • RMCP (Codex) only:
    • Ports: 8002 (/mcp), 8003 (/mcp)
    • Command: INDEX_MICRO_CHUNKS=1 MAX_MICRO_CHUNKS_PER_FILE=200 make reset-dev-codex
  • Dual compatibility (SSE + RMCP together):
    • Ports: 8000/8001 (/sse) and 8002/8003 (/mcp)
    • Command: INDEX_MICRO_CHUNKS=1 MAX_MICRO_CHUNKS_PER_FILE=200 make reset-dev-dual

Environment Configuration

Default Setup:

  • The repository includes .env.example with sensible defaults for local development
  • On first run, copy it to .env: cp .env.example .env
  • The make reset-dev* targets will use your .env settings automatically

Key Configuration Files:

  • .env — Your local environment variables (gitignored, safe to customize)
  • .env.example — Template with documented defaults (committed to repo)
  • docker-compose.yml — Service definitions that read from .env

Recommended Customizations:

  1. Enable micro-chunking (better retrieval quality):

    INDEX_MICRO_CHUNKS=1
    MAX_MICRO_CHUNKS_PER_FILE=200
  2. Enable decoder for Q&A (context_answer tool):

    REFRAG_DECODER=1              # Enable decoder (default: 1)
    REFRAG_RUNTIME=llamacpp       # Use llama.cpp (default) or glm
  3. GPU acceleration (Apple Silicon Metal):

    # Option A: Use the toggle script (recommended)
    scripts/gpu_toggle.sh gpu
    scripts/gpu_toggle.sh start
    
    # Option B: Manual .env settings
    USE_GPU_DECODER=1
    LLAMACPP_URL=http://host.docker.internal:8081
    LLAMACPP_GPU_LAYERS=32        # or -1 for all layers
  4. Alternative: GLM API (instead of local llama.cpp):

    REFRAG_RUNTIME=glm
    GLM_API_KEY=your-api-key-here
    GLM_MODEL=glm-4.6             # Optional, defaults to glm-4.6
  5. Collection name (unified by default):

    COLLECTION_NAME=codebase      # Default: single unified collection for all code
    # Only change this if you need isolated collections per project

After changing .env:

  • Restart services: docker compose restart mcp_indexer mcp_indexer_http
  • For indexing changes: make reindex or make reindex-hard
  • For decoder changes: docker compose up -d --force-recreate llamacpp (or restart native server)

Switch decoder model (llama.cpp)

  • Default tiny model: Granite 4.0 Micro (Q4_K_M GGUF)
  • Change the model by overriding Make vars (downloads to ./models/model.gguf):
LLAMACPP_MODEL_URL="https://huggingface.co/ORG/MODEL/resolve/main/model.gguf" \
  INDEX_MICRO_CHUNKS=1 MAX_MICRO_CHUNKS_PER_FILE=200 make reset-dev-dual
  • Want GPU acceleration? Set LLAMACPP_USE_GPU=1 (optionally LLAMACPP_GPU_LAYERS=-1) in your .env before docker compose up, or simply run scripts/gpu_toggle.sh gpu (described below) to flip the switch for you.
  • Embeddings: set EMBEDDING_MODEL in .env and reindex (make reindex)

Decoder env toggles (set in .env and managed automatically by scripts/gpu_toggle.sh):

Variable Description Typical values
USE_GPU_DECODER Feature-flag for native Metal decoder 0 (docker), 1 (native)
LLAMACPP_URL Decoder endpoint containers should use http://llamacpp:8080 or http://host.docker.internal:8081
LLAMACPP_GPU_LAYERS Number of layers to offload to GPU (-1 = all) 0, 32, -1

Alternative (compose only)

HOST_INDEX_PATH="$(pwd)" FASTMCP_INDEXER_PORT=8001 docker compose up -d qdrant mcp mcp_indexer indexer watcher

Recommended development flow

  1. Bring the stack up with the reset target that matches your client (make reset-dev, make reset-dev-codex, or make reset-dev-dual).
  2. When you need a clean ingest (after large edits or when the qdrant_status tool/make qdrant-status reports zero points), run make reindex-hard. This clears .codebase/cache.json before recreating the collection so unchanged files cannot be skipped.
  3. Confirm collection health with make qdrant-status (calls the MCP router to print counts and timestamps).
  4. Iterate using search helpers such as make hybrid ARGS="--query 'async file watcher'" or invoke the MCP tools directly from your client.

Apple Silicon Metal GPU (native) vs Docker decoder

On Apple Silicon you can run the llama.cpp decoder natively with Metal while keeping the rest of the stack in Docker:

  1. Install the Metal-enabled llama.cpp binary (e.g. brew install llama.cpp).
  2. Flip to GPU mode and start the native server:
    scripts/gpu_toggle.sh gpu
    scripts/gpu_toggle.sh start   # launches llama-server on localhost:8081
    docker compose up -d --force-recreate mcp_indexer mcp_indexer_http
    docker compose stop llamacpp   # optional once the native server is healthy
    The toggle updates .env to point at http://host.docker.internal:8081 so containers reach the host process.
  3. Run scripts/gpu_toggle.sh status to confirm the native server is healthy. All MCP context_answer calls will now use the Metal-backed decoder.

Want the original dockerised decoder (CPU-only or x86 GPU fallback)? Swap back with:

scripts/gpu_toggle.sh docker
docker compose up -d --force-recreate mcp_indexer mcp_indexer_http llamacpp

This re-enables the llamacpp container and resets .env to http://llamacpp:8080.

Make targets (quick reference)

  • reset-dev: SSE stack on 8000/8001; seeds Qdrant, downloads tokenizer + tiny llama.cpp model, reindexes, brings up memory + indexer + watcher
  • reset-dev-codex: RMCP stack on 8002/8003; same seeding + bring-up for Codex/Qodo
  • reset-dev-dual: SSE + RMCP together (8000/8001 and 8002/8003)
  • up / down / logs / ps: Docker Compose lifecycle helpers
  • index / reindex / reindex-hard: Index current repo; reindex recreates the collection; reindex-hard also clears the local cache so unchanged files are re-uploaded
  • index-here / index-path: Index arbitrary host path without cloning into this repo
  • watch: Watch-and-reindex on file changes
  • warm / health: Warm caches and run health checks
  • hybrid / rerank: Example hybrid search + reranker helper
  • setup-reranker / rerank-local / quantize-reranker: Manage ONNX reranker assets and local runs
  • prune / prune-path: Remove stale points (missing files or hash mismatch)
  • llama-model / tokenizer: Fetch tiny GGUF model and tokenizer.json
  • qdrant-status / qdrant-list / qdrant-prune / qdrant-index-root: Convenience wrappers that route through the MCP bridge to inspect or maintain collections

CLI: ctx prompt enhancer

A thin CLI that retrieves code context and rewrites your input into a better, context-aware prompt using the local LLM decoder. Works with both questions and commands/instructions. By default it prints ONLY the improved prompt.

Examples:

# Questions: Enhanced with specific details and multiple aspects
scripts/ctx.py "What is ReFRAG?"
# Output: Two detailed question paragraphs with file/line references

# Commands: Enhanced with concrete targets and implementation details
scripts/ctx.py "Refactor ctx.py"
# Output: Two detailed instruction paragraphs with specific steps

# Unicorn mode: staged 2–3 pass enhancement for best results
scripts/ctx.py "Refactor ctx.py" --unicorn

# Via Make target (default improved prompt only)
make ctx Q="Explain the caching logic to me in detail"

# Filter by language/path or adjust tokens
make ctx Q="Hybrid search details" ARGS="--language python --under scripts/ --limit 2 --rewrite-max-tokens 200"

Detail mode (short snippets)

Include compact code snippets in the retrieved context for richer rewrites (trades a bit of speed for quality):

# Enable detail mode (adds short snippets) - works with questions
scripts/ctx.py "Explain the caching logic" --detail

# Detail mode with commands - gets more specific implementation details
scripts/ctx.py "Add error handling to ctx.py" --detail

# Adjust snippet size if needed (default is 1 line when --detail is used)
make ctx Q="Explain hybrid search" ARGS="--detail --context-lines 2"

Notes:

  • Default behavior is header-only (fastest). --detail adds short snippets.
  • If --detail is set and --context-lines remains at its default (0), ctx.py automatically uses 1 line to keep snippets concise. Override with --context-lines N.
  • Detail mode is optimized for speed: automatically clamps to max 4 results and 1 result per file.

Unicorn mode (staged multi-pass for best quality)

Use --unicorn for the highest quality prompt enhancement with a staged 2-3 pass approach:

# Unicorn mode with commands - produces exceptional, detailed instructions
scripts/ctx.py "refactor ctx.py" --unicorn

# Unicorn mode with questions - produces highly intelligent, multi-faceted questions
scripts/ctx.py "what is ReFRAG and how does it work?" --unicorn

# Works with all filters
scripts/ctx.py "add error handling" --unicorn --language python

How it works:

Unicorn Usage

Unicorn mode uses multiple LLM passes with progressively richer code context:

  1. Pass 1 (Draft): Retrieves rich code snippets (8 lines of context per match) to understand the codebase and sharpen the intent
  2. Pass 2 (Refine): Retrieves even richer snippets (12 lines of context) based on the draft to ground the prompt with concrete code behaviors
  3. Pass 3 (Polish): Optional cleanup pass that runs only if the output appears generic or incomplete

Key features:

  • Code-grounded: References actual code behaviors and patterns from your codebase, not file paths or line numbers
  • No hallucinations: Only uses real code from your indexed repository - never invents references
  • Multi-paragraph output: Produces detailed, comprehensive prompts that explore multiple aspects
  • Works with both questions and commands: Enhances any type of prompt

When to use:

  • Normal mode: Quick, everyday prompts (fastest)
  • --detail: Richer context without multi-pass overhead (balanced)
  • --unicorn: When you need the absolute best prompt quality (highest quality)

Advanced Features

1. Streaming Output (Default)

All modes now stream tokens as they arrive for instant feedback:

# Streaming is enabled by default - see output appear immediately
scripts/ctx.py "refactor ctx.py" --unicorn

To disable streaming (wait for full response):

  • Set "streaming": false in ~/.ctx_config.json

2. Memory Blending

Automatically falls back to context_search with memories when repo search returns no hits:

# If no code matches, ctx.py will search design docs and ADRs
scripts/ctx.py "What is our authentication strategy?"

This ensures you get relevant context even when the query doesn't match code directly.

3. Adaptive Context Sizing

Automatically adjusts limit and context_lines based on query characteristics:

  • Short/vague queries → More context for richer grounding
  • Queries with file/function names → Lighter settings for speed
# Short query → auto-increases context
scripts/ctx.py "caching"

# Specific query → optimized for speed
scripts/ctx.py "refactor fetch_context function in ctx.py"

4. Automatic Quality Assurance

Enhanced _needs_polish() heuristic automatically triggers a third polish pass when:

  • Output is too short (< 180 chars)
  • Contains generic/vague language
  • Missing concrete code references
  • Lacks proper paragraph structure

This happens transparently in --unicorn mode - no user action needed.

5. Personalized Templates

Create ~/.ctx_config.json to customize prompt enhancement behavior:

{
  "always_include_tests": true,
  "prefer_bullet_commands": false,
  "extra_instructions": "Always consider error handling and edge cases",
  "streaming": true
}

Available preferences:

  • always_include_tests: Add testing considerations to all prompts
  • prefer_bullet_commands: Format commands as bullet points
  • extra_instructions: Custom instructions added to every rewrite
  • streaming: Enable/disable streaming output (default: true)

See ctx_config.example.json for a template.

GPU Acceleration (Apple Silicon): For faster prompt rewriting, use the native Metal-accelerated decoder:

# 1. Set USE_GPU_DECODER=1 in your .env file (already set by default)
# 2. Start the native llama.cpp server with Metal GPU
scripts/gpu_toggle.sh start

# Now ctx.py will automatically use the GPU decoder on port 8081
make ctx Q="Explain the caching logic to me in detail"

# Stop the native GPU server
scripts/gpu_toggle.sh stop

# To use Docker decoder instead, set USE_GPU_DECODER=0 in .env and restart:
docker compose up -d llamacpp

Notes:

Index another codebase (outside this repo)

You can index any local folder by mounting it at /work. Three easy ways:

  1. Make target: index a specific path
make index-path REPO_PATH=/abs/path/to/other/repo [RECREATE=1] [REPO_NAME=name] [COLLECTION=name]
  • RECREATE=1 drops and recreates the collection before indexing
  • Defaults: REPO_NAME and COLLECTION fall back to the folder name
  1. Make target: index the current working directory
cd /abs/path/to/other/repo
make -C /Users/user/Desktop/Context-Engine index-here [RECREATE=1] [REPO_NAME=name] [COLLECTION=name]
  1. Raw docker compose (one‑shot ingest without Make)
docker compose run --rm \
  -v /abs/path/to/other/repo:/work \
  indexer --root /work [--recreate]

Notes:

  • No need to bind-mount this repository; the images bake /app/scripts and set WORK_ROOTS="/work,/app" so utilities import correctly.
  • MCP clients can connect to the running servers and operate on whichever folder is mounted at /work.

Supported IDE clients/extensions

  • Roo (SSE/RMCP): supports both SSE and RMCP connections; see config examples below
  • Cline (SSE/RMCP): supports both SSE and RMCP connections; see config examples below
  • Windsurf (SSE/RMCP): supports both SSE and RMCP connections; see config examples below
  • Zed (SSE): uses mcp-remote bridge via command/args; see config below
  • Kiro (SSE): uses mcp-remote bridge via command/args; see config below
  • Qodo (RMCP): connects directly to HTTP endpoints; add each tool individually
  • OpenAI Codex (RMCP): TOML config for memory/indexer URLs
  • Augment (SSE): simple JSON configs for both servers
  • AmpCode (SSE): simple URL for both legacy sse endpoints
  • Claude Code CLI(SSE): simple JSON configs for both servers
  1. Verify endpoints
# Qdrant DB
curl -sSf http://localhost:6333/readyz >/dev/null && echo "Qdrant OK"
# Decoder (llama.cpp sidecar)
curl -s http://localhost:8080/health
# SSE endpoints (Memory, Indexer)
curl -sI http://localhost:8000/sse | head -n1
curl -sI http://localhost:8001/sse | head -n1
# RMCP endpoints (HTTP JSON-RPC)
curl -sI http://localhost:8002/mcp | head -n1
curl -sI http://localhost:8003/mcp | head -n1

Configuration reference (env vars)

Core

  • COLLECTION_NAME: Qdrant collection to use (defaults to repo name if unset in some flows)
  • REPO_NAME: Logical name for the indexed repo; stored in payload for filtering
  • HOST_INDEX_PATH: Absolute host path to index (mounted to /work in containers)

Indexing / micro-chunks

  • INDEX_MICRO_CHUNKS: 1 to enable micro‑chunking; off falls back to line chunks
  • MAX_MICRO_CHUNKS_PER_FILE: Cap micro‑chunks per file (e.g., 200 default)
  • TOKENIZER_URL, TOKENIZER_PATH: Hugging Face tokenizer.json URL and local path
  • USE_TREE_SITTER: 1 to enable tree-sitter parsing (optional; off by default)

Watcher

  • WATCH_DEBOUNCE_SECS: Debounce between change events (e.g., 1.5)
  • INDEX_UPSERT_BATCH / INDEX_UPSERT_RETRIES / INDEX_UPSERT_BACKOFF: Upsert tuning
  • QDRANT_TIMEOUT: Request timeout in seconds for upserts/queries (e.g., 60–90)
  • MCP_TOOL_TIMEOUT_SECS: Max duration for long-running MCP tools (index/prune); default 3600s

Reranker

  • RERANKER_ONNX_PATH, RERANKER_TOKENIZER_PATH: Paths for local ONNX cross‑encoder
  • RERANKER_ENABLED: 1/true to enable, 0/false to disable; default is enabled in server
    • Timeouts/failures automatically fall back to hybrid results

Decoder (llama.cpp / GLM)

  • REFRAG_DECODER: 1 to enable decoder for context_answer; 0 to disable (default: 1)
  • REFRAG_RUNTIME: llamacpp or glm (default: llamacpp)
  • LLAMACPP_URL: llama.cpp server endpoint (default: http://llamacpp:8080 or http://host.docker.internal:8081 for GPU)
  • LLAMACPP_TIMEOUT_SEC: Decoder request timeout in seconds (default: 300)
  • DECODER_MAX_TOKENS: Max tokens for decoder responses (default: 4000)
  • REFRAG_DECODER_MODE: prompt or soft (default: prompt; soft requires patched llama.cpp)
  • GLM_API_KEY: API key for GLM provider (required when REFRAG_RUNTIME=glm)
  • GLM_MODEL: GLM model name (default: glm-4.6)
  • USE_GPU_DECODER: 1 for native Metal decoder on host, 0 for Docker (managed by gpu_toggle.sh)
  • LLAMACPP_GPU_LAYERS: Number of layers to offload to GPU, -1 for all (default: 32)

ReFRAG (micro-chunking and retrieval)

  • REFRAG_MODE: 1 to enable micro-chunking and span budgeting (default: 1)
  • REFRAG_GATE_FIRST: 1 to enable mini-vector gating before dense search (default: 1)
  • REFRAG_CANDIDATES: Number of candidates for gate-first filtering (default: 200)
  • MICRO_BUDGET_TOKENS: Global token budget for context_answer spans (default: 512)
  • MICRO_OUT_MAX_SPANS: Max number of spans to return per query (default: 3)

Ports

  • FASTMCP_PORT (SSE/RMCP): Override Memory MCP ports (defaults: 8000/8002)
  • FASTMCP_INDEXER_PORT (SSE/RMCP): Override Indexer MCP ports (defaults: 8001/8003)

Env var quick table

Name Description Default
COLLECTION_NAME Qdrant collection name (unified across all repos) codebase
REPO_NAME Logical repo tag stored in payload for filtering auto-detect from git/folder
HOST_INDEX_PATH Host path mounted at /work in containers current repo (.)
QDRANT_URL Qdrant base URL container: http://qdrant:6333; local: http://localhost:6333
INDEX_MICRO_CHUNKS Enable token-based micro-chunking 0 (off)
HYBRID_EXPAND Enable heuristic multi-query expansion 0 (off)
MAX_MICRO_CHUNKS_PER_FILE Cap micro-chunks per file 200
TOKENIZER_URL HF tokenizer.json URL (for Make download) n/a (use Make target)
TOKENIZER_PATH Local path where tokenizer is saved (Make) models/tokenizer.json
TOKENIZER_JSON Runtime path for tokenizer (indexer) models/tokenizer.json
USE_TREE_SITTER Enable tree-sitter parsing (py/js/ts) 0 (off)
WATCH_DEBOUNCE_SECS Debounce between FS events (watcher) 1.5
INDEX_UPSERT_BATCH Upsert batch size (watcher) 128
INDEX_UPSERT_RETRIES Retry count (watcher) 5
MCP_TOOL_TIMEOUT_SECS Max duration for long-running MCP tools 3600
INDEX_UPSERT_BACKOFF Seconds between retries (watcher) 0.5
QDRANT_TIMEOUT HTTP timeout seconds watcher: 60; search: 20
RERANKER_ONNX_PATH Local ONNX cross-encoder model path unset (see make setup-reranker)
RERANKER_TOKENIZER_PATH Tokenizer path for reranker unset
RERANKER_ENABLED Enable reranker by default 1 (enabled)
FASTMCP_PORT Memory MCP server port (SSE/RMCP) 8000 (container-internal)
FASTMCP_INDEXER_PORT Indexer MCP server port (SSE/RMCP) 8001 (container-internal)
FASTMCP_HTTP_PORT Memory RMCP host port mapping 8002
FASTMCP_INDEXER_HTTP_PORT Indexer RMCP host port mapping 8003
FASTMCP_HEALTH_PORT Health port (memory/indexer) memory: 18000; indexer: 18001
LLM_EXPAND_MAX Max alternate queries generated via LLM 0
REFRAG_DECODER Enable decoder for context_answer 1 (enabled)
REFRAG_RUNTIME Decoder backend: llamacpp or glm llamacpp
LLAMACPP_URL llama.cpp server endpoint http://llamacpp:8080 or http://host.docker.internal:8081
LLAMACPP_TIMEOUT_SEC Decoder request timeout 300
DECODER_MAX_TOKENS Max tokens for decoder responses 4000
GLM_API_KEY API key for GLM provider unset
GLM_MODEL GLM model name glm-4.6
USE_GPU_DECODER Native Metal decoder (1) vs Docker (0) 0 (docker)
REFRAG_MODE Enable micro-chunking and span budgeting 1 (enabled)
REFRAG_GATE_FIRST Enable mini-vector gating 1 (enabled)
REFRAG_CANDIDATES Candidates for gate-first filtering 200
MICRO_BUDGET_TOKENS Token budget for context_answer 512

Running tests

Local (recommended)

  • Python 3.11+
  • Create venv and install deps:
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
  • Run the full suite:
pytest -q
  • Run a single file or test:
pytest tests/test_ingest_micro_chunks.py -q
pytest tests/test_php_support.py::test_imports -q
  • Tips:
    • RERANKER_ENABLED=0 can speed up some tests locally; functionality still validated via hybrid fallback.
    • Some integration tests may start ephemeral containers via testcontainers; ensure Docker is running.

Inside Docker (optional, ad-hoc)

  • You can run tests in the indexer image by overriding the entrypoint:
docker compose run --rm --entrypoint pytest mcp-indexer -q

Note: the provided dev images focus on runtime; local venv is faster for iterative testing.

Language support

  • Python, JavaScript/TypeScript, Go, Java, Rust, Shell, Terraform, PowerShell, YAML, C#, PHP

Watcher behavior and tips

  • Handles delete and move: removes/migrates points to avoid stale entries
  • Live reloads ignore patterns: changes to .qdrantignore are applied without restart
  • path_glob matches against relative paths (e.g., src/**/*.py), not absolute /work paths
  • If upserts time out, lower INDEX_UPSERT_BATCH (e.g., 96) or raise QDRANT_TIMEOUT (e.g., 90)
  • For very large files, reduce MAX_MICRO_CHUNKS_PER_FILE (e.g., 200) during dev

Expected HTTP behaviors

  • GET /mcp may return 400 (normal): the RMCP endpoint is POST-only for JSON-RPC
  • SSE requires a session handshake; raw POST /messages without it will error (expected)
curl -sSf http://localhost:6333/readyz >/dev/null && echo "Qdrant OK"
curl -sI http://localhost:8000/sse | head -n1
curl -sI http://localhost:8001/sse | head -n1
  1. Single command to index + search
# Fresh index of your repo and a quick hybrid example
make reindex-hard
make qdrant-status
make hybrid ARGS="--query 'async file watcher' --limit 5 --include-snippet"
  1. Example MCP client configurations

Kiro (SSE): Create .kiro/settings/mcp.json in your workspace:

{
  "mcpServers": {
    "qdrant-indexer": { "command": "npx", "args": ["mcp-remote", "http://localhost:8001/sse", "--transport", "sse-only"] },
    "memory": { "command": "npx", "args": ["mcp-remote", "http://localhost:8000/sse", "--transport", "sse-only"] }
  }
}

Zed (SSE): Add to your Zed settings.json (accessed via Command Palette → "Settings: Open Settings (JSON)"):

{
  /// The name of your MCP server
  "qdrant-indexer": {
    /// The command which runs the MCP server
    "command": "npx",
    /// The arguments to pass to the MCP server
    "args": [
      "mcp-remote",
      "http://localhost:8001/sse",
      "--transport",
      "sse-only"
    ],
    /// The environment variables to set
    "env": {}
  }
}

Notes:

  • Zed expects MCP servers at the root level of settings.json
  • Uses command/args (stdio). mcp-remote bridges to remote SSE endpoints
  • If npx prompts, add -y right after npx: "command": "npx", "args": ["-y", "mcp-remote", ...]
  • Alternative: Use direct HTTP connection if mcp-remote has issues:
    {
      "qdrant-indexer": {
        "type": "http",
        "url": "http://localhost:8001/sse"
      }
    }
  • For Qodo (RMCP) clients, see "Qodo Integration (RMCP config)" below for the direct url-based snippet.
  1. Common troubleshooting
  • Tree-sitter not found or parser errors:

    • Feature is optional. If you set USE_TREE_SITTER=1 and see errors, unset it or install tree-sitter deps, then reindex.
  • Tokenizer missing for micro-chunks:

    • Run make tokenizer or set TOKENIZER_JSON to a valid tokenizer.json; otherwise we fall back to line-based chunking.
  • SSE “Invalid session ID” when POSTing /messages directly:

    • Expected if you didn’t initiate an SSE session first. Use an MCP client (e.g., mcp-remote) to handle the handshake.
  • llama.cpp platform warning on Apple Silicon:

    • Prefer the native path above (scripts/gpu_toggle.sh gpu). If you stick with Docker, add platform: linux/amd64 to the service or ignore the warning during local dev.
  • Indexing feels stuck on very large files:

    • Use MAX_MICRO_CHUNKS_PER_FILE=200 during dev runs.
  • Watcher timeouts (-9) or Qdrant "ResponseHandlingException: timed out":

    • Set watcher-safe defaults to reduce payload size and add headroom during upserts:
    # Watcher-safe defaults (compose already applies these to the watcher service)
    QDRANT_TIMEOUT=60
    MAX_MICRO_CHUNKS_PER_FILE=200
    INDEX_UPSERT_BATCH=128
    INDEX_UPSERT_RETRIES=5
    INDEX_UPSERT_BACKOFF=0.5
    WATCH_DEBOUNCE_SECS=1.5
    • If issues persist, try lowering INDEX_UPSERT_BATCH to 96 or raising QDRANT_TIMEOUT to 90.

ReFRAG background: https://arxiv.org/abs/2509.01092

Endpoints

Component URL
Memory MCP http://localhost:8000/sse
Indexer MCP http://localhost:8001/sse
Qdrant DB http://localhost:6333

Streamable HTTP (RMCP) endpoints + OpenAI Codex config

OpenAI Codex config (RMCP client):

experimental_use_rmcp_client = true

[mcp_servers.memory_http]
url = "http://127.0.0.1:8002/mcp"

[mcp_servers.qdrant_indexer_http]
url = "http://127.0.0.1:8003/mcp"

Kiro Integration (workspace config)

Add this to your workspace-level Kiro config at .kiro/settings/mcp.json (restart Kiro after saving):

{
  "mcpServers": {
    "qdrant-indexer": { "command": "npx", "args": ["mcp-remote", "http://localhost:8001/sse", "--transport", "sse-only"] },
    "memory": { "command": "npx", "args": ["mcp-remote", "http://localhost:8000/sse", "--transport", "sse-only"] }
  }
}

Notes:

  • Kiro expects command/args (stdio). mcp-remote bridges to remote SSE endpoints.
  • If npx prompts in your environment, add -y right after npx.
  • Workspace config overrides user-level config (~/.kiro/settings/mcp.json).

Troubleshooting:

  • Error: “Enabled MCP Server must specify a command, ignoring.”
    • Fix: Use the command/args form above; do not use type:url in Kiro.
  • ImportError: deps: No module named 'scripts' when calling memory_store on the indexer MCP
    • Fix applied: server now adds /work and /app to sys.path. Restart mcp_indexer.

Available MCP tools

Memory MCP (8000 SSE, 8002 RMCP):

  • store(information, metadata?, collection?) — write a memory entry into the default collection (dual vectors: dense + lexical)
  • find(query, limit=5, collection?, top_k?) — hybrid memory search over memory-like entries

Indexer/Search MCP (8001 SSE, 8003 RMCP):

  • repo_search — hybrid code search (dense + lexical + optional reranker)
  • context_search — search that can also blend memory results (include_memories)
  • context_answer — natural-language Q&A with retrieval + local LLM (llama.cpp or GLM)
  • code_search — alias of repo_search
  • repo_search_compat — permissive wrapper that normalizes q/text/queries/top_k payloads
  • context_answer_compat — permissive wrapper for context_answer with lenient argument handling
  • expand_query(query, max_new?) — LLM-assisted query expansion (generates 1-2 alternates)
  • qdrant_index_root — index /work (mounted repo root) with safe defaults
  • qdrant_index(subdir?, recreate?, collection?) — index a subdir or recreate collection
  • qdrant_prune — remove points for missing files or file_hash mismatch
  • qdrant_list — list Qdrant collections
  • qdrant_status — collection counts and recent ingestion timestamps
  • workspace_info(workspace_path?) — read .codebase/state.json and resolve default collection
  • list_workspaces(search_root?) — scan for multiple workspaces in multi-repo environments
  • memory_store — convenience memory store from the indexer (uses default collection)
  • search_tests_for — intent wrapper for test files
  • search_config_for — intent wrapper for likely config files
  • search_callers_for — intent wrapper for probable callers/usages
  • search_importers_for — intent wrapper for files importing a module/symbol
  • change_history_for_path(path) — summarize recent changes using stored metadata
  • collection_map - return collection↔repo mappings
  • default_collection - set the collection to use for the session

Notes:

  • Most search tools accept filters like language, under, path_glob, kind, symbol, ext.
  • Reranker enabled by default; timeouts fall back to hybrid results.
  • context_answer requires decoder enabled (REFRAG_DECODER=1) with llama.cpp or GLM backend.

Qodo Integration (RMCP config)

Add this to your Qodo MCP settings to target the RMCP (HTTP) endpoints:

{
  "mcpServers": {
    "memory": { "url": "http://localhost:8002/mcp" },
    "qdrant-indexer": { "url": "http://localhost:8003/mcp" }
  }
}

Note: Qodo can talk to the RMCP endpoints directly, so no mcp-remote wrapper is required.

Architecture overview

flowchart LR
  subgraph Host/IDE
    A[IDE Agents]
  end
  subgraph Docker Network
    B(Memory MCP :8000)
    C(MCP Indexer :8001)
    D[Qdrant DB :6333]
    G[[llama.cpp Decoder :8080]]
    E[(One-shot Indexer)]
    F[(Watcher)]
  end
  A -- SSE /sse --> B
  A -- SSE /sse --> C
  B -- HTTP 6333 --> D
  C -- HTTP 6333 --> D
  E -- HTTP 6333 --> D
  F -- HTTP 6333 --> D
  C -. HTTP 8080 .-> G
  classDef opt stroke-dasharray: 5 5
  class G opt
Loading

Production-ready local development

One-line bring-up (ship-ready)

Start Qdrant, the Memory MCP (8000), the Indexer MCP (8001), and run a fresh index of your current repo:

HOST_INDEX_PATH="$(pwd)" FASTMCP_INDEXER_PORT=8001 docker compose up -d qdrant mcp mcp_indexer indexer watcher

Then wire your MCP-aware IDE/tooling to:

Tip: add watcher to the command if you want live reindex-on-save.

SSE Memory Server (port 8000)

  • URL: http://localhost:8000/sse
  • Tools: store, find
  • Env (used by the indexer to blend memory):
    • MEMORY_SSE_ENABLED=true
    • MEMORY_MCP_URL=http://mcp:8000/sse
    • MEMORY_MCP_TIMEOUT=6

IDE/Agent config (recommended):

{
  "mcpServers": {
    "memory": { "type": "sse", "url": "http://localhost:8000/sse", "disabled": false },
    "qdrant-indexer": { "type": "sse", "url": "http://localhost:8001/sse", "disabled": false }
  }
}

Blended search:

Memory usage patterns (how to get the most from memories)

When to use memories vs code search

  • Use memories when the information isn’t in your repository or is transient/user-authored: conventions, runbooks, decisions, links, known issues, FAQs, “how we do X here”.
  • Use code search for facts that live in the repo: APIs, functions/classes, configuration, and cross-file relationships.
  • Blend both for tasks like “how to run E2E tests” where instructions (memory) reference scripts in the repo (code).
  • Rule of thumb: if you’d write it in a team wiki or ticket comment, store it as a memory; if you’d grep for it, use code search.

Recommended metadata schema (best practices)

We store memory entries as points in Qdrant with a small, consistent payload. Recommended keys:

  • kind: "memory" (string) – required. Enables filtering and blending.
  • topic: short category string (e.g., "dev-env", "release-process").
  • tags: list of strings (e.g., ["qdrant", "indexing", "prod"]).
  • source: where this came from (e.g., "chat", "manual", "tool", "issue-123").
  • author: who added it (e.g., username or email).
  • created_at: ISO8601 timestamp (UTC).
  • expires_at: ISO8601 timestamp if this memory should be pruned later.
  • repo: optional repo identifier if sharing a Qdrant instance across repos.
  • link: optional URL to docs, tickets, or dashboards.
  • priority: 0.0–1.0 weight that clients can use to bias ranking when blending.

Notes:

  • Keep values small (short strings, small lists). Don’t store large blobs in payload; put details in the information text.
  • Use lowercase snake_case keys for consistency.
  • For secrets/PII: do not store plaintext. Store references or vault paths instead.

Example memory operations

Store a memory (via MCP Memory server tool store – use your MCP client):

{
  "information": "Run full reset: INDEX_MICRO_CHUNKS=1 MAX_MICRO_CHUNKS_PER_FILE=200 make reset-dev",
  "metadata": {
    "kind": "memory",
    "topic": "dev-env",
    "tags": ["make", "reset"],
    "source": "chat"
  }
}

Find memories (via MCP Memory server tool find):

{
  "query": "reset-dev",
  "limit": 5
}

Blend memories into code search (Indexer MCP context_search):

{
  "query": "async file watcher",
  "include_memories": true,
  "limit": 5,
  "include_snippet": true
}

Tips:

  • Use precise queries (2–5 tokens). Add a couple synonyms if needed; the server supports multiple phrasings.
  • Combine topic/tags in your memory text to make them easier to find (they also live in payload for filtering).

Backup and migration (advanced)

For production-grade backup/migration strategies, see the official Qdrant documentation for snapshots and export/import. For local development, we recommend relying on Docker volumes and reindexing when needed.

Operational notes:

  • Collection name comes from COLLECTION_NAME (see .env). This stack defaults to a single collection for both code and memories; filtering uses metadata.kind.

  • If you switch to a dedicated memory collection, update the MCP Memory server and the Indexer's memory blending env to point at it.

  • Consider pruning expired memories by filtering expires_at < now.

  • Call context_search on :8001 (SSE) or :8003 (RMCP) with { "include_memories": true } to return both memory and code results.

Collection Naming Strategies

Different hash lengths are used for different workspace types:

Local Workspaces: repo-name-8charhash

  • Example: Anesidara-e8d0f5fc
  • Used by local indexer/watcher
  • Assumes unique repo names within workspace

Remote Uploads: folder-name-16charhash-8charhash

  • Example: testupload2-04e680d5939dd035-b8b8d4cc
  • Collision avoidance for duplicate folder names for different codebases
  • 16-char hash identifies workspace, 8-char hash identifies collection

Enable memory blending (for context_search)

  1. Ensure the Memory MCP is running on :8000 (default in compose).
  2. Enable SSE memory blending on the Indexer MCP by setting these env vars for the mcp_indexer service (docker-compose.yml):
services:
  mcp_indexer:
    environment:
      - MEMORY_SSE_ENABLED=true
      - MEMORY_MCP_URL=http://mcp:8000/sse
      - MEMORY_MCP_TIMEOUT=6
  1. Restart the indexer service:
docker compose up -d mcp_indexer
  1. Validate by calling context_search with include_memories=true for a query that matches a stored memory:
{
  "query": "your test memory text",
  "include_memories": true,
  "limit": 5
}

Expected: non-zero results with blended items; memory hits will have memory-like payloads (e.g., metadata.kind = "memory").

  • Idempotent + incremental indexing out of the box:

    • Skips unchanged files automatically using a file content hash stored in payload (metadata.file_hash)
    • De-duplicates per-file points by deleting prior entries for the same path before insert
    • Payload indexes are auto-created on first run (metadata.language, metadata.path_prefix, metadata.repo, metadata.kind, metadata.symbol, metadata.symbol_path, metadata.imports, metadata.calls)
  • Commands:

    • Full rebuild: make reindex
    • Fast incremental: make index (skips unchanged files)
    • Health check: make health (verifies collection vector name/dim, HNSW, and filtered queries with kind/symbol)
    • Hybrid search: make hybrid (dense + lexical bump with RRF)
  • Bootstrap all services + index + checks: make bootstrap

  • Discover commands: make help lists all targets and descriptions

  • Ingest Git history: make history (messages + file lists)

    • If the repo has no local commits yet, the history ingester will shallow-fetch from the remote (default: origin) and use its HEAD. Configure with --remote and --fetch-depth.
  • Local reranker (ONNX): make rerank-local (set RERANKER_ONNX_PATH and RERANKER_TOKENIZER_PATH)

  • Setup ONNX reranker quickly: make setup-reranker ONNX_URL=... TOKENIZER_URL=... (updates .env paths)

  • Enable Tree-sitter parsing (more accurate symbols/scopes): set USE_TREE_SITTER=1 in .env then reindex

  • Flags (advanced):

    • Disable de-duplication: docker compose run --rm indexer --root /work --no-dedupe
    • Disable unchanged skipping: docker compose run --rm indexer --root /work --no-skip-unchanged

Notes:

  • Named vector remains aligned with the MCP server (fast-bge-base-en-v1.5). If you change EMBEDDING_MODEL, run make reindex to recreate the collection.
  • For very large repos, consider running make index on a schedule (or pre-commit) to keep Qdrant warm without full reingestion.

Multi-repo indexing (unified search)

The stack uses a single unified codebase collection by default, making multi-repo search seamless:

Index another repo into the same collection:

# From your qdrant directory
make index-here HOST_INDEX_PATH=/path/to/other/repo REPO_NAME=other-repo

# Or with full control:
HOST_INDEX_PATH=/path/to/other/repo \
COLLECTION_NAME=codebase \
REPO_NAME=other-repo \
docker compose run --rm indexer --root /work

What happens:

  • Files from the other repo get indexed into the unified codebase collection
  • Each file is tagged with metadata.repo = "other-repo" for filtering
  • Search across all repos by default, or filter by specific repo

Search examples:

# Search across all indexed repos
make hybrid QUERY="authentication logic"

# Filter by specific repo
python scripts/hybrid_search.py \
  --query "authentication logic" \
  --repo other-repo

# Filter by repo + language
python scripts/hybrid_search.py \
  --query "authentication logic" \
  --repo other-repo \
  --language python

Benefits:

  • One collection = unified search across all your code
  • No fragmentation or collection management overhead
  • Filter by repo when you need isolation
  • All repos share the same vector space for better semantic search

Multi-query re-ranker (no new deps)

  • Run a fused query with several phrasings and metadata-aware boosts:
make rerank
  • Customize:
    • Add more --query flags
    • Prefer language: --language python
    • Prefer under path: --under /work/scripts

Watch mode (incremental indexing)

  • Reindex changed files on save (runs until Ctrl+C):
make watch

HNSW recall tuning

  • Collection creation is tuned for higher recall: m=16, ef_construct=256.
  • If you change embeddings, run make reindex to recreate the collection with the tuned HNSW settings.

Warm start (optional)

  • Preload the embedding model and warm Qdrant's HNSW search path to reduce first-query latency and improve recall:
make warm

Or, since this stack already exposes SSE, you can configure the client to use http://localhost:8000/sse directly (recommended for Cursor/Windsurf).

Search filters (repo_search/context_search)

Most MCP clients let you pass structured tool arguments. The Indexer/search MCP supports applying server-side filters in repo_search/context_search when these keys are present:

  • language: value matches metadata.language
  • path_prefix: value matches metadata.path_prefix (e.g., /work/src)
  • kind: value matches metadata.kind (e.g., function, class, method)

Tip: Combine multiple query phrasings and apply these filters for best precision on large codebases.

Notes

Index your repository (code search quality)

We added a dockerized indexer that chunks code, embeds with BAAI/bge-base-en-v1.5, and stores metadata (path, path_prefix, language, start_line, end_line, code) in Qdrant. This boosts recall and relevance for the MCP tools.

# Index current workspace (does not drop data)
make index

# Full reindex (drops existing points in the collection)
make reindex

### Companion MCP: Index/Prune/List (Option B)

A second MCP server runs alongside the search MCP and exposes tools:
- qdrant-list: list collections
- qdrant-index: index the mounted path (/work or subdir)
- qdrant-prune: prune stale points for the mounted path

Configuration
- FASTMCP_INDEXER_PORT (default 8001)
- HOST_INDEX_PATH bind-mounts the target repo into /work (read-only)

Add to your agent as a separate MCP endpoint (SSE):
- URL: http://localhost:8001/sse

Example calls (semantics vary by client):
- qdrant-index with args {"subdir":"scripts","recreate":true}

### MCP client configuration examples

Roo (SSE/RMCP):

```json
{
  "mcpServers": {
    "memory": { "type": "sse", "url": "http://localhost:8000/sse", "disabled": false },
    "qdrant-indexer": { "type": "sse", "url": "http://localhost:8001/sse", "disabled": false }
  }
}

Cline (SSE/RMCP):

{
  "mcpServers": {
    "memory": { "type": "sse", "url": "http://localhost:8000/sse", "disabled": false },
    "qdrant-indexer": { "type": "sse", "url": "http://localhost:8001/sse", "disabled": false }
  }
}

Windsurf (SSE/RMCP):

{
  "mcpServers": {
    "memory": { "type": "sse", "url": "http://localhost:8000/sse", "disabled": false },
    "qdrant-indexer": { "type": "sse", "url": "http://localhost:8001/sse", "disabled": false }
  }
}

Windsurf/Cursor (stdio for search + SSE for indexer):

{
  "mcpServers": {
    "qdrant": {
      "command": "uvx",
      "args": ["mcp-server-qdrant"],
      "env": {
        "QDRANT_URL": "http://localhost:6333",
        "COLLECTION_NAME": "my-collection",
        "EMBEDDING_MODEL": "BAAI/bge-base-en-v1.5"
      },
      "disabled": false
    }
  }
}

Augment (SSE for both servers – recommended):

{
  "mcpServers": {
    "memory": { "type": "sse", "url": "http://localhost:8000/sse", "disabled": false },
    "qdrant-indexer": { "type": "sse", "url": "http://localhost:8001/sse", "disabled": false }
  }
}

Qodo (RMCP; add each tool individually):

Note: In Qodo, you must add each MCP tool separately through the UI, not as a single JSON config.

For each tool, use this format:

Tool 1 - memory:

{
  "memory": { "url": "http://localhost:8002/mcp" }
}

Tool 2 - qdrant-indexer:

{
  "qdrant-indexer": { "url": "http://localhost:8003/mcp" }
}

Important for IDE agents (Cursor/Windsurf/Augment)

  • Do not send null values to MCP tools. Omit the field or pass an empty string "" instead.
  • qdrant-index examples:
    • {"subdir":"","recreate":false,"collection":"my-collection","repo_name":"workspace"}
    • {"subdir":"scripts","recreate":true}
  • For indexing the repo root with no params, use the zero-arg tool qdrant_index_root (new) or call qdrant-index with subdir:"".
Zero-config search tool (new)
  • repo_search: run code search without filters or config.

    • Structured fields supported (parity with DSL): language, under, kind, symbol, ext, not_, case, path_regex, path_glob, not_glob

    • Response shaping: compact (bool) returns only path/start_line/end_line

    • Smart default: compact=true when query is an array with multiple queries (unless explicitly set)

    • If include_snippet is true, compact is forced off so snippet fields are returned

    • Glob fields accept a single string or an array; you can also pass a comma-separated string which will be split

    • Query parsing: accepts query or queries; JSON arrays, JSON-stringified arrays, comma-separated strings; also supports q/text aliases

    • Parity note: path_glob/not_glob list handling works in both modes — in-process and subprocess — with OR semantics for path_glob and reject-on-any for not_glob.

    • Examples:

      • {"query": "semantic chunking"}
      • {"query": ["function to split code", "overlapping chunks"], "limit": 15, "per_path": 3}
      • {"query": "watcher debounce", "language": "python", "under": "scripts/", "include_snippet": true, "context_lines": 2}
      • {"query": "parser", "ext": "ts", "path_regex": "/services/.+", "compact": true}
      • {"query": "adapter", "path_glob": ["/src/", "/pkg/"], "not_glob": "/tests/"}
    • Returns structured results: score, path, symbol, start_line, end_line, and optional snippet; or compact form.

  • code_search: alias of repo_search (same args) for easier discovery in some clients.

  • qdrant_status: return collection size and last index times (safe, read-only).

    • {"collection": "my-collection"}

Verification:

  • You should see tools from both servers (e.g., store, find, repo_search, code_search, context_search, qdrant_list, qdrant_index, qdrant_prune, qdrant_status).

  • Call qdrant_list to confirm Qdrant connectivity.

  • Call qdrant_index with args like { "subdir": "scripts", "recreate": true } to (re)index the mounted repo.

  • Call context_search with { "include_memories": true } to blend memory+code (requires enabling MEMORY_SSE_ENABLED on the indexer service).

  • qdrant_list with no args

  • qdrant_prune with no args

Notes:

  • The indexer reads env from .env (QDRANT_URL, COLLECTION_NAME, EMBEDDING_MODEL).
  • Default chunking: ~120 lines with 20-line overlap.
  • Skips typical build/venv directories.
  • Populates metadata.kind, metadata.symbol, and metadata.symbol_path for Python/JS/TS/Go/Java/Rust/Terraform (best-effort), per chunk.
  • Uses the same collection as the MCP server.

Exclusions (.qdrantignore) and defaults

  • The indexer now supports a .qdrantignore file at the repo root (similar to .gitignore). Use it to exclude directories/files from indexing.
  • Sensible defaults are excluded automatically (overridable): /models, /node_modules, /dist, /build, /.venv, /venv, /__pycache__, /.git, and files matching *.onnx, *.bin, *.safetensors, tokenizer.json, *.whl, *.tar.gz.
  • Override via env or flags:
    • Env: QDRANT_DEFAULT_EXCLUDES=0 to disable defaults; QDRANT_IGNORE_FILE=.myignore; QDRANT_EXCLUDES='tokenizer.json,*.onnx,/third_party'
    • CLI examples:
      • docker compose run --rm indexer --root /work --ignore-file .qdrantignore
      • docker compose run --rm indexer --root /work --no-default-excludes --exclude '/vendor' --exclude '*.bin'

Scaling and tuning (small → large codebases)

  • Chunking and batching are tunable via env or flags:
    • INDEX_CHUNK_LINES (default 120), INDEX_CHUNK_OVERLAP (default 20)
    • INDEX_BATCH_SIZE (default 64)
    • INDEX_PROGRESS_EVERY (default 200 files; 0 disables)

Prune stale points (optional)

If files were deleted or significantly changed outside the indexer, remove stale points safely:

make prune
  • CLI equivalents: --chunk-lines, --chunk-overlap, --batch-size, --progress-every.
  • Recommendations:
    • Small repos (<100 files): chunk 80–120, overlap 16–24, batch-size 32–64
    • Medium (100s–1k files): chunk 120–160, overlap ~20, batch-size 64–128
    • Large monorepos (1k+): start with defaults; consider INDEX_PROGRESS_EVERY=200 for visibility and INDEX_BATCH_SIZE=128 if RAM allows

ReFRAG micro-chunking (retrieval-side, production-ready)

ReFRAG-lite is enabled in this repo and can be toggled via env. It provides:

  • Token-level micro-chunking at ingest (tiny k-token windows with stride)
  • Compact vector gating and optional gate-first candidate restriction
  • Span compaction and a global token budget at search time

Enable and tune:

# Enable compressed retrieval with micro-chunks
REFRAG_MODE=1
INDEX_MICRO_CHUNKS=1

# Micro windowing
MICRO_CHUNK_TOKENS=16
MICRO_CHUNK_STRIDE=8

# Output shaping and budget
MICRO_OUT_MAX_SPANS=3
MICRO_MERGE_LINES=4
MICRO_BUDGET_TOKENS=512
MICRO_TOKENS_PER_LINE=32

# Optional: gate-first using mini vectors to prefilter dense search
REFRAG_GATE_FIRST=0
REFRAG_CANDIDATES=200

Reindex after changing chunking:

# Recreate collection (safe for local dev)
docker compose exec mcp_indexer python -c "from scripts.mcp_indexer_server import qdrant_index_root; qdrant_index_root(recreate=True)"

What results look like (context_search / code_search return shape):

{
  "score": 0.9234,
  "path": "scripts/ingest_code.py",
  "start_line": 120,
  "end_line": 148,
  "span_budgeted": true,
  "budget_tokens_used": 224,
  "components": { "dense": 0.78, "lex": 0.35, "mini": 0.81 },
  "why": ["dense", "mini"]
}

Notes:

  • span_budgeted=true indicates adjacent micro hits were merged and counted toward the global token budget.
  • Tune MICRO_* to control prompt footprint. Increase MICRO_MERGE_LINES to merge looser spans; reduce MICRO_OUT_MAX_SPANS for more file diversity.
  • Gate-first reduces dense search compute on large collections; keep off for tiny repos.

Decoder-path ReFRAG (feature-flagged)

This stack ships a feature-flagged decoder integration path via a llama.cpp sidecar. It is production-safe by default (off) and can run in a fallback “prompt” mode that uses a compressed textual context. A future “soft” mode will inject projected chunk embeddings into a patched llama.cpp server.

Decoder-path dataflow (compress → sense → expand)

flowchart LR
  %% Retrieval side
  Q[Query] --> R[Hybrid search + span budgeting]
  R --> S[Selected micro-spans]

  %% Projection (φ) and modes
  S -->|project via φ| P[(Soft embeddings)]
  S -. prompt compress .-> C[Compressed prompt]

  %% Decoder service
  subgraph Decoder
    G[[llama.cpp :8080]]
  end

  %% Mode routing
  P -->|soft mode| G
  C -->|prompt mode| G

  %% Output
  G --> O[Completion]

  %% Notes
  classDef opt stroke-dasharray: 5 5
  class C opt
Loading

Enable (safe default is off):

REFRAG_DECODER=1
REFRAG_RUNTIME=llamacpp
LLAMACPP_URL=http://llamacpp:8080
REFRAG_DECODER_MODE=prompt  # prompt|soft (soft requires patched llama.cpp)
REFRAG_ENCODER_MODEL=BAAI/bge-base-en-v1.5
REFRAG_PHI_PATH=/work/models/refrag_phi_768_to_dmodel.json

Bring up llama.cpp sidecar (optional):

docker compose up -d llamacpp

Make-based provisioning (recommended):

# downloads a tiny GGUF to ./models/model.gguf (override URL via LLAMACPP_MODEL_URL)
make llamacpp-up
# or just fetch the model without starting the service
make llama-model

Optional: bake the model into the image (no host volume required):

# builds an image that includes the model specified by MODEL_URL
make llamacpp-build-image LLAMACPP_MODEL_URL=https://huggingface.co/.../tiny.gguf
# then in docker-compose.yml, either remove the ./models volume for llamacpp
# or override the service to use image: context-llamacpp:tiny

Programmatic use:

from scripts.refrag_llamacpp import LlamaCppRefragClient
c = LlamaCppRefragClient()  # uses LLAMACPP_URL
text = c.generate_with_soft_embeddings("Question: ...\n", soft_embeddings=None, max_tokens=128)

Notes:

  • φ file format: JSON 2D array with shape (d_in, d_model). See scripts/refrag_phi.py. Set REFRAG_PHI_PATH to your JSON file.

  • In prompt mode, the client calls /completion on the llama.cpp server with a compressed prompt.

  • In soft mode, the client will require a patched server to accept soft embeddings. The flag ensures no breakage.

Alternative: GLM API Provider

Instead of running llama.cpp locally, you can use the GLM API (ZhipuAI) as your decoder backend:

Setup:

# In .env
REFRAG_DECODER=1
REFRAG_RUNTIME=glm          # Switch from llamacpp to glm
GLM_API_KEY=your-api-key    # Required
GLM_MODEL=glm-4.6           # Optional, defaults to glm-4.6

How it works:

  • Uses OpenAI SDK with base_url="https://api.z.ai/api/paas/v4/"
  • Supports prompt mode only (soft embeddings ignored)
  • Handles GLM-4.6's reasoning mode (reasoning_content field)
  • Drop-in replacement for llama.cpp—same interface, no code changes needed

Switch back to llama.cpp:

REFRAG_RUNTIME=llamacpp

The GLM provider is implemented in scripts/refrag_glm.py and automatically selected when REFRAG_RUNTIME=glm.

How context_answer works (with decoder)

The context_answer MCP tool answers natural-language questions using retrieval + a decoder sidecar.

  • Inputs (most relevant): query, limit, per_path, budget_tokens, include_snippet, collection, language, path_glob/not_glob
  • Outputs:
    • answer (string)
    • citations: [ { path, start_line, end_line, container_path? }, ... ]
    • query: list of query strings actually used
    • used: { "gate_first": true|false, "refrag": true|false }

Pipeline

  1. Hybrid search (gate-first): Uses MINI-vector gating when REFRAG_GATE_FIRST=1 to prefilter candidates, then runs dense+lexical fusion
  2. Micro-span budgeting: Merges adjacent micro hits and applies a global token budget (REFRAG_MODE=1, MICRO_BUDGET_TOKENS, MICRO_OUT_MAX_SPANS)
  3. Prompt assembly: Builds compact context blocks and a “Sources” footer
  4. Decoder call: When REFRAG_DECODER=1, calls the configured runtime (REFRAG_RUNTIME=llamacpp or glm) to synthesize the final answer
  5. Return: Answer + citations + usage flags; errors keep citations for debugging

Environment toggles

  • Retrieval: REFRAG_MODE=1, REFRAG_GATE_FIRST=1, REFRAG_CANDIDATES=200
  • Budgeting/output: MICRO_BUDGET_TOKENS, MICRO_OUT_MAX_SPANS
  • Decoder: REFRAG_DECODER=1, LLAMACPP_URL=http://localhost:8080

Fallbacks and safety

  • If gate-first yields 0 items and no strict language filter is set, the tool automatically retries without gating
  • If the decoder call fails, the response contains { "error": "..." } plus citations, so you can still inspect sources

Quick health + example

# Decoder health (llama.cpp sidecar)
curl -s http://localhost:8080/health

# Qdrant
curl -sSf http://localhost:6333/readyz >/dev/null && echo "Qdrant OK"
# Minimal local call (uses the running MCP indexer server code)
import os, asyncio
os.environ.update(
  QDRANT_URL="http://localhost:6333",
  COLLECTION_NAME="my-collection",
  REFRAG_MODE="1", REFRAG_GATE_FIRST="1",
  REFRAG_DECODER="1", LLAMACPP_URL="http://localhost:8080",
)
from scripts import mcp_indexer_server as srv
async def t():
    out = await srv.context_answer(query="How does hybrid search work?", limit=5)
    print(out["used"], len(out.get("citations", [])), len(out.get("answer", "")))
asyncio.run(t())

Implementation

  • See scripts/mcp_indexer_server.py (context_answer tool) for the full pipeline, env knobs, and debug flags (DEBUG_CONTEXT_ANSWER=1).

MCP search filtering (language, path, kind)

  • The indexer creates payload indexes for efficient filtering.
  • When querying (via MCP client or scripts), you can filter by:
    • metadata.language (e.g., python, typescript, javascript, go, rust)
    • metadata.path_prefix (e.g., /work/src)
    • metadata.kind (e.g., function, class, method)
  • Example: in the provided reranker script you can do:
make rerank ARGS="--language python --under /work/scripts"

### Operational safeguards and troubleshooting

- Tokenizer for micro-chunking: set TOKENIZER_JSON to a valid tokenizer.json path (default: models/tokenizer.json). If missing, the indexer falls back to line-based chunking.
- Cap micro-chunks per file: MAX_MICRO_CHUNKS_PER_FILE (default 2000) to prevent runaway chunk counts on very large files.
- Qdrant client timeout: QDRANT_TIMEOUT (seconds, default 20) applies to all MCP Qdrant calls.
- Memory auto-detect caching: MEMORY_AUTODETECT=1 by default with MEMORY_COLLECTION_TTL_SECS (default 300s) to avoid repeatedly sampling all collections.
- Schema repair: ensure_collection now repairs missing named vectors (lex, and mini when REFRAG_MODE=1) on existing collections.
  • Direct Qdrant filter example is shown below; most MCP clients allow passing tool args that map to server-side filters. If your client supports adding structured args to qdrant-find, prefer these filters to reduce noise.

Payload indexes (created for you)

We create payload indexes to accelerate filtered searches:

  • metadata.language (keyword)
  • metadata.path_prefix (keyword)
  • metadata.repo (keyword)
  • metadata.kind (keyword)
  • metadata.symbol (keyword)
  • metadata.symbol_path (keyword)
  • metadata.imports (keyword)
  • metadata.calls (keyword)
  • metadata.file_hash (keyword)
  • metadata.ingested_at (keyword)
  • Git history fields available in payload: commit_id, author_name, authored_date, message, files

Payload indexes enable fast server-side filters (e.g., language, path_prefix, kind, symbol). Prefer using the MCP tools repo_search/context_search with filter arguments rather than raw Qdrant REST/Python snippets. See the Qdrant documentation if you need low-level API examples.

Best-practice querying

  • Use precise intent + language: “python chunking function for Qdrant indexing”
  • Add path hints when you know the area: “under scripts or ingestion code”
  • Try 2–3 alternative phrasings (multi-query) and pick the consensus
  • Prefer results where metadata.language matches your target file
  • For navigation, prefer results where metadata.path_prefix matches your directory

Client tips:

  • MCP tools: issue multiple finds with variant phrasings and re-rank by score + metadata match
  • Direct Qdrant: use vector={name: ..., vector: ...} with the named vector above
  • Data persists in the qdrant_storage Docker volume.
  • The MCP server uses SSE transport and will auto-create the collection if it doesn't exist.
  • Only FastEmbed models are supported at this time.

Troubleshooting

Collection Health & Cache Sync

The stack includes automatic health checks that detect and fix cache/collection sync issues:

Check collection health:

python scripts/collection_health.py --workspace . --collection codebase

Auto-heal cache issues:

python scripts/collection_health.py --workspace . --collection codebase --auto-heal

What it detects:

  • Empty collection with cached files (cache thinks files are indexed but they're not)
  • Significant mismatch between cached files and actual collection contents
  • Missing metadata in collection points

When to use:

  • After manually deleting collections
  • If searches return no results despite indexing
  • After Qdrant crashes or data loss
  • When switching between collection names

Automatic healing:

  • Health checks run automatically on watcher and indexer startup
  • Cache is cleared when sync issues are detected
  • Files are reindexed on next run

General Issues

  • If the MCP servers can’t reach Qdrant, confirm both containers are up: make ps.
  • If the SSE port collides, change FASTMCP_PORT in .env and the mapped port in docker-compose.yml.
  • If you customize tool descriptions, restart: make restart.
  • If searches return no results, check collection health (see above).

About

Qdrant Context Engine

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 92.4%
  • Shell 3.2%
  • JavaScript 2.2%
  • Makefile 1.9%
  • Other 0.3%