This document defines the unyielding principles that guide the development and design of sift. When making architectural decisions, evaluating PRs, or planning new features, these rules must be satisfied.
sift is a CLI tool, not a service stack.
- No Daemons: There will be no long-running background processes or resident services.
- No Databases: We do not require users to install, configure, or manage external databases (e.g., Postgres, Redis, external Vector DBs).
- Stateless UX: From the user's perspective,
siftoperates on directories immediately. Any caching or indexing must happen transparently in standard user cache directories (e.g.,~/.cache/sift) without requiring explicit lifecycle management commands (sift start,sift stop).
sift follows the Zig-style "Search Asset Pipeline" model.
- File-Level Heuristics: We use filesystem metadata (
mtime,inode,size) to avoid redundant work (text extraction, hashing, embedding) on unchanged files. - Content-Addressable Storage: Cached assets are keyed by BLAKE3 content hashes for cross-project deduplication.
- Advisory Locking: All cache manifests are protected by filesystem advisory locks to ensure safe concurrent operations from multiple
siftprocesses.
sift must remain easily distributable.
- No C++ Toolchains: We strictly avoid C++ dependencies (like RocksDB, ProtoBuf, or Arrow) to ensure a clean Rust-only build.
- Static Distribution:
siftmust be capable of building as a fully static executable for easy installation without shared library conflicts.
Search results and evaluations must be reproducible.
- Tie-breaking in ranking must be stable (e.g., falling back to lexicographical path sorting).
- File tree traversal must be deterministic.
- Benchmarks must record the exact state of the world: git SHA, command used, corpus size, model parameters, and hardware environment.
Search is a pipeline, not a single algorithm.
- We do not hardcode "hybrid" or "agentic" as single functions. We compose them via explicit plans, graphs, and turns over
Query Expansion -> Retrieval -> Fusion -> Reranking. - Strategies are defined as data (Presets/Plans) and executed by an orchestrator, allowing for rapid experimentation and objective benchmarking.
- Agentic controllers must remain inspectable and replayable; hidden background state is not an acceptable substitute for an explicit search trace.
sift is built for local development and agentic workflows.
- Code stays on the machine. We do not send source code or documents to external APIs for embedding or search by default.
- Machine learning models run locally via pure-Rust implementations (e.g.,
candle), utilizing CPU and local accelerators.
The core search logic must remain pure.
- The terminal (CLI arguments, printing), the filesystem (walking directories), and the network (downloading datasets) must stay at the edge of the architecture.
- Domain models and ports (
Retriever,Fuser) define the behavior; adapters implement the details.
Implementation does not end at a clean compile.
- Every functional change must be verified against a test, a benchmark, or an empirical CLI proof.
- If a change degrades the benchmark quality against the BM25 baseline or the champion preset, the change must be justified with evidence.
sift is built as a modern Hybrid and Agentic Information Retrieval (IR) system
that captures user intent, not just keyword matches. The hybrid core uses
lexical retrieval, semantic retrieval, and reranking to bridge the vocabulary
gap between a user's question and the technical implementation in source code.
The agentic layer decomposes search into explicit turns that can manage context, reuse local models, and emit results to humans or other tools without violating the local-first contract.