ariel-frischer
diff --git a/‎CLAUDE.md‎
Lines changed: 38 additions & 17 deletions b/‎CLAUDE.md‎
Lines changed: 38 additions & 17 deletions
diff --git a/‎README.md‎
Lines changed: 48 additions & 11 deletions b/‎README.md‎
Lines changed: 48 additions & 11 deletions
diff --git a/‎docs/DEVELOPMENT.md‎
Lines changed: 3 additions & 2 deletions b/‎docs/DEVELOPMENT.md‎
Lines changed: 3 additions & 2 deletions
diff --git a/‎docs/kbignore.md‎
Lines changed: 3 additions & 1 deletion b/‎docs/kbignore.md‎
Lines changed: 3 additions & 1 deletion
diff --git a/‎pyproject.toml‎
Lines changed: 6 additions & 0 deletions b/‎pyproject.toml‎
Lines changed: 6 additions & 0 deletions
@@ -1,17 +1,22 @@
 # CLAUDE.md
 
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
 ## Project: kb
 
-CLI knowledge base tool. Indexes markdown + PDFs, hybrid search (sqlite-vec + FTS5), RAG answers with LLM rerank.
+CLI knowledge base tool. Indexes 30+ document formats (markdown, PDF, DOCX, EPUB, HTML, ODT, RTF, plain text, email, subtitles, and more). Hybrid search (sqlite-vec + FTS5), RAG answers with LLM rerank.
 
 ## Build & Test
 
 ```bash
-uv sync --all-extras        # install with all optional deps
-uv run kb --help            # run locally
-uv run pytest               # run tests
-uv run ruff check .         # lint
-uv run ruff format .        # format
+uv sync --all-extras          # install with all optional deps
+uv run kb --help              # run locally
+uv run pytest                 # run all tests
+uv run pytest tests/test_chunk.py -v              # single test file
+uv run pytest tests/test_chunk.py::test_name -v   # single test
+uv run ruff check .           # lint
+uv run ruff format .          # format
+make check                    # lint + format check + tests (CI equivalent)
 ```
 
 ## Install globally
@@ -27,18 +32,34 @@ Two modes — global (default) and project-local:
 - **Global**: config at `~/.config/kb/config.toml`, DB at `~/.local/share/kb/kb.db`. Sources are absolute paths. `kb init` creates global config.
 - **Project**: config at `.kb.toml` (walk-up from cwd), DB next to config. Sources are relative paths. `kb init --project` creates project config.
 
-Project `.kb.toml` takes precedence over global config when both exist.
+Project `.kb.toml` takes precedence over global config when both exist. Config walk-up works like `.gitignore` — `kb` works from any subdirectory.
 
-Source management: `kb add <dir>`, `kb remove <dir>`, `kb sources`.
+Path resolution: `Config.doc_path_for_db()` computes stored paths — relative to config_dir in project mode, `source_dir.name/relative` in global mode.
 
 ## Architecture
 
-- `src/kb/cli.py` — entry point, command dispatch (init, add, remove, sources, index, search, ask, stats, reset)
-- `src/kb/config.py` — config loading (project .kb.toml + global ~/.config/kb/config.toml), Config dataclass, save_config
-- `src/kb/db.py` — schema, sqlite-vec connection
-- `src/kb/chunk.py` — markdown + plain text chunking (chonkie or regex fallback)
-- `src/kb/embed.py` — OpenAI embedding helpers
-- `src/kb/search.py` — hybrid search, RRF fusion
-- `src/kb/rerank.py` — LLM reranking (RankGPT pattern)
-- `src/kb/filters.py` — pre-search filter parsing + application
-- `src/kb/ingest.py` — file indexing (markdown + PDF), uses Config.doc_path_for_db() for path resolution
+- `cli.py` — entry point, command dispatch via `sys.argv` (no argparse). Each `cmd_*` function handles one command.
+- `config.py` — `Config` dataclass with all tunables, `find_config()` walk-up loader, `save_config()` minimal TOML serializer (only writes non-default values)
+- `extract.py` — text extraction registry. `_register()` maps extensions to `(extractor_fn, doc_type, available, install_hint, is_code)`. Stdlib formats always available; optional deps (pymupdf, python-docx, etc.) probed at import time.
+- `ingest.py` — indexing pipeline: discover files → `.kbignore` filtering → size guard → `extract_text()` → content-hash diff → chunk → diff chunks by hash → batch embed new → store
+- `db.py` — schema creation + `SCHEMA_VERSION` migration (drops all tables on version bump). Tables: `documents`, `chunks`, `vec_chunks` (vec0 virtual table), `fts_chunks` (FTS5 content-sync'd from chunks)
+- `chunk.py` — markdown (heading-aware with ancestry tracking) + plain text chunking. Uses chonkie with overlap refinery when available, regex fallback otherwise. `embedding_text()` enriches chunks with file path + heading ancestry before embedding.
+- `search.py` — hybrid search: vector (vec0 MATCH) + FTS5, fused with Reciprocal Rank Fusion. `fill_fts_only_results()` backfills metadata for FTS-only hits.
+- `rerank.py` — RankGPT pattern: presents numbered passages to LLM, parses comma-separated ranking response
+- `filters.py` — inline filter syntax (`file:`, `dt>`, `dt<`, `+"kw"`, `-"kw"`) parsed from query string, applied post-search
+- `embed.py` — thin OpenAI embedding wrapper, `serialize_f32()` for sqlite-vec binary format
+
+### Data Flow
+
+**Index**: files → extract_text → content-hash check (skip unchanged) → chunk → diff chunks by hash (reuse unchanged) → batch embed new → store in vec0 + rebuild FTS5
+
+**Search**: query → parse_filters → embed → vec0 MATCH + FTS5 MATCH → RRF fusion → apply_filters → display
+
+**Ask**: same as search but over-fetches (rerank_fetch_k=20) → LLM rerank → top rerank_top_k → confidence threshold (min_similarity) → LLM generates answer
+
+### Key Design Decisions
+
+- **vec0 auxiliary columns** — `vec_chunks` stores chunk_text, doc_path, heading alongside embeddings, avoiding JOINs at search time
+- **Content-hash at two levels** — file-level hash skips unchanged files entirely; chunk-level hash avoids re-embedding unchanged chunks within modified files
+- **FTS5 content-sync** — `fts_chunks` uses `content='chunks'` with manual rebuild after indexing
+- **Schema versioning** — `SCHEMA_VERSION` in `meta` table; version bump drops and recreates all tables (simple migration for alpha-stage tool)
@@ -3,7 +3,7 @@
 [![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/)
 [![License: MIT](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)
 
-CLI RAG tool for your docs. Index markdown + PDFs, hybrid search (semantic + keyword), ask questions and get sourced answers. Built on [sqlite-vec](https://github.com/asg017/sqlite-vec).
+CLI RAG tool for your docs. Index 30+ document formats (markdown, PDF, DOCX, EPUB, HTML, ODT, RTF, plain text, email, and more), hybrid search (semantic + keyword), ask questions and get sourced answers. Built on [sqlite-vec](https://github.com/asg017/sqlite-vec).
 
 ## Features
 
@@ -12,7 +12,8 @@ CLI RAG tool for your docs. Index markdown + PDFs, hybrid search (semantic + key
 - **Incremental indexing** — content-hash per chunk, only re-embeds changes
 - **LLM rerank** — `ask` over-fetches candidates, LLM ranks by relevance, keeps the best
 - **Pre-search filters** — file globs, date ranges, keyword inclusion/exclusion
-- **PDF support** — install with `kb[pdf]` or `kb[all]`
+- **30+ formats** — markdown, PDF, DOCX, PPTX, XLSX, EPUB, HTML, ODT, ODS, ODP, RTF, email (.eml), subtitles (.srt/.vtt), and plain text variants (.txt, .rst, .org, .csv, .json, .yaml, .tex, etc.)
+- **Optional code indexing** — set `index_code = true` to also index source code files (.py, .js, .ts, .go, .rs, etc.)
 - **Pluggable chunking** — uses [chonkie](https://github.com/bhavnicksm/chonkie) when available, regex fallback otherwise
 
 ## Install
@@ -21,8 +22,16 @@ CLI RAG tool for your docs. Index markdown + PDFs, hybrid search (semantic + key
 # One-liner (installs uv if needed)
 curl -LsSf https://gitlab.com/ariel-frischer/kb/-/raw/main/install.sh | sh
 
-# Or with uv directly
+# Or with uv directly (all optional deps: PDF, Office, RTF, chunking)
 uv tool install --from "git+https://gitlab.com/ariel-frischer/kb.git" "kb[all]"
+
+# Minimal (markdown, HTML, plain text, email, EPUB, ODT — no extra deps)
+uv tool install --from "git+https://gitlab.com/ariel-frischer/kb.git" kb
+
+# Pick extras individually
+uv tool install --from "git+https://gitlab.com/ariel-frischer/kb.git" "kb[pdf]"       # + PDF
+uv tool install --from "git+https://gitlab.com/ariel-frischer/kb.git" "kb[office]"    # + DOCX, PPTX, XLSX
+uv tool install --from "git+https://gitlab.com/ariel-frischer/kb.git" "kb[rtf]"       # + RTF
 ```
 
 Requires an OpenAI-compatible API. Set `OPENAI_API_KEY` in your environment (or in `~/.config/kb/secrets.toml`).
@@ -93,6 +102,7 @@ sources = [
 # min_similarity = 0.25
 # rerank_fetch_k = 20
 # rerank_top_k = 5
+# index_code = false       # set true to also index source code files
 ```
 
 ### .kbignore
@@ -147,16 +157,43 @@ kb ask 'file:briefs/*.pdf dt>"2026-02-13" what are the costs?'
 | Must contain | `+"keyword"` | `+"docker"` |
 | Must not contain | `-"keyword"` | `-"kubernetes"` |
 
+## Supported Formats
+
+**Always available (no extra deps):**
+
+| Category | Extensions |
+|----------|-----------|
+| Markdown | `.md`, `.markdown` |
+| Plain text | `.txt`, `.text`, `.rst`, `.org`, `.log`, `.csv`, `.tsv`, `.json`, `.yaml`, `.yml`, `.toml`, `.xml`, `.ini`, `.cfg`, `.tex`, `.latex`, `.bib`, `.nfo`, `.adoc`, `.asciidoc`, `.properties` |
+| HTML | `.html`, `.htm`, `.xhtml` |
+| Subtitles | `.srt`, `.vtt` |
+| Email | `.eml` |
+| OpenDocument | `.odt`, `.ods`, `.odp` |
+| EPUB | `.epub` |
+
+**Optional (install with extras):**
+
+| Category | Extensions | Install |
+|----------|-----------|---------|
+| PDF | `.pdf` | `kb[pdf]` or `kb[all]` |
+| Office | `.docx`, `.pptx`, `.xlsx` | `kb[office]` or `kb[all]` |
+| RTF | `.rtf` | `kb[rtf]` or `kb[all]` |
+
+**Code files (opt-in):** Set `index_code = true` in config to also index source code — `.py`, `.js`, `.ts`, `.go`, `.rs`, `.java`, `.c`, `.cpp`, and 60+ more extensions.
+
+Run `kb stats` to see which formats are available in your installation.
+
 ## How It Works
 
 ```
 kb index
-  1. Find .md + .pdf files (respecting .kbignore)
-  2. Content-hash check — skip unchanged files
-  3. Chunk (chonkie or regex fallback)
-  4. Diff chunks by hash — only embed new/changed
-  5. Batch embed via OpenAI
-  6. Store in sqlite-vec (vec0) + FTS5
+  1. Find files matching supported formats (respecting .kbignore)
+  2. Extract text (format-specific: markdown, PDF, DOCX, HTML, etc.)
+  3. Content-hash check — skip unchanged files
+  4. Chunk (chonkie or regex fallback)
+  5. Diff chunks by hash — only embed new/changed
+  6. Batch embed via OpenAI
+  7. Store in sqlite-vec (vec0) + FTS5
 
 kb search "query"
   1. Parse filters, strip from query
@@ -178,7 +215,7 @@ kb ask "question"
 
 | Tool | What it is | Local-only | CLI | Setup |
 |------|-----------|:----------:|:---:|-------|
-| **kb** | CLI RAG tool — hybrid search + Q&A over your markdown/PDFs | Yes | Yes | `uv tool install`, single SQLite file |
+| **kb** | CLI RAG tool — hybrid search + Q&A over 30+ document formats | Yes | Yes | `uv tool install`, single SQLite file |
 | [Khoj](https://github.com/khoj-ai/khoj) | Self-hosted AI second brain with web UI, mobile, Obsidian/Emacs plugins | Optional | No | Docker or pip, runs a web server |
 | [Reor](https://github.com/reorproject/reor) | Desktop note-taking app with auto-linking and local LLM | Yes | No | Electron app, uses LanceDB + Ollama |
 | [LlamaIndex](https://github.com/run-llama/llama_index) | Framework for building RAG pipelines | Depends | No | Python library, you build the app |
@@ -187,7 +224,7 @@ kb ask "question"
 
 **When to use what:**
 
-- **kb** — you want a CLI RAG tool that indexes docs (markdown, PDFs) and answers questions from them
+- **kb** — you want a CLI RAG tool that indexes docs (markdown, PDFs, DOCX, EPUB, HTML, and more) and answers questions from them
 - **grepai** — you want semantic search over code (find by intent, trace call graphs), no RAG
 - **Khoj** — you want a full-featured app with web UI, phone access, Obsidian integration, and agent capabilities
 - **Reor** — you want an Obsidian-like desktop editor that auto-links notes using local AI
 
@@ -36,15 +36,16 @@ src/kb/
 ├── db.py        — SQLite schema, sqlite-vec connection, migrations
 ├── chunk.py     — Markdown + plain text chunking (chonkie or regex fallback)
 ├── embed.py     — OpenAI embedding helpers, batching
+├── extract.py   — Text extraction registry for 30+ formats (PDF, DOCX, EPUB, HTML, ODT, etc.)
 ├── search.py    — Hybrid search (vector + FTS5), RRF fusion
 ├── rerank.py    — LLM reranking (RankGPT pattern)
 ├── filters.py   — Pre-search filter parsing + application
-└── ingest.py    — File indexing pipeline (markdown + PDF)
+└── ingest.py    — File indexing pipeline (unified loop over all supported formats)
 ```
 
 ### Data flow
 
-**Indexing** (`kb index`): files → chunking → content-hash diff → embed new chunks → store in sqlite-vec (vec0) + FTS5
+**Indexing** (`kb index`): find files by extension → extract text (format-specific) → chunking → content-hash diff → embed new chunks → store in sqlite-vec (vec0) + FTS5
 
 **Search** (`kb search`): query → parse filters → embed → vector search + FTS5 → RRF fusion → apply filters → results
 
 
@@ -53,13 +53,15 @@ api-reference/
 _generated/
 ```
 
-### Large files that chunk poorly
+### Large / generated files
 
 ```
 *.min.js
 *.bundle.js
 package-lock.json
 yarn.lock
+*.min.css
+*.map
 ```
 
 ### Private / sensitive
 
@@ -28,9 +28,15 @@ dependencies = [
 all = [
     "chonkie[hub]",
     "pymupdf",
+    "python-docx",
+    "python-pptx",
+    "openpyxl",
+    "striprtf",
 ]
 chunking = ["chonkie[hub]"]
 pdf = ["pymupdf"]
+office = ["python-docx", "python-pptx", "openpyxl"]
+rtf = ["striprtf"]
 
 [project.scripts]
 kb = "kb.cli:main"