Skip to content

Commit bba0458

Browse files
feat(docs): update CLAUDE.md and README.md to reflect support for 30+ document formats and improve clarity on features
feat(docs): enhance documentation in DEVELOPMENT.md and kbignore.md for better understanding of file management and indexing feat(docs): add detailed extraction methods in extract.py for various document formats feat(ingest): refactor index_directory to support indexing across multiple document formats and implement file size checks feat(cli): introduce new commands for allowing large files and listing indexed documents in cli.py fix(config): add configuration options for indexing code files and managing file size limits in config.py test: update tests to reflect changes in indexing and extraction functionalities, ensuring compatibility with new features
1 parent 6112327 commit bba0458

File tree

11 files changed

+1008
-112
lines changed

11 files changed

+1008
-112
lines changed

CLAUDE.md

Lines changed: 38 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,22 @@
11
# CLAUDE.md
22

3+
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4+
35
## Project: kb
46

5-
CLI knowledge base tool. Indexes markdown + PDFs, hybrid search (sqlite-vec + FTS5), RAG answers with LLM rerank.
7+
CLI knowledge base tool. Indexes 30+ document formats (markdown, PDF, DOCX, EPUB, HTML, ODT, RTF, plain text, email, subtitles, and more). Hybrid search (sqlite-vec + FTS5), RAG answers with LLM rerank.
68

79
## Build & Test
810

911
```bash
10-
uv sync --all-extras # install with all optional deps
11-
uv run kb --help # run locally
12-
uv run pytest # run tests
13-
uv run ruff check . # lint
14-
uv run ruff format . # format
12+
uv sync --all-extras # install with all optional deps
13+
uv run kb --help # run locally
14+
uv run pytest # run all tests
15+
uv run pytest tests/test_chunk.py -v # single test file
16+
uv run pytest tests/test_chunk.py::test_name -v # single test
17+
uv run ruff check . # lint
18+
uv run ruff format . # format
19+
make check # lint + format check + tests (CI equivalent)
1520
```
1621

1722
## Install globally
@@ -27,18 +32,34 @@ Two modes — global (default) and project-local:
2732
- **Global**: config at `~/.config/kb/config.toml`, DB at `~/.local/share/kb/kb.db`. Sources are absolute paths. `kb init` creates global config.
2833
- **Project**: config at `.kb.toml` (walk-up from cwd), DB next to config. Sources are relative paths. `kb init --project` creates project config.
2934

30-
Project `.kb.toml` takes precedence over global config when both exist.
35+
Project `.kb.toml` takes precedence over global config when both exist. Config walk-up works like `.gitignore``kb` works from any subdirectory.
3136

32-
Source management: `kb add <dir>`, `kb remove <dir>`, `kb sources`.
37+
Path resolution: `Config.doc_path_for_db()` computes stored paths — relative to config_dir in project mode, `source_dir.name/relative` in global mode.
3338

3439
## Architecture
3540

36-
- `src/kb/cli.py` — entry point, command dispatch (init, add, remove, sources, index, search, ask, stats, reset)
37-
- `src/kb/config.py` — config loading (project .kb.toml + global ~/.config/kb/config.toml), Config dataclass, save_config
38-
- `src/kb/db.py` — schema, sqlite-vec connection
39-
- `src/kb/chunk.py` — markdown + plain text chunking (chonkie or regex fallback)
40-
- `src/kb/embed.py` — OpenAI embedding helpers
41-
- `src/kb/search.py` — hybrid search, RRF fusion
42-
- `src/kb/rerank.py` — LLM reranking (RankGPT pattern)
43-
- `src/kb/filters.py` — pre-search filter parsing + application
44-
- `src/kb/ingest.py` — file indexing (markdown + PDF), uses Config.doc_path_for_db() for path resolution
41+
- `cli.py` — entry point, command dispatch via `sys.argv` (no argparse). Each `cmd_*` function handles one command.
42+
- `config.py``Config` dataclass with all tunables, `find_config()` walk-up loader, `save_config()` minimal TOML serializer (only writes non-default values)
43+
- `extract.py` — text extraction registry. `_register()` maps extensions to `(extractor_fn, doc_type, available, install_hint, is_code)`. Stdlib formats always available; optional deps (pymupdf, python-docx, etc.) probed at import time.
44+
- `ingest.py` — indexing pipeline: discover files → `.kbignore` filtering → size guard → `extract_text()` → content-hash diff → chunk → diff chunks by hash → batch embed new → store
45+
- `db.py` — schema creation + `SCHEMA_VERSION` migration (drops all tables on version bump). Tables: `documents`, `chunks`, `vec_chunks` (vec0 virtual table), `fts_chunks` (FTS5 content-sync'd from chunks)
46+
- `chunk.py` — markdown (heading-aware with ancestry tracking) + plain text chunking. Uses chonkie with overlap refinery when available, regex fallback otherwise. `embedding_text()` enriches chunks with file path + heading ancestry before embedding.
47+
- `search.py` — hybrid search: vector (vec0 MATCH) + FTS5, fused with Reciprocal Rank Fusion. `fill_fts_only_results()` backfills metadata for FTS-only hits.
48+
- `rerank.py` — RankGPT pattern: presents numbered passages to LLM, parses comma-separated ranking response
49+
- `filters.py` — inline filter syntax (`file:`, `dt>`, `dt<`, `+"kw"`, `-"kw"`) parsed from query string, applied post-search
50+
- `embed.py` — thin OpenAI embedding wrapper, `serialize_f32()` for sqlite-vec binary format
51+
52+
### Data Flow
53+
54+
**Index**: files → extract_text → content-hash check (skip unchanged) → chunk → diff chunks by hash (reuse unchanged) → batch embed new → store in vec0 + rebuild FTS5
55+
56+
**Search**: query → parse_filters → embed → vec0 MATCH + FTS5 MATCH → RRF fusion → apply_filters → display
57+
58+
**Ask**: same as search but over-fetches (rerank_fetch_k=20) → LLM rerank → top rerank_top_k → confidence threshold (min_similarity) → LLM generates answer
59+
60+
### Key Design Decisions
61+
62+
- **vec0 auxiliary columns**`vec_chunks` stores chunk_text, doc_path, heading alongside embeddings, avoiding JOINs at search time
63+
- **Content-hash at two levels** — file-level hash skips unchanged files entirely; chunk-level hash avoids re-embedding unchanged chunks within modified files
64+
- **FTS5 content-sync**`fts_chunks` uses `content='chunks'` with manual rebuild after indexing
65+
- **Schema versioning**`SCHEMA_VERSION` in `meta` table; version bump drops and recreates all tables (simple migration for alpha-stage tool)

README.md

Lines changed: 48 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
[![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/)
44
[![License: MIT](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)
55

6-
CLI RAG tool for your docs. Index markdown + PDFs, hybrid search (semantic + keyword), ask questions and get sourced answers. Built on [sqlite-vec](https://github.com/asg017/sqlite-vec).
6+
CLI RAG tool for your docs. Index 30+ document formats (markdown, PDF, DOCX, EPUB, HTML, ODT, RTF, plain text, email, and more), hybrid search (semantic + keyword), ask questions and get sourced answers. Built on [sqlite-vec](https://github.com/asg017/sqlite-vec).
77

88
## Features
99

@@ -12,7 +12,8 @@ CLI RAG tool for your docs. Index markdown + PDFs, hybrid search (semantic + key
1212
- **Incremental indexing** — content-hash per chunk, only re-embeds changes
1313
- **LLM rerank**`ask` over-fetches candidates, LLM ranks by relevance, keeps the best
1414
- **Pre-search filters** — file globs, date ranges, keyword inclusion/exclusion
15-
- **PDF support** — install with `kb[pdf]` or `kb[all]`
15+
- **30+ formats** — markdown, PDF, DOCX, PPTX, XLSX, EPUB, HTML, ODT, ODS, ODP, RTF, email (.eml), subtitles (.srt/.vtt), and plain text variants (.txt, .rst, .org, .csv, .json, .yaml, .tex, etc.)
16+
- **Optional code indexing** — set `index_code = true` to also index source code files (.py, .js, .ts, .go, .rs, etc.)
1617
- **Pluggable chunking** — uses [chonkie](https://github.com/bhavnicksm/chonkie) when available, regex fallback otherwise
1718

1819
## Install
@@ -21,8 +22,16 @@ CLI RAG tool for your docs. Index markdown + PDFs, hybrid search (semantic + key
2122
# One-liner (installs uv if needed)
2223
curl -LsSf https://gitlab.com/ariel-frischer/kb/-/raw/main/install.sh | sh
2324

24-
# Or with uv directly
25+
# Or with uv directly (all optional deps: PDF, Office, RTF, chunking)
2526
uv tool install --from "git+https://gitlab.com/ariel-frischer/kb.git" "kb[all]"
27+
28+
# Minimal (markdown, HTML, plain text, email, EPUB, ODT — no extra deps)
29+
uv tool install --from "git+https://gitlab.com/ariel-frischer/kb.git" kb
30+
31+
# Pick extras individually
32+
uv tool install --from "git+https://gitlab.com/ariel-frischer/kb.git" "kb[pdf]" # + PDF
33+
uv tool install --from "git+https://gitlab.com/ariel-frischer/kb.git" "kb[office]" # + DOCX, PPTX, XLSX
34+
uv tool install --from "git+https://gitlab.com/ariel-frischer/kb.git" "kb[rtf]" # + RTF
2635
```
2736

2837
Requires an OpenAI-compatible API. Set `OPENAI_API_KEY` in your environment (or in `~/.config/kb/secrets.toml`).
@@ -93,6 +102,7 @@ sources = [
93102
# min_similarity = 0.25
94103
# rerank_fetch_k = 20
95104
# rerank_top_k = 5
105+
# index_code = false # set true to also index source code files
96106
```
97107

98108
### .kbignore
@@ -147,16 +157,43 @@ kb ask 'file:briefs/*.pdf dt>"2026-02-13" what are the costs?'
147157
| Must contain | `+"keyword"` | `+"docker"` |
148158
| Must not contain | `-"keyword"` | `-"kubernetes"` |
149159

160+
## Supported Formats
161+
162+
**Always available (no extra deps):**
163+
164+
| Category | Extensions |
165+
|----------|-----------|
166+
| Markdown | `.md`, `.markdown` |
167+
| Plain text | `.txt`, `.text`, `.rst`, `.org`, `.log`, `.csv`, `.tsv`, `.json`, `.yaml`, `.yml`, `.toml`, `.xml`, `.ini`, `.cfg`, `.tex`, `.latex`, `.bib`, `.nfo`, `.adoc`, `.asciidoc`, `.properties` |
168+
| HTML | `.html`, `.htm`, `.xhtml` |
169+
| Subtitles | `.srt`, `.vtt` |
170+
| Email | `.eml` |
171+
| OpenDocument | `.odt`, `.ods`, `.odp` |
172+
| EPUB | `.epub` |
173+
174+
**Optional (install with extras):**
175+
176+
| Category | Extensions | Install |
177+
|----------|-----------|---------|
178+
| PDF | `.pdf` | `kb[pdf]` or `kb[all]` |
179+
| Office | `.docx`, `.pptx`, `.xlsx` | `kb[office]` or `kb[all]` |
180+
| RTF | `.rtf` | `kb[rtf]` or `kb[all]` |
181+
182+
**Code files (opt-in):** Set `index_code = true` in config to also index source code — `.py`, `.js`, `.ts`, `.go`, `.rs`, `.java`, `.c`, `.cpp`, and 60+ more extensions.
183+
184+
Run `kb stats` to see which formats are available in your installation.
185+
150186
## How It Works
151187

152188
```
153189
kb index
154-
1. Find .md + .pdf files (respecting .kbignore)
155-
2. Content-hash check — skip unchanged files
156-
3. Chunk (chonkie or regex fallback)
157-
4. Diff chunks by hash — only embed new/changed
158-
5. Batch embed via OpenAI
159-
6. Store in sqlite-vec (vec0) + FTS5
190+
1. Find files matching supported formats (respecting .kbignore)
191+
2. Extract text (format-specific: markdown, PDF, DOCX, HTML, etc.)
192+
3. Content-hash check — skip unchanged files
193+
4. Chunk (chonkie or regex fallback)
194+
5. Diff chunks by hash — only embed new/changed
195+
6. Batch embed via OpenAI
196+
7. Store in sqlite-vec (vec0) + FTS5
160197
161198
kb search "query"
162199
1. Parse filters, strip from query
@@ -178,7 +215,7 @@ kb ask "question"
178215

179216
| Tool | What it is | Local-only | CLI | Setup |
180217
|------|-----------|:----------:|:---:|-------|
181-
| **kb** | CLI RAG tool — hybrid search + Q&A over your markdown/PDFs | Yes | Yes | `uv tool install`, single SQLite file |
218+
| **kb** | CLI RAG tool — hybrid search + Q&A over 30+ document formats | Yes | Yes | `uv tool install`, single SQLite file |
182219
| [Khoj](https://github.com/khoj-ai/khoj) | Self-hosted AI second brain with web UI, mobile, Obsidian/Emacs plugins | Optional | No | Docker or pip, runs a web server |
183220
| [Reor](https://github.com/reorproject/reor) | Desktop note-taking app with auto-linking and local LLM | Yes | No | Electron app, uses LanceDB + Ollama |
184221
| [LlamaIndex](https://github.com/run-llama/llama_index) | Framework for building RAG pipelines | Depends | No | Python library, you build the app |
@@ -187,7 +224,7 @@ kb ask "question"
187224

188225
**When to use what:**
189226

190-
- **kb** — you want a CLI RAG tool that indexes docs (markdown, PDFs) and answers questions from them
227+
- **kb** — you want a CLI RAG tool that indexes docs (markdown, PDFs, DOCX, EPUB, HTML, and more) and answers questions from them
191228
- **grepai** — you want semantic search over code (find by intent, trace call graphs), no RAG
192229
- **Khoj** — you want a full-featured app with web UI, phone access, Obsidian integration, and agent capabilities
193230
- **Reor** — you want an Obsidian-like desktop editor that auto-links notes using local AI

docs/DEVELOPMENT.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -36,15 +36,16 @@ src/kb/
3636
├── db.py — SQLite schema, sqlite-vec connection, migrations
3737
├── chunk.py — Markdown + plain text chunking (chonkie or regex fallback)
3838
├── embed.py — OpenAI embedding helpers, batching
39+
├── extract.py — Text extraction registry for 30+ formats (PDF, DOCX, EPUB, HTML, ODT, etc.)
3940
├── search.py — Hybrid search (vector + FTS5), RRF fusion
4041
├── rerank.py — LLM reranking (RankGPT pattern)
4142
├── filters.py — Pre-search filter parsing + application
42-
└── ingest.py — File indexing pipeline (markdown + PDF)
43+
└── ingest.py — File indexing pipeline (unified loop over all supported formats)
4344
```
4445

4546
### Data flow
4647

47-
**Indexing** (`kb index`): files → chunking → content-hash diff → embed new chunks → store in sqlite-vec (vec0) + FTS5
48+
**Indexing** (`kb index`): find files by extension → extract text (format-specific) → chunking → content-hash diff → embed new chunks → store in sqlite-vec (vec0) + FTS5
4849

4950
**Search** (`kb search`): query → parse filters → embed → vector search + FTS5 → RRF fusion → apply filters → results
5051

docs/kbignore.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -53,13 +53,15 @@ api-reference/
5353
_generated/
5454
```
5555

56-
### Large files that chunk poorly
56+
### Large / generated files
5757

5858
```
5959
*.min.js
6060
*.bundle.js
6161
package-lock.json
6262
yarn.lock
63+
*.min.css
64+
*.map
6365
```
6466

6567
### Private / sensitive

pyproject.toml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,9 +28,15 @@ dependencies = [
2828
all = [
2929
"chonkie[hub]",
3030
"pymupdf",
31+
"python-docx",
32+
"python-pptx",
33+
"openpyxl",
34+
"striprtf",
3135
]
3236
chunking = ["chonkie[hub]"]
3337
pdf = ["pymupdf"]
38+
office = ["python-docx", "python-pptx", "openpyxl"]
39+
rtf = ["striprtf"]
3440

3541
[project.scripts]
3642
kb = "kb.cli:main"

0 commit comments

Comments
 (0)