Skip to content

Commit d8fe8c4

Browse files
committed
docs: slim down README and move details to docs/benchmarks.md
Move regex optimization, medium/small benchmarks, output formats, environment variables, exit codes, and technical details to docs/benchmarks.md. Keep README focused on features, quick start, key benchmarks (Linux kernel + next.js), and MCP setup. 282 lines → 145 lines.
1 parent b2c1618 commit d8fe8c4

File tree

2 files changed

+162
-164
lines changed

2 files changed

+162
-164
lines changed

README.md

Lines changed: 28 additions & 164 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ Pre-builds a trigram inverted index, then searches in milliseconds. Designed for
3030
| AI agent integration | None | None | MCP server built-in |
3131
| Memory (search) | 11MB | 288MB | 208MB |
3232

33-
xgrep is not a ripgrep replacement. Use ripgrep for one-off searches. Use xgrep when you search the same codebase repeatedly — the index pays for itself after ~12 searches.
33+
xgrep is not a ripgrep replacement. Use ripgrep for one-off searches. Use xgrep when you search the same codebase repeatedly — the index pays for itself after ~2 searches.
3434

3535
## Quick Start
3636

@@ -56,39 +56,19 @@ cp target/release/xg ~/.local/bin/
5656

5757
```bash
5858
xg "pattern" # Fixed string search
59-
xg "pattern" /path/to/repo # Search a specific directory (no cd needed)
60-
xg "pattern" /path/to/file.rs # Search a single file directly
59+
xg "pattern" /path/to/repo # Search a specific directory
6160
xg -e "handle_\w+" # Regex search
62-
xg "pattern" -i # Case-insensitive
63-
xg "pattern" --type rs # Filter by file type
61+
xg "pattern" -t rs # Filter by file type
6462
xg "pattern" -C 3 # Context lines
6563
xg "pattern" --format llm # Markdown output for LLMs
6664
xg "pattern" --changed # Only git changed files
67-
xg "pattern" --since 1h # Recently changed files
68-
xg "pattern" --fresh # Check index freshness (slower but up-to-date)
69-
xg "pattern" --absolute-paths # Show absolute paths in output
70-
xg "pattern" --exclude vendor # Exclude paths containing "vendor"
71-
xg "pattern" --no-hints # Suppress regex pattern hints
72-
xg --find "*.rs" /path/to/repo # Find files by glob pattern
73-
xg --find config /path/to/repo # Find files by substring
74-
xg --find "*.rs" --changed # Find changed .rs files
75-
xg --find "*" -t toml # Find all .toml files (--find + -t)
76-
xg --list-types # Show supported file types
77-
xg status # Show index status
65+
xg --find "*.rs" # Find files by glob pattern
7866
xg init # Explicitly rebuild index
79-
xg init /path/to/repo # Build index for a specific directory
80-
xg --version # Show version
8167
```
8268

83-
### Environment Variables
69+
Run `xg --help` for all options.
8470

85-
| Variable | Description | Default |
86-
|----------|-------------|---------|
87-
| `XGREP_LLM_CONTEXT` | Default context lines for `--format llm` | `3` |
88-
| `XGREP_ABSOLUTE_PATHS` | Set to `1` to always use absolute paths | unset |
89-
| `XGREP_NO_HINTS` | Set to `1` to suppress regex pattern hints | unset |
90-
91-
## MCP Server for AI Agents
71+
## MCP Server
9272

9373
xgrep runs as an [MCP](https://modelcontextprotocol.io/) server, giving AI coding tools fast indexed search.
9474

@@ -99,8 +79,6 @@ xg serve --root /path/to/repo # Specific directory
9979

10080
### Claude Code
10181

102-
Add to settings:
103-
10482
```json
10583
{
10684
"mcpServers": {
@@ -112,165 +90,51 @@ Add to settings:
11290
}
11391
```
11492

115-
### Available Tools
116-
117-
| Tool | Description |
118-
|------|-------------|
119-
| `search` | Text/regex search with context. Auto-builds index. Max 4000 tokens by default. |
120-
| `find_definitions` | Find likely definitions by regex heuristics (may include false positives) |
121-
| `read_file` | Read file contents with optional line range |
122-
| `index_status` | Check index freshness and stats |
123-
| `build_index` | Explicitly rebuild index |
93+
**Available tools:** `search`, `find_definitions`, `read_file`, `index_status`, `build_index`
12494

12595
## Performance
12696

127-
Benchmarked with [hyperfine](https://github.com/sharkdp/hyperfine) on Apple M4, 32GB RAM, macOS. **All numbers are warm cache, after index build.** First run includes a one-time index build (~6s for Linux kernel). See [Index Cost](#index-cost) for details.
97+
Benchmarked with [hyperfine](https://github.com/sharkdp/hyperfine) on Apple M4, 32GB RAM, macOS. All numbers are warm cache, after index build.
12898

129-
### Large: Linux kernel (92,947 files, 2.0GB)
99+
### Search: Linux kernel (92,947 files, 2.0GB)
130100

131101
| Query | xg | ripgrep | vs ripgrep |
132102
|-------|-----|---------|------------|
133103
| `struct file_operations` | 38ms | 2,236ms | **59x faster** |
134104
| `printk` | 54ms | 1,795ms | **33x faster** |
135105
| `EXPORT_SYMBOL` | 70ms | 1,900ms | **27x faster** |
136106

137-
### Medium: ripgrep source (248 files, 4.3MB)
138-
139-
| Query | xg | ripgrep | vs ripgrep |
140-
|-------|-----|---------|------------|
141-
| `fn main` | 2.5ms | 7.9ms | **3.1x faster** |
142-
| `Options` | 2.3ms | 7.7ms | **3.3x faster** |
143-
| `pub struct` | 2.6ms | 7.8ms | **3.1x faster** |
144-
145-
### Small: xgrep source (17 files)
146-
147-
| Query | xg | ripgrep | vs ripgrep |
148-
|-------|-----|---------|------------|
149-
| `fn main` | 2.1ms | 5.2ms | **2.5x faster** |
150-
| `SearchResult` | 1.6ms | 4.7ms | **2.9x faster** |
151-
| `Matcher` | 2.2ms | 5.0ms | **2.3x faster** |
152-
153-
### Index Cost
107+
### File discovery: next.js (26,424 files)
154108

155-
| Metric | xgrep | zoekt | ripgrep |
156-
|--------|-------|-------|---------|
157-
| Build time | 6s | 46s | N/A |
158-
| Index size | 175MB (8%) | 3.0GB (155%) | N/A |
159-
| Breakeven | ~2 searches | - | - |
109+
| Query | xg --find | fd | vs fd |
110+
|-------|-----------|-----|-------|
111+
| `*.ts` (4,639 files) | 12.9ms | 289.7ms | **22x faster** |
112+
| `config` (substring) | 6.4ms | 228.9ms | **36x faster** |
160113

161-
> zoekt numbers are CLI mode. In server mode, zoekt search latency is significantly lower.
114+
### Index cost
162115

163-
### File Discovery: `--find` vs fd vs find
116+
| Metric | xgrep | zoekt |
117+
|--------|-------|-------|
118+
| Build time (Linux kernel) | 6s | 46s |
119+
| Index size | 175MB (8% of source) | 3.0GB (155%) |
120+
| Breakeven | ~2 searches ||
164121

165-
Benchmarked with [hyperfine](https://github.com/sharkdp/hyperfine) (`-N --warmup 5 --min-runs 50`). Repos are shallow-cloned to a temp directory for reproducibility.
166-
167-
**tokio** (825 files, Rust async runtime):
168-
169-
| Query | xg --find | fd | find | vs fd |
170-
|-------|-----------|-----|------|-------|
171-
| `*.rs` (769 files) | 2.4ms | 8.9ms | 7.9ms | **3.7x faster** |
172-
| `config` (substring) | 1.9ms | 8.1ms | 8.3ms | **4.3x faster** |
173-
174-
**next.js** (26,424 files, React framework):
175-
176-
| Query | xg --find | fd | find | vs fd |
177-
|-------|-----------|-----|------|-------|
178-
| `*.ts` (4,639 files) | 12.9ms | 289.7ms | 606.5ms | **22x faster** |
179-
| `config` (substring) | 6.4ms | 228.9ms | 637.0ms | **36x faster** |
180-
181-
`xg --find` reads file paths from the in-memory index (mmap), while fd/find walk the filesystem. The gap widens with repository size.
182-
183-
### Reproduce Benchmarks
184-
185-
```bash
186-
./bench/run.sh small # xgrep source (~20 files, 30s)
187-
./bench/run.sh medium # ripgrep source (~250 files, auto-downloads)
188-
./bench/run.sh large # Linux kernel (~92K files, requires manual download)
189-
./benchmarks/bench_find.sh # --find vs fd vs find (auto-clones repos)
190-
```
191-
192-
## Output Formats
193-
194-
**Default** (ripgrep-compatible):
195-
```
196-
src/main.rs:42:fn handle_auth() {}
197-
```
198-
199-
**LLM** (`--format llm`): Markdown code blocks with language tags and context lines.
200-
201-
**JSON** (`--json`): Structured output for programmatic use.
202-
203-
### Regex Performance Notes
204-
205-
xgrep extracts trigram literals from regex patterns to narrow search candidates before full regex matching. This works well for patterns with literal substrings but falls back to full scan for purely abstract patterns.
206-
207-
**Fast (trigram-optimized):**
208-
209-
| Pattern | Why | Trigrams extracted |
210-
|---------|-----|--------------------|
211-
| `handle_\w+` | Literal prefix "handle_" | `han`, `and`, `ndl`, `dle`, `le_` |
212-
| `fn\s+main` | Literal parts "fn" and "main" | `mai`, `ain` |
213-
| `error.*timeout` | Literals "error" and "timeout" | Both sets |
214-
215-
**Slow (full scan fallback):**
216-
217-
| Pattern | Why |
218-
|---------|-----|
219-
| `.*` | No literals |
220-
| `[a-z]+` | Only character classes |
221-
| `\d{3}-\d{4}` | No literal strings |
222-
| `.+error` | Leading `.+` prevents extraction |
223-
224-
For patterns that fall back to full scan, xgrep will show a warning: `warning: regex cannot be optimized with trigram index (full scan)`.
225-
226-
**Tip:** Include at least 3 literal characters in your regex for best performance. `handle_\w+` is much faster than `\w+_auth`.
122+
> First run includes a one-time index build. See [docs/benchmarks.md](docs/benchmarks.md) for full results including medium/small repos.
227123
228124
## Limitations
229125

230-
xgrep uses a [trigram inverted index](https://swtch.com/~rsc/regexp/regexp4.html), the same technique as Google Code Search (2006) and zoekt. This approach has inherent trade-offs:
231-
232-
- **Short queries (< 3 chars) bypass the index**: Patterns like `if`, `fn`, `go` fall back to full file scan with no speed advantage over ripgrep.
233-
- **Common trigrams reduce filtering**: Queries containing frequent trigrams (`the`, `int`, `return`) produce many candidate files, narrowing the speed gap with ripgrep.
234-
- **Scaling limits not yet determined**: Tested up to 92K files (Linux kernel, 2.0GB) where performance is excellent. Behavior on larger codebases (Chromium-scale, 350K+ files) has not been benchmarked.
235-
- **Index staleness**: Background rebuild runs every ~30 seconds. Recently saved files may not appear until the next rebuild completes.
236-
- **find_definitions is regex-based**: Uses heuristic patterns (`fn`/`struct`/`class`/`def`), not AST analysis. False positives are expected.
237-
- **ASCII-only case folding**: Case-insensitive search (`-i`) handles ASCII letters only. Unicode case folding is not supported.
238-
239-
### When to use ripgrep instead
240-
241-
- One-off searches on a codebase you won't search again
242-
- Very small codebases (< 100 files, where index overhead outweighs benefit)
243-
- Queries shorter than 3 characters
244-
- When you need results from files saved within the last 30 seconds
245-
246-
### Why trigrams?
247-
248-
xgrep prioritizes **simplicity and small index size** over search precision. Alternative approaches:
249-
250-
| Approach | Index size | Precision | Trade-off |
251-
|----------|-----------|-----------|-----------|
252-
| **Trigram** (xgrep, zoekt) | ~8% of source | Moderate (false positives) | Simple, small, fast to build |
253-
| **Suffix array** (Livegrep) | 2-5x source | High | Large index, slow to build |
254-
| **AST/Symbol** (Searkt, LSP) | Varies | Exact | Language-specific, complex |
255-
256-
Trigrams are the right choice when you want a single binary that works on any codebase without language-specific setup.
257-
258-
## Exit Codes
259-
260-
| Code | Meaning |
261-
|------|---------|
262-
| `0` | Matches found |
263-
| `1` | No matches found (not an error) |
264-
| `2` | Error (invalid pattern, missing index, I/O error, usage error) |
126+
- **Short queries (< 3 chars)** bypass the index — no speed advantage over ripgrep
127+
- **Index staleness** — background rebuild runs every ~30s. Use `--fresh` for up-to-date results
128+
- **find_definitions** uses regex heuristics, not AST analysis — false positives expected
265129

266-
Follows the same convention as ripgrep.
130+
When to use ripgrep instead: one-off searches, very small codebases (< 100 files), or queries shorter than 3 characters.
267131

268132
## How It Works
269133

270-
1. **Index Build**: Walks the codebase, extracts 3-byte trigrams from each file, builds an inverted index (trigram -> file IDs) with delta+varint compression
134+
1. **Index Build**: Walks the codebase, extracts 3-byte trigrams from each file, builds an inverted index with delta+varint compression
271135
2. **Search**: Extracts trigrams from query, intersects posting lists to find candidate files, verifies matches
272-
3. **Hybrid Mode**: When the index is stale, combines index results with direct scanning of changed files — no rebuild needed
273-
4. **MCP Server**: Exposes search via JSON-RPC over stdio, with LLM-optimized output and token-aware truncation
136+
3. **Hybrid Mode**: Combines index results with direct scanning of changed files when index is stale
137+
4. **MCP Server**: Exposes search via JSON-RPC over stdio, with token-aware truncation
274138

275139
## Contributing
276140

docs/benchmarks.md

Lines changed: 134 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,134 @@
1+
# Benchmarks
2+
3+
Benchmarked with [hyperfine](https://github.com/sharkdp/hyperfine) on Apple M4, 32GB RAM, macOS. All numbers are warm cache, after index build.
4+
5+
## Search benchmarks
6+
7+
### Large: Linux kernel (92,947 files, 2.0GB)
8+
9+
| Query | xg | ripgrep | vs ripgrep |
10+
|-------|-----|---------|------------|
11+
| `struct file_operations` | 38ms | 2,236ms | **59x faster** |
12+
| `printk` | 54ms | 1,795ms | **33x faster** |
13+
| `EXPORT_SYMBOL` | 70ms | 1,900ms | **27x faster** |
14+
15+
### Medium: ripgrep source (248 files, 4.3MB)
16+
17+
| Query | xg | ripgrep | vs ripgrep |
18+
|-------|-----|---------|------------|
19+
| `fn main` | 2.5ms | 7.9ms | **3.1x faster** |
20+
| `Options` | 2.3ms | 7.7ms | **3.3x faster** |
21+
| `pub struct` | 2.6ms | 7.8ms | **3.1x faster** |
22+
23+
### Small: xgrep source (17 files)
24+
25+
| Query | xg | ripgrep | vs ripgrep |
26+
|-------|-----|---------|------------|
27+
| `fn main` | 2.1ms | 5.2ms | **2.5x faster** |
28+
| `SearchResult` | 1.6ms | 4.7ms | **2.9x faster** |
29+
| `Matcher` | 2.2ms | 5.0ms | **2.3x faster** |
30+
31+
## File discovery benchmarks
32+
33+
### tokio (825 files, Rust async runtime)
34+
35+
| Query | xg --find | fd | find | vs fd |
36+
|-------|-----------|-----|------|-------|
37+
| `*.rs` (769 files) | 2.4ms | 8.9ms | 7.9ms | **3.7x faster** |
38+
| `config` (substring) | 1.9ms | 8.1ms | 8.3ms | **4.3x faster** |
39+
40+
### next.js (26,424 files, React framework)
41+
42+
| Query | xg --find | fd | find | vs fd |
43+
|-------|-----------|-----|------|-------|
44+
| `*.ts` (4,639 files) | 12.9ms | 289.7ms | 606.5ms | **22x faster** |
45+
| `config` (substring) | 6.4ms | 228.9ms | 637.0ms | **36x faster** |
46+
47+
`xg --find` reads file paths from the in-memory index (mmap), while fd/find walk the filesystem. The gap widens with repository size.
48+
49+
## Index cost
50+
51+
| Metric | xgrep | zoekt | ripgrep |
52+
|--------|-------|-------|---------|
53+
| Build time | 6s | 46s | N/A |
54+
| Index size | 175MB (8%) | 3.0GB (155%) | N/A |
55+
| Breakeven | ~2 searches | - | - |
56+
57+
> zoekt numbers are CLI mode. In server mode, zoekt search latency is significantly lower.
58+
59+
## Reproduce
60+
61+
```bash
62+
./bench/run.sh small # xgrep source (~20 files, 30s)
63+
./bench/run.sh medium # ripgrep source (~250 files, auto-downloads)
64+
./bench/run.sh large # Linux kernel (~92K files, requires manual download)
65+
./benchmarks/bench_find.sh # --find vs fd vs find (auto-clones repos)
66+
```
67+
68+
## Regex optimization
69+
70+
xgrep extracts trigram literals from regex patterns to narrow search candidates before full regex matching.
71+
72+
**Fast (trigram-optimized):**
73+
74+
| Pattern | Why | Trigrams extracted |
75+
|---------|-----|--------------------|
76+
| `handle_\w+` | Literal prefix "handle_" | `han`, `and`, `ndl`, `dle`, `le_` |
77+
| `fn\s+main` | Literal parts "fn" and "main" | `mai`, `ain` |
78+
| `error.*timeout` | Literals "error" and "timeout" | Both sets |
79+
80+
**Slow (full scan fallback):**
81+
82+
| Pattern | Why |
83+
|---------|-----|
84+
| `.*` | No literals |
85+
| `[a-z]+` | Only character classes |
86+
| `\d{3}-\d{4}` | No literal strings |
87+
| `.+error` | Leading `.+` prevents extraction |
88+
89+
For patterns that fall back to full scan, xgrep will show a warning: `warning: regex cannot be optimized with trigram index (full scan)`.
90+
91+
**Tip:** Include at least 3 literal characters in your regex for best performance.
92+
93+
## Technical details
94+
95+
### Why trigrams?
96+
97+
xgrep prioritizes simplicity and small index size over search precision.
98+
99+
| Approach | Index size | Precision | Trade-off |
100+
|----------|-----------|-----------|-----------|
101+
| **Trigram** (xgrep, zoekt) | ~8% of source | Moderate (false positives) | Simple, small, fast to build |
102+
| **Suffix array** (Livegrep) | 2-5x source | High | Large index, slow to build |
103+
| **AST/Symbol** (Searkt, LSP) | Varies | Exact | Language-specific, complex |
104+
105+
Trigrams are the right choice when you want a single binary that works on any codebase without language-specific setup.
106+
107+
### Output formats
108+
109+
**Default** (ripgrep-compatible):
110+
```
111+
src/main.rs:42:fn handle_auth() {}
112+
```
113+
114+
**LLM** (`--format llm`): Markdown code blocks with language tags and context lines.
115+
116+
**JSON** (`--json`): Structured output for programmatic use.
117+
118+
### Environment variables
119+
120+
| Variable | Description | Default |
121+
|----------|-------------|---------|
122+
| `XGREP_LLM_CONTEXT` | Default context lines for `--format llm` | `3` |
123+
| `XGREP_ABSOLUTE_PATHS` | Set to `1` to always use absolute paths | unset |
124+
| `XGREP_NO_HINTS` | Set to `1` to suppress regex pattern hints | unset |
125+
126+
### Exit codes
127+
128+
| Code | Meaning |
129+
|------|---------|
130+
| `0` | Matches found |
131+
| `1` | No matches found (not an error) |
132+
| `2` | Error (invalid pattern, missing index, I/O error, usage error) |
133+
134+
Follows the same convention as ripgrep.

0 commit comments

Comments
 (0)