You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -30,7 +30,7 @@ Pre-builds a trigram inverted index, then searches in milliseconds. Designed for
30
30
| AI agent integration | None | None | MCP server built-in |
31
31
| Memory (search) | 11MB | 288MB | 208MB |
32
32
33
-
xgrep is not a ripgrep replacement. Use ripgrep for one-off searches. Use xgrep when you search the same codebase repeatedly — the index pays for itself after ~12 searches.
33
+
xgrep is not a ripgrep replacement. Use ripgrep for one-off searches. Use xgrep when you search the same codebase repeatedly — the index pays for itself after ~2 searches.
Benchmarked with [hyperfine](https://github.com/sharkdp/hyperfine) on Apple M4, 32GB RAM, macOS. **All numbers are warm cache, after index build.** First run includes a one-time index build (~6s for Linux kernel). See [Index Cost](#index-cost) for details.
97
+
Benchmarked with [hyperfine](https://github.com/sharkdp/hyperfine) on Apple M4, 32GB RAM, macOS. All numbers are warm cache, after index build.
> zoekt numbers are CLI mode. In server mode, zoekt search latency is significantly lower.
114
+
### Index cost
162
115
163
-
### File Discovery: `--find` vs fd vs find
116
+
| Metric | xgrep | zoekt |
117
+
|--------|-------|-------|
118
+
| Build time (Linux kernel) | 6s | 46s |
119
+
| Index size | 175MB (8% of source) | 3.0GB (155%) |
120
+
| Breakeven |~2 searches | — |
164
121
165
-
Benchmarked with [hyperfine](https://github.com/sharkdp/hyperfine) (`-N --warmup 5 --min-runs 50`). Repos are shallow-cloned to a temp directory for reproducibility.
`xg --find` reads file paths from the in-memory index (mmap), while fd/find walk the filesystem. The gap widens with repository size.
182
-
183
-
### Reproduce Benchmarks
184
-
185
-
```bash
186
-
./bench/run.sh small # xgrep source (~20 files, 30s)
187
-
./bench/run.sh medium # ripgrep source (~250 files, auto-downloads)
188
-
./bench/run.sh large # Linux kernel (~92K files, requires manual download)
189
-
./benchmarks/bench_find.sh # --find vs fd vs find (auto-clones repos)
190
-
```
191
-
192
-
## Output Formats
193
-
194
-
**Default** (ripgrep-compatible):
195
-
```
196
-
src/main.rs:42:fn handle_auth() {}
197
-
```
198
-
199
-
**LLM** (`--format llm`): Markdown code blocks with language tags and context lines.
200
-
201
-
**JSON** (`--json`): Structured output for programmatic use.
202
-
203
-
### Regex Performance Notes
204
-
205
-
xgrep extracts trigram literals from regex patterns to narrow search candidates before full regex matching. This works well for patterns with literal substrings but falls back to full scan for purely abstract patterns.
|`fn\s+main`| Literal parts "fn" and "main" |`mai`, `ain`|
213
-
|`error.*timeout`| Literals "error" and "timeout" | Both sets |
214
-
215
-
**Slow (full scan fallback):**
216
-
217
-
| Pattern | Why |
218
-
|---------|-----|
219
-
|`.*`| No literals |
220
-
|`[a-z]+`| Only character classes |
221
-
|`\d{3}-\d{4}`| No literal strings |
222
-
|`.+error`| Leading `.+` prevents extraction |
223
-
224
-
For patterns that fall back to full scan, xgrep will show a warning: `warning: regex cannot be optimized with trigram index (full scan)`.
225
-
226
-
**Tip:** Include at least 3 literal characters in your regex for best performance. `handle_\w+` is much faster than `\w+_auth`.
122
+
> First run includes a one-time index build. See [docs/benchmarks.md](docs/benchmarks.md) for full results including medium/small repos.
227
123
228
124
## Limitations
229
125
230
-
xgrep uses a [trigram inverted index](https://swtch.com/~rsc/regexp/regexp4.html), the same technique as Google Code Search (2006) and zoekt. This approach has inherent trade-offs:
231
-
232
-
-**Short queries (< 3 chars) bypass the index**: Patterns like `if`, `fn`, `go` fall back to full file scan with no speed advantage over ripgrep.
233
-
-**Common trigrams reduce filtering**: Queries containing frequent trigrams (`the`, `int`, `return`) produce many candidate files, narrowing the speed gap with ripgrep.
234
-
-**Scaling limits not yet determined**: Tested up to 92K files (Linux kernel, 2.0GB) where performance is excellent. Behavior on larger codebases (Chromium-scale, 350K+ files) has not been benchmarked.
235
-
-**Index staleness**: Background rebuild runs every ~30 seconds. Recently saved files may not appear until the next rebuild completes.
236
-
-**find_definitions is regex-based**: Uses heuristic patterns (`fn`/`struct`/`class`/`def`), not AST analysis. False positives are expected.
237
-
-**ASCII-only case folding**: Case-insensitive search (`-i`) handles ASCII letters only. Unicode case folding is not supported.
238
-
239
-
### When to use ripgrep instead
240
-
241
-
- One-off searches on a codebase you won't search again
242
-
- Very small codebases (< 100 files, where index overhead outweighs benefit)
243
-
- Queries shorter than 3 characters
244
-
- When you need results from files saved within the last 30 seconds
245
-
246
-
### Why trigrams?
247
-
248
-
xgrep prioritizes **simplicity and small index size** over search precision. Alternative approaches:
249
-
250
-
| Approach | Index size | Precision | Trade-off |
251
-
|----------|-----------|-----------|-----------|
252
-
|**Trigram** (xgrep, zoekt) |~8% of source | Moderate (false positives) | Simple, small, fast to build |
253
-
|**Suffix array** (Livegrep) | 2-5x source | High | Large index, slow to build |
When to use ripgrep instead: one-off searches, very small codebases (< 100 files), or queries shorter than 3 characters.
267
131
268
132
## How It Works
269
133
270
-
1.**Index Build**: Walks the codebase, extracts 3-byte trigrams from each file, builds an inverted index (trigram -> file IDs) with delta+varint compression
134
+
1.**Index Build**: Walks the codebase, extracts 3-byte trigrams from each file, builds an inverted index with delta+varint compression
271
135
2.**Search**: Extracts trigrams from query, intersects posting lists to find candidate files, verifies matches
272
-
3.**Hybrid Mode**: When the index is stale, combines index results with direct scanning of changed files — no rebuild needed
273
-
4.**MCP Server**: Exposes search via JSON-RPC over stdio, with LLM-optimized output and token-aware truncation
136
+
3.**Hybrid Mode**: Combines index results with direct scanning of changed files when index is stale
137
+
4.**MCP Server**: Exposes search via JSON-RPC over stdio, with token-aware truncation
0 commit comments