Semantic search and document parsing tools for the command line
A collection of high-performance CLI tools for document processing and semantic search, built with Rust for speed and reliability.
parse
- Parse documents (PDF, DOCX, etc.) using, by default, the LlamaParse API into markdown formatsearch
- Local semantic keyword search using multilingual embeddings with cosine similarity matching and per-line context matching
NOTE: By default, parse
uses LlamaParse as a backend. Get your API key today for free at https://cloud.llamaindex.ai. search
remains local-only.
- Fast semantic search using model2vec embeddings from minishlab/potion-multilingual-128M
- Reliable document parsing with caching and error handling
- Unix-friendly design with proper stdin/stdout handling
- Configurable distance thresholds and returned chunk sizes
- Multi-format support for parsing documents (PDF, DOCX, PPTX, etc.)
- Concurrent processing for better parsing performance
Prerequisites:
- For the
parse
tool: LlamaIndex Cloud API key
Install:
You can install semtools
via npm:
npm i -g @llamaindex/semtools
Or via cargo:
# install entire crate
cargo install semtools
# install only parse
cargo install semtools --no-default-features --features=parse
# install only search
cargo install semtools --no-default-features --features=search
Note: Installing from npm builds the Rust binaries locally during install if a prebuilt binary is not available, which requires Rust and Cargo to be available in your environment. Install from rustup
if needed: https://www.rust-lang.org/tools/install
.
Basic Usage:
# Parse some files
parse my_dir/*.pdf
# Search some (text-based) files
search "some keywords" *.txt --max-distance 0.3 --n-lines 5
# Combine parsing and search
parse my_docs/*.pdf | xargs search "API endpoints"
Advanced Usage:
# Combine with grep for exact-match pre-filtering and distance thresholding
parse *.pdf | xargs cat | grep -i "error" | search "network error" --max-distance 0.3
# Pipeline with content search (note the 'cat')
find . -name "*.md" | xargs parse | xargs search "installation"
# Combine with grep for filtering (grep could be before or after parse/search!)
parse docs/*.pdf | xargs search "API" | grep -A5 "authentication"
# Save search results
parse report.pdf | xargs cat | search "summary" > results.txt
parse --help
A CLI tool for parsing documents using various backends
Usage: parse [OPTIONS] <FILES>...
Arguments:
<FILES>... Files to parse
Options:
-c, --parse-config <PARSE_CONFIG> Path to the config file. Defaults to ~/.parse_config.json
-b, --backend <BACKEND> The backend type to use for parsing. Defaults to `llama-parse` [default: llama-parse]
-v, --verbose Verbose output while parsing
-h, --help Print help
-V, --version Print version
search --help
A CLI tool for fast semantic keyword search
Usage: search [OPTIONS] <QUERY> [FILES]...
Arguments:
<QUERY> Query to search for (positional argument)
[FILES]... Files or directories to search
Options:
-n, --n-lines <N_LINES> How many lines before/after to return as context [default: 3]
--top-k <TOP_K> The top-k files or texts to return (ignored if max_distance is set) [default: 3]
-m, --max-distance <MAX_DISTANCE> Return all results with distance below this threshold (0.0+)
-i, --ignore-case Perform case-insensitive search (default is false)
-h, --help Print help
-V, --version Print version
By default, the parse
tool uses the LlamaParse API to parse documents.
It will look for a ~/.parse_config.json
file to configure the API key and other parameters.
Otherwise, it will fallback to looking for a LLAMA_CLOUD_API_KEY
environment variable and a set of default parameters.
To configure the parse
tool, create a ~/.parse_config.json
file with the following content (defaults are shown below):
{
"api_key": "your_llama_cloud_api_key_here",
"num_ongoing_requests": 10,
"base_url": "https://api.cloud.llamaindex.ai",
"check_interval": 5,
"max_timeout": 3600,
"max_retries": 10,
"retry_delay_ms": 1000,
"backoff_multiplier": 2.0,
"parse_kwargs": {
"parse_mode": "parse_page_with_agent",
"model": "openai-gpt-4-1-mini",
"high_res_ocr": "true",
"adaptive_long_table": "true",
"outlined_table_extraction": "true",
"output_tables_as_HTML": "true"
}
}
Or just set via environment variable:
export LLAMA_CLOUD_API_KEY="your_api_key_here"
- More parsing backends (something local-only would be great!)
- Improved search algorithms
- (optional) Persistence for speedups on repeat searches on the same files
We welcome contributions! Please see CONTRIBUTING.md for guidelines.
This project is licensed under the MIT License - see the LICENSE file for details.
- LlamaIndex/LlamaParse for document parsing capabilities
- model2vec-rsfor fast embedding generation
- minishlab/potion-multilingual-128M for an amazing default static embedding model
- simsimd for efficient similarity computation