This repository contains a fully local, privacy-preserving LLM setup for querying personal notes using Retrieval-Augmented Generation (RAG).
All inference and embeddings run entirely on this machine (no OpenAI, no Google, no cloud APIs).
The system is optimized for:
- Privacy-first: No telemetry, no cloud APIs, all data stays local
- Markdown notes: Structure-aware indexing with heading context
- Source attribution: Every answer cites source files
- Hybrid retrieval: Vector (MMR) + Lexical (BM25) + RRF fusion
- Unified CLI: Modern
local-llmcommand with rich features - macOS & Linux: Works on Apple Silicon and x86_64
git clone https://github.com/alex-thorne/local-llm.git
cd local-llm
python3.12 -m venv venv
source venv/bin/activate
pip install -e ".[bm25]"# Install Ollama
brew install ollama
# Start Ollama server
ollama serve &
# Pull required models
ollama pull llama3.1:8b-instruct
ollama pull bge-m3local-llm doctorThis checks:
- ✅ Ollama server is running
- ✅ Required models are available
- ✅ Notes directory exists
- ✅ All dependencies are working
local-llm indexFirst-time indexing will:
- Read all
.mdfiles from./notes/ - Split by headings (structure-aware)
- Generate embeddings locally
- Build vector and BM25 indices
local-llm chatAvailable commands:
- Normal questions: Ask anything about your notes
:find <query>- Search without LLM synthesis:sources- Show sources from last answer:open N- Open source file in$EDITOR:debug on/off- Toggle retrieval debug mode:help- Show all commandsexit/quit/:q- Exit
| Command | Description |
|---|---|
local-llm --help |
Show all commands |
local-llm --version |
Show version |
local-llm doctor |
Check prerequisites and system health |
local-llm index |
Build or rebuild the vector index |
local-llm index --full |
Force full rebuild (wipe existing index) |
local-llm chat |
Interactive RAG chat session |
local-llm find <query> |
Search notes without LLM synthesis |
local-llm config show |
Display current configuration |
local-llm config edit |
Open config file in $EDITOR |
local-llm stt transcribe <file> |
Transcribe audio files |
Create ~/.config/local-llm/config.toml or ./config.toml:
[notes]
source = "~/notes"
mirror = "./notes"
[ollama]
base_url = "http://127.0.0.1:11434"
embed_model = "bge-m3"
chat_model = "llama3.1:8b-instruct-q4_K_M-16k"
[retrieval]
answer_k = 10
find_k = 20
candidate_k = 40See config.example.toml for all options.
Environment variable overrides:
export LOCAL_LLM_OLLAMA_BASE_URL="http://localhost:11434"
export LOCAL_LLM_RETRIEVAL_ANSWER_K=15- Runs a local LLM server using Ollama
- Indexes Markdown notes into a local Chroma vector store
local-llm/
├── src/local_llm/ # Python package (installable)
│ ├── __init__.py
│ ├── cli.py # Main CLI dispatcher
│ ├── config.py # Configuration system
│ ├── index.py # Indexing logic
│ ├── chat.py # Chat session logic
│ ├── find.py # Search functionality
│ ├── retrieval.py # Hybrid retrieval (MMR + BM25)
│ ├── embeddings.py # Ollama embeddings client
│ ├── doctor.py # System diagnostics
│ ├── stt.py # Speech-to-text wrapper
│ └── utils.py # Shared helpers
├── tests/ # Test suite
├── docs/ # Documentation
├── stt/ # Speech-to-text subsystem
├── notes/ # Your Markdown notes (indexed input)
├── index/ # Persisted Chroma DB (generated)
├── chunks.jsonl # BM25 index (generated)
├── pyproject.toml # Package metadata
├── config.example.toml # Configuration template
└── README.md
- Runs a local LLM server using Ollama
- Indexes Markdown notes into a local Chroma vector store
- Uses local embedding models (no external calls)
- Provides a unified CLI with:
- System diagnostics (
doctor) - Index management (
index) - Interactive chat (
chat) - Direct search (
find) - Configuration management
- Speech-to-text integration
- System diagnostics (
- Keeps all data strictly local
- Apple Silicon Mac or x86_64 Linux (recommended for optimal performance)
- Minimum 8GB RAM (16GB+ recommended for larger models)
- Python 3.10+ (3.12 recommended)
- Ollama (native, not Docker)
- ffmpeg (for speech-to-text features)
Create ~/.config/local-llm/config.toml:
[notes]
source = "~/Documents/notes"
mirror = "./notes"
[ollama]
base_url = "http://127.0.0.1:11434"
embed_model = "bge-m3"
chat_model = "llama3.1:8b-instruct-q4_K_M-16k"
timeout = 120
[retrieval]
answer_k = 10
find_k = 20
candidate_k = 40
max_context_chars = 22000pip install -e ".[all]" # Includes BM25, STT, and dev tools# Build index incrementally (default)
local-llm index
# Force full rebuild (wipe and recreate)
local-llm index --full
# Check index status
local-llm doctor# Direct search (no LLM synthesis)
local-llm find "vector embeddings hybrid retrieval"
# Interactive chat (uses LLM for synthesis)
local-llm chat# Transcribe an audio file
local-llm stt transcribe /path/to/audio.wav
# With advanced options
local-llm stt transcribe audio.wav \
--model large-v3 \
--align \
--diarize \
--output-dir ./transcriptsNote: This section documents the original installation method. For new installations, use the Quick Start guide above.
brew install ollamaStart the Ollama daemon:
ollama serveVerify it is listening:
lsof -nP -iTCP:11434 | grep LISTENollama pull llama3.1:8b-instruct
ollama pull nomic-embed-textcd local-llm
python3.12 -m venv venv
source venv/bin/activateInstall dependencies:
pip install --upgrade pip
pip install \
langchain \
langchain-community \
langchain-text-splitters \
chromadb \
httpx \
tqdm \
richAll Markdown notes live in:
./notes/**/*.md
Only .md files are indexed.
source venv/bin/activate
python index.pyExpected behavior:
- Progress bar over files
- Progress bar over chunks
- Final confirmation message:
Index built. Notes: ./notes Index: ./index
If you change:
- chunk size
- embedding model
- metadata logic
👉 delete ./index/ and rebuild.
source venv/bin/activate
python chat.py| Command | Description |
|---|---|
| normal text | Ask a question using RAG |
:find <term> |
Keyword search across notes |
:debug on/off |
Show retrieved chunks + scores |
:quit |
Exit |
All LLM calls go to:
http://127.0.0.1:11434
(IPv4 enforced to avoid intermittent IPv6 issues.)
Each chunk stores:
- Source file path
- Markdown heading context
- Chunk text
- File modification timestamp (
mtime) - File change timestamp (
ctime)
This enables:
- Better attribution
- Recency-aware reasoning
- Future rerank improvements
.git is intentionally NOT indexed.
Git metadata may be used later for timestamps, but raw .git content is noise for embeddings.
| Decision | Rationale |
|---|---|
| Ollama native | Lowest friction, stable, fully local |
| Chroma | Simple, persistent, debuggable |
| Markdown-aware chunking | Notes retain structure |
| Explicit progress bars | No silent hangs |
| No Docker | Avoids Apple Silicon friction |
| IPv4 only | Prevents ::1 refusal issues |
| Deterministic rebuilds | Trustworthy indexing |
- adding notes
- editing many notes
- changing chunking or embeddings
rm -rf ./index
python index.pyollama pull <model>- No incremental indexing yet
- No semantic date reasoning
- Reranking is heuristic-based
- No UI (terminal only by design)
- [ ] Add Git commit timestamp metadata to index
- [ ] Incremental indexing (hash-based change detection)
- [ ] Query-aware recency boosting in reranker
- [ ] Section-level embeddings (per heading)
- [ ] Containerization and segregation for improved deployment and security
- [ ] CI/CD and end-to-end testing for updates in base project
- [ ] improving STT functioning and usability
- [ ] Better long-answer synthesis prompts
- [ ] Optional Obsidian URI deep-links in citations
- [ ] Add `:sources` command to inspect index metadata
- [ ] Optional local UI (TUI or WebUI, still offline)
- [ ] Add regression test queries for answer quality