Local LLM + Notes RAG (Private, Offline-First)

This repository contains a fully local, privacy-preserving LLM setup for querying personal notes using Retrieval-Augmented Generation (RAG).
All inference and embeddings run entirely on this machine (no OpenAI, no Google, no cloud APIs).

The system is optimized for:

Privacy-first: No telemetry, no cloud APIs, all data stays local
Markdown notes: Structure-aware indexing with heading context
Source attribution: Every answer cites source files
Hybrid retrieval: Vector (MMR) + Lexical (BM25) + RRF fusion
Unified CLI: Modern local-llm command with rich features
macOS & Linux: Works on Apple Silicon and x86_64

Quick Start

1. Install

git clone https://github.com/alex-thorne/local-llm.git
cd local-llm
python3.12 -m venv venv
source venv/bin/activate
pip install -e ".[bm25]"

2. Install Ollama and Models

# Install Ollama
brew install ollama

# Start Ollama server
ollama serve &

# Pull required models
ollama pull llama3.1:8b-instruct
ollama pull bge-m3

3. Verify Setup

local-llm doctor

This checks:

✅ Ollama server is running
✅ Required models are available
✅ Notes directory exists
✅ All dependencies are working

4. Build Index

local-llm index

First-time indexing will:

Read all .md files from ./notes/
Split by headings (structure-aware)
Generate embeddings locally
Build vector and BM25 indices

5. Start Chatting

local-llm chat

Available commands:

Normal questions: Ask anything about your notes
:find <query> - Search without LLM synthesis
:sources - Show sources from last answer
:open N - Open source file in $EDITOR
:debug on/off - Toggle retrieval debug mode
:help - Show all commands
exit / quit / :q - Exit

CLI Commands Reference

Command	Description
`local-llm --help`	Show all commands
`local-llm --version`	Show version
`local-llm doctor`	Check prerequisites and system health
`local-llm index`	Build or rebuild the vector index
`local-llm index --full`	Force full rebuild (wipe existing index)
`local-llm chat`	Interactive RAG chat session
`local-llm find <query>`	Search notes without LLM synthesis
`local-llm config show`	Display current configuration
`local-llm config edit`	Open config file in `$EDITOR`
`local-llm stt transcribe <file>`	Transcribe audio files

Configuration

Create ~/.config/local-llm/config.toml or ./config.toml:

[notes]
source = "~/notes"
mirror = "./notes"

[ollama]
base_url = "http://127.0.0.1:11434"
embed_model = "bge-m3"
chat_model = "llama3.1:8b-instruct-q4_K_M-16k"

[retrieval]
answer_k = 10
find_k = 20
candidate_k = 40

See config.example.toml for all options.

Environment variable overrides:

export LOCAL_LLM_OLLAMA_BASE_URL="http://localhost:11434"
export LOCAL_LLM_RETRIEVAL_ANSWER_K=15

1. What This Setup Does (High-Level)

Runs a local LLM server using Ollama
Indexes Markdown notes into a local Chroma vector store

Architecture Overview

Directory Structure

local-llm/
├── src/local_llm/        # Python package (installable)
│   ├── __init__.py
│   ├── cli.py            # Main CLI dispatcher
│   ├── config.py         # Configuration system
│   ├── index.py          # Indexing logic
│   ├── chat.py           # Chat session logic
│   ├── find.py           # Search functionality
│   ├── retrieval.py      # Hybrid retrieval (MMR + BM25)
│   ├── embeddings.py     # Ollama embeddings client
│   ├── doctor.py         # System diagnostics
│   ├── stt.py            # Speech-to-text wrapper
│   └── utils.py          # Shared helpers
├── tests/                # Test suite
├── docs/                 # Documentation
├── stt/                  # Speech-to-text subsystem
├── notes/                # Your Markdown notes (indexed input)
├── index/                # Persisted Chroma DB (generated)
├── chunks.jsonl          # BM25 index (generated)
├── pyproject.toml        # Package metadata
├── config.example.toml   # Configuration template
└── README.md

What This Setup Does

Runs a local LLM server using Ollama
Indexes Markdown notes into a local Chroma vector store
Uses local embedding models (no external calls)
Provides a unified CLI with:
- System diagnostics (doctor)
- Index management (index)
- Interactive chat (chat)
- Direct search (find)
- Configuration management
- Speech-to-text integration
Keeps all data strictly local

Prerequisites

Hardware

Apple Silicon Mac or x86_64 Linux (recommended for optimal performance)
Minimum 8GB RAM (16GB+ recommended for larger models)

Software

Python 3.10+ (3.12 recommended)
Ollama (native, not Docker)
ffmpeg (for speech-to-text features)

Advanced Installation

Using Custom Config

Create ~/.config/local-llm/config.toml:

[notes]
source = "~/Documents/notes"
mirror = "./notes"

[ollama]
base_url = "http://127.0.0.1:11434"
embed_model = "bge-m3"
chat_model = "llama3.1:8b-instruct-q4_K_M-16k"
timeout = 120

[retrieval]
answer_k = 10
find_k = 20
candidate_k = 40
max_context_chars = 22000

Install with All Optional Dependencies

pip install -e ".[all]"  # Includes BM25, STT, and dev tools

Advanced Usage

Index Management

# Build index incrementally (default)
local-llm index

# Force full rebuild (wipe and recreate)
local-llm index --full

# Check index status
local-llm doctor

Search and Retrieval

# Direct search (no LLM synthesis)
local-llm find "vector embeddings hybrid retrieval"

# Interactive chat (uses LLM for synthesis)
local-llm chat

Speech-to-Text

# Transcribe an audio file
local-llm stt transcribe /path/to/audio.wav

# With advanced options
local-llm stt transcribe audio.wav \
  --model large-v3 \
  --align \
  --diarize \
  --output-dir ./transcripts

Initial Installation (Legacy Documentation)

Note: This section documents the original installation method. For new installations, use the Quick Start guide above.

Install Ollama

brew install ollama

Start the Ollama daemon:

ollama serve

Verify it is listening:

lsof -nP -iTCP:11434 | grep LISTEN

4.2 Pull Required Models

ollama pull llama3.1:8b-instruct
ollama pull nomic-embed-text

4.3 Python Environment

cd local-llm
python3.12 -m venv venv
source venv/bin/activate

Install dependencies:

pip install --upgrade pip
pip install \
  langchain \
  langchain-community \
  langchain-text-splitters \
  chromadb \
  httpx \
  tqdm \
  rich

5. Indexing Your Notes

5.1 Notes Location

All Markdown notes live in:

./notes/**/*.md

Only .md files are indexed.

5.2 Build the Index

source venv/bin/activate
python index.py

Expected behavior:

Progress bar over files
Progress bar over chunks
Final confirmation message:

Index built. Notes: ./notes  Index: ./index

If you change:

chunk size
embedding model
metadata logic

👉 delete ./index/ and rebuild.

6. Running the Chat Interface

source venv/bin/activate
python chat.py

Available Commands

Command	Description
normal text	Ask a question using RAG
`:find <term>`	Keyword search across notes
`:debug on/off`	Show retrieved chunks + scores
`:quit`	Exit

All LLM calls go to:

http://127.0.0.1:11434

(IPv4 enforced to avoid intermittent IPv6 issues.)

7. What Is Indexed (and Why)

Each chunk stores:

Source file path
Markdown heading context
Chunk text
File modification timestamp (mtime)
File change timestamp (ctime)

This enables:

Better attribution
Recency-aware reasoning
Future rerank improvements

.git is intentionally NOT indexed.
Git metadata may be used later for timestamps, but raw .git content is noise for embeddings.

8. Why This Design

Decision	Rationale
Ollama native	Lowest friction, stable, fully local
Chroma	Simple, persistent, debuggable
Markdown-aware chunking	Notes retain structure
Explicit progress bars	No silent hangs
No Docker	Avoids Apple Silicon friction
IPv4 only	Prevents ::1 refusal issues
Deterministic rebuilds	Trustworthy indexing

9. Maintenance

Re-index after:

adding notes
editing many notes
changing chunking or embeddings

rm -rf ./index
python index.py

Update models:

ollama pull <model>

10. Known Limitations

No incremental indexing yet
No semantic date reasoning
Reranking is heuristic-based
No UI (terminal only by design)

11. TODO / Future Improvements

- [ ] Add Git commit timestamp metadata to index
- [ ] Incremental indexing (hash-based change detection)
- [ ] Query-aware recency boosting in reranker
- [ ] Section-level embeddings (per heading)
- [ ] Containerization and segregation for improved deployment and security
- [ ] CI/CD and end-to-end testing for updates in base project
- [ ] improving STT functioning and usability
- [ ] Better long-answer synthesis prompts
- [ ] Optional Obsidian URI deep-links in citations
- [ ] Add `:sources` command to inspect index metadata
- [ ] Optional local UI (TUI or WebUI, still offline)
- [ ] Add regression test queries for answer quality

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.github/workflows		.github/workflows
docs		docs
src/local_llm		src/local_llm
stt		stt
tests		tests
.gitignore		.gitignore
Modelfile.llama3.1-8b-16k		Modelfile.llama3.1-8b-16k
README.md		README.md
SECURITY.md		SECURITY.md
SYSTEM_MAP.md		SYSTEM_MAP.md
activate		activate
chat.py		chat.py
config.example.toml		config.example.toml
docker-compose.secure.yml		docker-compose.secure.yml
docker-compose.yml		docker-compose.yml
index.py		index.py
ollama-reset.sh		ollama-reset.sh
ollama-serve.sh		ollama-serve.sh
pyproject.toml		pyproject.toml
sessions.py		sessions.py
update-notes-index.sh		update-notes-index.sh

alex-thorne/local-llm

Folders and files

Latest commit

History

Repository files navigation