Skip to content

A setup guide for a strictly local LLM and RAG setup for accessing personal notes context

Notifications You must be signed in to change notification settings

alex-thorne/local-llm

Repository files navigation

Local LLM + Notes RAG (Private, Offline-First)

This repository contains a fully local, privacy-preserving LLM setup for querying personal notes using Retrieval-Augmented Generation (RAG).
All inference and embeddings run entirely on this machine (no OpenAI, no Google, no cloud APIs).

The system is optimized for:

  • Privacy-first: No telemetry, no cloud APIs, all data stays local
  • Markdown notes: Structure-aware indexing with heading context
  • Source attribution: Every answer cites source files
  • Hybrid retrieval: Vector (MMR) + Lexical (BM25) + RRF fusion
  • Unified CLI: Modern local-llm command with rich features
  • macOS & Linux: Works on Apple Silicon and x86_64

Quick Start

1. Install

git clone https://github.com/alex-thorne/local-llm.git
cd local-llm
python3.12 -m venv venv
source venv/bin/activate
pip install -e ".[bm25]"

2. Install Ollama and Models

# Install Ollama
brew install ollama

# Start Ollama server
ollama serve &

# Pull required models
ollama pull llama3.1:8b-instruct
ollama pull bge-m3

3. Verify Setup

local-llm doctor

This checks:

  • ✅ Ollama server is running
  • ✅ Required models are available
  • ✅ Notes directory exists
  • ✅ All dependencies are working

4. Build Index

local-llm index

First-time indexing will:

  • Read all .md files from ./notes/
  • Split by headings (structure-aware)
  • Generate embeddings locally
  • Build vector and BM25 indices

5. Start Chatting

local-llm chat

Available commands:

  • Normal questions: Ask anything about your notes
  • :find <query> - Search without LLM synthesis
  • :sources - Show sources from last answer
  • :open N - Open source file in $EDITOR
  • :debug on/off - Toggle retrieval debug mode
  • :help - Show all commands
  • exit / quit / :q - Exit

CLI Commands Reference

Command Description
local-llm --help Show all commands
local-llm --version Show version
local-llm doctor Check prerequisites and system health
local-llm index Build or rebuild the vector index
local-llm index --full Force full rebuild (wipe existing index)
local-llm chat Interactive RAG chat session
local-llm find <query> Search notes without LLM synthesis
local-llm config show Display current configuration
local-llm config edit Open config file in $EDITOR
local-llm stt transcribe <file> Transcribe audio files

Configuration

Create ~/.config/local-llm/config.toml or ./config.toml:

[notes]
source = "~/notes"
mirror = "./notes"

[ollama]
base_url = "http://127.0.0.1:11434"
embed_model = "bge-m3"
chat_model = "llama3.1:8b-instruct-q4_K_M-16k"

[retrieval]
answer_k = 10
find_k = 20
candidate_k = 40

See config.example.toml for all options.

Environment variable overrides:

export LOCAL_LLM_OLLAMA_BASE_URL="http://localhost:11434"
export LOCAL_LLM_RETRIEVAL_ANSWER_K=15

1. What This Setup Does (High-Level)

  • Runs a local LLM server using Ollama
  • Indexes Markdown notes into a local Chroma vector store

Architecture Overview

Directory Structure

local-llm/
├── src/local_llm/        # Python package (installable)
│   ├── __init__.py
│   ├── cli.py            # Main CLI dispatcher
│   ├── config.py         # Configuration system
│   ├── index.py          # Indexing logic
│   ├── chat.py           # Chat session logic
│   ├── find.py           # Search functionality
│   ├── retrieval.py      # Hybrid retrieval (MMR + BM25)
│   ├── embeddings.py     # Ollama embeddings client
│   ├── doctor.py         # System diagnostics
│   ├── stt.py            # Speech-to-text wrapper
│   └── utils.py          # Shared helpers
├── tests/                # Test suite
├── docs/                 # Documentation
├── stt/                  # Speech-to-text subsystem
├── notes/                # Your Markdown notes (indexed input)
├── index/                # Persisted Chroma DB (generated)
├── chunks.jsonl          # BM25 index (generated)
├── pyproject.toml        # Package metadata
├── config.example.toml   # Configuration template
└── README.md

What This Setup Does

  • Runs a local LLM server using Ollama
  • Indexes Markdown notes into a local Chroma vector store
  • Uses local embedding models (no external calls)
  • Provides a unified CLI with:
    • System diagnostics (doctor)
    • Index management (index)
    • Interactive chat (chat)
    • Direct search (find)
    • Configuration management
    • Speech-to-text integration
  • Keeps all data strictly local

Prerequisites

Hardware

  • Apple Silicon Mac or x86_64 Linux (recommended for optimal performance)
  • Minimum 8GB RAM (16GB+ recommended for larger models)

Software

  • Python 3.10+ (3.12 recommended)
  • Ollama (native, not Docker)
  • ffmpeg (for speech-to-text features)

Advanced Installation

Using Custom Config

Create ~/.config/local-llm/config.toml:

[notes]
source = "~/Documents/notes"
mirror = "./notes"

[ollama]
base_url = "http://127.0.0.1:11434"
embed_model = "bge-m3"
chat_model = "llama3.1:8b-instruct-q4_K_M-16k"
timeout = 120

[retrieval]
answer_k = 10
find_k = 20
candidate_k = 40
max_context_chars = 22000

Install with All Optional Dependencies

pip install -e ".[all]"  # Includes BM25, STT, and dev tools

Advanced Usage

Index Management

# Build index incrementally (default)
local-llm index

# Force full rebuild (wipe and recreate)
local-llm index --full

# Check index status
local-llm doctor

Search and Retrieval

# Direct search (no LLM synthesis)
local-llm find "vector embeddings hybrid retrieval"

# Interactive chat (uses LLM for synthesis)
local-llm chat

Speech-to-Text

# Transcribe an audio file
local-llm stt transcribe /path/to/audio.wav

# With advanced options
local-llm stt transcribe audio.wav \
  --model large-v3 \
  --align \
  --diarize \
  --output-dir ./transcripts

Initial Installation (Legacy Documentation)

Note: This section documents the original installation method. For new installations, use the Quick Start guide above.

Install Ollama

brew install ollama

Start the Ollama daemon:

ollama serve

Verify it is listening:

lsof -nP -iTCP:11434 | grep LISTEN

4.2 Pull Required Models

ollama pull llama3.1:8b-instruct
ollama pull nomic-embed-text

4.3 Python Environment

cd local-llm
python3.12 -m venv venv
source venv/bin/activate

Install dependencies:

pip install --upgrade pip
pip install \
  langchain \
  langchain-community \
  langchain-text-splitters \
  chromadb \
  httpx \
  tqdm \
  rich

5. Indexing Your Notes

5.1 Notes Location

All Markdown notes live in:

./notes/**/*.md

Only .md files are indexed.


5.2 Build the Index

source venv/bin/activate
python index.py

Expected behavior:

  • Progress bar over files
  • Progress bar over chunks
  • Final confirmation message:
Index built. Notes: ./notes  Index: ./index

If you change:

  • chunk size
  • embedding model
  • metadata logic

👉 delete ./index/ and rebuild.


6. Running the Chat Interface

source venv/bin/activate
python chat.py

Available Commands

Command Description
normal text Ask a question using RAG
:find <term> Keyword search across notes
:debug on/off Show retrieved chunks + scores
:quit Exit

All LLM calls go to:

http://127.0.0.1:11434

(IPv4 enforced to avoid intermittent IPv6 issues.)


7. What Is Indexed (and Why)

Each chunk stores:

  • Source file path
  • Markdown heading context
  • Chunk text
  • File modification timestamp (mtime)
  • File change timestamp (ctime)

This enables:

  • Better attribution
  • Recency-aware reasoning
  • Future rerank improvements

.git is intentionally NOT indexed.
Git metadata may be used later for timestamps, but raw .git content is noise for embeddings.


8. Why This Design

Decision Rationale
Ollama native Lowest friction, stable, fully local
Chroma Simple, persistent, debuggable
Markdown-aware chunking Notes retain structure
Explicit progress bars No silent hangs
No Docker Avoids Apple Silicon friction
IPv4 only Prevents ::1 refusal issues
Deterministic rebuilds Trustworthy indexing

9. Maintenance

Re-index after:

  • adding notes
  • editing many notes
  • changing chunking or embeddings
rm -rf ./index
python index.py

Update models:

ollama pull <model>

10. Known Limitations

  • No incremental indexing yet
  • No semantic date reasoning
  • Reranking is heuristic-based
  • No UI (terminal only by design)

11. TODO / Future Improvements

- [ ] Add Git commit timestamp metadata to index
- [ ] Incremental indexing (hash-based change detection)
- [ ] Query-aware recency boosting in reranker
- [ ] Section-level embeddings (per heading)
- [ ] Containerization and segregation for improved deployment and security
- [ ] CI/CD and end-to-end testing for updates in base project
- [ ] improving STT functioning and usability
- [ ] Better long-answer synthesis prompts
- [ ] Optional Obsidian URI deep-links in citations
- [ ] Add `:sources` command to inspect index metadata
- [ ] Optional local UI (TUI or WebUI, still offline)
- [ ] Add regression test queries for answer quality

About

A setup guide for a strictly local LLM and RAG setup for accessing personal notes context

Resources

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published