Skip to content

Commit 76ee8a7

Browse files
author
Daniele Briggi
committed
chore(readme): improve
1 parent 54d105e commit 76ee8a7

File tree

3 files changed

+121
-107
lines changed

3 files changed

+121
-107
lines changed

README.md

Lines changed: 58 additions & 97 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,20 @@
1+
<img src="https://private-user-images.githubusercontent.com/6153996/490482446-6e1326c5-9009-4b2d-afc1-48b7867fa215.png?jwt=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NTgxMDM3MjMsIm5iZiI6MTc1ODEwMzQyMywicGF0aCI6Ii82MTUzOTk2LzQ5MDQ4MjQ0Ni02ZTEzMjZjNS05MDA5LTRiMmQtYWZjMS00OGI3ODY3ZmEyMTUucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI1MDkxNyUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTA5MTdUMTAwMzQzWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9ZTI1NmZjOWJlNTY2NGM4ZmRhNTkzYzAyMWFlOTFmNjdmMmI3OWI2Mzk5MjY2NzFiMDE2NDk4ZGY1ZTFjMjNkOSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.mnWZsUVwZRjpV2nz9WDX9OA9MvkbqT4DO8nQR5trKQI" alt="https://sqlite.ai" width="100"/>
2+
13
# SQLite RAG
24

3-
A hybrid search engine built on SQLite with AI and Vector extensions. SQLite-RAG combines vector similarity search with full-text search using Reciprocal Rank Fusion (RRF) for enhanced document retrieval.
5+
[![Run Tests](https://github.com/sqliteai/sqlite-rag/actions/workflows/test.yaml/badge.svg?branch=main&event=release)](https://github.com/sqliteai/sqlite-rag/actions/workflows/test.yaml)
6+
[![codecov](https://codecov.io/github/sqliteai/sqlite-rag/graph/badge.svg?token=30KYPY7864)](https://codecov.io/github/sqliteai/sqlite-rag)
7+
![PyPI - Version](https://img.shields.io/pypi/v/sqlite-rag?link=https%3A%2F%2Fpypi.org%2Fproject%2Fsqlite-rag%2F)
8+
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/sqlite-rag?link=https%3A%2F%2Fpypi.org%2Fproject%2Fsqlite-rag)
9+
10+
A hybrid search engine built on SQLite with [SQLite AI](https://github.com/sqliteai/sqlite-ai) and [SQLite Vector](https://github.com/sqliteai/sqlite-vector) extensions. SQLite RAG combines vector similarity search with full-text search ([FTS5](https://www.sqlite.org/fts5.html) extension) using Reciprocal Rank Fusion (RRF) for enhanced document retrieval.
411

512
## Features
613

714
- **Hybrid Search**: Combines vector embeddings with full-text search for optimal results
815
- **SQLite-based**: Built on SQLite with AI and Vector extensions for reliability and performance
9-
- **Multi-format Support**: Process 25+ file formats including PDF, DOCX, Markdown, code files
10-
- **Intelligent Chunking**: Token-aware text chunking with configurable overlap
16+
- **Multi-format Text Support**: Process text file formats including PDF, DOCX, Markdown, code files
17+
- **Recursive Character Text Splitter**: Token-aware text chunking with configurable overlap
1118
- **Interactive CLI**: Command-line interface with interactive REPL mode
1219
- **Flexible Configuration**: Customizable embedding models, search weights, and chunking parameters
1320

@@ -19,8 +26,16 @@ pip install sqlite-rag
1926

2027
## Quick Start
2128

29+
Download the model [Embedding Gemma](https://huggingface.co/unsloth/embeddinggemma-300m-GGUF) from Hugging Face chosen as default model:
30+
2231
```bash
23-
# Initialize and add documents
32+
sqlite-rag download-model unsloth/embeddinggemma-300m-GGUF embeddinggemma-300M-Q8_0.gguf
33+
```
34+
35+
Then start with default settings:
36+
37+
```bash
38+
# Initialize sqliterag.sqlite database and add documents
2439
sqlite-rag add /path/to/documents --recursive
2540

2641
# Search your documents
@@ -33,133 +48,79 @@ sqlite-rag
3348
> exit
3449
```
3550

36-
## CLI Commands
37-
38-
### Document Management
51+
For help run:
3952

40-
**Add files or directories:**
4153
```bash
42-
sqlite-rag add <path> [--recursive] [--absolute-paths] [--metadata '{"key": "value"}']
54+
sqlite-rag --help
4355
```
4456

45-
**Add raw text:**
46-
```bash
47-
sqlite-rag add-text "your text content" [uri] [--metadata '{"key": "value"}']
48-
```
57+
## CLI Commands
4958

50-
**List all documents:**
51-
```bash
52-
sqlite-rag list
53-
```
59+
### Configuration
60+
61+
Settings are stored in the database and should be set before adding any documents.
5462

55-
**Remove documents:**
5663
```bash
57-
sqlite-rag remove <path-or-uuid> [--yes]
58-
```
64+
# Interactive configuration
65+
sqlite-rag configure
5966

60-
### Search & Query
67+
# View current settings
68+
sqlite-rag settings
6169

62-
**Hybrid search:**
63-
```bash
64-
sqlite-rag search "your query" [--limit 10] [--debug]
70+
# View available configuration options
71+
sqlite-rag configure --help
6572
```
6673

67-
Use `--debug` to see detailed ranking information including vector ranks, FTS ranks, and combined scores.
74+
To use a different database path, use the global `--database` option:
6875

69-
### Database Operations
70-
71-
**Rebuild indexes and embeddings:**
7276
```bash
73-
sqlite-rag rebuild [--remove-missing]
74-
```
77+
# Single command with custom database
78+
sqlite-rag --database mydb.db add-text "What's AI?"
7579

76-
**Clear entire database:**
77-
```bash
78-
sqlite-rag reset [--yes]
80+
# Interactive mode with custom database
81+
sqlite-rag --database mydb.db
7982
```
8083

81-
### Configuration
84+
### Model Management
8285

83-
**View current settings:**
84-
```bash
85-
sqlite-rag settings
86-
```
86+
You can experiment with other models from Hugging Face by downloading them with:
8787

88-
**Update configuration:**
8988
```bash
90-
sqlite-rag set [options]
89+
# Download GGUF models from Hugging Face
90+
sqlite-rag download-model <model-repo> <filename>
9191
```
9292

93-
Available settings:
94-
- `--model-path-or-name`: Embedding model (file path or HuggingFace model)
95-
- `--embedding-dim`: Vector dimensions
96-
- `--chunk-size`: Text chunk size (tokens)
97-
- `--chunk-overlap`: Token overlap between chunks
98-
- `--weight-fts`: Full-text search weight (0.0-1.0)
99-
- `--weight-vec`: Vector search weight (0.0-1.0)
100-
- `--quantize-scan`: Enable quantized vectors for faster search
101-
- `--quantize-preload`: Preload quantized vectors in memory
93+
## Supported File Formats
10294

103-
## Python API
95+
SQLite RAG supports the following file formats:
10496

105-
```python
106-
from sqlite_rag import SQLiteRag
97+
- **Text**: `.txt`, `.md`, `.mdx`, `.csv`, `.json`, `.xml`, `.yaml`, `.yml`
98+
- **Documents**: `.pdf`, `.docx`, `.pptx`, `.xlsx`
99+
- **Code**: `.c`, `.cpp`, `.css`, `.go`, `.h`, `.hpp`, `.html`, `.java`, `.js`, `.mjs`, `.kt`, `.php`, `.py`, `.rb`, `.rs`, `.swift`, `.ts`, `.tsx`
100+
- **Web Frameworks**: `.svelte`, `.vue`
107101

108-
# Create RAG instance
109-
rag = SQLiteRag.create("./database.sqlite")
102+
## Development
110103

111-
# Add documents
112-
rag.add("/path/to/documents", recursive=True)
113-
rag.add_text("Raw text content", uri="doc.txt")
104+
### Installation
114105

115-
# Search
116-
results = rag.search("search query", top_k=5)
117-
for result in results:
118-
print(f"Score: {result.score}")
119-
print(f"Content: {result.content}")
120-
print(f"URI: {result.uri}")
106+
For development, clone the repository and install with development dependencies:
121107

122-
# List documents
123-
documents = rag.list_documents()
108+
```bash
109+
# Clone the repository
110+
git clone https://github.com/sqliteai/sqlite-rag.git
111+
cd sqlite-rag
124112

125-
# Remove document
126-
rag.remove_document("document-id-or-path")
113+
# Create virtual environment
114+
python -m venv .venv
115+
source .venv/bin/activate # On Windows: .venv\Scripts\activate
127116

128-
# Database operations
129-
rag.rebuild(remove_missing=True)
130-
rag.reset()
117+
# Install in development mode
118+
pip install -e .[dev]
131119
```
132-
133-
## Supported File Formats
134-
135-
SQLite-RAG supports 25+ file formats through the MarkItDown library:
136-
137-
- **Text**: `.txt`, `.md`, `.csv`, `.json`, `.xml`
138-
- **Documents**: `.pdf`, `.docx`, `.pptx`, `.xlsx`
139-
- **Code**: `.py`, `.js`, `.html`, `.css`, `.sql`
140-
- **And many more**: `.rtf`, `.odt`, `.epub`, `.zip`, etc.
141-
142120
## How It Works
143121

144122
1. **Document Processing**: Files are processed and split into overlapping chunks
145123
2. **Embedding Generation**: Text chunks are converted to vector embeddings using AI models
146124
3. **Dual Indexing**: Content is indexed for both vector similarity and full-text search
147125
4. **Hybrid Search**: Queries are processed through both search methods
148126
5. **Result Fusion**: Results are combined using Reciprocal Rank Fusion for optimal relevance
149-
150-
## Default Configuration
151-
152-
- **Model**: Qwen3-Embedding-0.6B (Q8_0 quantized, 1024 dimensions)
153-
- **Chunking**: 12,000 tokens per chunk with 1,200 token overlap
154-
- **Vectors**: FLOAT16 storage with cosine similarity
155-
- **Search**: Equal weighting (1.0) for vector and full-text results
156-
- **Database**: `./sqliterag.sqlite`
157-
158-
## Extensions Required
159-
160-
SQLite-RAG requires these SQLite extensions:
161-
162-
- **[sqlite-ai](https://github.com/sqliteai/sqlite-ai)**: LLM model loading and embedding generation
163-
- **[sqlite-vector](https://github.com/sqliteai/sqlite-vector)**: Vector storage and similarity search
164-
165-
These are automatically installed as dependencies.

model_evaluation/README.md

Lines changed: 50 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,51 @@
1-
# 1. Process dataset
2-
python test_ms_marco.py process --config example_config.json --limit-rows 100
1+
# Model Evaluation Python Script
32

4-
# 2. Evaluate (saves to example_config_evaluation_results.txt)
5-
python test_ms_marco.py evaluate --config example_config.json --limit-rows 100
3+
A simple evaluation script for SQLite Rag using the MS MARCO dataset. Compares performance against the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard) benchmarks.
4+
5+
## MS MARCO Dataset
6+
7+
**MS MARCO**: Microsoft Question-Answering dataset with real web queries and passages.
8+
9+
## Evaluation Metrics
10+
11+
- **Hit Rate (HR@k)**: Percentage of queries with relevant results in top-k
12+
- **MRR**: Mean Reciprocal Rank - position-weighted relevance score
13+
- **NDCG**: Normalized Discounted Cumulative Gain - ranking quality metric
14+
15+
## Usage
16+
17+
### 1. Setup Configuration
18+
19+
Create an example config file and then edit it with your model settings:
20+
21+
```bash
22+
python ms_marco.py create-config
23+
```
24+
25+
### 2. Process Dataset
26+
27+
```bash
28+
python ms_marco.py process --config configs/my_config.json --limit-rows 100
29+
```
30+
31+
Processes MS MARCO passages into the SQLite Rag database for evaluation.
32+
33+
### 3. Evaluate Performance
34+
35+
```bash
36+
python ms_marco.py evaluate --config configs/my_config.json --limit-rows 100
37+
```
38+
39+
Runs evaluation and saves results to `results/my_config_evaluation_results.txt`.
40+
41+
> **Note**: Without proper hardware, processing and evaluating the entire database may take a lot of time.
42+
> Use `--limit-rows` to process and evaluate only the first n rows.
43+
44+
## Example Results
45+
46+
```
47+
Metric @1 @3 @5 @10
48+
Hit Rate 0.650 0.780 0.820 0.850
49+
MRR 0.650 0.710 0.720 0.725
50+
NDCG 0.650 0.715 0.735 0.750
51+
```

model_evaluation/ms_marco.py

Lines changed: 13 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -486,8 +486,8 @@ def main():
486486
)
487487
parser.add_argument(
488488
"action",
489-
choices=["process", "evaluate"],
490-
help="Action to perform: 'process' to add passages to database, 'evaluate' to test search quality",
489+
choices=["process", "evaluate", "create-config"],
490+
help="Action to perform: 'process' to add passages to database, 'evaluate' to test search quality, 'create-config' to generate example configuration",
491491
)
492492
parser.add_argument(
493493
"--limit-rows",
@@ -498,16 +498,23 @@ def main():
498498
parser.add_argument(
499499
"--config",
500500
type=str,
501-
required=True,
501+
required=False,
502502
help="JSON configuration file with RAG settings and database path",
503503
)
504504

505505
args = parser.parse_args()
506506

507+
if args.action == "create-config":
508+
print("Creating example configuration file...")
509+
config_file = create_example_config()
510+
print(f"Configuration file created: {config_file}")
511+
print("Edit the file with your settings and then run process/evaluate.")
512+
return
513+
514+
# Config is required for process and evaluate actions
507515
if args.config is None:
508-
print("Missing config file. Creating example config...")
509-
create_example_config()
510-
print("Please edit ms_marco_config.json with your settings and try again.")
516+
print("Error: --config is required for process and evaluate actions")
517+
print("Use 'create-config' action to generate an example configuration file")
511518
return
512519

513520
try:

0 commit comments

Comments
 (0)