chore(readme): improve

Daniele Briggi · Daniele Briggi · commit 76ee8a7893f1 · 2025-09-17T10:11:32.000Z
diff --git a/README.md b/README.md
@@ -1,13 +1,20 @@
+<img src="https://private-user-images.githubusercontent.com/6153996/490482446-6e1326c5-9009-4b2d-afc1-48b7867fa215.png?jwt=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NTgxMDM3MjMsIm5iZiI6MTc1ODEwMzQyMywicGF0aCI6Ii82MTUzOTk2LzQ5MDQ4MjQ0Ni02ZTEzMjZjNS05MDA5LTRiMmQtYWZjMS00OGI3ODY3ZmEyMTUucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI1MDkxNyUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTA5MTdUMTAwMzQzWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9ZTI1NmZjOWJlNTY2NGM4ZmRhNTkzYzAyMWFlOTFmNjdmMmI3OWI2Mzk5MjY2NzFiMDE2NDk4ZGY1ZTFjMjNkOSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.mnWZsUVwZRjpV2nz9WDX9OA9MvkbqT4DO8nQR5trKQI" alt="https://sqlite.ai" width="100"/>
+
 # SQLite RAG
 
-A hybrid search engine built on SQLite with AI and Vector extensions. SQLite-RAG combines vector similarity search with full-text search using Reciprocal Rank Fusion (RRF) for enhanced document retrieval.
+[![Run Tests](https://github.com/sqliteai/sqlite-rag/actions/workflows/test.yaml/badge.svg?branch=main&event=release)](https://github.com/sqliteai/sqlite-rag/actions/workflows/test.yaml)
+[![codecov](https://codecov.io/github/sqliteai/sqlite-rag/graph/badge.svg?token=30KYPY7864)](https://codecov.io/github/sqliteai/sqlite-rag)
+![PyPI - Version](https://img.shields.io/pypi/v/sqlite-rag?link=https%3A%2F%2Fpypi.org%2Fproject%2Fsqlite-rag%2F)
+![PyPI - Python Version](https://img.shields.io/pypi/pyversions/sqlite-rag?link=https%3A%2F%2Fpypi.org%2Fproject%2Fsqlite-rag)
+
+A hybrid search engine built on SQLite with [SQLite AI](https://github.com/sqliteai/sqlite-ai) and [SQLite Vector](https://github.com/sqliteai/sqlite-vector) extensions. SQLite RAG combines vector similarity search with full-text search ([FTS5](https://www.sqlite.org/fts5.html) extension) using Reciprocal Rank Fusion (RRF) for enhanced document retrieval.
 
 ## Features
 
 - **Hybrid Search**: Combines vector embeddings with full-text search for optimal results
 - **SQLite-based**: Built on SQLite with AI and Vector extensions for reliability and performance
-- **Multi-format Support**: Process 25+ file formats including PDF, DOCX, Markdown, code files
-- **Intelligent Chunking**: Token-aware text chunking with configurable overlap
+- **Multi-format Text Support**: Process text file formats including PDF, DOCX, Markdown, code files
+- **Recursive Character Text Splitter**: Token-aware text chunking with configurable overlap
 - **Interactive CLI**: Command-line interface with interactive REPL mode
 - **Flexible Configuration**: Customizable embedding models, search weights, and chunking parameters
 
@@ -19,8 +26,16 @@ pip install sqlite-rag
 
 ## Quick Start
 
+Download the model [Embedding Gemma](https://huggingface.co/unsloth/embeddinggemma-300m-GGUF) from Hugging Face chosen as default model:
+
 ```bash
-# Initialize and add documents
+sqlite-rag download-model unsloth/embeddinggemma-300m-GGUF embeddinggemma-300M-Q8_0.gguf
+```
+
+Then start with default settings:
+
+```bash
+# Initialize sqliterag.sqlite database and add documents
 sqlite-rag add /path/to/documents --recursive
 
 # Search your documents
@@ -33,133 +48,79 @@ sqlite-rag
 > exit
 ```
 
-## CLI Commands
-
-### Document Management
+For help run:
 
-**Add files or directories:**
 ```bash
-sqlite-rag add <path> [--recursive] [--absolute-paths] [--metadata '{"key": "value"}']
+sqlite-rag --help
 ```
 
-**Add raw text:**
-```bash
-sqlite-rag add-text "your text content" [uri] [--metadata '{"key": "value"}']
-```
+## CLI Commands
 
-**List all documents:**
-```bash
-sqlite-rag list
-```
+### Configuration
+
+Settings are stored in the database and should be set before adding any documents.
 
-**Remove documents:**
 ```bash
-sqlite-rag remove <path-or-uuid> [--yes]
-```
+# Interactive configuration
+sqlite-rag configure
 
-### Search & Query
+# View current settings
+sqlite-rag settings
 
-**Hybrid search:**
-```bash
-sqlite-rag search "your query" [--limit 10] [--debug]
+# View available configuration options
+sqlite-rag configure --help
 ```
 
-Use `--debug` to see detailed ranking information including vector ranks, FTS ranks, and combined scores.
+To use a different database path, use the global `--database` option:
 
-### Database Operations
-
-**Rebuild indexes and embeddings:**
 ```bash
-sqlite-rag rebuild [--remove-missing]
-```
+# Single command with custom database
+sqlite-rag --database mydb.db add-text "What's AI?"
 
-**Clear entire database:**
-```bash
-sqlite-rag reset [--yes]
+# Interactive mode with custom database
+sqlite-rag --database mydb.db
 ```
 
-### Configuration
+### Model Management
 
-**View current settings:**
-```bash
-sqlite-rag settings
-```
+You can experiment with other models from Hugging Face by downloading them with:
 
-**Update configuration:**
 ```bash
-sqlite-rag set [options]
+# Download GGUF models from Hugging Face
+sqlite-rag download-model <model-repo> <filename>
 ```
 
-Available settings:
-- `--model-path-or-name`: Embedding model (file path or HuggingFace model)
-- `--embedding-dim`: Vector dimensions
-- `--chunk-size`: Text chunk size (tokens)
-- `--chunk-overlap`: Token overlap between chunks
-- `--weight-fts`: Full-text search weight (0.0-1.0)
-- `--weight-vec`: Vector search weight (0.0-1.0)
-- `--quantize-scan`: Enable quantized vectors for faster search
-- `--quantize-preload`: Preload quantized vectors in memory
+## Supported File Formats
 
-## Python API
+SQLite RAG supports the following file formats:
 
-```python
-from sqlite_rag import SQLiteRag
+- **Text**: `.txt`, `.md`, `.mdx`, `.csv`, `.json`, `.xml`, `.yaml`, `.yml`
+- **Documents**: `.pdf`, `.docx`, `.pptx`, `.xlsx`
+- **Code**: `.c`, `.cpp`, `.css`, `.go`, `.h`, `.hpp`, `.html`, `.java`, `.js`, `.mjs`, `.kt`, `.php`, `.py`, `.rb`, `.rs`, `.swift`, `.ts`, `.tsx`
+- **Web Frameworks**: `.svelte`, `.vue`
 
-# Create RAG instance
-rag = SQLiteRag.create("./database.sqlite")
+## Development
 
-# Add documents
-rag.add("/path/to/documents", recursive=True)
-rag.add_text("Raw text content", uri="doc.txt")
+### Installation
 
-# Search
-results = rag.search("search query", top_k=5)
-for result in results:
-    print(f"Score: {result.score}")
-    print(f"Content: {result.content}")
-    print(f"URI: {result.uri}")
+For development, clone the repository and install with development dependencies:
 
-# List documents
-documents = rag.list_documents()
+```bash
+# Clone the repository
+git clone https://github.com/sqliteai/sqlite-rag.git
+cd sqlite-rag
 
-# Remove document
-rag.remove_document("document-id-or-path")
+# Create virtual environment
+python -m venv .venv
+source .venv/bin/activate  # On Windows: .venv\Scripts\activate
 
-# Database operations
-rag.rebuild(remove_missing=True)
-rag.reset()
+# Install in development mode
+pip install -e .[dev]
 ```
-
-## Supported File Formats
-
-SQLite-RAG supports 25+ file formats through the MarkItDown library:
-
-- **Text**: `.txt`, `.md`, `.csv`, `.json`, `.xml`
-- **Documents**: `.pdf`, `.docx`, `.pptx`, `.xlsx`
-- **Code**: `.py`, `.js`, `.html`, `.css`, `.sql`
-- **And many more**: `.rtf`, `.odt`, `.epub`, `.zip`, etc.
-
 ## How It Works
 
 1. **Document Processing**: Files are processed and split into overlapping chunks
 2. **Embedding Generation**: Text chunks are converted to vector embeddings using AI models
 3. **Dual Indexing**: Content is indexed for both vector similarity and full-text search
 4. **Hybrid Search**: Queries are processed through both search methods
 5. **Result Fusion**: Results are combined using Reciprocal Rank Fusion for optimal relevance
-
-## Default Configuration
-
-- **Model**: Qwen3-Embedding-0.6B (Q8_0 quantized, 1024 dimensions)
-- **Chunking**: 12,000 tokens per chunk with 1,200 token overlap
-- **Vectors**: FLOAT16 storage with cosine similarity
-- **Search**: Equal weighting (1.0) for vector and full-text results
-- **Database**: `./sqliterag.sqlite`
-
-## Extensions Required
-
-SQLite-RAG requires these SQLite extensions:
-
-- **[sqlite-ai](https://github.com/sqliteai/sqlite-ai)**: LLM model loading and embedding generation
-- **[sqlite-vector](https://github.com/sqliteai/sqlite-vector)**: Vector storage and similarity search
-
-These are automatically installed as dependencies.
diff --git a/model_evaluation/README.md b/model_evaluation/README.md
@@ -1,5 +1,51 @@
-  # 1. Process dataset
-  python test_ms_marco.py process --config example_config.json --limit-rows 100
+# Model Evaluation Python Script
 
-  # 2. Evaluate (saves to example_config_evaluation_results.txt)
-  python test_ms_marco.py evaluate --config example_config.json --limit-rows 100
+A simple evaluation script for SQLite Rag using the MS MARCO dataset. Compares performance against the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard) benchmarks.
+
+## MS MARCO Dataset
+
+**MS MARCO**: Microsoft Question-Answering dataset with real web queries and passages.
+
+## Evaluation Metrics
+
+- **Hit Rate (HR@k)**: Percentage of queries with relevant results in top-k
+- **MRR**: Mean Reciprocal Rank - position-weighted relevance score
+- **NDCG**: Normalized Discounted Cumulative Gain - ranking quality metric
+
+## Usage
+
+### 1. Setup Configuration
+
+Create an example config file and then edit it with your model settings:
+
+```bash
+python ms_marco.py create-config
+```
+
+### 2. Process Dataset
+
+```bash
+python ms_marco.py process --config configs/my_config.json --limit-rows 100
+```
+
+Processes MS MARCO passages into the SQLite Rag database for evaluation.
+
+### 3. Evaluate Performance
+
+```bash
+python ms_marco.py evaluate --config configs/my_config.json --limit-rows 100
+```
+
+Runs evaluation and saves results to `results/my_config_evaluation_results.txt`.
+
+> **Note**: Without proper hardware, processing and evaluating the entire database may take a lot of time.
+> Use `--limit-rows` to process and evaluate only the first n rows.
+
+## Example Results
+
+```
+Metric               @1         @3         @5         @10
+Hit Rate            0.650      0.780      0.820      0.850
+MRR                 0.650      0.710      0.720      0.725
+NDCG                0.650      0.715      0.735      0.750
+```
diff --git a/model_evaluation/ms_marco.py b/model_evaluation/ms_marco.py
@@ -486,8 +486,8 @@ def main():
     )
     parser.add_argument(
         "action",
-        choices=["process", "evaluate"],
-        help="Action to perform: 'process' to add passages to database, 'evaluate' to test search quality",
+        choices=["process", "evaluate", "create-config"],
+        help="Action to perform: 'process' to add passages to database, 'evaluate' to test search quality, 'create-config' to generate example configuration",
     )
     parser.add_argument(
         "--limit-rows",
@@ -498,16 +498,23 @@ def main():
     parser.add_argument(
         "--config",
         type=str,
-        required=True,
+        required=False,
         help="JSON configuration file with RAG settings and database path",
     )
 
     args = parser.parse_args()
 
+    if args.action == "create-config":
+        print("Creating example configuration file...")
+        config_file = create_example_config()
+        print(f"Configuration file created: {config_file}")
+        print("Edit the file with your settings and then run process/evaluate.")
+        return
+
+    # Config is required for process and evaluate actions
     if args.config is None:
-        print("Missing config file. Creating example config...")
-        create_example_config()
-        print("Please edit ms_marco_config.json with your settings and try again.")
+        print("Error: --config is required for process and evaluate actions")
+        print("Use 'create-config' action to generate an example configuration file")
         return
 
     try: