Skip to content

ramartinez7/semantic-search

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

26 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ” File Semantic Search Library

A standalone file semantic search library with Azure OpenAI integration, featuring intelligent file indexing, vector similarity search, and AI-powered reranking.

✨ Features

  • Setup Wizard: Interactive guided setup with rich terminal interface
  • Azure OpenAI Integration: Text embeddings + GPT-4 reranking
  • SQLite Storage: Persistent vector database with efficient retrieval
  • Smart File Indexing: Automatic file discovery, summarization, and embedding
  • File Semantic Search: Intent-aware search beyond keyword matching across your documents
  • Flexible Auth: API key or Azure AD authentication
  • MCP Server: Model Context Protocol integration for AI assistants
  • Performance Optimized: sqlite-vec integration for fast vector search
  • Customizable: Configurable prompts, thresholds, and processing options
  • Comprehensive: Detailed status reporting and database statistics

πŸš€ Quick Start

1. Initialize (First Time Setup)

npx semsearch init

The setup wizard will guide you through:

  • Azure OpenAI endpoint configuration
  • Authentication method (API key vs Azure AD)
  • Model deployment selection
  • Database location setup
  • Connection testing

2. Index Your Document Files

npx semsearch index ./my-documents

3. Search Your Files

npx semsearch search "machine learning algorithms"

πŸ“‹ Commands

Command Description
init Run setup wizard (first time)
status Show configuration and database status
index <path> Index files for search
search <query> Search indexed files
info <id> Get detailed file information
reset Reset configuration and/or database
prompts Manage AI prompts for summarization and reranking
test-connection Test Azure OpenAI connectivity
mcp Start MCP server for AI assistant integration

Command Options

Global Options

  • --endpoint <url> - Azure OpenAI endpoint URL
  • --api-key <key> - Azure OpenAI API key (if not using managed identity)
  • --embedding-model <model> - Azure OpenAI embedding deployment name (default: text-embedding-ada-002)
  • --llm-model <model> - Azure OpenAI LLM deployment name for reranking (default: gpt-4.1-mini)
  • -v, --verbose - Show detailed progress information

status [options]

  • --db <path> - SQLite database path

reset [options]

  • --force - Skip confirmation prompt
  • --db-only - Reset only the database, keep configuration

index <path> [options]

  • --db <path> - SQLite database path
  • --maxChars <n> - Max characters to read per file (default: 50000)
  • --force - Reprocess all files even if already indexed
  • -c, --concurrency <n> - Number of files to process in parallel (default: 3)

search <query> [options]

  • --db <path> - SQLite database path
  • --topK <n> - Maximum number of results to return (default: 5)
  • --min-similarity <n> - Minimum cosine similarity threshold 0.0-1.0, filters vector matches (default: 0.1)
  • --min-score <n> - Minimum final score threshold 0-100, filters reranked results (default: 0)

info <id> [options]

  • --db <path> - SQLite database path

prompts [options]

  • --edit - Edit prompts interactively
  • --reset - Reset prompts to defaults

test-connection [options]

  • --endpoint <url> - Azure OpenAI endpoint URL
  • --api-key <key> - Azure OpenAI API key

Example Usage

# Basic commands
npx semsearch init                              # Interactive setup
npx semsearch index ./documents                 # Index documents
npx semsearch search "machine learning"         # Basic search

# Advanced indexing options
npx semsearch index ./docs --maxChars 100000 --concurrency 5 --force

# Advanced search with filtering
npx semsearch search "database design" --topK 10 --min-similarity 0.7 --min-score 80

# Use custom database location
npx semsearch search "query" --db ./custom/index.db

# Reset database only (keep configuration)
npx semsearch reset --db-only --force

# Edit AI prompts
npx semsearch prompts --edit

# Test connection with specific credentials
npx semsearch test-connection --endpoint https://your-endpoint.com --api-key your-key

πŸ€– MCP Server Integration

The MCP (Model Context Protocol) server allows AI assistants to access your semantic search functionality directly.

Start MCP Server

npx semsearch mcp
# OR use the direct binary
npx semsearch-mcp

This starts a JSON-RPC server using stdio transport that provides:

πŸ› οΈ Tools

  • search: Perform semantic search across indexed files
    • Parameters: query (string), topK (number, default: 5), minScore (number, default: 0)
  • get_stats: Get database statistics (file count, size, etc.)

πŸ“š Resources

  • file://<id>: Access individual indexed files by their UUID
  • Dynamic resource list based on your indexed files

AI Assistant Integration

Configure your AI assistant (Claude Desktop, VS Code Copilot, etc.) to connect to the MCP server:

{
  "mcpServers": {
    "semsearch": {
      "command": "npx",
      "args": ["semsearch@latest", "mcp"],
      "type": "stdio"
    }
  }
}

Note: The MCP server automatically uses your configured database and settings from the default locations. No environment variables needed!

The AI assistant can then:

  • Search your files semantically
  • Retrieve specific files by ID
  • Get database statistics
  • Access your knowledge base contextually

πŸ”§ Configuration

The setup wizard creates two configuration files:

Supported Endpoints

The library supports two Azure endpoint types:

  • Azure OpenAI: https://your-resource.openai.azure.com/
  • Azure AI Foundry: https://your-resource.cognitiveservices.azure.com/

.env (Environment Variables)

# Use one of the following endpoint formats:
# - Azure OpenAI: https://your-resource.openai.azure.com/
# - Azure AI Foundry: https://your-resource.cognitiveservices.azure.com/
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
AZURE_OPENAI_API_KEY=your_api_key_here
AZURE_OPENAI_EMBED_DEPLOYMENT=text-embedding-ada-002
AZURE_OPENAI_RERANK_DEPLOYMENT=gpt-4.1-mini
SEMSEARCH_DEFAULT_DB=./.data/index.db

.semsearch.json (Configuration Metadata)

{
  "version": "0.3.0",
  "setupDate": "2025-09-14T08:24:01.945Z",
  "authentication": "api-key",
  "endpoints": {
    "azure": "https://your-resource.openai.azure.com/"
  },
  "deployments": {
    "embedding": "text-embedding-ada-002",
    "rerank": "gpt-4.1-mini"
  },
  "storage": {
    "defaultDatabase": "./.data/index.db"
  }
}

Note: For Azure AI Foundry, use https://your-resource.cognitiveservices.azure.com/ as the endpoint.

🎨 CLI Examples

Status Check

$ npx semsearch status
πŸ“Š Semantic Search Status

Configuration:
  Endpoint: https://your-resource.openai.azure.com/
  Auth: API Key
  Embedding: text-embedding-ada-002
  Rerank: gpt-4.1-mini

Database:
  Path: ./.data/index.db
  Size: 60.0 KB
  Modified: 2025-09-14T08:25:13.398Z
  Documents: 5

Config Files:
  .env: βœ“
  .semsearch.json: βœ“

Search Results

$ npx semsearch search "database design"
πŸ” Searching for: database design

Found 5 results:

1. 100.0 database-design.md (eded72e8...)
   The text outlines fundamental database design principles, including 
   normalization, data integrity, performance optimization, and security...

2. 15.0 semantic-search.md (a9408b5e...)
   The text discusses semantic search technology, which leverages natural
   language processing and machine learning...

# Search with minimum similarity threshold
$ npx semsearch search "machine learning" --min-similarity 0.7
πŸ” Searching for: machine learning
   Minimum cosine similarity: 0.70

Found 2 results:

# Search with minimum final score threshold  
$ npx semsearch search "database" --min-score 80
πŸ” Searching for: database
   Minimum final score: 80

Found 1 result:

πŸ—οΈ Architecture

  • Azure OpenAI: Text embeddings (text-embedding-ada-002) + reranking (gpt-4.1-mini)
  • SQLite Database: Vector storage with sqlite-vec extension and WAL mode
  • TypeScript: Fully typed with strict compilation
  • CLI Framework: Commander.js with inquirer, chalk, ora, boxen for UX
  • Binary Entry Points:
    • semsearch - Main CLI interface
    • semsearch-mcp - Direct MCP server launcher
  • MCP Integration: JSON-RPC over stdio transport for AI assistant integration

πŸ” Authentication Options

API Key

  • Simple setup through init wizard
  • Stored securely in .env file
  • Immediate access without additional configuration

Azure AD

  • Uses DefaultAzureCredential
  • Supports managed identities
  • Enhanced security for enterprise environments

πŸ“Š Performance

  • Embedding Model: text-embedding-ada-002 (1536 dimensions)
  • Search Strategy: Vector similarity + AI reranking
  • Storage: SQLite with sqlite-vec extension for optimized vector operations
  • Retrieval: Sub-linear vector search with fallback to brute-force when needed
  • Optimization: Dynamic vector table creation with dimension detection
  • Concurrency: Configurable parallel processing for indexing operations

πŸ’» Library Usage

import { SemanticStore } from 'semantic-search';
import { DefaultAzureCredential } from '@azure/identity';

const store = new SemanticStore({
  azure: {
    endpoint: process.env.AZURE_OPENAI_ENDPOINT || 'https://your-resource.openai.azure.com/',
    embeddingDeployment: process.env.AZURE_OPENAI_EMBED_DEPLOYMENT || 'text-embedding-ada-002',
    rerankDeployment: process.env.AZURE_OPENAI_RERANK_DEPLOYMENT || 'gpt-4.1-mini'
  },
  credential: new DefaultAzureCredential(),
  sqlite: { path: './.data/index.db' },
  summarizer: { maxChars: 50000 }
});

await store.indexPath('./docs');  // Index all files in ./docs directory
const results = await store.search('How to configure X?', { topK: 10 });

πŸ› οΈ Development

Build from Source

npm install
npm run build

Project Structure

src/
β”œβ”€β”€ cli.ts                  # Command-line interface with all commands
β”œβ”€β”€ init-wizard.ts          # Setup wizard with interactive UX
β”œβ”€β”€ store.ts                # Main SemanticStore class
β”œβ”€β”€ azure.ts                # Azure OpenAI client management
β”œβ”€β”€ sqlite.ts               # SQLite vector database with sqlite-vec
β”œβ”€β”€ nlp.ts                  # NLP operations (embed, summarize, rerank)
β”œβ”€β”€ mcp-server-simple.ts    # MCP server implementation (JSON-RPC over stdio)
β”œβ”€β”€ config-manager.ts       # Configuration file management
β”œβ”€β”€ indexing-progress.ts    # Progress tracking for indexing operations
β”œβ”€β”€ prompts-manager.ts      # AI prompt management and customization
β”œβ”€β”€ types.ts                # TypeScript type definitions
└── utils.ts                # Utility functions

πŸ“ Notes

  • Embeddings are normalized Float32 arrays stored as SQLite BLOBs
  • Uses sqlite-vec extension for optimized vector search with brute-force fallback
  • Files are only reprocessed if not found in database (no automatic change detection)
  • Use --force flag to reprocess all files regardless of database state
  • WAL mode enabled for better SQLite performance
  • Dual storage: main files table + virtual vec_files table for vector operations
  • Supports text files with automatic MIME type detection

πŸ“ License

MIT

About

No description, website, or topics provided.

Resources

Contributing

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •