A standalone file semantic search library with Azure OpenAI integration, featuring intelligent file indexing, vector similarity search, and AI-powered reranking.
- Setup Wizard: Interactive guided setup with rich terminal interface
- Azure OpenAI Integration: Text embeddings + GPT-4 reranking
- SQLite Storage: Persistent vector database with efficient retrieval
- Smart File Indexing: Automatic file discovery, summarization, and embedding
- File Semantic Search: Intent-aware search beyond keyword matching across your documents
- Flexible Auth: API key or Azure AD authentication
- MCP Server: Model Context Protocol integration for AI assistants
- Performance Optimized: sqlite-vec integration for fast vector search
- Customizable: Configurable prompts, thresholds, and processing options
- Comprehensive: Detailed status reporting and database statistics
npx semsearch initThe setup wizard will guide you through:
- Azure OpenAI endpoint configuration
- Authentication method (API key vs Azure AD)
- Model deployment selection
- Database location setup
- Connection testing
npx semsearch index ./my-documentsnpx semsearch search "machine learning algorithms"| Command | Description |
|---|---|
init |
Run setup wizard (first time) |
status |
Show configuration and database status |
index <path> |
Index files for search |
search <query> |
Search indexed files |
info <id> |
Get detailed file information |
reset |
Reset configuration and/or database |
prompts |
Manage AI prompts for summarization and reranking |
test-connection |
Test Azure OpenAI connectivity |
mcp |
Start MCP server for AI assistant integration |
--endpoint <url>- Azure OpenAI endpoint URL--api-key <key>- Azure OpenAI API key (if not using managed identity)--embedding-model <model>- Azure OpenAI embedding deployment name (default: text-embedding-ada-002)--llm-model <model>- Azure OpenAI LLM deployment name for reranking (default: gpt-4.1-mini)-v, --verbose- Show detailed progress information
--db <path>- SQLite database path
--force- Skip confirmation prompt--db-only- Reset only the database, keep configuration
--db <path>- SQLite database path--maxChars <n>- Max characters to read per file (default: 50000)--force- Reprocess all files even if already indexed-c, --concurrency <n>- Number of files to process in parallel (default: 3)
--db <path>- SQLite database path--topK <n>- Maximum number of results to return (default: 5)--min-similarity <n>- Minimum cosine similarity threshold 0.0-1.0, filters vector matches (default: 0.1)--min-score <n>- Minimum final score threshold 0-100, filters reranked results (default: 0)
--db <path>- SQLite database path
--edit- Edit prompts interactively--reset- Reset prompts to defaults
--endpoint <url>- Azure OpenAI endpoint URL--api-key <key>- Azure OpenAI API key
# Basic commands
npx semsearch init # Interactive setup
npx semsearch index ./documents # Index documents
npx semsearch search "machine learning" # Basic search
# Advanced indexing options
npx semsearch index ./docs --maxChars 100000 --concurrency 5 --force
# Advanced search with filtering
npx semsearch search "database design" --topK 10 --min-similarity 0.7 --min-score 80
# Use custom database location
npx semsearch search "query" --db ./custom/index.db
# Reset database only (keep configuration)
npx semsearch reset --db-only --force
# Edit AI prompts
npx semsearch prompts --edit
# Test connection with specific credentials
npx semsearch test-connection --endpoint https://your-endpoint.com --api-key your-keyThe MCP (Model Context Protocol) server allows AI assistants to access your semantic search functionality directly.
npx semsearch mcp
# OR use the direct binary
npx semsearch-mcpThis starts a JSON-RPC server using stdio transport that provides:
search: Perform semantic search across indexed files- Parameters:
query(string),topK(number, default: 5),minScore(number, default: 0)
- Parameters:
get_stats: Get database statistics (file count, size, etc.)
file://<id>: Access individual indexed files by their UUID- Dynamic resource list based on your indexed files
Configure your AI assistant (Claude Desktop, VS Code Copilot, etc.) to connect to the MCP server:
{
"mcpServers": {
"semsearch": {
"command": "npx",
"args": ["semsearch@latest", "mcp"],
"type": "stdio"
}
}
}Note: The MCP server automatically uses your configured database and settings from the default locations. No environment variables needed!
The AI assistant can then:
- Search your files semantically
- Retrieve specific files by ID
- Get database statistics
- Access your knowledge base contextually
The setup wizard creates two configuration files:
The library supports two Azure endpoint types:
- Azure OpenAI:
https://your-resource.openai.azure.com/ - Azure AI Foundry:
https://your-resource.cognitiveservices.azure.com/
# Use one of the following endpoint formats:
# - Azure OpenAI: https://your-resource.openai.azure.com/
# - Azure AI Foundry: https://your-resource.cognitiveservices.azure.com/
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
AZURE_OPENAI_API_KEY=your_api_key_here
AZURE_OPENAI_EMBED_DEPLOYMENT=text-embedding-ada-002
AZURE_OPENAI_RERANK_DEPLOYMENT=gpt-4.1-mini
SEMSEARCH_DEFAULT_DB=./.data/index.db{
"version": "0.3.0",
"setupDate": "2025-09-14T08:24:01.945Z",
"authentication": "api-key",
"endpoints": {
"azure": "https://your-resource.openai.azure.com/"
},
"deployments": {
"embedding": "text-embedding-ada-002",
"rerank": "gpt-4.1-mini"
},
"storage": {
"defaultDatabase": "./.data/index.db"
}
}Note: For Azure AI Foundry, use
https://your-resource.cognitiveservices.azure.com/as the endpoint.
$ npx semsearch status
π Semantic Search Status
Configuration:
Endpoint: https://your-resource.openai.azure.com/
Auth: API Key
Embedding: text-embedding-ada-002
Rerank: gpt-4.1-mini
Database:
Path: ./.data/index.db
Size: 60.0 KB
Modified: 2025-09-14T08:25:13.398Z
Documents: 5
Config Files:
.env: β
.semsearch.json: β$ npx semsearch search "database design"
π Searching for: database design
Found 5 results:
1. 100.0 database-design.md (eded72e8...)
The text outlines fundamental database design principles, including
normalization, data integrity, performance optimization, and security...
2. 15.0 semantic-search.md (a9408b5e...)
The text discusses semantic search technology, which leverages natural
language processing and machine learning...
# Search with minimum similarity threshold
$ npx semsearch search "machine learning" --min-similarity 0.7
π Searching for: machine learning
Minimum cosine similarity: 0.70
Found 2 results:
# Search with minimum final score threshold
$ npx semsearch search "database" --min-score 80
π Searching for: database
Minimum final score: 80
Found 1 result:- Azure OpenAI: Text embeddings (
text-embedding-ada-002) + reranking (gpt-4.1-mini) - SQLite Database: Vector storage with sqlite-vec extension and WAL mode
- TypeScript: Fully typed with strict compilation
- CLI Framework: Commander.js with inquirer, chalk, ora, boxen for UX
- Binary Entry Points:
semsearch- Main CLI interfacesemsearch-mcp- Direct MCP server launcher
- MCP Integration: JSON-RPC over stdio transport for AI assistant integration
- Simple setup through init wizard
- Stored securely in
.envfile - Immediate access without additional configuration
- Uses DefaultAzureCredential
- Supports managed identities
- Enhanced security for enterprise environments
- Embedding Model: text-embedding-ada-002 (1536 dimensions)
- Search Strategy: Vector similarity + AI reranking
- Storage: SQLite with sqlite-vec extension for optimized vector operations
- Retrieval: Sub-linear vector search with fallback to brute-force when needed
- Optimization: Dynamic vector table creation with dimension detection
- Concurrency: Configurable parallel processing for indexing operations
import { SemanticStore } from 'semantic-search';
import { DefaultAzureCredential } from '@azure/identity';
const store = new SemanticStore({
azure: {
endpoint: process.env.AZURE_OPENAI_ENDPOINT || 'https://your-resource.openai.azure.com/',
embeddingDeployment: process.env.AZURE_OPENAI_EMBED_DEPLOYMENT || 'text-embedding-ada-002',
rerankDeployment: process.env.AZURE_OPENAI_RERANK_DEPLOYMENT || 'gpt-4.1-mini'
},
credential: new DefaultAzureCredential(),
sqlite: { path: './.data/index.db' },
summarizer: { maxChars: 50000 }
});
await store.indexPath('./docs'); // Index all files in ./docs directory
const results = await store.search('How to configure X?', { topK: 10 });npm install
npm run buildsrc/
βββ cli.ts # Command-line interface with all commands
βββ init-wizard.ts # Setup wizard with interactive UX
βββ store.ts # Main SemanticStore class
βββ azure.ts # Azure OpenAI client management
βββ sqlite.ts # SQLite vector database with sqlite-vec
βββ nlp.ts # NLP operations (embed, summarize, rerank)
βββ mcp-server-simple.ts # MCP server implementation (JSON-RPC over stdio)
βββ config-manager.ts # Configuration file management
βββ indexing-progress.ts # Progress tracking for indexing operations
βββ prompts-manager.ts # AI prompt management and customization
βββ types.ts # TypeScript type definitions
βββ utils.ts # Utility functions
- Embeddings are normalized Float32 arrays stored as SQLite BLOBs
- Uses sqlite-vec extension for optimized vector search with brute-force fallback
- Files are only reprocessed if not found in database (no automatic change detection)
- Use
--forceflag to reprocess all files regardless of database state - WAL mode enabled for better SQLite performance
- Dual storage: main
filestable + virtualvec_filestable for vector operations - Supports text files with automatic MIME type detection
MIT