CustomKB: Production-Ready AI Knowledgebase System

CustomKB transforms your documents into AI-powered, searchable knowledgebases with state-of-the-art embedding models, vector search, and language models to deliver contextually relevant answers from your data.

Key Features

Core Capabilities

Semantic Search: Find information by meaning, not just keywords
Multi-Provider AI: OpenAI, Anthropic, Google, xAI, and local models via Ollama
Universal Document Support: Process Markdown, HTML, code, PDFs, and plain text
27+ Language Support: Multi-language processing with automatic detection
Hybrid Search: Combines vector similarity with BM25 keyword matching
Cross-Encoder Reranking: Boosts accuracy by 20-40% with advanced models
Enterprise Security: Input validation, path protection, API key security

Performance & Scale

Memory-Optimized Tiers: Automatically adapts from 4GB to 128GB+ systems
GPU Acceleration: CUDA support for faster reranking
Concurrent Processing: Batch operations with configurable thread pools
Smart Caching: Two-tier cache system with LRU eviction
Production Ready: Checkpoint saving, automatic retries, graceful error handling

Prerequisites

Python 3.12 or higher
SQLite 3.45+
4GB+ RAM (8GB+ recommended)
NVIDIA GPU with CUDA (optional, for acceleration)
API keys for chosen providers

Installation

1. Clone Repository

git clone https://github.com/Open-Technology-Foundation/customkb.git
cd customkb

2. Setup Environment

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install -r requirements.txt

3. Install NLTK Data

sudo ./setup/nltk_setup.py download cleanup

4. Configure API Keys

export OPENAI_API_KEY="your-openai-key"
export ANTHROPIC_API_KEY="your-anthropic-key"
export GOOGLE_API_KEY="your-google-key"      # Optional
export XAI_API_KEY="your-xai-key"            # Optional
export VECTORDBS="/var/lib/vectordbs"        # KB storage location

Quick Start

Create Your First Knowledgebase

# 1. Create knowledgebase directory
mkdir -p /var/lib/vectordbs/myproject

# 2. Create configuration
cat > /var/lib/vectordbs/myproject/myproject.cfg << 'EOF'
[DEFAULT]
vector_model = text-embedding-3-small
query_model = gpt-4o-mini
db_min_tokens = 200
db_max_tokens = 400
EOF

# 3. Process documents
customkb database myproject docs/*.md *.txt

# 4. Generate embeddings
customkb embed myproject

# 5. Query your knowledgebase
customkb query myproject "What are the main features?"

Core Commands

`database` - Import Documents

customkb database <kb_name> [files...] [options]

Process and store text files in the knowledgebase.

Options:

-l, --language: Stopwords language (en, fr, de, etc.)
--detect-language: Auto-detect language per file
-f, --force: Reprocess existing files
-v, --verbose: Detailed output

Example:

customkb database myproject ~/docs/**/*.md --detect-language

`embed` - Generate Embeddings

customkb embed <kb_name> [options]

Create vector embeddings for all text chunks.

Options:

-r, --reset-database: Reset embedding status
-v, --verbose: Show progress

`query` - Search & Ask Questions

customkb query <kb_name> "<question>" [options]

Perform semantic search and generate AI responses.

Options:

-c, --context-only: Return only context, no AI response
-m, --model: AI model to use
-k, --top-k: Number of results (default: 50)
-t, --temperature: Response creativity (0-2)
-f, --format: Output format (xml, json, markdown, plain)
-p, --prompt-template: Response style template

Examples:

# Simple query
customkb query myproject "How does authentication work?"

# Advanced query with options
customkb query myproject "Explain the architecture" \
  --model claude-3-5-sonnet-latest \
  --format json \
  --prompt-template technical

Performance Optimization

Auto-Optimization

# Analyze and show recommendations
customkb optimize --analyze

# Apply optimizations
customkb optimize myproject

# Preview changes
customkb optimize myproject --dry-run

Memory Tiers

CustomKB automatically configures based on available memory:

Memory	Tier	Features
<16GB	Low	Conservative settings, no hybrid search
16-64GB	Medium	Balanced performance, moderate caching
64-128GB	High	Large batches, hybrid search enabled
>128GB	Very High	Maximum performance, all features

Database Indexes

# Verify performance indexes
customkb verify-indexes myproject

# Build BM25 hybrid search index
customkb bm25 myproject

Supported Models

Language Models (LLMs)

OpenAI

GPT-4o, GPT-4o-mini (128k context)
o3, o3-mini, o3-pro (reasoning models)
o4-mini (multimodal reasoning)

Anthropic

Claude 4.0 Opus/Sonnet (200k context)
Claude 3.7 Sonnet (extended thinking)
Claude 3.5 Sonnet/Haiku

Google

Gemini 2.5 Pro/Flash/Lite (thinking models)
Gemini 2.0 Pro/Flash (2M context)

xAI

Grok 4.0, Grok 4.0-heavy (PhD-level reasoning)

Local (Ollama)

Llama 3.3 (8B-70B)
Gemma 3 (4B-27B)
DeepSeek R1
Qwen 2.5, Mistral, Phi-4

Embedding Models

OpenAI

text-embedding-3-large (3072 dims, best quality)
text-embedding-3-small (1536 dims, cost-effective)
text-embedding-ada-002 (1536 dims, legacy)

Google

gemini-embedding-001 (768/1536/3072 dims)
- 68% MTEB score vs 64.6% for OpenAI
- 30k token context vs 8k
- Matryoshka Representation Learning

Configuration

CustomKB uses INI-style configuration with environment variable overrides.

Priority Order

Environment variables (highest)
Configuration file
Default values (lowest)

Example Configuration

[DEFAULT]
# Models
vector_model = text-embedding-3-small
query_model = gpt-4o-mini

# Text Processing
db_min_tokens = 200
db_max_tokens = 400

# Query Settings
query_max_tokens = 4096
query_top_k = 30
query_temperature = 0.1
query_role = You are a helpful expert assistant.

# Output Format
reference_format = json  # xml, json, markdown, plain
query_prompt_template = technical  # default, scholarly, concise, etc.

[ALGORITHMS]
# Search Configuration
similarity_threshold = 0.6
enable_hybrid_search = true
bm25_weight = 0.5
bm25_max_results = 1000

# Reranking
enable_reranking = true
reranking_model = cross-encoder/ms-marco-MiniLM-L-6-v2
reranking_top_k = 30

[PERFORMANCE]
# Optimization
embedding_batch_size = 100
cache_thread_pool_size = 4
memory_cache_size = 10000
checkpoint_interval = 10

[API]
# Rate Limiting
api_call_delay_seconds = 0.05
api_max_concurrency = 8
api_max_retries = 20

Advanced Features

Prompt Templates

Customize response styles:

customkb query myproject "question" --prompt-template <template>

Templates: default, instructive, scholarly, concise, analytical, conversational, technical

Output Formats

Control how results are formatted:

# JSON for APIs
customkb query myproject "search" --format json

# Markdown for documentation
customkb query myproject "search" --format markdown

Category Filtering

Filter results by categories:

# Categorize documents
customkb categorize myproject --import

# Query with filters
customkb query myproject "query" --categories "Technical,Legal"

Multi-Language Support

# Process with specific language
customkb database myproject docs/*.txt --language french

# Auto-detect languages
customkb database myproject docs/ --detect-language

Knowledgebase Structure

All knowledgebases live in $VECTORDBS (default: /var/lib/vectordbs):

/var/lib/vectordbs/
├── myproject/
│   ├── myproject.cfg       # Configuration (required)
│   ├── myproject.db        # SQLite database
│   ├── myproject.faiss     # Vector index
│   ├── myproject.bm25      # BM25 index (optional)
│   └── logs/               # Runtime logs

Name Resolution

The system intelligently resolves KB names:

# All resolve to the same KB:
customkb query myproject "test"
customkb query myproject.cfg "test"
customkb query /path/to/myproject "test"
# → Uses /var/lib/vectordbs/myproject/myproject.cfg

Utility Scripts

Located in scripts/:

Performance

optimize_kb_performance.py - Apply memory tiers
performance_analyzer.py - Analyze metrics
emergency_optimize.py - Conservative recovery

GPU

benchmark_gpu.py - GPU vs CPU benchmarks
gpu_monitor.sh - Real-time monitoring

Maintenance

rebuild_bm25_filtered.py - Filtered BM25 indexes
upgrade_bm25_tokens.py - Database upgrades
diagnose_crashes.py - Crash diagnostics

Testing

# Install test dependencies
pip install -r requirements-test.txt

# Run tests
python run_tests.py              # All tests
python run_tests.py --unit       # Unit only
python run_tests.py --safe       # With memory limits
python run_tests.py --coverage   # Coverage report

Troubleshooting

Common Issues

"Knowledgebase not found"

Verify KB exists in $VECTORDBS
Check error message for available KBs

"API rate limit"

Increase api_call_delay_seconds
Reduce api_max_concurrency

"Out of memory"

Run customkb optimize --analyze
Reduce embedding_batch_size

"Low similarity scores"

Check language match
Try stronger embedding model
Adjust similarity_threshold

Debug Mode

# Enable debug logging
customkb query myproject "test" --debug

# Check logs
tail -f /var/lib/vectordbs/myproject/logs/myproject.log

Integration

With Dejavu2-CLI

# Use as dv2 knowledgebase
dv2 -k /var/lib/vectordbs/myproject/myproject.cfg "question"

# Create custom agents
dv2 --edit-templates

Production Deployment

# From development server
yes | ./push-to-okusi 3 -N

# Container deployment
customkb optimize --memory-gb 8  # Override for containers

Complete Example

Building a Production Knowledgebase

# 1. Prepare data
mkdir -p /var/lib/vectordbs/techbase
cd /var/lib/vectordbs/techbase

# 2. Create configuration
cat > techbase.cfg << 'EOF'
[DEFAULT]
vector_model = gemini-embedding-001
embedding_dimensions = 1536
query_model = claude-3-5-sonnet-latest
db_min_tokens = 250
db_max_tokens = 500

[ALGORITHMS]
enable_hybrid_search = true
enable_reranking = true
similarity_threshold = 0.65

[PERFORMANCE]
embedding_batch_size = 150
memory_cache_size = 20000
EOF

# 3. Process documents
customkb database techbase ~/docs/**/*.md --detect-language

# 4. Generate embeddings
customkb embed techbase --verbose

# 5. Build indexes
customkb bm25 techbase
customkb optimize techbase

# 6. Verify setup
customkb verify-indexes techbase

# 7. Test queries
customkb query techbase "What are the best practices?" \
  --prompt-template technical \
  --format markdown

Quick Reference

Environment Variables

OPENAI_API_KEY       # OpenAI API key
ANTHROPIC_API_KEY    # Anthropic API key
GOOGLE_API_KEY       # Google/Gemini API key
XAI_API_KEY          # xAI API key
VECTORDBS            # KB base directory
NLTK_DATA            # NLTK data location

Command Aliases

# Model shortcuts
-m gpt4o     → gpt-4o
-m sonnet    → claude-3-5-sonnet-latest
-m gemini2   → gemini-2.0-flash
-m list      → Show all models

Performance Tips

Use batch processing for large datasets
Enable hybrid search for technical content
Configure GPU acceleration when available
Monitor cache hit rates in logs
Run optimize after major changes

License

GPL-3.0 License - see LICENSE file.

Actively maintained by the Indonesian Open Technology Foundation

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.github		.github
categorize		categorize
config		config
database		database
docs		docs
embedding		embedding
examples		examples
gpu_patches		gpu_patches
models		models
query		query
scripts		scripts
setup		setup
tests		tests
utils		utils
.bash_completion		.bash_completion
.coveragerc		.coveragerc
.editorconfig		.editorconfig
.env.test		.env.test
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
Models.json		Models.json
README.md		README.md
customkb		customkb
customkb.py		customkb.py
example.cfg		example.cfg
pyproject.toml		pyproject.toml
pytest-safe.ini		pytest-safe.ini
pytest.ini		pytest.ini
requirements-dev.txt		requirements-dev.txt
requirements-test.txt		requirements-test.txt
requirements.txt		requirements.txt
run_tests.py		run_tests.py
test_pr_updates.py		test_pr_updates.py
version.py		version.py
version.sh		version.sh

License

Open-Technology-Foundation/customkb

Folders and files

Latest commit

History

Repository files navigation

CustomKB: Production-Ready AI Knowledgebase System

Key Features

Core Capabilities

Performance & Scale

Prerequisites

Installation

1. Clone Repository

2. Setup Environment

3. Install NLTK Data

4. Configure API Keys

Quick Start

Create Your First Knowledgebase

Core Commands

database - Import Documents

embed - Generate Embeddings

query - Search & Ask Questions

Performance Optimization

Auto-Optimization

Memory Tiers

Database Indexes

Supported Models

Language Models (LLMs)

Embedding Models

Configuration

Priority Order

Example Configuration

Advanced Features

Prompt Templates

Output Formats

Category Filtering

Multi-Language Support

Knowledgebase Structure

Name Resolution

Utility Scripts

Performance

GPU

Maintenance

Testing

Troubleshooting

Common Issues

Debug Mode

Integration

With Dejavu2-CLI

Production Deployment

Complete Example

Building a Production Knowledgebase

Quick Reference

Environment Variables

Command Aliases

Performance Tips

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`database` - Import Documents

`embed` - Generate Embeddings

`query` - Search & Ask Questions

Packages