Skip to content

Production-ready LLM toolkit with unified interface for multiple providers (OpenAI, Anthropic, Google, Ollama)

License

Notifications You must be signed in to change notification settings

leebeanbin/beanllm

Repository files navigation

πŸš€ beanllm

Production-ready LLM toolkit with Clean Architecture and unified interface for multiple providers

PyPI version Python 3.11+ License: MIT Downloads Tests GitHub Stars

beanllm is a comprehensive, production-ready toolkit for building LLM applications with a unified interface across OpenAI, Anthropic, Google, DeepSeek, Perplexity, and Ollama. Built with Clean Architecture and SOLID principles for maintainability and scalability.


πŸ“š Documentation


✨ Key Features

🎯 Core Features

  • πŸ”„ Unified Interface - Single API for 7 LLM providers (OpenAI, Claude, Gemini, DeepSeek, Perplexity, Ollama)
  • πŸŽ›οΈ Intelligent Adaptation - Automatic parameter conversion between providers
  • πŸ“Š Model Registry - Auto-detect available models from API keys
  • πŸ” CLI Tools - Inspect models and capabilities from command line
  • πŸ’° Cost Tracking - Accurate token counting and cost estimation
  • πŸ—οΈ Clean Architecture - Layered architecture with clear separation of concerns

πŸ“„ RAG & Document Processing

  • πŸ“‘ Document Loaders - PDF, DOCX, XLSX, PPTX (Docling), Jupyter Notebooks, HTML, CSV, TXT
  • πŸš€ beanPDFLoader - Advanced PDF processing with 3-layer architecture
    • ⚑ Fast Layer (PyMuPDF): ~2s/100 pages, image extraction
    • 🎯 Accurate Layer (pdfplumber): 95% accuracy, table extraction
    • πŸ€– ML Layer (marker-pdf): 98% accuracy, structure-preserving Markdown
  • βœ‚οΈ Smart Text Splitters - Semantic chunking with tiktoken
  • πŸ—„οΈ Vector Search - Chroma, FAISS, Pinecone, Qdrant, Weaviate, Milvus, LanceDB, pgvector
  • 🎯 RAG Pipeline - Complete question-answering system in one line
  • πŸ“Š RAG Evaluation - TruLens integration, context recall metrics

🧠 Embeddings

  • πŸ“ Text Embeddings - OpenAI, Gemini, Voyage, Jina, Mistral, Cohere, HuggingFace, Ollama
  • 🌏 Multilingual - Qwen3-Embedding-8B (top multilingual model)
  • πŸ’» Code Embeddings - Specialized embeddings for code search
  • πŸ–ΌοΈ Vision Embeddings - CLIP, SigLIP, MobileCLIP for image-text matching
  • 🎨 Advanced Features - Matryoshka (dimension reduction), MMR search, hard negative mining

πŸ‘οΈ Vision AI

  • βœ‚οΈ Segmentation - SAM 3 (zero-shot segmentation)
  • 🎯 Object Detection - YOLOv12 (latest detection/segmentation)
  • πŸ€– Vision-Language - Qwen3-VL (VQA, OCR, captioning, 128K context)
  • πŸ–ΌοΈ Image Understanding - Florence-2 (detection, captioning, VQA)
  • πŸ” Vision RAG - Image-based question answering with CLIP embeddings

πŸŽ™οΈ Audio Processing

  • 🎀 Speech-to-Text - 8 STT engines with multilingual support
    • ⚑ SenseVoice-Small: 15x faster than Whisper-Large, emotion recognition, ν•œκ΅­μ–΄ 지원
    • 🏒 Granite Speech 8B: Open ASR Leaderboard #2 (WER 5.85%), enterprise-grade
    • πŸ”₯ Whisper V3 Turbo, Distil-Whisper, Parakeet TDT, Canary, Moonshine
  • πŸ”Š Text-to-Speech - Multi-provider TTS (OpenAI, Azure, Google)
  • 🎧 Audio RAG - Search and QA across audio files

πŸ€– Advanced LLM Features

  • πŸ› οΈ Tools & Agents - Function calling with ReAct pattern
  • 🧠 Memory Systems - Buffer, window, token-based, summary memory
  • ⛓️ Chains - Sequential, parallel, and custom chain composition
  • πŸ“Š Output Parsers - Pydantic, JSON, datetime, enum parsing
  • πŸ’« Streaming - Real-time response streaming
  • 🎯 Structured Outputs - 100% schema accuracy (OpenAI strict mode)
  • πŸ’Ύ Prompt Caching - 85% latency reduction, 10x cost savings (Anthropic)
  • ⚑ Parallel Tool Calling - Concurrent function execution

πŸ•ΈοΈ Graph & Multi-Agent

  • πŸ“Š Graph Workflows - LangGraph-style DAG execution
  • 🀝 Multi-Agent - Sequential, parallel, hierarchical, debate patterns
  • πŸ’Ύ State Management - Automatic state threading and checkpoints
  • πŸ“ž Communication - Inter-agent message passing

🏭 Production Features

  • πŸ“ˆ Evaluation - BLEU, ROUGE, LLM-as-Judge, RAG metrics, context recall
  • πŸ‘€ Human-in-the-Loop - Feedback collection and hybrid evaluation
  • πŸ”„ Continuous Evaluation - Scheduled evaluation and tracking
  • πŸ“‰ Drift Detection - Model performance monitoring
  • 🎯 Fine-tuning - OpenAI fine-tuning API integration
  • πŸ›‘οΈ Error Handling - Retry, circuit breaker, rate limiting
  • πŸ“Š Tracing - Distributed tracing with OpenTelemetry

⚑ Performance Optimizations (v0.2.1)

Algorithm Optimizations:

  • πŸš€ Model Parameter Lookup: 100Γ— speedup (O(n) β†’ O(1)) - Pre-cached dictionary lookup
  • πŸ” Hybrid Search: 10-50% faster top-k selection (O(n log n) β†’ O(n log k)) - heapq.nlargest() optimization
  • πŸ“ Directory Loading: 1000Γ— faster pattern matching (O(nΓ—mΓ—p) β†’ O(nΓ—m)) - Pre-compiled regex patterns

Code Quality:

  • 🧹 Duplicate Code: ~100+ lines eliminated via helper methods (CSV loader, cache consolidation)
  • πŸ›‘οΈ Error Handling: Standardized utilities in base provider (reduces boilerplate across all providers)
  • πŸ—οΈ Architecture: Single Responsibility, DRY principle, Template Method pattern

Impact:

  • Model-heavy workflows: 10-30% faster
  • Large-scale RAG: 20-50% faster
  • Directory scanning: 50-90% faster

πŸ—οΈ Project Structure Improvements (v0.2.1)

Phase 1: Configuration & Cleanup:

  • βœ… MANIFEST.in: Fixed package name bug (llmkit β†’ beanllm)
  • βœ… Dependencies: Moved pytest to dev, added version caps (prevents breaking changes)
  • βœ… .env.example: Created template with all required API keys
  • βœ… Cleanup: Removed ~396MB of unnecessary files (caches, build artifacts, bytecode)
  • βœ… Simplified: Eliminated duplicate re-export layers (vector_stores/, embeddings.py)

Phase 2: Code Quality & Utilities:

  • ✨ DependencyManager: Centralized dependency checking (261 duplicates β†’ 1)
  • ✨ LazyLoadMixin: Deferred initialization pattern (23 duplicates β†’ 1)
  • ✨ StructuredLogger: Consistent logging (510+ calls unified)
  • ✨ Module Naming: _source_providers/ β†’ providers/, _source_models/ β†’ models/

Phase 3: God Class Decomposition (5,930 lines β†’ 23 files):

  • πŸ“¦ vision/models.py (1,845 lines) β†’ 4 files (sam, florence, yolo, + 4 more models)
  • πŸ“¦ vector_stores/implementations.py (1,650 lines) β†’ 9 files (8 stores + re-exports)
  • πŸ“¦ loaders/loaders.py (1,435 lines) β†’ 8 files (7 loaders + re-exports)

Impact:

  • Disk space: -396MB (-99%)
  • Code duplication: -90% (794 β†’ ~80)
  • God classes: 5 β†’ 0 (all decomposed βœ…)
  • Average file size: ~200 lines (was 1,500+)
  • New modules: +21 focused files
  • Utility modules: +3 (reusable)
  • Configuration bugs: 0 (all fixed)
  • Module naming: 100% consistent
  • Backward compatibility: Maintained (re-exports)

πŸ“¦ Installation

Using pip

# Basic installation
pip install beanllm

# Specific providers
pip install beanllm[openai]
pip install beanllm[anthropic]
pip install beanllm[gemini]
pip install beanllm[all]

# ML-based PDF processing
pip install beanllm[ml]

# Development tools
pip install beanllm[dev,all]

Using Poetry (ꢌμž₯)

git clone https://github.com/yourusername/beanllm.git
cd beanllm
poetry install --extras all
poetry shell

πŸš€ Quick Start

Environment Setup

Create .env file in project root:

# LLM Providers
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GEMINI_API_KEY=...
DEEPSEEK_API_KEY=sk-...
PERPLEXITY_API_KEY=pplx-...
OLLAMA_HOST=http://localhost:11434

πŸ’¬ Basic Chat

import asyncio
from beanllm import Client

async def main():
    # Unified interface - works with any provider
    client = Client(model="gpt-4o")
    response = await client.chat(
        messages=[{"role": "user", "content": "Explain quantum computing"}]
    )
    print(response.content)

    # Switch providers seamlessly
    client = Client(model="claude-sonnet-4-20250514")
    response = await client.chat(
        messages=[{"role": "user", "content": "Same question, different provider"}]
    )

    # Streaming
    async for chunk in client.stream_chat(
        messages=[{"role": "user", "content": "Tell me a story"}]
    ):
        print(chunk, end="", flush=True)

asyncio.run(main())

πŸ“š RAG in One Line

import asyncio
from beanllm import RAGChain

async def main():
    # Create RAG system from documents
    rag = RAGChain.from_documents("docs/")

    # Ask questions
    answer = await rag.query("What is this document about?")
    print(answer)

    # With sources
    result = await rag.query("Explain the main concept", include_sources=True)
    print(result.answer)
    for source in result.sources:
        print(f"πŸ“„ Source: {source.metadata.get('source', 'unknown')}")

    # Streaming query
    async for chunk in rag.stream_query("Tell me more"):
        print(chunk, end="", flush=True)

asyncio.run(main())

πŸ› οΈ Tools & Agents

import asyncio
from beanllm import Agent, Tool

async def main():
    # Define tools
    @Tool.from_function
    def calculator(expression: str) -> str:
        """Evaluate a math expression"""
        return str(eval(expression))

    @Tool.from_function
    def get_weather(city: str) -> str:
        """Get weather for a city"""
        return f"Sunny, 22Β°C in {city}"

    # Create agent
    agent = Agent(
        model="gpt-4o-mini",
        tools=[calculator, get_weather],
        max_iterations=10
    )

    # Run agent
    result = await agent.run("What is 25 * 17? Also what's the weather in Seoul?")
    print(result.answer)
    print(f"⏱️ Steps: {result.total_steps}")

asyncio.run(main())

πŸ•ΈοΈ Graph Workflows

import asyncio
from beanllm import StateGraph, Client

async def main():
    client = Client(model="gpt-4o-mini")

    # Create graph
    graph = StateGraph()

    async def analyze(state):
        response = await client.chat(
            messages=[{"role": "user", "content": f"Analyze: {state['input']}"}]
        )
        state["analysis"] = response.content
        return state

    async def improve(state):
        response = await client.chat(
            messages=[{"role": "user", "content": f"Improve: {state['input']}"}]
        )
        state["improved"] = response.content
        return state

    def decide(state):
        score = 0.9 if "excellent" in state["analysis"].lower() else 0.5
        return "good" if score > 0.8 else "bad"

    # Build graph
    graph.add_node("analyze", analyze)
    graph.add_node("improve", improve)
    graph.add_conditional_edges("analyze", decide, {
        "good": "END",
        "bad": "improve"
    })
    graph.add_edge("improve", "END")
    graph.set_entry_point("analyze")

    # Run
    result = await graph.invoke({"input": "Draft proposal"})
    print(result)

asyncio.run(main())

🎨 Advanced Features

🎯 Structured Outputs (100% Schema Accuracy)

from openai import AsyncOpenAI

client = AsyncOpenAI()

response = await client.chat.completions.create(
    model="gpt-4o-2024-08-06",
    messages=[{"role": "user", "content": "Extract: John Doe, 30, [email protected]"}],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "user_info",
            "strict": True,  # βœ… 100% accuracy
            "schema": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "age": {"type": "integer"},
                    "email": {"type": "string"}
                },
                "required": ["name", "age", "email"]
            }
        }
    }
)

πŸ’Ύ Prompt Caching (10x Cost Savings)

from anthropic import AsyncAnthropic

client = AsyncAnthropic()

response = await client.messages.create(
    model="claude-sonnet-4-20250514",
    system=[{
        "type": "text",
        "text": "Long system prompt..." * 1000,
        "cache_control": {"type": "ephemeral"}  # πŸ’° 10x cheaper
    }],
    messages=[{"role": "user", "content": "Question"}],
    extra_headers={"anthropic-beta": "prompt-caching-2024-07-31"}
)

# Check cache savings
print(f"πŸ’Ύ Cache created: {response.usage.cache_creation_input_tokens}")
print(f"⚑ Cache read: {response.usage.cache_read_input_tokens}")

See Advanced Features Guide for more details.


🎯 Model Support

πŸ€– LLM Providers (7 providers)

  • OpenAI: GPT-5, GPT-4o, GPT-4.1, GPT-4o-mini
  • Anthropic: Claude Opus 4, Claude Sonnet 4.5, Claude Haiku 3.5
  • Google: Gemini 2.5 Pro, Gemini 2.5 Flash
  • DeepSeek: DeepSeek-V3 (671B MoE, open-source top performance)
  • Perplexity: Sonar (real-time web search + LLM)
  • Meta: Llama 3.3 70B (via Ollama)
  • Ollama: Local LLM support

🎀 Speech-to-Text (8 engines)

  • SenseVoice-Small: 15x faster than Whisper-Large, emotion recognition
  • Granite Speech 8B: Open ASR Leaderboard #2 (WER 5.85%)
  • Whisper V3 Turbo: Latest OpenAI model
  • Distil-Whisper: 6x faster with similar accuracy
  • Parakeet TDT: Real-time optimized (RTFx >2000)
  • Canary: Multilingual + translation
  • Moonshine: On-device optimized

πŸ‘οΈ Vision Models

  • SAM 3: Zero-shot segmentation
  • YOLOv12: Latest object detection
  • Qwen3-VL: Vision-language model (VQA, OCR, captioning)
  • Florence-2: Microsoft multimodal model

🧠 Embeddings

  • Qwen3-Embedding-8B: Top multilingual model
  • Code Embeddings: Specialized for code search
  • CLIP/SigLIP: Vision-text embeddings
  • OpenAI: text-embedding-3-small/large
  • Voyage, Jina, Cohere, Mistral: Alternative providers

πŸ—οΈ Architecture

beanllm follows Clean Architecture with SOLID principles.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  Facade Layer                       β”‚
β”‚  μ‚¬μš©μž μΉœν™”μ  API (Client, RAGChain, Agent)       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                   β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                 Handler Layer                       β”‚
β”‚  Controller μ—­ν•  (μž…λ ₯ 검증, μ—λŸ¬ 처리)             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                   β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                 Service Layer                       β”‚
β”‚  λΉ„μ¦ˆλ‹ˆμŠ€ 둜직 (μΈν„°νŽ˜μ΄μŠ€ + κ΅¬ν˜„μ²΄)                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                   β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                 Domain Layer                        β”‚
β”‚  핡심 λΉ„μ¦ˆλ‹ˆμŠ€ (μ—”ν‹°ν‹°, μΈν„°νŽ˜μ΄μŠ€, κ·œμΉ™)          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                   β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚            Infrastructure Layer                     β”‚
β”‚  μ™ΈλΆ€ μ‹œμŠ€ν…œ (Provider, Vector Store κ΅¬ν˜„)          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

μžμ„Έν•œ μ•„ν‚€ν…μ²˜ μ„€λͺ…은 **ARCHITECTURE.md**λ₯Ό μ°Έκ³ ν•˜μ„Έμš”.


πŸ”§ CLI Usage

# List available models
beanllm list

# Show model details
beanllm show gpt-4o

# Check providers
beanllm providers

# Quick summary
beanllm summary

# Export model info
beanllm export > models.json

πŸ§ͺ Testing

# Run all tests
pytest

# With coverage
pytest --cov=src/beanllm --cov-report=html

# Specific module
pytest tests/test_facade/ -v

Test Coverage: 61% (624 tests, 593 passed)


πŸ› οΈ Development

Using Makefile (ꢌμž₯)

# Install dev tools
make install-dev

# Quick auto-fix
make quick-fix

# Type check
make type-check

# Lint check
make lint

# Run all checks
make all

Manual

# Install in editable mode
pip install -e ".[dev,all]"

# Format code
ruff format src/beanllm

# Lint
ruff check src/beanllm

# Type check
mypy src/beanllm

πŸ—ΊοΈ Roadmap

βœ… Completed (2024-2025)

  • βœ… Clean Architecture & SOLID principles
  • βœ… Unified multi-provider interface (7 providers)
  • βœ… RAG pipeline & document processing
  • βœ… beanPDFLoader with 3-layer architecture
  • βœ… Vision AI (SAM 3, YOLOv12, Qwen3-VL)
  • βœ… Audio processing (8 STT engines)
  • βœ… Embeddings (Qwen3-Embedding-8B, Matryoshka, Code)
  • βœ… Vector stores (Milvus, LanceDB, pgvector)
  • βœ… RAG evaluation (TruLens, HyDE)
  • βœ… Advanced features (Structured Outputs, Prompt Caching, Parallel Tool Calling)
  • βœ… Tools, agents, graph workflows
  • βœ… Multi-agent systems
  • βœ… Production features (evaluation, monitoring, cost tracking)

πŸ“‹ Planned

  • ⬜ Benchmark system
  • ⬜ Advanced agent frameworks integration

πŸ“„ License

MIT License - see LICENSE file for details.


πŸ™ Acknowledgments

Inspired by:

Special thanks to:

  • OpenAI, Anthropic, Google, DeepSeek, Perplexity for APIs
  • Ollama team for local LLM support
  • Open-source AI community

πŸ“§ Contact


Built with ❀️ for the LLM community

Transform your LLM applications from prototype to production with beanllm.

About

Production-ready LLM toolkit with unified interface for multiple providers (OpenAI, Anthropic, Google, Ollama)

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •  

Languages